Real-time ML: Accelerating Python for inference (< 10ms) at scale SciPy 2025

Real-time ML: Accelerating Python for inference (< 10ms) at scale
.ical

2025-07-11 13:15–13:45, Room 315

Real-time machine learning depends on features and data that by definition can’t be pre-computed. Detecting fraud or acute diseases like sepsis requires processing events that emerged seconds ago. How do we build an infrastructure platform that executes complex data pipelines (< 10ms) end-to-end and on-demand? All while meeting data teams where they are–in Python–the language of ML!
Learn how we built a symbolic interpreter that accelerates ML pipelines by transpiling Python into DAGs of static expressions. These expressions are optimized in C++ and eventually run in production workloads at scale with Velox–an OSS (~4k stars) unified query engine (C++) from Meta.

Relevance to SciPy audience and high-level overview

Offers insights into parsing Python's abstract syntax tree using Python’s ast module
Demonstrate how we parse type annotations to build a DAG that’s later leveraged to dynamically create query plans at inference time
Filters and projections are pushed down E2E
Parse Python functions to generate static expressions that are executed either
in C++ post fetching data
or as SQL UDF’s at query time to achieve orders of magnitude speed up
- We also built our own SQL Driver in C++ that we interface with to replace SQLAlchemy
Python functions that can’t be converted into static expressions are run against isolated processes (Ray cluster/or custom sub-process) for parallelism
Execute the query plan using Velox
Velox is an extensible C++ execution engine library for building data management systems
- Open-source with 4k stars on GitHub
Contribute upstream to Velox, as it wasn’t built with ML use cases in mind

Background & Motivation

Real-time machine learning demands rapid response times
Python’s speed is often a bottleneck
Overview of the strategies we’ve employed to circumvent this
Need for real-time is only growing
Demand for contextually aware systems that can process and react to information instantly is increasing
Demand also grows as the individual touchpoints between humans and AI systems multiply

What are the real-world applications for real-time ML?

Inference, recommendations, reranking, etc.

Predictive monitoring + health
Predictive monitoring with sensor data (Pathogens, pollution, IoT, etc.)
Acute disease detection
Operations e.g. ambulance routing
Fraud: Is this transaction fraudulent? Is this website phishing (AI-generated content)?
Lots of features (which are only meaningful when meticulously combined)
Ecommerce: Recommendations e.g. other shoppers also bought X
Collaborative filtering
Marketplaces: Re-ranking matches between buyers and sellers
Event-driven similarity and vector search at a huge scale
single user + real-time preferences vs many sellers / products + real-time preferences

Methods & Approach

DAG construction using Python’s AST
Automatic parallel execution of concurrent pipeline stages
Vectorization of pipeline stages that are written using scalar syntax
Low-latency key/value stores like Redis, Bigtable, and DynamoDB to minimize cached feature fetch time
Statistics-informed join planning
JIT transformation of Python code into native code

Results & Effects

Throughput of hundreds of millions of features per second.
Sub-second latencies (E2E)
Zero-ETL - Our ML model can now be fully decoupled from ETL unlocking similar affordances to infrastructure-as-code
Single source of truth
- Features and logic are shared and consistent across teams i.e. another team can easily use the (exact same) logic for computing creditworthiness
- Version control > easily rollback to the previous iteration without needing to backfill data
Branch deploys - Data scientists can experiment in their own branches
- Easily simulate model runs with historical production data
- Experiment, A/B test, and evaluate models
Prevent drift between training/serving, staging/prod, etc.
- It’s the same underlying data
Integrate more data sources
- Minimize dependency and schema requests to your DE teams since data is transformed post-fetch
- Broadened contexts by pulling from Postgres, Kafka, AWS Glue, Snowflake, a microservice/3rd party API

Elliot Marx

Elliot Marx is one of the co-founders of Chalk. He started his career at Affirm, where he built the early risk and credit data infrastructure system (the inspiration for Chalk). He then co-founded Haven Money, which Credit Karma acquired to power its banking products. He holds a B.S. and M.S. in Computer Science from Stanford University.

Real-time ML: Accelerating Python for inference (< 10ms) at scale .ical 2025-07-11 13:15–13:45, Room 315

Relevance to SciPy audience and high-level overview

Background & Motivation

Methods & Approach

Results & Effects

Real-time ML: Accelerating Python for inference (< 10ms) at scale
.ical

2025-07-11 13:15–13:45, Room 315