SciPy 2025

Real-time ML: Accelerating Python for inference (< 10ms) at scale
07-11, 13:15–13:45 (US/Pacific), Room 315

Real-time machine learning depends on features and data that by definition can’t be pre-computed. Detecting fraud or acute diseases like sepsis requires processing events that emerged seconds ago. How do we build an infrastructure platform that executes complex data pipelines (< 10ms) end-to-end and on-demand? All while meeting data teams where they are–in Python–the language of ML!
Learn how we built a symbolic interpreter that accelerates ML pipelines by transpiling Python into DAGs of static expressions. These expressions are optimized in C++ and eventually run in production workloads at scale with Velox–an OSS (~4k stars) unified query engine (C++) from Meta.


Relevance to SciPy audience and high-level overview

  • Offers insights into parsing Python's abstract syntax tree using Python’s ast module
  • Demonstrate how we parse type annotations to build a DAG that’s later leveraged to dynamically create query plans at inference time
  • Filters and projections are pushed down E2E
  • Parse Python functions to generate static expressions that are executed either
  • in C++ post fetching data
  • or as SQL UDF’s at query time to achieve orders of magnitude speed up
    • We also built our own SQL Driver in C++ that we interface with to replace SQLAlchemy
  • Python functions that can’t be converted into static expressions are run against isolated processes (Ray cluster/or custom sub-process) for parallelism
  • Execute the query plan using Velox
  • Velox is an extensible C++ execution engine library for building data management systems
    • Open-source with 4k stars on GitHub
  • Contribute upstream to Velox, as it wasn’t built with ML use cases in mind

Background & Motivation

  • Real-time machine learning demands rapid response times
  • Python’s speed is often a bottleneck
  • Overview of the strategies we’ve employed to circumvent this
  • Need for real-time is only growing
  • Demand for contextually aware systems that can process and react to information instantly is increasing
  • Demand also grows as the individual touchpoints between humans and AI systems multiply

What are the real-world applications for real-time ML?

Inference, recommendations, reranking, etc.

  • Predictive monitoring + health
  • Predictive monitoring with sensor data (Pathogens, pollution, IoT, etc.)
  • Acute disease detection
  • Operations e.g. ambulance routing
  • Fraud: Is this transaction fraudulent? Is this website phishing (AI-generated content)?
  • Lots of features (which are only meaningful when meticulously combined)
  • Ecommerce: Recommendations e.g. other shoppers also bought X
  • Collaborative filtering
  • Marketplaces: Re-ranking matches between buyers and sellers
  • Event-driven similarity and vector search at a huge scale
  • single user + real-time preferences vs many sellers / products + real-time preferences

Methods & Approach

  • DAG construction using Python’s AST
  • Automatic parallel execution of concurrent pipeline stages
  • Vectorization of pipeline stages that are written using scalar syntax
  • Low-latency key/value stores like Redis, Bigtable, and DynamoDB to minimize cached feature fetch time
  • Statistics-informed join planning
  • JIT transformation of Python code into native code

Results & Effects

  • Throughput of hundreds of millions of features per second.
  • Sub-second latencies (E2E)
  • Zero-ETL - Our ML model can now be fully decoupled from ETL unlocking similar affordances to infrastructure-as-code
  • Single source of truth
    • Features and logic are shared and consistent across teams i.e. another team can easily use the (exact same) logic for computing creditworthiness
    • Version control > easily rollback to the previous iteration without needing to backfill data
  • Branch deploys - Data scientists can experiment in their own branches
    • Easily simulate model runs with historical production data
    • Experiment, A/B test, and evaluate models
  • Prevent drift between training/serving, staging/prod, etc.
    • It’s the same underlying data
  • Integrate more data sources
    • Minimize dependency and schema requests to your DE teams since data is transformed post-fetch
    • Broadened contexts by pulling from Postgres, Kafka, AWS Glue, Snowflake, a microservice/3rd party API

Elliot Marx is one of the co-founders of Chalk. He started his career at Affirm, where he built the early risk and credit data infrastructure system (the inspiration for Chalk). He then co-founded Haven Money, which Credit Karma acquired to power its banking products. He holds a B.S. and M.S. in Computer Science from Stanford University.