SciPy 2025

Building machine learning pipelines that scale: a case study using Ibis and IbisML
07-07, 13:30–17:30 (US/Pacific), Ballroom A

Abstract

Pandas and scikit-learn have become staples in the machine learning toolkit for processing and modeling tabular data in Python. However, when data size scales up, these tools become slow or run out of memory. Ibis provides a unified, Pythonic, dataframe-like interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Ibis enables users to leverage these powerful tools without rewriting their data engineering code (or learning SQL). IbisML extends the benefits of using Ibis to the ML workflow by letting users preprocess their data at scale on any Ibis-supported backend.

In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability after each move during chess games.


Description

Tabular data is everywhere. As Python has become the language of choice for data science, pandas and scikit-learn have become staples in the machine learning (ML) toolkit for processing and modeling this data. However, when data size scales up, these tools become unwieldy (slow) or altogether untenable (running out of memory). Ibis provides a unified, Pythonic, dataframe interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Local backends, such as Polars, DuckDB, and DataFusion, perform orders of magnitude faster than pandas while using less memory. Ibis further enables users to scale using distributed backends like Spark or cloud data warehouses like Snowflake and BigQuery without changing their code, giving them the power to choose the right engine for any scale. With Ibis, scientific Python users enjoy the performance of SQL from the comfort/familiarity of Python.

IbisML extends the intrinsic benefits of using Ibis to the ML workflow. It lets users preprocess their data at scale on any Ibis-supported backend—users create IbisML recipes defining sequences of last-mile preprocessing steps to get their data ready for modeling. A recipe and any scikit-learn estimator can be chained together into a pipeline, so IbisML seamlessly integrates with scikit-learn, XGBoost (using the scikit-learn estimator interface), and PyTorch (using skorch) models. At inference time, Ibis/IbisML once again takes the feature preprocessing to the efficient backend (instead of having to bring the data to the preprocessor), and user-defined functions (UDFs) enable prediction while minimizing data transfer. This completes an end-to-end ML workflow that scales with data size.

In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability at any given move during a chess game. You’ll be using actual, recent games from the largest free chess server (Lichess).

Learning Goals

During this tutorial, you’ll:
- Gain an appreciation for the principles underlying Ibis (deferred execution, unified interface, etc.) and the advantages these result in in different use-cases.
- Learn the basics of Ibis and apply them to create features from a real-world database.
- Learn IbisML constructs (including Steps, Recipes, and Pipelines) and apply this knowledge to process features before training your live win probability model.
- Observe inference at scale (on a distributed backend) using the same model, gaining an appreciation of how an end-to-end ML workflow that scales is possible.


Installation Instructions

We’ll be using GitHub Codespaces for the tutorial, so you don’t need to download or configure anything; however, please remember to bring a laptop!

Prerequisites

We expect attendees to have basic knowledge of Python, but the focus of the tutorial is on Ibis/IbisML functionality and syntax. If you have any questions about the Python syntax used, please ask! No SQL knowledge is necessary—though we may occasionally discuss or draw comparisons to SQL, Ibis itself is a pure Python library that lets you enjoy the performance-at-scale of SQL.

Exposure to basic machine learning/statistical learning with tabular data (not necessarily in Python) will be helpful for following the second half of the tutorial. We’ll be using logistic regression and XGBoost. Overall, we’re striving to make this tutorial as accessible as possible.

Deepyaman is a software engineer at Dagster Labs. He joined from Voltron Data, where he was a Senior Staff Software Engineer on the Ibis team. Before their acquisition by Voltron Data, he was a Founding Machine Learning Engineer at Claypot AI, working on their real-time feature engineering platform. Prior to that, he led data engineering teams and asset development across a range of industries at QuantumBlack, AI by McKinsey.

Deepyaman is passionate about building and contributing to the broader open-source data ecosystem. Outside of his day job, he helps maintain Kedro, an open-source Python framework for building production-ready data science pipelines.

This speaker also appears in: