Building machine learning pipelines that scale: a case study using Ibis and IbisML SciPy 2025

Building machine learning pipelines that scale: a case study using Ibis and IbisML
.ical
2025-07-07 13:30–17:30, Room 315

Pandas and scikit-learn have become staples in the machine learning toolkit for processing and modeling tabular data in Python. However, when data size scales up, these tools become slow or run out of memory. Ibis provides a unified, Pythonic, dataframe-like interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Ibis enables users to leverage these powerful tools without rewriting their data engineering code (or learning SQL). IbisML extends the benefits of using Ibis to the ML workflow by letting users preprocess their data at scale on any Ibis-supported backend.

In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability after each move during chess games.

Description

Tabular data is everywhere. As Python has become the language of choice for data science, pandas and scikit-learn have become staples in the machine learning (ML) toolkit for processing and modeling this data. However, when data size scales up, these tools become unwieldy (slow) or altogether untenable (running out of memory). Ibis provides a unified, Pythonic, dataframe interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Local backends, such as Polars, DuckDB, and DataFusion, perform orders of magnitude faster than pandas while using less memory. Ibis further enables users to scale using distributed backends like Spark or cloud data warehouses like Snowflake and BigQuery without changing their code, giving them the power to choose the right engine for any scale. With Ibis, scientific Python users enjoy the performance of SQL from the comfort/familiarity of Python.

IbisML extends the intrinsic benefits of using Ibis to the ML workflow. It lets users preprocess their data at scale on any Ibis-supported backend—users create IbisML recipes defining sequences of last-mile preprocessing steps to get their data ready for modeling. A recipe and any scikit-learn estimator can be chained together into a pipeline, so IbisML seamlessly integrates with scikit-learn, XGBoost (using the scikit-learn estimator interface), and PyTorch (using skorch) models. At inference time, Ibis/IbisML once again takes the feature preprocessing to the efficient backend (instead of having to bring the data to the preprocessor), and user-defined functions (UDFs) enable prediction while minimizing data transfer. This completes an end-to-end ML workflow that scales with data size.

In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability at any given move during a chess game. You’ll be using actual, recent games from the largest free chess server (Lichess).

Learning Goals

During this tutorial, you’ll:
- Gain an appreciation for the principles underlying Ibis (deferred execution, unified interface, etc.) and the advantages these result in in different use-cases.
- Learn the basics of Ibis and apply them to create features from a real-world database.
- Learn IbisML constructs (including Steps, Recipes, and Pipelines) and apply this knowledge to process features before training your live win probability model.
- Observe inference at scale (on a distributed backend) using the same model, gaining an appreciation of how an end-to-end ML workflow that scales is possible.

Prerequisites:

We expect attendees to have basic knowledge of Python, but the focus of the tutorial is on Ibis/IbisML functionality and syntax. If you have any questions about the Python syntax used, please ask! No SQL knowledge is necessary—though we may occasionally discuss or draw comparisons to SQL, Ibis itself is a pure Python library that lets you enjoy the performance-at-scale of SQL.

Exposure to basic machine learning/statistical learning with tabular data (not necessarily in Python) will be helpful for following the second half of the tutorial. We’ll be using logistic regression and XGBoost. Overall, we’re striving to make this tutorial as accessible as possible.

Installation Instructions:

Setup instructions are in the README here: https://github.com/deepyaman/lichess-live-win-probability-tutorial

Deepyaman Datta

Deepyaman is an experienced data practitioner and tool builder. He was a Senior Staff Software Engineer at Voltron Data on the Ibis team. Before their acquisition by Voltron Data, he was a Founding Machine Learning Engineer at Claypot AI, working on their real-time feature engineering platform. Prior to that, he led data engineering teams and asset development across a range of industries at QuantumBlack, AI by McKinsey.

Deepyaman is passionate about building and contributing to the broader open-source data ecosystem. Outside of his day job, he helps maintain Kedro, an open-source Python framework for building production-ready data science pipelines.

This speaker also appears in:

Python is all you need: an overview of the composable, Python-native data stack

Anjali Datta

Anjali is an MRI Applications Engineer at Vista, a startup combining MRI and AI to shorten wait times for MRI exams, especially in the heart. Before that, she was a postdoc at Stanford Medicine. She also has a PhD in Electrical Engineering from Stanford, during which she developed MRI acquisition and reconstruction methods. Medical imaging is of course a field where ML is taking over, and Anjali is also interested in the applications of deep learning to MRI and other signal processing.

Building machine learning pipelines that scale: a case study using Ibis and IbisML .ical 2025-07-07 13:30–17:30, Room 315

Description

Learning Goals

Building machine learning pipelines that scale: a case study using Ibis and IbisML
.ical
2025-07-07 13:30–17:30, Room 315