SciPy 2024

Intro to Ibis: blazing fast analytics with DuckDB, Polars, Snowflake, and more, from the comfort of your Python repl.
07-08, 08:00–12:00 (US/Pacific), Ballroom A

Tabular data is ubiquitous, and pandas has been the de facto tool in Python for analyzing it. However, as data size scales, analysis using pandas may become untenable. Luckily, modern analytical databases (like DuckDB) are able to analyze this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Many of these systems only provide a SQL interface though; something far different from pandas’ dataframe interface, requiring a rewrite of your analysis code.

This is where Ibis comes in. Ibis is a pure-Python open-source library that provides a dataframe interface to many popular databases and analytics tools (DuckDB, Polars, Snowflake, Spark, etc...). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL. No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend.

https://ibis-project.org/
https://github.com/ibis-project/ibis


This tutorial is open to all. If you have ever

  • been thwarted by SQL or data stored somewhere, or
  • been stuck trying to translate a pandas POC to PySpark for "production", or
  • are interested in how to write blazing fast analytics code that uses all of the cores on your laptop without running into memory limits (and without writing any SQL)

then this tutorial is for you!

We’ll cover:

  • The basic operations of Ibis (select, filter, group_by, order_by, join, and aggregate), and how these operations may be composed to form more complicated queries.
  • How Ibis may be used on a number of different local and remote backend engines to execute the same queries on different systems.
  • How to quickly compare performance between different backends without changing your code.
  • How Ibis integrates into the larger Python data ecosystem, including tools like Scikit-Learn, Matplotlib, PyArrow, pandas, Shapely, Altair, hvPlot, and VegaFusion.

Prerequisites

This is a hands-on tutorial presented with Jupyter notebooks, with numerous examples to get your hands dirty. Participants should ideally have some experience using Python and pandas, but no SQL experience is necessary.

Installation Instructions

We intend to use GitHub Codespaces for quick environment setup -- detailed instructions for Codespace setup and for local installation (if attendees want to) are available on the tutorial repo at https://github.com/ibis-project/ibis-tutorial.

Gil Forsyth is a software engineer at Voltron Data. He followed the common career path of Japanese language specialist -> administrative assistant -> mechanical engineer -> computational fluid dynamicist -> data scientist -> software engineer -> machine learning engineer -> software engineer. Gil contributes to several projects in the PyData ecosystem and is a core maintainer of xonsh and Ibis. He served as the program chair for the Scientific Computing with Python (SciPy) conference from 2017 to 2020.

This speaker also appears in:

I'm Phillip Cloud, a software engineer. I work on Ibis full-time at Voltron Data. I like a lot of things, including Dune, jazz and puns. Let's chat!

Naty is a senior software engineer at Voltron Data. She is a former academic with a Masters in Physics and PhD in Mechanical and Aerospace Engineering to her name. She is currently contributing to Ibis, but in the past has also contributed and maintained Dask. She is also an active member of Pyladies and a one of the directors of Women Who Code DC.

This speaker also appears in: