SciPy 2024

Building Daft: Python + Rust = a better distributed query engine
07-11, 15:00–15:30 (US/Pacific), Ballroom

Python is a popular language for data engineering workloads. In data engineering, developers must use a "Query Engine" to efficiently retrieve data, run data processing and then send data back out to a destination storage system or application.

The Python API for Apache Spark (PySpark) is currently the most popular framework that most data engineers use for data engineering at large scale. However, PySpark has a heavy dependency on the JVM which causes high friction during the development process.

In this talk, we discuss our work with the Daft Python Dataframe (www.getdaft.io) which is a distributed Python query engine built with Rust. We will perform a deep-dive into Daft architecture, and talk about how the strong synergy between Python and Rust enables key advantages for Daft to succeed as a query engine.


Background

The scientific method is predicated on empirical observations, and therefore data is at the heart of scientific computing.

The process of producing high quality data (data engineering) has become one of the key practices powering scientific computing across all domains, and especially in AI/ML. Because of its ubiquity in Data Science, Python is also emerging as the preferred language of data engineering.

Reynold Xin, co-creator of Apache Spark: ... these days probably 80% of the Spark users use Python

However, most data engineering technology was built for enterprise systems such as Hadoop, and built in the JVM ecosystem. JVM interoperability with Python is not easy and leads to high friction in development.

Other Python data frameworks (Pandas, Polars etc) are also built and optimized for a local development experience, rather than data engineering at larger scales. Notably, Dask Dataframes uses Pandas under the hood to enable a "distributed Pandas API".

Daft Dataframe

Rust has emerged as the perfect language to pair with Python due to its extremely complementary characteristics. Python provides the fast iterative interface for users, but Rust provides strong memory and performance. Together, these two languages provide a solid foundation for building a better solution for data engineering.

The Daft open-sourced project (https://www.getdaft.io) leverages the advantages of the two languages to provide a better distributed data engineering experience:

  1. API is a familiar Python Dataframe, with a declarative Expressions API for type-checked operations on data
  2. The distributed scheduler is written in Python to leverage Ray, a framework for distributed Python applications, but also simply runs on a Python multithreading pool for performant local execution
  3. Core execution code is written in Rust for fast vectorized execution of kernels with SIMD and extremely efficient I/O using Rust's highly performant async runtimes
  4. Query planning and optimization is in Rust for efficiency and use of Rust's pattern matching syntax

Talk Proposal

In this talk, we will expand upon key architectural decisions in Daft that enable it to bring out the best of both the worlds of Python and Rust. We also will focus on the key interaction points between our Python and Rust code, especially in the context of needing to support execution in a distributed computing environment.

Listeners to the talk should come away with a strong understanding of how to leverage Rust within their own distributed Python applications.

Other Resources

Jay is a cofounder of Eventual and a primary contributor to the Daft open-sourced project. Prior to Eventual, he was a software engineer building large scale ML data systems for computational biology at Freenome and self-driving cars at Lyft. He hails from the sunny island nation of Singapore, and used to command a platoon of tanks in the Singapore military.