SciPy 2023

Modern compute stack for scaling large AI/ML workloads
07-14, 11:25–11:55 (America/Chicago), Zlotnik Ballroom

Existing production machine learning systems often suffer from various problems that make them hard to use. For example, data scientists and ML practitioners often spend most of their time stitching and managing bespoke distributed systems to build end-to-end ML applications and push models to production.

To address this, the Ray community has built Ray AI Runtime (Ray AIR), an open-source toolkit for building large-scale end-to-end ML applications.


Existing production machine learning systems often suffer from various problems that make them hard to use. For example, data scientists and ML practitioners often spend most of their time stitching and managing bespoke distributed systems to build end-to-end ML applications and push models to production.

To address this, the Ray community has built Ray AI Runtime (Ray AIR), an open-source toolkit for building large-scale end-to-end ML applications.

Ray is a distributed compute framework, powering large scale machine learning models such as OpenAI's ChatGPT. By leveraging Ray’s distributed compute strata and library ecosystem, the Ray AI Runtime brings scalability and programmability to ML platforms.

The main focus of the Ray AI Runtime is to provide the compute layer for Python-based AI/ML workloads and is designed to interoperate with popular ML frameworks and other systems for storage and metadata needs.

In this session, we’ll explore and discuss the following:
Why and what is Ray
How AIR, built atop Ray, allows you to program and scale your machine learning workloads easily
AIR’s interoperability and easy integration points with other systems for storage and metadata needs
AIR’s cutting-edge features for accelerating the machine learning lifecycle such as data preprocessing, last-mile data ingestion, tuning and training, and serving at scale

Key takeaways for attendees are:

  • Ray as a general purpose framework for distributed computing
  • Understand how Ray AI Runtime can be used to implement scalable, programmable machine learning workflows.
  • Learn how to pass and share data across distributed trainers and Ray native libraries: Tune, Serve, Train, RLlib, etc.
  • How to scale python-based workloads across supported public clouds

Jules S. Damji is a lead developer advocate at Anyscale Inc, an MLflow contributor, and co-author of Learning Spark, 2nd Edition. He is a hands-on developer with over 25 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, @Home, Opsware/LoudCloud, VeriSign, ProQuest, Hortonworks, and Databricks, building large-scale distributed systems. He holds a B.Sc and M.Sc in computer science (from Oregon State University and Cal State, Chico respectively), and an MA in political advocacy and communication (from Johns Hopkins University).

Amog is a Senior Software Engineer at Anyscale where works on the Ray open source project building solutions for distributed machine learning workloads including distributed model training and offline inference.