SciPy 2023

Scalable machine learning workloads with Ray AI Runtime
07-10, 13:30–17:30 (America/Chicago), Classroom 103

Machine learning (ML) pipelines involve a variety of computationally intensive stages. As state-of-the-art models and systems demand more compute, there is an urgent need for adaptable tools to scale ML workloads. This idea drove the creation of Ray—an open source, distributed ML compute framework that not only powers systems like ChatGPT but also pushes theoretical computing benchmarks. Ray AIR is especially useful for parallelizing ML workloads such as pre-processing images, model training and finetuning, and batch inference. In this tutorial, participants will learn about AIR’s composable APIs through hands-on coding exercises.


State-of-the-art machine learning (ML) models require an exponentially increasing amount of compute, making it necessary to utilize the full capacity of your laptop or workstation and beyond to cloud cluster. However, scaling introduces challenges with orchestration, integration, and maintenance. What's more, ML systems change quickly. If you rely on piecemeal solutions to parallelize individual stages of pre-processing, training, inference, and tuning, then stitching these evolving systems together requires a lot of overhead.

This context drove the development of Ray: a solution to enable researchers and developers to scale Python code to the full capacity of your laptop or cluster without worrying about implementing complex distributed computing logic.

This hands-on tutorial introduces Ray AI Runtime (AIR), an open source, Python-based set of libraries that equip researchers and developers with a toolkit for parallelizing ML workloads. We will use a popular computer vision (CV) use case, image segmentation, to guide participants through common ML workloads, including data pre-processing, model training and fine-tuning, and parallel batch inference.

Resources

  • GitHub repository with relevant resources including notebooks, setup instructions, reference implementations to coding exercises, and a README for an overview.

  • Participants will be able to use a pre-configured compute cluster for the duration of the tutorial.

Audience

  • Intermediate-level Python and ML researchers and developers.

  • Those interested in scaling ML workloads up to full laptop capacity to a cluster.

Prerequisites

  • Familiarity with basic ML concepts and workflows.

  • No prior experience with Ray or distributed computing.

  • (Optional) Overview of Ray notebook as background material.

Key Takeaways

  • Understand common challenges and trade-offs when scaling CV pipelines from laptop to cluster.

  • Hands-on skill in using Ray AIR to scale CV workloads, including model training, fine-tuning, inference.

Outline

Challenges with scaling ML systems (10 min)

  • Why are distributed systems so important to ML in general and CV pipelines in particular? How does Ray provide the common ML compute scale from laptop to cluster?

Hands-on lab 1: Composing CV pipelines (60 min)

  • Examples introducing Ray Data, Train and Tune libraries. Participants will practice composing components to scale an end-to-end ML workload.

  • Ray Data - Ingest, shard and preprocess the data.

  • Ray Train - Train a model on the preprocessed training set.

  • Ray Tune - Run hyperparameter tuning experiment.

  • BatchPredictor - Perform batch inference on the test set.

(10 minute break)

Hands-on lab 2: Model training and fine-tuning (60 min)

  • Learn about approaches to scaling model training.

  • Code: Implement transformer model fine-tuning with Ray Train and evaluate performance.

(10 minute break)

Hands-on lab 3: Batch inference (60 min)

  • Learn about and evaluate several distributed batch inference design patterns.

  • Implement distributed batch inference through hands-on coding exercises.

  • Code: Run batch inference using vision transformer and evaluate performance.

Next steps (10 min)

  • How to get involved with Ray and access further resources.

Prerequisites
  • Familiarity with basic ML concepts and workflows.

  • No prior experience with Ray or distributed computing.

  • (Optional) Overview of Ray notebook will be helpful background material.

Installation Instructions

Anyscale login credentials (will be sent via email) and (Optional) Overview of Ray

Emmy is a technical trainer at Anyscale Inc. She holds a B.Sc in Physics from Stanford University where she contributed toward computational astrophysics research at the Stanford Linear Accelerator Laboratory and NASA’s Jet Propulsion Laboratory. Emmy is passionate about creating high quality educational materials and sharing them with the broader Ray community.

Adam Breindel is a member of the Anyscale training team and he consults and teaches on large-scale data engineering and AI/machine learning. He has served as technical reviewer for numerous O'Reilly titles covering Ray, Apache Spark, and other topics. Adam's 20 years of engineering experience include numerous startups and large enterprises with projects ranging from AI/ML systems and cluster management to web, mobile, and IoT apps. He holds a BA (Mathematics) from University of Chicago and a MA (Classics) from Brown University. Adam's interests include hiking, literature, and complex adaptive systems.