SciPy 2025

GPUs & ML – Beyond Deep Learning
07-10, 10:45–11:15 (US/Pacific), Ballroom

This talk explores various methods to accelerate traditional machine learning pipelines using scikit-learn, UMAP, and HDBSCAN on GPUs. We will contrast the experimental Array API Standard support layer in scikit-learn with the cuML library from the NVIDIA RAPIDS Data Science stack, including its zero-code change acceleration capability. ML and data science practitioners will learn how to seamlessly accelerate machine learning workflows, highlight performance benefits, and receive practical guidance for different problem types and sizes. Insights into minimizing cost and runtime by effectively mixing hardware for various tasks, as well as the current implementation status and future plans for these acceleration methods, will be provided.


The utilization of GPUs to accelerate machine learning model training and inference offers the potential for enormous speedups, addressing the growing computational demands of modern data science. However, writing dedicated GPU-accelerated pipelines can be challenging and time-consuming, often requiring specialized algorithms and dealing with overheads like memory migration and just-in-time compilation.

The Python data science community has largely adopted scikit-learn as the standard for traditional machine learning algorithms. Its estimator paradigm allows for easily composable pipelines, yielding high reusability and reproducibility. Users now have multiple avenues to accelerate these scikit-learn-style ML pipelines on GPUs without explicitly implementing GPU compute kernels.

We demonstrate how to accelerate exemplary ML pipelines using two methods: scikit-learn’s experimental Array API Standard support layer, and cuML, part of the NVIDIA RAPIDS Data Science stack, which provides accelerated estimator algorithms mirroring those in scikit-learn, umap-learn, and hdbscan. The recently released cuML zero code change acceleration enables pipeline acceleration without any code changes through a transparent API intercept layer that dispatches to accelerated estimator variants where beneficial.

We demonstrate how to accelerate exemplary ML pipelines using these methods, highlighting differences in approach and performance at varying data sizes. We provide guidance to help practitioners choose suitable approaches for different problem types and sizes, showing a natural progression from small-scale prototyping to large-scale accelerated production pipelines. We show how to minimize cost and runtime by effectively mixing hardware for different problems, e.g., performing model training on GPUs and inference on CPUs. We further show that GPU acceleration can be highly beneficial even at the prototype stage where users benefit significantly from reduced iteration times.

cuML zero code change acceleration is designed to seamlessly integrate with existing machine learning workflows. Users can invoke their unaltered Python scripts with a simple command-line interface or use a Jupyter magic command for Jupyter notebooks to enable the acceleration mode. This mode automatically accelerates any supported estimators by leveraging cuML's GPU-optimized implementations, while gracefully falling back to CPU execution for unsupported methods. This is achieved by transparently and selectively intercepting class instantiation and function calls, and rerouting them to their GPU-accelerated counterparts. This is analogous to the zero-code change mode provided by another RAPIDS Data Science library, cuDF for pandas and Polars. This seamless integration allows users to reuse existing code and generally focus on their machine learning tasks without worrying about the underlying hardware or the intricacies of GPU programming.

We conclude the talk by discussing the current implementation status of cuML in terms of algorithmic coverage and parity with the scikit-learn, umap-learn, and hdbscan APIs, as well as future plans for expanding the cuML zero code change acceleration capability to other algorithms. By leveraging these advancements, practitioners can achieve significant performance improvements, enabling faster prototyping and more efficient production pipelines. The seamless integration with existing scikit-learn, UMAP, and HDBSCAN workflows ensures that users can easily transition to GPU-accelerated computing, maximizing the potential of their hardware.