SciPy 2025

Bring Accelerated Computing to Data Science in Python
07-08, 13:30–17:30 (US/Pacific), Room 318

As data science continues to evolve, the ever-growing size of datasets poses significant computational challenges. Traditional CPU-based processing often struggles to keep pace with the demands of data science workflows. Accelerated computing with GPUs offers a solution by enabling massive parallelism and significantly reducing processing times for data-heavy tasks. In this session, we will explore GPU computing architecture, how it differs from CPUs, and why it is particularly well-suited for data science workloads. This hands-on lab will dive into the different approaches to GPU programming, from low-level CUDA coding to high-level Python libraries within RAPIDS such as, CuPy, cuDF, cuGraph, and cuML.


In this 4-hour hands-on lab, participants will dive into an end-to-end data science project focused on a fictional epidemic scenario. Using a synthetic dataset of infection rates, population demographics, and mobility patterns, attendees will harness GPU-accelerated tools to process, analyze, and model large-scale data efficiently. The lab is structured into five sections, blending practical coding with key insights into high-performance data science workflows.

Section one will take 15 minutes. The lab begins with a presentation on the fundamental concepts of leveraging GPUs for data science workflows. We'll explore the difference between CPU and GPU computing architecture, and delve into the various approach to GPU programming.

Section two will take 60 minutes. In this section, participants will use CuPy, a GPU-accelerated drop-in replacement for NumPy and SciPy, to handle a massive dataset of infection records (e.g., timestamps, locations, and case counts). Attendees will learn to perform array operations, statistical computations, and matrix manipulations at scale, comparing CuPy’s performance against traditional CPU-based NumPy. For example, they’ll calculate infection growth rates across regions, leveraging CuPy’s speed to process millions of data points in seconds. This section emphasizes how GPU parallelism accelerates foundational numerical tasks.

Section three will take 60 minutes. In this section, participants will transition to data wrangling using cuDF, a GPU-accelerated alternative to pandas. They’ll load a multi-gigabyte dataset of patient demographics and mobility logs into a cuDF DataFrame, performing operations like filtering, grouping, and joining to identify high-risk populations and infection hotspots. For instance, attendees might aggregate cases by age group or merge mobility data with infection records to trace transmission patterns. This section highlights cuDF’s ability to handle large tabular datasets for analysis and visualization.

Section four will take 30 minutes. In this section participants will use cuGraph to build and analyze a contact network. Starting with mobility data, they’ll construct a graph where nodes represent individuals or locations and edges denote interactions. Using cuGraph’s accelerated graph algorithms, attendees will compute metrics like centrality (to identify superspreaders) and shortest paths (to trace transmission chains). This section contrasts cuGraph’s performance with NetworkX, demonstrating how GPU acceleration enables rapid analysis of complex networks, critical for real-time epidemic tracking.

Section five will take 60 minutes. In this section, participants will apply machine learning using cuML, a GPU-accelerated counterpart to scikit-learn. They’ll train models like random forests or logistic regression to predict infection risk based on features like age, mobility, and prior case data. Attendees will also explore clustering (e.g., k-means) to segment populations for targeted interventions. This section showcases cuML’s compatibility with scikit-learn workflows while delivering orders-of-magnitude faster training and inference, essential for iterating models on large epidemic datasets.

Section six will take 15 minutes. The lab concludes with a discussion on practical considerations for working with large datasets. Topics include memory management (e.g., avoiding GPU memory overflow), data pipeline optimization, and integrating GPU tools into production workflows. Participants will reflect on trade-offs, such as when to use CPU vs. GPU processing, and learn best practices for scaling data science projects to real-world scenarios.


Prerequisites
  • A fundamental understanding of Python programming, including basic syntax, variables, loops, and functions
  • Some experience working with large datasets and familiarity with machine learning model development is a plus

Kevin Lee is a senior technical content developer on the Deep Learning Institute Team at NVIDIA. Kevin’s work focuses on raising awareness and driving adoption for GPU-accelerated technologies by creating developer focused hands-on training with an emphasis on Data Science, Computer Vision, and Large Language Models. Prior to NVIDIA, Kevin led a risk analytics team at Morgan Stanley and taught Data Science and Machine Learning at the University of California, Berkeley.