07-09, 16:05–16:35 (US/Pacific), Ballroom
Many scientists rely on NumPy for its simplicity and strong CPU performance, but scaling beyond a single node is challenging. The researchers at SLAC need to process massive datasets under tight beam time constraints, often needing to modify code on the fly. This is where cuPyNumeric comes in—a drop-in replacement for NumPy that distributes work across CPUs and GPUs. With its familiar NumPy interface, cuPyNumeric makes it easy to scale computations without rewriting code, helping scientists focus on their research instead of debugging. It’s a great example of how the SciPy ecosystem enables cutting-edge science.
Many data and simulation scientists use NumPy for its ease of use and good performance on CPU. This approach works well for single-node tasks, but scaling to handle larger datasets or more resource-intensive computations introduces significant challenges. Not to mention, using GPUs to speed-up compute-intensive parts of the code requires another level of complexity.
Scientists at the Stanford Linear Accelerator Center (SLAC) need to process a large amount of data within a fixed time window, called beam time. The full dataset generated during experiments is too large to be processed on a single CPU. Additionally, the code often must be modified during the beam time to adapt to changing experimental needs. Being able to use NumPy syntax rather than lower level distributed computing libraries makes these changes quick and easy, allowing researchers to focus on conducting more experiments rather than debugging or optimizing code.
To address these challenges, we developed a cuPyNumeric, an open-source drop-in replacement for NumPy that seamlessly distributes work across CPUs and GPUs. Built on top of task-based distributed runtime from Stanford University, it automatically parallelizes NumPy APIs across all available resources, taking care of data distribution, communication, asynchronous and accelerated execution of compute kernels on both GPUs or multi-core CPUs. In addition, cuPyNumeric can be used alongside with other popular Python libraries like SciPy, matplotlib, Jax.
With cuPyNumeric, SLAC scientists successfully ran their data processing code distributed across multiple nodes and GPUs, processing the full dataset with a 6x speed-up compared to the original single-node implementation. This acceleration not only ensured timely processing of the full dataset but also enabled researchers to adapt their code dynamically to changing experimental needs.
This talk is for Python developers who work with large data or want to speed up their code, whether or not they’ve used accelerated libraries before. It will showcase the productivity and performance of cuPyNumeric library on the example of scaling up the signal processing code from SLAC. It will also cover some details on library implementation.
We propose following outline for the talk:
5 minutes: An Introduction to cuPyNumeric
5 minutes: Some details on its implementation.
5 minutes: Overview of the SLAC code and challenges researchers face when processing experimental data during the beam time.
5 minutes: Details on integration of cuPyNumeric library into the SLAC code,
3 minutes: Performance results in detail (including explanations of changes to the original code, that both improve code quality and performance at scale)
2 minutes: Conclusion remarks
Irina Demeshko is a senior software engineer at NVIDIA working on cuNumeric and Legate projects. Before NVIDIA, Irina was a research scientist and team leader of the Co-Design team at the Los Alamos National Laboratory. Her work and research interests are in the area of new HPC technologies and programming models.