SciPy 2025

Reproducible Machine Learning Workflows for Scientists with pixi
07-07, 13:30–17:30 (US/Pacific), Room 315

Scientific researchers need reproducible software environments for complex applications that can run across heterogeneous computing platforms. Modern open source tools, like pixi, provide automatic reproducibility solutions for all dependencies while providing a high level interface well suited for researchers.

This tutorial will provide a practical introduction to using pixi to easily create scientific and AI/ML environments that benefit from hardware acceleration, across multiple machines and platforms. The focus will be on applications using the PyTorch and JAX Python machine learning libraries with CUDA enabled, as well as deploying these environments to production settings in Linux container images.


As artificial intelligence (AI) and machine learning (ML) becomes a modern part of the scientific toolkit, the need to have robustly reproducible scientific computing environments that support hardware acceleration, e.g. with CUDA, becomes more important. However, historically just installing a working CUDA environment on a single machine, let alone on multiple platforms with different requirements, was considered a particularly difficult and painful task. This lead to many scientific machine learning workflows being reliably runnable on only particular machines, and, even worse, with environments that were not reproducible across time.

With significant recent advancements by the NVIDIA open source team and the conda-forge open source community, the entire CUDA stack — from compilers to runtime libraries — is now distributed on conda-forge. This significantly reduces the overhead to install CUDA dependencies, but packaging and distribution of binaries alone does not solve the problem of reproducibility. With automatic multi-platform hash-level lock file support for all dependencies that are available on package indexes (like PyPI and conda-forge), highly efficient solving strategies, and high level user interfaces, pixi provides a missing piece to the scientific researcher toolkit. With pixi, researchers are able to easily specify the hardware acceleration requirements they have, multiple different computational environments needed for their experiments, and the required software dependencies, and then quickly solve for a multi-platform lock file of all the dependencies required, down to the compiler level. This makes it possible to have multiple hardware accelerated environments defined that are able to run AI/ML workflows across heterogeneous machines with different GPU types and CUDA compatibility.

This tutorial will be targeted to scientific researchers who use Python for scientific computing and use hardware accelerated workflows in their research, with a particular focus on AI/ML. No prior expertise with hardware accelerator systems is assumed. The tutorial structure will begin with an introduction to pixi as a computational environment manager, and explore how it provides features beyond other more common package managers that might be used for Python dependencies. It will then extend to adding CUDA requirements to pixi environments, and provide participants with exercises for solving environments and running simple AI/ML workflows using the PyTorch and JAX machine learning libraries. The tutorial will then move towards more complex environment requirements in later exercises. The tutorial will conclude with examples and exercises focusing on deploying pixi workflows to production environments by distributing pixi environments in Linux container images.

Tutorial participants will code all examples themselves. Participants will also be given time to explore solutions to their own hardware accelerated Python workflows. To make the tutorial more practical and interactive, cloud GPU resources will be requested from industry partners, that will allow for participants to have hardware accelerated resources to run their own examples on.


Installation Instructions

pixi is the only tool that needs to be installed prior to the start of the tutorial. Install instructions for pixi are provided on the pixi documentation website, but can be summarized as * Linux, macOS: curl -fsSL https://pixi.sh/install.sh | bash * Windows powershell -ExecutionPolicy ByPass -c "irm -useb https://pixi.sh/install.ps1 | iex"

Prerequisites

Participants should be familiar with Python programming for science, and using external dependencies in their work. The tutorial will use machine learning workflows as examples, but while familiarly with machine learning may be useful for conceptual understanding of the tasks, no prior machine learning knowledge is required to complete the tutorial. No prior expertise with CUDA is assumed.

Matthew is a research scientist in experimental high energy physics and data science at the University of Wisconsin-Madison Data Science Institute (a "data physicist"). He works as a member of the ATLAS collaboration on searches for physics beyond the standard model with experiments performed at CERN's Large Hadron Collider (LHC) in Geneva, Switzerland. He also serves on the executive board of the Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP) where he is a researcher and the Analysis Systems Area lead. He is also a topical editor for physics and data science for the Journal of Open Source Software.

Matthew has served on the SciPy Organizing Committee since 2020, with roles as co-chair of the Physics and Astronomy specialized track and co-chair of the Program Committee.

Former robotics engineer now solving package management, so others don't have to experience what I had to go through. I'm a core maintainer of pixi and love sharing our work through talks, podcasts, or videos.

This speaker also appears in: