SciPy 2025
The advancement of AI systems necessitates the need for interpretability to address transparency, biases, risks, and regulatory compliance. The workshop teaches core techniques in interpretability, including SHAP (game-theoretic feature attribution), GINI (decision tree impurity analysis), LIME (local surrogate models), and Permutation Importance (feature shuffling), which provide global and local explanations for model decisions. With hands-on building of interpretability tools and visualization techniques, we explore how these methods enable bias detection and clinical trust in healthcare diagnostics and develop the most effective strategies in finance. These techniques are essential in building interpretable AI to address the challenges of the black-box models.
Structured Query Language (or SQL for short) is a programming language to manage data in a database system and an essential part of any data engineer’s tool kit. In this tutorial, you will learn how to use SQL to create databases, tables, insert data into them and extract, filter, join data or make calculations using queries. We will use DuckDB, a new open source embedded in-process database system that combines cutting edge database research with dataframe-inspired ease of use. DuckDB is only a pip install away (with zero dependencies), and runs right on your laptop. You will learn how to use DuckDB with your existing Python tools like Pandas, Polars, and Ibis to simplify and speed up your pipelines. Lastly, you will learn how to use SQL to create fast, interactive data visualizations, and how to teach your data how to fly and share it via the Cloud.
In this tutorial, you will learn how to integrate Large Language Models (LLMs) directly into Python programs as thoughtfully-designed core components of the program rather than bolt-on additions. This hands-on session teaches design principles and practical techniques for incorporating LLM outputs into program control flow. We will use LlamaBot, an open-source Python interface to LLMs, focusing on local execution with local and efficient models.
This tutorial will explore GPU-accelerated clustering techniques using RAPIDS cuML, optimizing algorithms like K-Means, DBSCAN, and HDBSCAN for large datasets. Traditional clustering methods struggle with scalability, but GPU acceleration significantly enhances performance and efficiency.
Participants will learn to leverage dimensionality reduction techniques (PCA, T-SNE, UMAP) for better data visualization and apply hyperparameter tuning with Optuna and cuML. The session also includes real-world applications like topic modeling in NLP and customer segmentation. By the end, attendees will be equipped to implement, optimize, and scale clustering algorithms effectively, unlocking faster and more powerful insights in machine learning workflows.
As general purpose GPU programming has risen in popularity, many Python programmers have expressed a need to use this technology in their libraries and applications. They soon realize that the GPU landscape is vast and sometimes difficult to traverse for Python users.
In this talk, I will demystify the CUDA-enabled Accelerated Python landscape, focusing on the advantages and disadvantages of popular libraries, the common performance issues encountered, and the best practices to getting the most out of your GPU. Topics include CuPy, numba, nvmath-python, cuDF, and cuML.
This talk is beginner-friendly, but even the most seasoned programmer will gain insight into the Python GPU computing landscape.
Despite its reputation for being slow, Python is the leading language of scientific computing, which generally needs large-scale (fast) computations. This is because most scientific problems can be split into "metadata bookkeeping" and "number crunching," where the latter is performed by array-oriented (vectorized) calls into precompiled routines.
This tutorial is an introduction to array-oriented programming. We'll focus on techniques that are equally useful in any array library, with a particular focus on NumPy and JAX. You'll work in groups on four class projects: Conway's Game of Life using arrays, iterative computations on arrays, just-in-time (JIT) compilation for the Mandelbrot set, and exploring data in ragged arrays.
This tutorial is an introduction to data visualization using the popular Vega-Altair Python library. Vega-Altair provides a simple and expressive API, enabling authors to rapidly create a wide range of interactive charts.
Participants will explore the fundamentals of effective chart design and gain hands-on experience building a variety of visualizations using Vega-Altair's declarative API. Furthermore, this tutorial will introduce users to advanced topics such as data transformations and interaction design. We will finish off by covering practical workflows such as integrating Vega-Altair into dashboarding systems, publishing visualizations, and creating reusable, themed charting libraries. By the end of the session, attendees will have the skills to leverage Vega-Altair for both rapid prototyping and production-ready visualizations in diverse environments
PyVista is a general purpose 3D visualization library used for over 2000+ open source projects for the visualization of everything from computer aided engineering and geophysics to volcanoes and digital artwork.
PyVista exposes a Pythonic API to the Visualization Toolkit (VTK) to provide tooling that is immediately usable without any prior knowledge of VTK and is being built as the 3D equivalent of Matplotlib, with plugins to Jupyter to enable visualization of 3D data using both server- and client-side rendering.
Abstract
Pandas and scikit-learn have become staples in the machine learning toolkit for processing and modeling tabular data in Python. However, when data size scales up, these tools become slow or run out of memory. Ibis provides a unified, Pythonic, dataframe-like interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Ibis enables users to leverage these powerful tools without rewriting their data engineering code (or learning SQL). IbisML extends the benefits of using Ibis to the ML workflow by letting users preprocess their data at scale on any Ibis-supported backend.
In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability after each move during chess games.
Spreadsheets are one of the most common ways to share and work with data which helpfully also works great in Python! In this tutorial, we will cover some of the basics and best pratice of consuming and producing spreadsheets in Python as well as a deep dive into how to run Python directly in your spreadsheets. We will introduce and dive deep into the new Python in Excel features as well as the Anaconda Toolbox for Excel add-in.
Working with data can be challenging: it often doesn’t come in the best format for analysis, and understanding it well enough to extract insights requires both time and the skills to filter, aggregate, reshape, and visualize it. This session will equip you with the knowledge you need to effectively use pandas – a powerful library for data analysis in Python – to make this process easier.
Pandas makes it possible to work with tabular data and perform all parts of the analysis from collection and manipulation through aggregation and visualization. While most of this session focuses on pandas, during our discussion of visualization, we will also introduce at a high level Matplotlib (the library that pandas uses for its visualization features, which when used directly makes it possible to create custom layouts, add annotations, etc.) and Seaborn (another plotting library, which features additional plot types and the ability to visualize long-format data).
Scientific researchers need reproducible software environments for complex applications that can run across heterogeneous computing platforms. Modern open source tools, like pixi
, provide automatic reproducibility solutions for all dependencies while providing a high level interface well suited for researchers.
This tutorial will provide a practical introduction to using pixi
to easily create scientific and AI/ML environments that benefit from hardware acceleration, across multiple machines and platforms. The focus will be on applications using the PyTorch and JAX Python machine learning libraries with CUDA enabled, as well as deploying these environments to production settings in Linux container images.
Large Language Models (LLMs) have revolutionized natural language processing, but they come with limitations such as hallucinations and outdated knowledge. Retrieval-Augmented Generation (RAG) is a practical approach to mitigating these issues by integrating external knowledge retrieval into the LLM generation process.
This tutorial will introduce the core concepts of RAG, walk through its key components, and provide a hands-on session for building a complete RAG pipeline. We will also cover advanced techniques, such as hybrid search, re-ranking, ensemble retrieval, and benchmarking. By the end of this tutorial, participants will be equipped with both the theoretical understanding and practical skills needed to build robust RAG pipeline.
Ontologies provide a powerful way to structure knowledge, enable reasoning, and support more meaningful queries compared to traditional data models. Recently, interest in ontologies has resurged, driven by advancements in language models, reasoning capabilities, and the growing adoption of platforms like Palantir Foundry.
In this hands-on tutorial, participants will explore ontology development across multiple domains using a variety of Python-based tools such as rdflib
, Owlready2
, SWI-Prolog
, PySpark
, Pandas
, NetworkX
, and SciPy
. They will learn how ontologies facilitate semantic reasoning, improve data interoperability, and enhance query capabilities.
Additionally, attendees will build a rudimentary reasoning engine to better understand inference mechanisms.
The tutorial emphasizes practical applications and comparisons with conventional data representations, making it ideal for researchers, data engineers, and developers interested in knowledge representation and reasoning.
This hands-on tutorial will guide participants through building an end-to-end AI agent that translates natural language questions into SQL queries, executes them on live databases, and returns coherent responses. Using the Retrieval-Augmented Generation (RAG) approach with modern LLMs, participants will learn how to construct robust NL2SQL systems that understand database schema, respect database constraints, and generate accurate SQL. By the end of this 4-hour session, attendees will have created a working prototype using the Brazilian E-Commerce dataset that they can adapt to their own data sources.
Python packaging can be overwhelming. However, a trusted, community-vetted workflow can make it easier. In this hands-on workshop, you’ll learn a tested approach developed by the pyOpenSci community and vetted by Python packaging maintainers. You’ll create an installable, maintainable, and citable package using a quickstart template. You’ll also receive step-by-step guidance on publishing to TestPyPI (and resources for conda-forge, and adding a DOI with Zenodo). If you can’t install software on your laptop, you can use GitHub Codespaces to participate in the workshop. Join us to package your Python code confidently and to access ongoing support in our community beyond the workshop.
With cameras in everything from microscopes to telescopes to satellites, scientists produce image data in countless formats, shapes, sizes, and dimensions. Python provides a rich ecosystem of libraries to make sense of them. napari is a Python library for multidimensional image visualization, but it does double duty as a standalone application that can be easily extended with GUI tools for analysis, visualization, and annotation. In this tutorial, we'll start with the basics of image visualization and analysis in Python, then show how to extend the napari user interface to make analysis workflows as easy as pushing a button, and finally show how to share these extensions as plugins, which can be easily installed by users and collaborators. If you work with images (particularly multidimensional images), and especially if you work with scientists who may not be comfortable with Python, this tutorial might be for you!
The rapid expansion of the geospatial industry and accompanying increase in availability of geospatial data, presents unique opportunities and challenges in data science. As the need for skilled data scientists increases, the ability to manipulate and interpret this data becomes crucial. This workshop introduces the essentials of geospatial data manipulation and data visualisation, emphasizing hands-on techniques to transform, analyze and visualise diverse datasets effectively.
Throughout the workshop, attendees will explore the extensive ecosystem of geospatial Python libraries. Key tools include GeoPandas, Shapely and Cartopy for vector data, GDAL, Rasterio and rioxarray for raster data and participants will also learn to integrate these with popular plotting libraries such as Matplotlib, Bokeh, and Plotly for visualizations.
This tutorial will cover three primary topics: visualizing geospatial shapes, managing raster datasets, and synthesizing multiple data types into unified visual representations. Each section will incorporate data manipulation exercises to ensure attendees not only visualize but also deeply understand geospatial data.
Targeting both beginners and advanced practitioners, the workshop will employ real-world examples to guide participants through the necessary steps to produce striking and informative geospatial visualizations. By the end, attendees will be equipped with the knowledge to leverage advanced data science techniques in their geospatial projects, making them proficient in both the analysis and communication of spatial information.
Through the use of NetworkX's API, tutorial participants will learn about the basics of graph theory and its use in applied network science. Starting with a computationally-oriented definition of a graph and its associated methods, we will progress through the following concepts: path and structure finding, visualization, and graph storage on disk. We will also offer tutorial participants the option of one advanced topic overview, including the use of graphs alongside LLMs for knowledge retrieval, scalable alternatives to NetworkX including cuGraph, and the use of linear algebraic translation of graph problems to speed up computations.
Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset.
These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data.
They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations.
In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless.
FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden.
Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup.
In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit.
Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions.
We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines
in the Cloud that demonstrate the benefits of cloud-optimized data management.
Work not shown is work lost. Many excellent scientists and engineers are not always adept at showcasing their work. This results in many interesting scientific ideas that have never been seen in the light of day.
However, using today's tools, one no longer has to leave the Python ecosystem to create classy, complete prototypes using modern data visualization and web development tools. With over five years of experience building and presenting data solutions at huge science companies, we show it doesn't have to be challenging. We give a walkthrough of the primary web application frameworks and showcase Fast Dash, an open-source Python library we built to solve specific prototyping needs.
This tutorial is meant for all data professionals who find value in quickly turning their science code into web applications. Participants will learn about the leading frameworks, their strengths and limitations, and a framework for picking the best one for a given task. We will go through some day-to-day applications and hands-on tutorials in the final section.
Maintaining code quality can be challenging, no matter the size of your project or number of contributors. Different team members may have different opinions on code styling and preferences for code structure, while solo contributors might find themselves spending a considerable amount of time making sure the code conforms to accepted conventions. However, manually inspecting and fixing issues in files is both tedious and error-prone. As such, computers are much more suited to this task than humans. Pre-commit hooks are a great way to have a computer handle this for you.
Pre-commit hooks are code checks that run whenever you attempt to commit your changes with Git. They can detect and, in some cases, automatically correct code-quality issues before they make it to your codebase. In this tutorial, you will learn how to install and configure pre-commit hooks for your repository to ensure that only code that passes your checks makes it into your code base. We will also explore how to build custom pre-commit hooks for novel use cases.
As data science continues to evolve, the ever-growing size of datasets poses significant computational challenges. Traditional CPU-based processing often struggles to keep pace with the demands of data science workflows. Accelerated computing with GPUs offers a solution by enabling massive parallelism and significantly reducing processing times for data-heavy tasks. In this session, we will explore GPU computing architecture, how it differs from CPUs, and why it is particularly well-suited for data science workloads. This hands-on lab will dive into the different approaches to GPU programming, from low-level CUDA coding to high-level Python libraries within RAPIDS such as, CuPy, cuDF, cuGraph, and cuML.
This workshop is designed to equip software engineers with the skills to build and iterate on generative AI-powered applications. Participants will explore key components of the AI software development lifecycle through first principles thinking, including prompt engineering, monitoring, evaluations, and handling non-determinism. The session focuses on using multimodal AI models to build applications, such as querying PDFs, while providing insights into the engineering challenges unique to AI systems. By the end of the workshop, participants will know how to build a PDF-querying app, but all techniques learned will be generalizable for building a variety of generative AI applications.
If you're a data scientist, machine learning practitioner, or AI enthusiast, this workshop can also be valuable for learning about the software engineering aspects of AI applications, such as lifecycle management, iterative development, and monitoring, which are critical for production-level AI systems.
Satellite-based air quality products (e.g., NO₂, PM2.5/AOD, CO) are valuable for environmental monitoring but often have coarse resolution and significant gaps, especially under cloudy conditions. This tutorial guides participants through the end-to-end process of generating high-resolution air quality maps from coarse-resolution satellite data using AI/ML techniques. The tutorial features practical exercises utilizing Python's robust ecosystem (Xarray, Rasterio, scikit-learn, TensorFlow/Keras, PyTorch, GeoPandas, Folium, etc.), enabling participants to produce accurate, validated, and interactive maps suitable for local-level air quality assessments.
Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets. Many real-world datasets often have hierarchical or heterogeneous structure, and are best organized through groups of related data arrays. Through xarray.DataTree, the xarray data model now supports opening datasets with a hierarchical structure of groups, such as HDF5 files and Zarr stores. This expanded data model is now general enough to manage data across different scientific disciplines, including geosciences and biosciences. This hands-on tutorial focuses on intermediate and advanced workflows using xarray to analyze real-world hierarchical data.
Artificial intelligence has been successfully applied to bioimage understanding and achieved significative results in the last decade. Advances in imaging technologies have also allowed the acquisition of higher resolution images. That has increased not only the magnification at what images are captured, but the size of the acquired images as well. This comprises a challenge for deep learning inference in large-scale images, since these methods are commonly used in relatively small regions rather than whole images. This workshop presents techniques to scale-up inference of deep learning models to large-scale image data with help of Dask for parallelization in Python.
Shiny is a framework for building web applications and data dashboards in Python.
In this workshop,
you will see how the basic building blocks of shiny can be extended to create
your own scalable production-ready python applications.
In particular, this workshop covers:
- Overview of the basic building blocks of a Shiny for Python application
- How to refactor applications into shiny modules
- How to write tests for your shiny application
- Deploy and share your application
At the end of this course you will be able to:
- Build a Shiny app in Python
- Refactor your reactive logic into Shiny Modules
- Identify when to write Shiny modules
- Write unit tests and end-to-end tests for your shiny application
- Deploy and share your application (for free!)
Simulation-Based Inference (SBI) is a powerful class of machine learning based methods for statistical inference, with applications in many scientific domains including particle physics, cosmology and astrophysics. While demonstrating significant promise in small-scale studies over the last 10 years, a full-scale deployment for a particle physics experiment remained elusive due to the computational challenges involved. Using novel ideas, modern distributed computing resources, and tools like JAX and Tensorflow, we built the first end-to-end SBI workflow exclusively using the Scientific Python ecosystem that is scalable for measurements at the Large Hadron Collider (LHC). The new techniques were used to measure the lifetime of the Higgs boson with the ATLAS experiment at the LHC with unprecedented precision.
In this talk, I will present the many challenges encountered while scaling the new analysis method to a full-scale measurement and demonstrate how the power, versatility and the rich set of tools in the Scientific Python ecosystem played a critical role in overcoming them.
The practice of data science in genomics and computational biology is fraught with friction. This is largely due to a tight coupling of bioinformatic tools to file input/output. While omic data is specialized and the storage formats for high-throughput sequencing and related data are often standardized, the adoption of emerging open standards not tied to bioinformatics can help better integrate bioinformatic workflows into the wider data science, visualization, and AI/ML ecosystems. Here, we present three libraries as short vignettes for composable bioinformatics. First, we present Oxbow, a Rust and Python-based adapter library that unifies access to common genomic data formats by efficiently transforming queries into Apache Arrow, a standard in-memory columnar representation for tabular data analytics. Second, we present Bioframe, a Python library that performs genomic range operations using standard Pandas dataframes. Last, we present Anywidget, an architecture based on modern web standards for sharing interactive visualizations across all Jupyter-compatible runtimes, including JupyterLab, Google Colab, VSCode, and more. Together, we demonstrate the composition of these libraries to build a custom connected genomic analysis and visualization environment. We propose that components such as these, which leverage scientific domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data analysis and machine learning workflows as well as systems for exploratory data analysis and visualization.
A lot of data scientists use UMAP to help them quickly visualize and explore complex datasets. This could be exploring large unstructured datasets via neural embeddings, or working on LLM explainability by mapping out Sparse Autoencoder features. Making the visualizations good enough, and compelling enough, to present to end users is much harder. However, if done right a good UMAP plot can be a powerful communication tool, or a rich interactive experience that draws users in. Attendees will come away with a sense of what is possible, and an introduction to open source tools that can make it easy.
Climate models generate a lot of data - and this can make it hard for researchers to efficiently access and use the data they need. The solutions of yesteryear include standardised file structures, sqlite databases, and just knowing where to look. All of these work - to varying degrees - but can leave new users scratching their heads. In this talk, I'll outline how ACCESS-NRI built tooling around Intake and Intake-ESM to make it easy for climate researchers to access available data, share their own, and avoid writing the custom scripts over and over to work with the data their experiments generate.
Data manipulation libraries like Polars allow us to analyze and process data much faster than with native Python, but that’s only true if you know how to use them properly. When the team working on NCEI's Global Summary of the Month first integrated Polars, they found it was actually slower than the original Java version. In this talk, we'll discuss how our team learned how to think about computing problems like spreadsheet programmers, increasing our products’ processing speed by over 80%. We’ll share tips for rewriting legacy code to take advantage of parallel processing. We’ll also cover how we created custom, pre-compiled functions with Numba when the business requirements were too complex for native Polars expressions.
OpenMC is an open source, community-developed, Monte Carlo tool for neutron transport simulations, featuring a depletion module for fuel burnup calculations in nuclear reactors and a Python API. Depletion calculations can be expensive as they require solving the neutron transport and bateman equations in each timestep to update the neutron flux and material composition, respectively. Material properties such as temperature and density govern material cross sections, which in turn govern reaction rates. The reaction rates can effect the neutron population. In a scenario where there is no significant change in the material properties or composition, the transport simulation may only need to be run once; the same cross sections are used for the entire depletion calculation. We recently extended the depletion module in OpenMC to enable transport-independent depletion using multigroup cross sections and fluxes. This talk will focus on the technical details of this feature, its validation, and briefly touch on areas where the feature has been used. Two recent use cases will be highlighted. The first use case calculates shutdown dose rates for fusion power applications, and the second performs depletion for fission reactor fuel cycle modeling.
GBNet
Gradient Boosting Machines (GBMs) are widely used for their predictive power and interpretability, while Neural Networks offer flexible architectures but can be opaque. GBNet is a Python package that integrates XGBoost and LightGBM with PyTorch. By leveraging PyTorch’s auto-differentiation, GBNet enables novel architectures for GBMs that were previously exclusive to pure Neural Networks. The result is a greatly expanded set of applications for GBMs and an improved ability to interpret expressive architectures due to the use of GBMs.
Image analysis is a central tool in modern biology. Cell and developmental biologists generate multidimensional microscopy data, including imaging of cellular, subcellular and tissue structures, in three dimensions, over time, and with multiple molecular markers. Segmentation and tracking of multidimensional microscopy data requires high accuracy across many images (e.g. timepoints) and is a labour-intensive part of biological image processing pipelines. We present ReSCU-Nets, recurrent convolutional neural networks that use the segmentation results from the previous frame as a prompt to segment the current frame. We demonstrate that ReSCU-Nets outperform state-of-the-art segmentation models in different tasks on biological multidimensional microscopy sequences.
This talk provides an overview of several libraries in the open-source JAX ecosystem (such as Equinox, Diffrax, Optimistix, ...) In short, we have been building an "autodifferentiable GPU-capable SciPy". These libraries offer the foundational core of tools that have made it possible for us to train neural networks (e.g. score-based diffusions for image generation), solve PDEs, and smoothly handle hybridisations of the two (e.g. fit neural ODEs to scientific data). By the end of the talk, the goal is for you to be able to walk away with a slew of new modelling tools, suitable for tackling problems both in ML and in science.
Cubed is a framework for distributed processing of large arrays without a cluster. Designed to respect memory constraints at all times, Cubed can express any NumPy-like array operation as a series of embarrassingly-parallel, bounded-memory steps. By using Zarr as persistent storage between steps, Cubed can run in a serverless fashion on both a local machine and on a range of Cloud platforms. After explaining Cubed’s model, we will show how Cubed has been integrated with Xarray and demonstrate its performance on various large array geoscience workloads.
Many notable PyData projects including Jupyter Hub, Matplotlib and JAX follow a versioning scheme called EffVer, where instead of making promises around backward compatibility they communicate the likelihood and magnitude of the work required to adopt a new version.
In this talk we will dive into EffVer, what it is and what it means for developers and users. We will discuss how to apply EffVer to your own projects and how to depend on projects that use it.
Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII redaction—improving data quality and training efficiency.
We will showcase real-world examples demonstrating how multi-node, multi-GPU processing scales dataset preparation to 100+ TB of data, achieving up to 7% improvement in LLM downstream tasks. Attendees will gain insights into configurable pipelines that enhance training workflows, with a focus on reproducibility, scalability, and open-source integration within Python's scientific computing ecosystem.
Large language models (LLMs) enable powerful data-driven applications, but many projects get stuck in “proof-of-concept purgatory”—where flashy demos fail to translate into reliable, production-ready software. This talk introduces the LLM software development lifecycle (SDLC)—a structured approach to moving beyond early-stage prototypes. Using first principles from software engineering, observability, and iterative evaluation, we’ll cover common pitfalls, techniques for structured output extraction, and methods for improving reliability in real-world data applications. Attendees will leave with concrete strategies for integrating AI into scientific Python workflows—ensuring LLMs generate value beyond the prototype stage.
One of the most important aspects of developing scientific software is distribution for others. The Scientific Python Development Guide was developed to provide up-to-date best practices for packaging, linting, and testing, along with a versatile template supporting multiple backends, and a WebAssembly-powered repo-review tool to check a repository directly in the guide. This talk, with the guide for reference, will cover key best practices for project setup, backend selection, packaging metadata, GitHub Actions for testing and deployment, tools for validating code quality. We will even cover tools for packaging compiled components that are simple enough for anyone to use.
For the past decade, SQL has reigned king of the data transformation world, and tools like dbt have formed a cornerstone of the modern data stack. Until recently, Python-first alternatives couldn't compete with the scale and performance of modern SQL. Now Ibis can provide the same benefits of SQL execution with a flexible Python dataframe API.
In this talk, you will learn how Ibis supercharges existing open-source libraries like Kedro and Pandera and how you can combine these technologies (and a few more) to build and orchestrate scalable data engineering pipelines without sacrificing the comfort (and other advantages) of Python.
Over the past few years, Discrete Global Grid Systems (DGGS) that subdivide the earth into (roughly) equally sized faces have seen increased popularity. However, their in-memory representation is different from traditional projection-based data, which is either comprises of evenly shaped rectangular grid (aka raster) or discrete geometries (aka vector), and thus requires specialized tooling. In particular, this includes libraries that can work on the numeric cell ids defined by the specific DGGS.
xdggs
is a library that provides a unified interface for xarray
that allows working with and visualizing a variety of DGGS-indexed data sets.
LLMs are powerful, flexible, easy-to-use... and often wrong. This is a dangerous combination, especially for data analysis and scientific research, where correctness and reproducibility are core requirements. Fortunately, it turns out that by carefully applying LLMs to narrower use cases, we can turn them into surprisingly reliable assistants that accelerate and enhance, rather than undermine, scientific work.
This is not just theory—I’ll showcase working examples of seamlessly integrating LLMs into analytic workflows, helping data scientists build interactive, intelligent applications without needing to be web developers. You’ll see firsthand how keeping LLMs focused lets us leverage their "intelligence" in a way that’s practical, rigorous, and reproducible.
User guides are the piece you often hit right after clicking the "Learn" or "Get Started" button in a package's documentation. They're responsible for onboarding new users, and providing a learning path through a package. Surprisingly, while pieces of documentation like the API Reference tend to be the same, the design of user guides tend to differ across packages.
In this talk, I'll discuss how to design an effective user guide for open source software. I'll explain how the guides for Polars, DuckDB, and FastAPI balance working end-to-end like a course, with being browsable like a reference.
Block-based programming divides inputs into local arrays that are processed concurrently by groups of threads. Users write sequential array-centric code, and the framework handles parallelization, synchronization, and data movement behind the scenes. This approach aligns well with SciPy's array-centric ethos and has roots in older HPC libraries, such as NWChem’s TCE, BLIS, and ATLAS.
In recent years, many block-based Python programming models for GPUs have emerged, like Triton, JAX/Pallas, and Warp, aiming to make parallelism more accessible for scientists and increase portability.
In this talk, we'll present cuTile and Tile IR, a new Pythonic tile-based programming model and compiler recently announced by NVIDIA. We'll explore cuTile examples from a variety of domains, including a new LLAMA3-based reference app and a port of miniWeather. You'll learn the best practices for writing and debugging block-based Python GPU code, gain insight into how such code performs, and learn how it differs from traditional SIMT programming.
By the end of the session, you'll understand how block-based GPU programming enables more intuitive, portable, and efficient development of high-performance, data-parallel Python applications for HPC, data science, and machine learning.
Tracking and Object-Based Analysis of Clouds (tobac) is a Python package that enables researchers to identify, track, and perform object-based analyses of phenomena in large atmospheric datasets. Over the past four years, tobac’s userbase has grown within atmospheric science, and the package has transitioned from its original life as a small, focused package with few maintainers to a larger package with more robust governance and structure. In this presentation, we will discuss the challenges and lessons learned during the transition to robust governance structures and the future of tobac as we incorporate new techniques for using multiple variables and scales to track the same system.
At CERN (European Organization for Nuclear Research), machine learning models are developed and deployed for various applications, including data analysis, event reconstruction, and classification. These models must not only be highly sophisticated but also optimized for efficient inference. A critical application is in Triggers- systems designed to identify and select interesting events from an immense stream of experimental data. Experiments like ATLAS and CMS generate data at rates of approximately 100 TB/s, requiring Triggers to rapidly filter out irrelevant events. This talk will explore the challenges of deploying machine learning in such high-throughput environments and discuss solutions to enhance their performance and reliability.
X-ray ptychographic imaging is becoming an indispensable tool for visualizing matter at nanoscale, driving innovation across many fields, including functional materials, electronics, life sciences, etc. This imaging mode is particularly attractive thanks to its ability to generate high-resolution view of an extended object without using a lens with high numerical aperture. The technique relies on advanced mathematical algorithms to retrieve the missing phase information that is not directly recorded by a physical detector, therefore computation intensive. Advances in accelerator, optics, and detector technologies have greatly increased data generate rate, imposing a big challenge on efficient execution of reconstruction process to support decision-making in an experiment. Here, we demonstrate how efficient GPU-based reconstruction algorithms, deployed at the edge, enable real-time feedback during high-speed continuous data acquisition increasing the speed and efficiency of the experiments. The developments further pave the way for AI-augmented autonomous microscopic experimentation performed at machine speeds.
Generative Artificial Intelligence (AI) is reshaping engineering education by
offering students new ways to engage with complex concepts and content. Ethical
concerns including bias, intellectual property, and plagiarism make Generative AI
a controversial educational tool. Overreliance on AI may also lead to academic
integrity issues, necessitating clear student codes of conduct that define acceptable
use. As educators we should carefully design learning objectives to align with
transferrable career skills in our fields. By practicing backward design with a
focus on career-readiness skills, we can incorporate useful prompt engineering,
rapid prototyping, and critical reasoning skills that incorporate generative AI.
Engineering students want to develop essential career skills such as critical
thinking, communication, and technology. This talk will focus on case studies for
using generative AI and rapid prototyping for scientific computing in engineering
courses for physics, programming, and technical writing. These courses include
assignments and reading examples using NumPy, SciPy, Pandas, etc. in Jupyter
notebooks. Embracing generative AI tools has helped students compare, evaluate,
and discuss work that was inaccessible before generative AI. This talk explores
strategies for using AI in engineering education while accomplishing learning
objectives and giving students opportunities to practice career readiness skills.
Scaling artificial intelligence (AI) and machine learning (ML) workflows on high-performance computing (HPC) systems presents unique challenges, particularly as models become more complex and data-intensive. This study explores strategies to optimize AI/ML workflows for enhanced performance and resource utilization on HPC platforms.
We investigate advanced parallelization techniques, such as Data Parallelism (DP), Distributed Data Parallel (DDP), and Fully Sharded Data Parallel (FSDP). Implementing memory-efficient strategies, including mixed precision training and activation checkpointing, significantly reduces memory consumption without compromising model accuracy. Additionally, we examine various communication backends( i.e. NCCL, MPI, and Gloo) to enhance inter-GPU and inter-node communication efficiency. Special attention is given to the complexities of implementing these backends in HPC environments, providing solutions for proper configuration and execution.
Our findings demonstrate that these optimizations enable stable and scalable AI/ML model training and inference, achieving substantial improvements in training times and resource efficiency. This presentation will detail the technical challenges encountered and the solutions developed, offering insights into effectively scaling AI/ML workflows on HPC systems.
This talk presents a candid reflection on integrating generative AI into an Engineering Computations course, revealing unexpected challenges despite best intentions. Students quickly developed patterns of using AI as a shortcut rather than a learning companion, leading to decreased attendance and an "illusion of competence." I'll discuss the disconnect between instructor expectations and student behavior, analyze how traditional assessment structures reinforced counterproductive AI usage, and share strategies for guiding students toward using AI as a co-pilot rather than a substitute for critical thinking while maintaining academic integrity.
Computational needs in high energy physics applications are increasingly met by utilizing GPUs as hardware accelerators, but achieving the highest throughput requires directly reading data into GPU memory. This has yet to be achieved for HEP’s standard domain specific “ROOT” file formats. Using KvikIO’s python bindings to CuFile and NvComp, KvikUproot is a prototype package to support the reading of ROOT file formats by the GPU. On GPUDirect storage (GDS) enabled systems, data bypasses the CPU and is loaded directly from storage to the GPU. We will discuss the methodology we developed to read ROOT files into GPUs via RDMA.
Explainable AI (XAI) emerged to clarify the decision-making of complex deep learning models, but standard XAI methods are often uninformative on Earth system models due to their high-dimensional and physically constrained nature. We introduce “physical XAI,” which adapts XAI techniques to maintain physical realism and handle autocorrelated data effectively. Our approach includes physically consistent perturbations, analysis of uncertainty, and the use of variance-based global sensitivity tools. Furthermore, we expand the definition of “physical XAI” to include meaningful interactive data analysis. We demonstrate these methods on two Earth system models: a data-driven global weather model and a winter precipitation type model to show how we can gain more physically meaningful insights.
Many scientists rely on NumPy for its simplicity and strong CPU performance, but scaling beyond a single node is challenging. The researchers at SLAC need to process massive datasets under tight beam time constraints, often needing to modify code on the fly. This is where cuPyNumeric comes in—a drop-in replacement for NumPy that distributes work across CPUs and GPUs. With its familiar NumPy interface, cuPyNumeric makes it easy to scale computations without rewriting code, helping scientists focus on their research instead of debugging. It’s a great example of how the SciPy ecosystem enables cutting-edge science.
Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference.
Designing tomorrow's materials requires understanding how atoms behave – a challenge that's both fascinating and incredibly complex. While machine learning offers exciting speedups in materials simulation, it often falls short, missing vital electronic structure information needed to connect theory with experimental results. This work introduces a powerful solution: Density Functional Tight Binding (DFTB), which, combined with the versatile tools of Scientific Python, allows us to understand the electronic behavior of materials while maintaining computational efficiency. In this talk, I will present our findings demonstrating how DFTB, coupled with readily available Python packages, allows for direct comparison between theoretical predictions and experimental data, such as XPS measurements. I will also showcase our publicly available repository, containing DFTB parameters for a wide range of materials, making this powerful approach accessible to the broader research community.
This talk explores various methods to accelerate traditional machine learning pipelines using scikit-learn, UMAP, and HDBSCAN on GPUs. We will contrast the experimental Array API Standard support layer in scikit-learn with the cuML library from the NVIDIA RAPIDS Data Science stack, including its zero-code change acceleration capability. ML and data science practitioners will learn how to seamlessly accelerate machine learning workflows, highlight performance benefits, and receive practical guidance for different problem types and sizes. Insights into minimizing cost and runtime by effectively mixing hardware for various tasks, as well as the current implementation status and future plans for these acceleration methods, will be provided.
Rydberg atoms offer unique quantum properties that enable radio-frequency sensing capabilities distinct from any classical analogue; however, large parameter spaces and complex configurations make understanding and designing these quantum experiments challenging. Current solutions are often developed as in-house, closed-sourced software simulating a narrow range of problems. We present RydIQule, an open-source package leveraging tools of computational python in novel ways to model the behavior of these systems generally. We describe RydIQule’s approach to representing quantum systems using computational graphs and leveraging numpy broadcasting to define complete experiments. In addition to discussing the computational challenges RydIQule helps overcome, we outline how collaboration between physics and computational research backgrounds has led to this impactful tool.
Xarray has enormous potential as a data model and toolkit for labeled N-D arrays in biology. Originally developed within the geosciences community, it is seeing increased usage in biology, with applications ranging from genomics to image analysis and beyond. However, it has not yet been widely adopted. This presentation will investigate what the blockers have been to wider adoption, showcase the power of Xarray in biology through existing use cases, and present a roadmap for the future of Xarray in biological workflows through recent and upcoming improvements in Xarray.
Synthetic aviation fuels (SAFs) offer a pathway to improving efficiency, but high cost and volume requirements hinder property testing and increase risk of developing low-performing fuels. To promote productive SAF research, we used Fourier Transform Infrared (FTIR) spectra to train accurate, interpretable fuel property models. In this presentation, we will discuss how we leveraged standard Python libraries – NumPy, pandas, and scikit-learn – and Non-negative Matrix Factorization to decompose FTIR spectra and develop predictive models. Specifically, we will review the pipeline developed for preprocessing FTIR data, the ensemble models used for property prediction, and how the features correlate with physicochemical properties.
The “napari-activelearning” plugin provides a framework to fine tune deep learning models for large-scale bioimage analysis, such as digital pathology Whole Slide Images (WSI). This plugin was developed with the motivation of easing the integration of deep learning tools into bioimage analysis workflows. This plugin implements the concept of Active Learning for reducing the time spent on labeling samples when fine tuning models. Because this plugin is integrated into Napari and leverages the use of next generation file formats (Zarr), it is suitable for fine tuning deep learning models on large-scale images with little image preparation.
AI assistants are evolving from simple Q&A bots to intelligent, multimodal, multilingual, and agentic systems capable of reasoning, retrieving, and autonomously acting. In this talk, we’ll showcase how to build a voice-enabled, multilingual, multimodal RAG (Retrieval-Augmented Generation) assistant using Gradio, OpenAI’s Whisper, LangChain, LangGraph, and FAISS. Our assistant will not only process voice and text inputs in multiple languages but also intelligently retrieve information from structured and unstructured data. We’ll demonstrate this with a flight search use case—leveraging a flight database for retrieval and, when necessary, autonomously searching external sources using LangGraph. You will gain practical insights into building scalable, adaptive AI assistants that move beyond static chatbots to autonomous agents that interact dynamically with users and the web.
This talk presents zfit with the newest improvements, a general purpose distribution fitting library for complicated model building beyond fitting a normal distribution. The talk will cover all aspects of fitting with a focus on the strong model building part in zfit; composable distributions with sums, products and more, build and mix binned and unbinned, analytic and templated functions in multiple dimensions. This includes the creation of arbitrary, custom distributions with minimal effort that fulfils everyones need.
Thanks to the numpy-like backend used by TensorFlow, zfit is highly performant by using JIT compiled code on CPUs and even GPUs, a showcase for scientific computing faster than numpy.
Jupyter Book allows researchers and educators to create books and knowledge bases that are reusable, reproducible, and interactive. Jupyter Book 2 has been rebuilt on a new document engine that prioritizes extensibility, machine readability and flexible deployment, allowing us to create and share interactive computational content in new ways. In this talk, we will introduce Jupyter Book 2.0, demonstrate its game changing features, and showcase real-world examples like The Turing Way, QuantEcon and Project Pythia. We'll conclude with a live demo, taking a folder of notebooks and markdown files and turning them into a deployable, feature-rich website.
The rapidly evolving Python ecosystem presents increasing challenges for adapting code using traditional methods. Developers frequently need to rewrite applications to leverage new libraries, hardware architectures, and optimization techniques. To address this challenge, the Numba team is developing a superoptimizing compiler built on equality saturation-based term rewriting. This innovative approach enables domain experts to express and share optimizations without requiring extensive compiler expertise. This talk explores how Numba v2 enables sophisticated optimizations—from floating-point approximation and automatic GPU acceleration to energy-efficient multiplication for deep learning models—all through the familiar NumPy API. Join us to discover how Numba v2 is bringing superoptimization capabilities to the Python ecosystem.
This track highlights the fantastic scientific applications that the
SciPy community creates with the tools we collectively make. Talk
proposals to this track should be stories of how using the Scientific
Python ecosystem the speakers were able to overcome challenges, create
new collaborations, reduce the time to scientific insight, and share
their results in ways not previously possible. Proposals should focus
on novel applications and problems, and be of broad interest to the
conference, but should not shy away from explaining the scientific
nuances that make the story in the proposal exciting.
Women remain critically underrepresented in data science and Python communities, comprising only 15–22% of professionals globally and less than 3% of contributors to Python open-source projects. This disparity not only limits diversity but also represents a missed opportunity for innovation and community growth. This talk explores actionable strategies to address these gaps, drawing from my leadership in Women in AI at IBM, TechWomen mentorship, and initiatives with NumFOCUS. Attendees will gain insights and practical steps to create inclusive environments, foster diverse collaboration, and ensure the scientific Python community thrives by unlocking its full potential.
Today’s quantum computers are far noisier than their classical counterparts. Unlike traditional computing errors, quantum noise is more complex, arising from decoherence, crosstalk, and gate imperfections that corrupt quantum states. Error mitigation has become a rapidly evolving field, offering ways to address these errors on existing devices. New techniques emerge regularly, requiring flexible tools for implementation and testing. This talk explores the challenges of mitigating noise and how researchers and engineers use Python to iterate quickly while maintaining reliable and reproducible workflows.
Reproducibility is a major underpinning of the scientific method. In scientific computing, this also includes the ability to reproduce your dependencies. Yet, in 2025 this still remains a challenging topic.
Pixi is a modern package manager built on the Conda ecosystem. It integrates very well with all existing packages on conda-forge. Pixi makes package management reproducible, fast and painless – so that scientists can go back to coding instead of dealing with “dependency hell”. Pixi improves the mix Conda and PyPI package management by integrating with uv
by astral.sh and streamlines automation with a cross-platform task runner. These features combined with a powerful lockfile make creating reproducible projects trivial.
This talk is for people who are interested in new, fast ways to set up their software (dev) environments on different systems – think your coworker's computer, CI, containers, and more.
In today’s world of ever-growing data and AI, learning about GPUs has become an essential part of software carpentry, professional development and the education curriculum. However, teaching with GPUs can be challenging, from resource accessibility to managing dependencies and varying knowledge levels.
During this talk we will address these issues by offering practical strategies to promote active learning with GPUs and share our experiences from running numerous Python conference tutorials that leveraged GPUs. Attendees will learn different options to how to provide GPU access, tailor content for different expertise levels, and simplify package management when possible.
If you are an educator, researcher, and/or developer who is interested in teaching or learning about GPU computing with Python, this talk will give you the confidence to teach topics that require GPU acceleration and quickly get your audience up and running.
mybinder.org has served millions of scientific python users for 8 years now! It is an experiment in running open source infrastructure as a public good. Sustainability challenges faced by open source software production are magnified here - we need people time to manage the infrastructure, pay for computational infrastructure required to run the service, operate it reliably by responding to outages in a timely fashion, and fight off abuse from malicious actors. This talk covers the lessons learnt over the years, and new community oriented experiments to better sustainability, functionality & reliability that we are trying out now.
The Universe isn't always so quiet: neutron stars, fast radio bursts, and potentially alien civilizations emit bursts of electromagnetic energy - radio transients - into the unknown. In some cases, these emissions, like with pulsars, are constant and periodic; but in others, like with fast radio bursts, they're short in duration and infrequent. Classical detection surveys typically rely on dedispersion techniques and human-crafted signal processing filters to remove noise and highlight a signal of interest. But what if we're missing something?
In this talk we will introduce a workflow to avoid classical processing all together. By feeding RF samples directly from the telescope's digitizers into GPU computing, we can train an AI model to serve as a detector -- not only enabling real time performance, but also making decisions directly on raw spectrogram data, eliminating the need for classical processing. We will demonstrate how each step of the pipeline works - from AI model training and data curation to real-time inferencing at scale. Our hope is that this new sensor processing architecture can simplify development, democratize science, and process increasingly large amounts of data in real time.
The Issaquah Robotics Society (IRS) has been teaching Python and data analysis to high school students since 2016. Our presentation will summarize what we’ve learned from nine years of combining Python, competitive robotics, and high school students with no prior programming experience. We’ll focus on the importance of keeping it fun, learning the tools, and how to provide useful feedback without making learning Python feel like just another class. We’ll also explain how Python helps us win robotics competitions.
The best way to distribute large scientific datasets is via the Cloud, in Cloud-Optimized formats. But often this data is stuck in archival pre-Cloud file formats such as netCDF.
VirtualiZarr makes it easy to create "Virtual" Zarr datacubes, allowing performant access to huge archival datasets as if it were in the Cloud-Optimized Zarr format, without duplicating any of the original data.
We will demonstrate using VirtualiZarr to generate references to archival files, combine them into one array datacube using xarray-like syntax, commit them to Icechunk, and read the data back with zarr-python v3.
Camera traps are an essential tool for wildlife research. Zamba is an open source Python package that leverages machine learning and computer vision to automate time-intensive processing tasks for wildlife camera trap data. This talk will dive into Zamba's capabilities and key factors that influenced its design and development. Topics will include the importance of code-free custom model training, Zamba’s origins in an open machine learning competition, and the technical challenges of processing video data. Attendees will walk away with a better understanding of how machine learning and Python tools can support conservation efforts.
In Python, data analytics users often prioritize convenience, flexibility, and familiarity over pure performance. The cuDF DataFrame library provides a pandas-like experience with from 10x up to 50x performance improvements, but subtle differences prevent it from being a true drop-in replacement for many users. This talk will showcase the evolution of this library to provide zero-code change experiences, first for pandas users and now for Polars. We will provide examples of this usage and a high level overview of how users can make use of these today. We will then delve into the details of how GPU acceleration is implemented differently in pandas and Polars, along with a deep dive into some of the different technical challenges encountered for each. This talk will have something for both data practitioners and library developers.
Working with data in grids or spreadsheets is great for collaboration as there are many different tools to view and edit the files. Data science workflows often include packages like openpyxl to create, load, edit, and export spreadsheets that then are shared with others who can use other tools like Excel, Google Sheets, or IDEs to view them. The new Python in Excel feature as well as the Anaconda Toolbox add-in provides the tools to run Python directly in cells in a spreadsheet, making it easier for Pythonistas to access and collaborate on code. This talk will introduce how these features work, demo collaborating on Python code in a worksheet, and talk about some case studies where these tools have been used to teach and collaborate with Python.
Jdaviz (https://github.com/spacetelescope/jdaviz) is a Jupyter-based data analysis and visualization tool for astronomical data. In this talk, I will demonstrate recently implemented configurations for visualizing ramps and light curves, and discuss how early design decisions allowed us to extend the tool beyond the original use cases to support these data types. I'll also discuss how UI/UX reviews and user testing have driven the evolution of core features, and the benefits of a Jupyter widget-based design for running the tool in a variety of environments.
We illustrate the power and flexibility of a new extension point in Xarray's data model: "custom indexes" that allow Xarray users to neatly handle complex grids, and enables at least one new data model (vector data cubes). We present a whirlwind tour of specific examples to illustrate the power of this feature, and aim to stimulate experimentation during the sprints.
Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference.
As scientific computing increasingly relies on diverse hardware (CPUs, GPUs, etc) and data structures, libraries face pressure to support multiple backends while maintaining a consistent API. This talk presents practical considerations for adding dispatching to existing libraries, enabling seamless integration with external backends. Using NetworkX and scikit-image as case studies, we demonstrate how they evolved to become a common API with multiple implementations, handle backend-specific behaviors, and ensure robustness through testing and documentation. We also discuss technical challenges, differences in approaches, community adoption strategies, and the broader implications for the SciPy ecosystem.
The SciPy Proceedings (https://proceedings.scipy.org) have long served as a cornerstone for publishing research in the scientific python community; with over 330 peer-reviewed articles being published over the last 17 years. In 2024, the SciPy Proceedings underwent a significant transformation, adopting MyST Markdown (https://mystmd.org) and Curvenote (https://curvenote.com) to enhance accessibility, interactivity, and reproducibility — including publishing of Jupyter Notebooks. The new proceedings articles are web-first, providing features such as deep-dive links for cross-references and previews of GItHub content, interactive 3D visualizations, and rich-rendering of Jupyter Notebooks. In this talk, we will (1) present the new authoring & reading capabilities introduced in 2024; (2) highlight connections to prominent open-science initiatives and their impact on advancing computational research publishing; and (3) demonstrate the underlying technologies and how they enhance integrations with SciPy packages and how to use these tools in your own communication workflows.
Our presentation will give an overview of the revised authoring process for SciPy Proceedings; how we improve metadata standards in a similar way to code-linting and continuous integration; and the integration of live previews of the articles, including auto-generated PDFs and JATS XML (a standard used in scientific publishing). The peer-review process for the proceedings currently happens using GitHub’s peer-review commenting in a similar fashion to the Journal of Open Source Software; we will demonstrate this process as well as showcase opportunities for working with distributed review services such as PREreview (https://prereview.org). The open publishing pipeline has streamlined the submission, review, and revision processes while maintaining high scientific quality and improving the completeness of scholarly metadata. Finally, we will present how this work connects into other high-profile scientific publishing initiatives that have incorporated Jupyter Notebooks and live computational figures as well as interactive displays of large-scale data. These initiatives include Notebooks Now! by the American Geophysical Union, which is focusing on ensuring that Jupyter Notebooks can be properly integrated into the scholarly record; and the Microscopy Society of America’s work on interactive publishing and publishing of large-scale microscopy data with interactive visualizations. These initiatives and the SciPy Proceedings are enabled by recent improvements in open-source tools including MyST Markdown, JupyterLab, BinderHub, and Curvenote, which enable new ways to share executable research content. These initiatives collectively aim to improve both the reproducibility, interactivity, and the accessibility of research by providing improved connections between data, software and narrative research articles.
By embracing open science principles and modern technologies, the SciPy Proceedings exemplify how computational research can be more transparent, reproducible, and accessible. The shift to computational publishing, especially in the context of the scientific python community, opens new opportunities for researchers to publish not only their final results but also the computational workflows, datasets, and interactive visualizations that underpin them. This transformation aligns with broader efforts in open science infrastructure, such as integrating persistent identifiers (DOIs, ORCID, ROR), and adopting FAIR (Findable, Accessible, Interoperable, Reusable) principles for computational content. Building on these foundations, as well as open tools like MyST Markdown and Curvenote, provides a scalable model for open scientific publishing that bridges the gap between computational research and scholarly communication, fostering a more collaborative, iterative, and continuous approach to scientific knowledge dissemination.
Neuroscientists record brain activity using probes that capture rapid voltage changes ('spikes') from neurons. Spike sorting, the process of isolating these signals and attributing them to specific neurons, faces significant challenges: incompatible file formats, diverse algorithms, and inconsistent quality control. SpikeInterface provides a unified Python framework that standardizes data handling across technologies and enables reproducibility. In this talk, we will discuss: 1) SpikeInterface's modular components for I/O, processing, and sorting; 2) containerized dependency management that eliminates complex installation conflicts between diverse spike sorters; and 3) parallelization tools optimized for the memory-intensive nature of large-scale electrophysiology recordings.
Extreme weather events threaten industries and economic stability. NOAA’s National Centers for Environmental Information (NCEI) addresses this through the Industry Proving Grounds (IPG), which modernizes data delivery by collaborating with sectors like re/insurance and retail to develop practical, data-driven solutions. This presentation explores IPG’s technical innovations, including implementing Polars for efficient data processing, AWS for scalability, and CI/CD pipelines for streamlined deployment. These tools enhance data accessibility, reduce latency, and support real-time decision-making. By integrating scientific computing, cloud technology, and DevOps, NCEI improves climate resilience and provides a model for leveraging open-source tools to address global challenges.
The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads.
The SciPy library provides objects representing well over 100 univariate probability distributions. These have served the scientific Python ecosystem for decades, but they are built upon an infrastructure that has not kept up with the demands of today’s users. To address its shortcomings, SciPy 1.15 includes a new infrastructure for working with probability distributions. This talk will introduce users to the new infrastructure and demonstrate its many advantages in terms of usability, flexibility, accuracy, and performance.
Would you rather read a “Climate summary” or a “Climate summary for exactly where you live”? Producing documents that tailor your scientific results to an individual or their situation increases understanding, engagement, and connection. But, producing many reports can be onerous.
If you are looking for a way to automate producing many reports, or you produce reports like this but find yourself in copy-and-paste hell, come along to learn how Quarto solves this problem with parameterized reports - you create a single Python notebook, but you generate many beautiful customized PDFs.
Open-source projects are intricate ecosystems that consist of humans contributing in a diverse manner. These contributions are one of the essential elements driving the projects and must be encouraged. The humans behind these contributions play a vital role in constituting the lively and diverse community of the project. Both the humans and their contributions must be preserved and handled with utmost care for the success and evolution of the project.
As with every community, certain best practices should be followed to maintain its health, and certain pitfalls should be avoided. In this talk, I’ll share what I have learned from maintaining the vibrant and wonderful Zarr project and its community over the years.
Real-time machine learning depends on features and data that by definition can’t be pre-computed. Detecting fraud or acute diseases like sepsis requires processing events that emerged seconds ago. How do we build an infrastructure platform that executes complex data pipelines (< 10ms) end-to-end and on-demand? All while meeting data teams where they are–in Python–the language of ML!
Learn how we built a symbolic interpreter that accelerates ML pipelines by transpiling Python into DAGs of static expressions. These expressions are optimized in C++ and eventually run in production workloads at scale with Velox–an OSS (~4k stars) unified query engine (C++) from Meta.
The increasing prevalence of AI models necessitates robust mechanisms to ensure their trustworthiness. This talk introduces a standardized, PKI-agnostic approach to verifying the origins and integrity of machine learning models, as built by the OpenSSF Model Signing project. We extend this methodology beyond models to encompass datasets and other associated files, offering a holistic solution for maintaining data provenance and integrity.
Napari, an open-source viewer for scientific data, has an inviting and well-established community that encourages contribution to its own project and the broader bioimage analysis community. This talk will explore how napari supports non-traditional contributors—especially those without formal software development experience—through its welcoming community, human-centered documentation, and rich plugin ecosystem.
As someone with a pure biology background, I will share my journey into computational bioimage analysis and the scientific Python world, and contributing to napari's community. By sharing my experience writing a plugin and contributing to the core project, I will show how community-driven projects, like napari, lower barriers to entry, empower scientists, and cultivate a diverse, engaged research and developer community.
Python notebooks are a workhorse of scientific computing. But traditional notebooks have problems — they suffer from a reproducibility crisis; they are difficult to use with interactive widgets; their file format does not play well with Git; and they aren't reusable like regular Python scripts or modules.
This talk presents a marimo, an open-source reactive Python notebook that addresses these concerns by modeling notebooks as dataflow graphs and storing them as Python files. We discuss design decisions and their tradeoffs, and show how these decisions make marimo notebooks reproducible in execution and packaging, Git-friendly, executable as scripts, and shareable as apps.
The rapid growth of scientific data repositories demands innovative solutions for efficient metadata creation. In this talk, we present our open-source project that leverages large language models to automate the generation of standard-compliant metadata files from raw scientific datasets. Our approach harnesses the capabilities of pre-trained open source models, finetuned with domain-specific data, and integrated with Langgraph to orchestrate a modular, end-to-end pipeline capable of ingesting heterogeneous raw data files and outputting metadata conforming to specific standards.
The methodology involves a multi-stage process where raw data is first parsed and analyzed by the LLM to extract relevant scientific and contextual information. This information is then structured into metadata templates that adhere strictly to recognized standards, thereby reducing human error and accelerating the data release cycle. We demonstrate the effectiveness of our approach using the USGS ScienceBase repository, where we have successfully generated metadata for a variety of scientific datasets, including images, time series, and text data.
Beyond its immediate application to the USGS ScienceBase repository, our open-source framework is designed to be extensible, allowing adaptation to other data release processes across various scientific domains. We will discuss the technical challenges encountered, such as managing diverse data formats and ensuring metadata quality, and outline strategies for community-driven enhancements. This work not only streamlines the metadata creation workflow but also sets the stage for broader adoption of generative AI in scientific data management.
Additional Material:
- Project supported by USGS and ORNL
- Codebase will be available on GitHub after paper publication
- Fine-tuned LLM models will be available on Hugginface after paper publication
Flyte is a Linux Foundation OSS orchestrator built for Data and Machine Learning workflows focused on scalability, reliability, and developer productivity. Flyte’s Python SDK, Flytekit, empowers developers by shipping their code from their local environments onto a cluster with one simple CLI command. In this talk, you will learn about the design and implementation details that powers Flytekit’s core features, such as “fast registration” and “type transformers”, and a plugin system that enables Dask, Ray, or distributed GPU workflows.
PhD students, postdocs and independent researchers often struggle when trying to execute code developed locally in the cloud or HPC clusters for better performance. This is even more difficult if they can't count on IT staff to set up the necessary infrastructure for them on the remote machine, which is common in third-world countries. Spyder 6.1 will come with a whole set of improvements to address that limitation, from setting up a server automatically to easily run code remotely on behalf of users, to manage remote Conda environments and the remote file system from the comfort of a local Spyder installation.
Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference.