To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.

Monday, July 7, 2025

Tuesday, July 8, 2025

Wednesday, July 9, 2025

Thursday, July 10, 2025

Friday, July 11, 2025

Saturday, July 12, 2025

Sunday, July 13, 2025

07:00

60min

Registration

Ballroom A

08:00

240min

A Hands-on Tutorial towards building Explainable Machine Learning using SHAP, GINI, LIME, and Permutation Importance

Dr. Debarshi Datta, Dr. Subhosit Ray

The advancement of AI systems necessitates the need for interpretability to address transparency, biases, risks, and regulatory compliance. The workshop teaches core techniques in interpretability, including SHAP (game-theoretic feature attribution), GINI (decision tree impurity analysis), LIME (local surrogate models), and Permutation Importance (feature shuffling), which provide global and local explanations for model decisions. With hands-on building of interpretability tools and visualization techniques, we explore how these methods enable bias detection and clinical trust in healthcare diagnostics and develop the most effective strategies in finance. These techniques are essential in building interpretable AI to address the challenges of the black-box models.

All the SQL a Pythonista needs to know: an introduction to SQL and DataFrames with DuckDB

Guen Prawiroatmodjo, Alex Monahan, Jacob Matson

Structured Query Language (or SQL for short) is a programming language to manage data in a database system and an essential part of any data engineer’s tool kit. In this tutorial, you will learn how to use SQL to create databases, tables, insert data into them and extract, filter, join data or make calculations using queries. We will use DuckDB, a new open source embedded in-process database system that combines cutting edge database research with dataframe-inspired ease of use. DuckDB is only a pip install away (with zero dependencies), and runs right on your laptop. You will learn how to use DuckDB with your existing Python tools like Pandas, Polars, and Ibis to simplify and speed up your pipelines. Lastly, you will learn how to use SQL to create fast, interactive data visualizations, and how to teach your data how to fly and share it via the Cloud.

Building with LLMs Made Simple

Eric Ma

In this tutorial, you will learn how to integrate Large Language Models (LLMs) directly into Python programs as thoughtfully-designed core components of the program rather than bolt-on additions. This hands-on session teaches design principles and practical techniques for incorporating LLM outputs into program control flow. We will use LlamaBot, an open-source Python interface to LLMs, focusing on local execution with local and efficient models.

Scaling Clustering for Big Data: Leveraging RAPIDS cuML

Allison Ding

This tutorial will explore GPU-accelerated clustering techniques using RAPIDS cuML, optimizing algorithms like K-Means, DBSCAN, and HDBSCAN for large datasets. Traditional clustering methods struggle with scalability, but GPU acceleration significantly enhances performance and efficiency.

Participants will learn to leverage dimensionality reduction techniques (PCA, T-SNE, UMAP) for better data visualization and apply hyperparameter tuning with Optuna and cuML. The session also includes real-world applications like topic modeling in NLP and customer segmentation. By the end, attendees will be equipped to implement, optimize, and scale clustering algorithms effectively, unlocking faster and more powerful insights in machine learning workflows.

The Accelerated Python Developer's Toolbox

Katrina Riehl

As general purpose GPU programming has risen in popularity, many Python programmers have expressed a need to use this technology in their libraries and applications. They soon realize that the GPU landscape is vast and sometimes difficult to traverse for Python users.

In this talk, I will demystify the CUDA-enabled Accelerated Python landscape, focusing on the advantages and disadvantages of popular libraries, the common performance issues encountered, and the best practices to getting the most out of your GPU. Topics include CuPy, numba, nvmath-python, cuDF, and cuML.

This talk is beginner-friendly, but even the most seasoned programmer will gain insight into the Python GPU computing landscape.

Jim Pivarski, Peter Fackeldey

Despite its reputation for being slow, Python is the leading language of scientific computing, which generally needs large-scale (fast) computations. This is because most scientific problems can be split into "metadata bookkeeping" and "number crunching," where the latter is performed by array-oriented (vectorized) calls into precompiled routines.

This tutorial is an introduction to array-oriented programming. We'll focus on techniques that are equally useful in any array library, with a particular focus on NumPy and JAX. You'll work in groups on four class projects: Conway's Game of Life using arrays, iterative computations on arrays, just-in-time (JIT) compilation for the Mandelbrot set, and exploring data in ragged arrays.

Vega-Altair: A Structured Way to Build Interactive Charts

Dylan Wootton, Jon Mease

This tutorial is an introduction to data visualization using the popular Vega-Altair Python library. Vega-Altair provides a simple and expressive API, enabling authors to rapidly create a wide range of interactive charts.

Participants will explore the fundamentals of effective chart design and gain hands-on experience building a variety of visualizations using Vega-Altair's declarative API. Furthermore, this tutorial will introduce users to advanced topics such as data transformations and interaction design. We will finish off by covering practical workflows such as integrating Vega-Altair into dashboarding systems, publishing visualizations, and creating reusable, themed charting libraries. By the end of the session, attendees will have the skills to leverage Vega-Altair for both rapid prototyping and production-ready visualizations in diverse environments

Tutorials

Room 317

12:00

90min

Lunch

Ballroom A

13:30

240min

3D Visualization with PyVista

Tetsuo Koyama, Alexander Kaszynski, Bane Sullivan

PyVista is a general purpose 3D visualization library used for over 2000+ open source projects for the visualization of everything from computer aided engineering and geophysics to volcanoes and digital artwork.

PyVista exposes a Pythonic API to the Visualization Toolkit (VTK) to provide tooling that is immediately usable without any prior knowledge of VTK and is being built as the 3D equivalent of Matplotlib, with plugins to Jupyter to enable visualization of 3D data using both server- and client-side rendering.

Building machine learning pipelines that scale: a case study using Ibis and IbisML

Anjali Datta, Deepyaman Datta

Pandas and scikit-learn have become staples in the machine learning toolkit for processing and modeling tabular data in Python. However, when data size scales up, these tools become slow or run out of memory. Ibis provides a unified, Pythonic, dataframe-like interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Ibis enables users to leverage these powerful tools without rewriting their data engineering code (or learning SQL). IbisML extends the benefits of using Ibis to the ML workflow by letting users preprocess their data at scale on any Ibis-supported backend.

In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability after each move during chess games.

Develop Pythonic spreadsheets running Python in and out of the grid

Sarah Kaiser, Jim Kitchen

Spreadsheets are one of the most common ways to share and work with data which helpfully also works great in Python! In this tutorial, we will cover some of the basics and best pratice of consuming and producing spreadsheets in Python as well as a deep dive into how to run Python directly in your spreadsheets. We will introduce and dive deep into the new Python in Excel features as well as the Anaconda Toolbox for Excel add-in.

Introduction to Data Analysis Using Pandas

Stefanie Molin

Working with data can be challenging: it often doesn’t come in the best format for analysis, and understanding it well enough to extract insights requires both time and the skills to filter, aggregate, reshape, and visualize it. This session will equip you with the knowledge you need to effectively use pandas – a powerful library for data analysis in Python – to make this process easier.

Pandas makes it possible to work with tabular data and perform all parts of the analysis from collection and manipulation through aggregation and visualization. While most of this session focuses on pandas, during our discussion of visualization, we will also introduce at a high level Matplotlib (the library that pandas uses for its visualization features, which when used directly makes it possible to create custom layouts, add annotations, etc.) and Seaborn (another plotting library, which features additional plot types and the ability to visualize long-format data).

Reproducible Machine Learning Workflows for Scientists with pixi

Matthew Feickert, Ruben Arts, John Kirkham

Scientific researchers need reproducible software environments for complex applications that can run across heterogeneous computing platforms. Modern open source tools, like pixi, provide automatic reproducibility solutions for all dependencies while providing a high level interface well suited for researchers.

This tutorial will provide a practical introduction to using pixi to easily create scientific and AI/ML environments that benefit from hardware acceleration, across multiple machines and platforms. The focus will be on applications using the PyTorch and JAX Python machine learning libraries with CUDA enabled, as well as deploying these environments to production settings in Linux container images.

Retrieval Augmented Generation (RAG) for LLMs

Sukhada Kulkarni, Siyu Qian, Xinling Luo, Antoni Liria Sala

Large Language Models (LLMs) have revolutionized natural language processing, but they come with limitations such as hallucinations and outdated knowledge. Retrieval-Augmented Generation (RAG) is a practical approach to mitigating these issues by integrating external knowledge retrieval into the LLM generation process.

This tutorial will introduce the core concepts of RAG, walk through its key components, and provide a hands-on session for building a complete RAG pipeline. We will also cover advanced techniques, such as hybrid search, re-ranking, ensemble retrieval, and benchmarking. By the end of this tutorial, participants will be equipped with both the theoretical understanding and practical skills needed to build robust RAG pipeline.

The-Silmaril: Practice #ontology engineering with Python (and other languages).

Shaurya Agarwal

Ontologies provide a powerful way to structure knowledge, enable reasoning, and support more meaningful queries compared to traditional data models. Recently, interest in ontologies has resurged, driven by advancements in language models, reasoning capabilities, and the growing adoption of platforms like Palantir Foundry.

In this hands-on tutorial, participants will explore ontology development across multiple domains using a variety of Python-based tools such as rdflib, Owlready2, PySpark, Pandas, and SciPy. They will learn how ontologies facilitate semantic reasoning, improve data interoperability, and enhance query capabilities.
Additionally, attendees will build a rudimentary reasoning engine to better understand inference mechanisms.
The tutorial emphasizes practical applications and comparisons with conventional data representations, making it ideal for researchers, data engineers, and developers interested in knowledge representation and reasoning.

Tutorials

Room 317

07:00

60min

Registration

Ballroom A

08:00

240min

Building an AI Agent for Natural Language to SQL Query Execution on Live Databases

Cainã Max Couto da Silva

This hands-on tutorial will guide participants through building an end-to-end AI agent that translates natural language questions into SQL queries, validates and executes them on live databases, and returns accurate responses. Participants will build a system that intelligently routes between a specialized SQL agent and a ReAct chat agent, implementing RAG for query similarity matching, comprehensive safety validation, and human-in-the-loop confirmation. By the end of this 4-hour session, attendees will have created a powerful and extensible system they can adapt to their own data sources.

Create Your First Python Package: Make Your Python Code Easier to Share and Use

Tetsuo Koyama, Leah Wasser, Inessa Pawson, Carol Willing, Jeremiah Paige

Python packaging can be overwhelming. However, a trusted, community-vetted workflow can make it easier. In this hands-on workshop, you’ll learn a tested approach developed by the pyOpenSci community and vetted by Python packaging maintainers. You’ll create an installable, maintainable, and citable package using a quickstart template. You’ll also receive step-by-step guidance on publishing to TestPyPI (and resources for conda-forge, and adding a DOI with Zenodo). If you can’t install software on your laptop, you can use GitHub Codespaces to participate in the workshop. Join us to package your Python code confidently and to access ongoing support in our community beyond the workshop.

Create custom image visualization and analysis tools with napari

Draga Doncila Pop, Peter Sobolewski, Tim Monko

With cameras in everything from microscopes to telescopes to satellites, scientists produce image data in countless formats, shapes, sizes, and dimensions. Python provides a rich ecosystem of libraries to make sense of them. napari is a Python library for multidimensional image visualization, but it does double duty as a standalone application that can be easily extended with GUI tools for analysis, visualization, and annotation. In this tutorial, we'll start with the basics of image visualization and analysis in Python, then show how to extend the napari user interface to make analysis workflows as easy as pushing a button, and finally show how to share these extensions as plugins, which can be easily installed by users and collaborators. If you work with images (particularly multidimensional images), and especially if you work with scientists who may not be comfortable with Python, this tutorial might be for you!

Geospatial data visualisation in Python

Adam Symington

The rapid expansion of the geospatial industry and accompanying increase in availability of geospatial data, presents unique opportunities and challenges in data science. As the need for skilled data scientists increases, the ability to manipulate and interpret this data becomes crucial. This workshop introduces the essentials of geospatial data manipulation and data visualisation, emphasizing hands-on techniques to transform, analyze and visualise diverse datasets effectively.

Throughout the workshop, attendees will explore the extensive ecosystem of geospatial Python libraries. Key tools include GeoPandas, Shapely and Cartopy for vector data, GDAL, Rasterio and rioxarray for raster data and participants will also learn to integrate these with popular plotting libraries such as Matplotlib, Bokeh, and Plotly for visualizations.

This tutorial will cover three primary topics: visualizing geospatial shapes, managing raster datasets, and synthesizing multiple data types into unified visual representations. Each section will incorporate data manipulation exercises to ensure attendees not only visualize but also deeply understand geospatial data.

Targeting both beginners and advanced practitioners, the workshop will employ real-world examples to guide participants through the necessary steps to produce striking and informative geospatial visualizations. By the end, attendees will be equipped with the knowledge to leverage advanced data science techniques in their geospatial projects, making them proficient in both the analysis and communication of spatial information.

Network Analysis Made Simple

Eric Ma

Through the use of NetworkX's API, tutorial participants will learn about the basics of graph theory and its use in applied network science. Starting with a computationally-oriented definition of a graph and its associated methods, we will progress through the following concepts: path and structure finding, visualization, and graph storage on disk. We will also offer tutorial participants the option of one advanced topic overview, including the use of graphs alongside LLMs for knowledge retrieval, scalable alternatives to NetworkX including cuGraph, and the use of linear algebraic translation of graph problems to speed up computations.

Processing Cloud-optimized data in Python with Serverless Functions (Lithops, Dataplug)

Universitat Rovira i Virgili (Pedro Garcia Lopez), Enrique Molina Giménez

Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset.
These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data.
They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations.
In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless.
FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden.
Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup.

In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit.
Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions.
We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines
in the Cloud that demonstrate the benefits of cloud-optimized data management.

Show your work: Tutorial on building and hosting web applications

Kedar Dabhadkar, Archit Datar

TL;DR
Learn how to turn your Python functions into interactive web applications using open-source tools. By the end, each of us will have deployed a portfolio (or store) with multiple web applications and learned how to reproduce it easily later on.

Tell me more
Work not shown is work lost. Many excellent scientists and engineers are not always adept at showcasing their work. This results in many interesting scientific ideas that have never been brought to light.

However, using today's tools, one no longer has to leave the Python ecosystem to create classy, complete prototypes using modern data visualization and web development tools. With over five years of experience building and presenting data solutions at huge science companies, we show it doesn't have to be challenging. We provide a walkthrough of the primary web application frameworks and showcase Fast Dash, an open-source Python library that we built to address specific prototyping needs.

This tutorial is designed for all data professionals who value the ability to quickly convert their scientific code into web applications. Participants will learn about the leading frameworks, their strengths and limitations, and a decision flowchart for picking the best one for a given task. We will go through some day-to-day applications and hands-on Python coding throughout the session. Whether you bring your use-cases and datasets, or pick from our suggestions, you'll have a reproducible portfolio (app store) of deployed web applications by the end!

Tutorials

Ballroom C

12:00

90min

Lunch

Ballroom A

13:30

240min

(Pre-)Commit to Better Code

Stefanie Molin

Maintaining code quality can be challenging, no matter the size of your project or number of contributors. Different team members may have different opinions on code styling and preferences for code structure, while solo contributors might find themselves spending a considerable amount of time making sure the code conforms to accepted conventions. However, manually inspecting and fixing issues in files is both tedious and error-prone. As such, computers are much more suited to this task than humans. Pre-commit hooks are a great way to have a computer handle this for you.

Pre-commit hooks are code checks that run whenever you attempt to commit your changes with Git. They can detect and, in some cases, automatically correct code-quality issues before they make it to your codebase. In this tutorial, you will learn how to install and configure pre-commit hooks for your repository to ensure that only code that passes your checks makes it into your code base. We will also explore how to build custom pre-commit hooks for novel use cases.

Bring Accelerated Computing to Data Science in Python

Kevin Lee

As data science continues to evolve, the ever-growing size of datasets poses significant computational challenges. Traditional CPU-based processing often struggles to keep pace with the demands of data science workflows. Accelerated computing with GPUs offers a solution by enabling massive parallelism and significantly reducing processing times for data-heavy tasks. In this session, we will explore GPU computing architecture, how it differs from CPUs, and why it is particularly well-suited for data science workloads. This hands-on lab will dive into the different approaches to GPU programming, from low-level CUDA coding to high-level Python libraries within RAPIDS such as, CuPy, cuDF, cuGraph, and cuML.

Building LLM-Powered Applications for Data Scientists and Software Engineers

hugo bowne-anderson, Stefan Krawczyk

This workshop is designed to equip software engineers with the skills to build and iterate on generative AI-powered applications. Participants will explore key components of the AI software development lifecycle through first principles thinking, including prompt engineering, monitoring, evaluations, and handling non-determinism. The session focuses on using multimodal AI models to build applications, such as querying PDFs, while providing insights into the engineering challenges unique to AI systems. By the end of the workshop, participants will know how to build a PDF-querying app, but all techniques learned will be generalizable for building a variety of generative AI applications.

If you're a data scientist, machine learning practitioner, or AI enthusiast, this workshop can also be valuable for learning about the software engineering aspects of AI applications, such as lifecycle management, iterative development, and monitoring, which are critical for production-level AI systems.

Hierarchical Data Analysis with Xarray DataTree & Zarr

Tom Nicholas, Deepak Cherian, Joe Hamman, Eniola Awowale, Ian Hunt-Isaak, Justus Magin, Negin Sobhani, Scott Henderson

Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets. Many real-world datasets often have hierarchical or heterogeneous structure, and are best organized through groups of related data arrays. Through xarray.DataTree, the xarray data model now supports opening datasets with a hierarchical structure of groups, such as HDF5 files and Zarr stores. This expanded data model is now general enough to manage data across different scientific disciplines, including geosciences and biosciences. This hands-on tutorial focuses on intermediate and advanced workflows using xarray to analyze real-world hierarchical data.

Scaling-up deep learning inference to large-scale bioimage data

Fernando Cervantes Sanchez, Peter Sobolewski

Artificial intelligence has been successfully applied to bioimage understanding and achieved significative results in the last decade. Advances in imaging technologies have also allowed the acquisition of higher resolution images. That has increased not only the magnification at what images are captured, but the size of the acquired images as well. This comprises a challenge for deep learning inference in large-scale images, since these methods are commonly used in relatively small regions rather than whole images. This workshop presents techniques to scale-up inference of deep learning models to large-scale image data with help of Dask for parallelization in Python.

Shiny for Python: Building Production-Ready Dashboards in Python

Daniel Chen

Shiny is a framework for building web applications and data dashboards in Python.
In this workshop,
you will see how the basic building blocks of shiny can be extended to create
your own scalable production-ready python applications.

In particular, this workshop covers:

Overview of the basic building blocks of a Shiny for Python application
How to refactor applications into shiny modules
How to write tests for your shiny application
Deploy and share your application

At the end of this course you will be able to:

Build a Shiny app in Python
Refactor your reactive logic into Shiny Modules
Identify when to write Shiny modules
Write unit tests and end-to-end tests for your shiny application
Deploy and share your application (for free!)

Tutorials

Ballroom D

07:30

90min

Registration and Breakfast

Ballroom

09:00

15min

Opening Notes

Ballroom

09:15

45min

What We Maintain, We Defend

Hon. Kathryn D. Huff, PhD

Scientific Python is not only at the heart of discovery and advancement, but also infrastructure. This talk will provide a perspective on how open-source Python tools that are already powering real-world impact across the sciences are also supportive of public institutions and critical public data infrastructure. Drawing on her previous experience leading policy efforts in the Department of Energy as well as her experience in open-source scientific computing, Katy will highlight the indispensable role of transparency, reproducibility, and community in high-stakes domains. This talk invites the SciPy community to recognize its unique strengths and to amplify their impact by contributing to the public good through technically excellent, civic-minded development.

Keynotes

Ballroom

10:00

30min

SciPy Tools Plenary

Ballroom

10:30

15min

Break

Ballroom

10:45

GBNet