2.0 -//Pentabarf//Schedule//EN

PUBLISH PLUNRN@@cfp.scipy.org

-PLUNRN

A Hands-on Tutorial towards building Explainable Machine Learning using SHAP, GINI, LIME, and Permutation Importance en

20250707T080000 20250707T120000 4.00000

A Hands-on Tutorial towards building Explainable Machine Learning using SHAP, GINI, LIME, and Permutation Importance

The rapid adoption of artificial intelligence (AI) systems across industries has created an urgent need for transparency in algorithmic decision-making. As organizations deploy machine learning (ML) models for critical applications ranging from healthcare diagnostics to financial risk assessment, the opacity of these systems poses significant challenges to accountability, fairness, and regulatory compliance. Contemporary AI systems achieve remarkable predictive accuracy at the cost of interpretability. A 2025 analysis of Fortune 500 companies revealed that 78% of deployed ML models function as black boxes, with decision processes inaccessible even to their developers. Interpretable AI (XAI) sheds light on AI-based decision processes that are comprehensible to human stakeholders and, thus, is a critical bridge between advanced computational capabilities and ethical implementation. In this workshop, we will explore the technical foundations, methodological innovations, and practical implementations of interpretable ML techniques, with a particular focus on SHAP (Shapley Additive explanations), GINI impurity-based analysis, LIME (Local Interpretable Model-agnostic Explanations), and Permutation Importance. Through detailed analysis of real-world applications, theoretical frameworks, and emerging research directions, we demonstrate how these tools enable practitioners to maintain model performance while meeting growing demands for explainability in high-stakes environments. Interpretable AI has great potential to develop transparency by highlighting which features are important for the black box model to decide. This can increase trust in healthcare, finance, and criminal justice sectors directly affecting human lives. By understanding how an AI model makes decisions, stakeholders can ensure it is fair toward minority classes, such as individuals from protected groups (defined by race, religion, gender, disability, or ethnicity). This course is intended for data scientists and analysts who want to understand how to interpret black-box models, such as ensemble models, decision trees, and random forests. The course outlines different approaches to Interpretability methods, such as model-dependent and model-agnostic interpretability techniques. Model-dependent techniques such as GINI rely on the algorithm of the black-box model. In contrast, model-agnostic techniques such as SHAP, Lime, and Permutation Importance can analyze any models after training. The course will provide a thorough hands-on experience by teaching how to code these methods and work through real-world examples. Emphasis will be placed on numerous visualization techniques, such as interpreting Summary Plots, Beeswarms Plots, Waterfall Plots, and Interaction Feature Maps (Dependency Plots), to understand how each feature individually and their interactions influence the model outcomes. Prior experience in model interpretability is not required. Attendees must possess basic knowledge of ML models to maximize the tutorial's benefits. Familiarity with core Python data science libraries, such as NumPy, Pandas, and Scikit-Learn, is essential. The tutorial will be presented in Jupyter Notebook, enabling participants to follow along, execute examples, and finish exercises independently. A GitHub repository will be available after completion of the workshop, providing instructions for setting up the Python environment and the required packages. By the end of the tutorial, attendees will understand how to work with different interpretability techniques, compare and contrast each of their unique strengths and weaknesses, and develop a strong foundation for applications in real-world scenarios. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/PLUNRN/ Ballroom A Debarshi Datta Dr. Subhosit Ray PUBLISH ZAKQHP@@cfp.scipy.org

-ZAKQHP

Building machine learning pipelines that scale: a case study using Ibis and IbisML en

20250707T133000 20250707T173000 4.00000

Building machine learning pipelines that scale: a case study using Ibis and IbisML

### Description Tabular data is everywhere. As Python has become the language of choice for data science, pandas and scikit-learn have become staples in the machine learning (ML) toolkit for processing and modeling this data. However, when data size scales up, these tools become unwieldy (slow) or altogether untenable (running out of memory). Ibis provides a unified, Pythonic, dataframe interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Local backends, such as Polars, DuckDB, and DataFusion, perform orders of magnitude faster than pandas while using less memory. Ibis further enables users to scale using distributed backends like Spark or cloud data warehouses like Snowflake and BigQuery without changing their code, giving them the power to choose the right engine for any scale. With Ibis, scientific Python users enjoy the performance of SQL from the comfort/familiarity of Python. IbisML extends the intrinsic benefits of using Ibis to the ML workflow. It lets users preprocess their data at scale on any Ibis-supported backend—users create IbisML recipes defining sequences of last-mile preprocessing steps to get their data ready for modeling. A recipe and any scikit-learn estimator can be chained together into a pipeline, so IbisML seamlessly integrates with scikit-learn, XGBoost (using the scikit-learn estimator interface), and PyTorch (using skorch) models. At inference time, Ibis/IbisML once again takes the feature preprocessing to the efficient backend (instead of having to bring the data to the preprocessor), and user-defined functions (UDFs) enable prediction while minimizing data transfer. This completes an end-to-end ML workflow that scales with data size. In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability at any given move during a chess game. You’ll be using actual, recent games from the largest free chess server (Lichess). ### Learning Goals During this tutorial, you’ll: - Gain an appreciation for the principles underlying Ibis (deferred execution, unified interface, etc.) and the advantages these result in in different use-cases. - Learn the basics of Ibis and apply them to create features from a real-world database. - Learn IbisML constructs (including `Step`s, `Recipe`s, and `Pipeline`s) and apply this knowledge to process features before training your live win probability model. - Observe inference at scale (on a distributed backend) using the same model, gaining an appreciation of how an end-to-end ML workflow that scales is possible. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/ZAKQHP/ Ballroom A Anjali Datta Deepyaman Datta PUBLISH 9Y38WQ@@cfp.scipy.org

-9Y38WQ

Building with LLMs Made Simple en

20250707T080000 20250707T120000 4.00000

Building with LLMs Made Simple

This hands-on tutorial teaches practical integration of Large Language Models (LLMs) into Python programs using LlamaBot and Ollama. Working with locally-run models that fit within 16GB RAM, participants will build a git commit message generator while learning core concepts of LLM application development. Through Jupyter notebooks, we'll progress from basic LLM interactions with SimpleBot to structured outputs using Pydantic, culminating in systematic evaluation and practical deployment. The tutorial emphasizes learning-by-doing: participants will experiment with different models, prompting strategies, and temperature settings to understand their effects on output quality. Key learning outcomes include mastering prompt design, implementing structured generation with schema validation, developing systematic evaluation approaches, and integrating LLM-powered features into existing workflows. The session concludes with a class-chosen discussion on broader implications of LLM applications in practice. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/9Y38WQ/ Ballroom C Eric Ma PUBLISH K3DQD9@@cfp.scipy.org

-K3DQD9

Retrieval Augmented Generation (RAG) for LLMs en

20250707T133000 20250707T173000 4.00000

Retrieval Augmented Generation (RAG) for LLMs

RAG is a rapidly growing field with practical applications in AI-powered search, chatbots, and domain-specific knowledge retrieval. This tutorial provides a structured, hands-on learning experience for participants and implement more reliable and context-aware AI systems.   Target Audience: - Data scientists and ML practitioners working with LLMs. - Engineers building AI-driven search and retrieval applications. Expected Outcomes: - Understand the role of retrieval in improving LLM performance. - Implement a functional RAG pipeline using open-source tools. - Learn advanced retrieval and ranking techniques. - Gain insights into scaling RAG for production use cases. Requirements: - Familiarity with Python and basic NLP concepts. - Laptop with Python 3.8+, Jupyter Notebook, and required libraries installed. - Access to an LLM API (e.g., OpenAI, Llama 2).    Outline: Part 1: Introduction to RAG - Overview of Retrieval-Augmented Generation. - Why retrieval is essential for LLMs. - Real-world applications and use cases. Part 2: Breaking Down the RAG Pipeline - Key components of a RAG system: - Document ingestion and chunking. - Embedding models and vector databases. - Retrieval strategies: BM25, dense retrieval, hybrid search. - Response generation with LLMs. - Trade-offs between different retrieval methods. Part 3: Hands-on Implementation - Setting up a basic RAG pipeline with LangChain and FAISS. - Implementing hybrid retrieval for better search results. - Evaluating retrieval and generation quality. Part 4: Advanced RAG Techniques - Re-ranking retrieved documents. - Combining multiple retrievers (ensemble retrieval). PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/K3DQD9/ Ballroom C Sukhada Kulkarni Siyu Qian Xinling Antoni Liria Sala PUBLISH FXCRJW@@cfp.scipy.org

-FXCRJW

Vega-Altair: A Structured Way to Build Interactive Charts en

20250707T080000 20250707T120000 4.00000

Vega-Altair: A Structured Way to Build Interactive Charts

This tutorial introduces attendees to Vega-Altair, a python library for creating beautiful and interactive charts. Over four hands-on sessions, we’ll explore everything from the basics of chart design to advanced techniques like interactivity and custom theming. Part 1 focuses on the why behind great visualizations, covering principles like chart anatomy, perceptual efficiency, and common pitfalls, followed by a critique of famous Vega-Altair charts. Part 2 dives into the how, teaching participants to map data variables to visual properties using Vega-Altair’s API, with exercises to recreate and redesign charts. Part 3 introduces advanced topics like data transformations and interaction design. Finally, Part 4 covers practical workflows, including exporting charts, integrating with dashboarding tools, and creating custom charting libraries. Each part will include a mix of instruction and exercises. By the end of this tutorial, participants will not only understand the theory behind great visualizations but also have the skills to create them. Participants will be equipped to design effective visualizations, apply advanced techniques like data transformations and interactivity, and integrate their work into real-world workflows and systems. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/FXCRJW/ Ballroom D Dylan Wootton PUBLISH MHNTAD@@cfp.scipy.org

-MHNTAD

3D Visualization with PyVista en

20250707T133000 20250707T173000 4.00000

3D Visualization with PyVista

Our tutorial will demonstrate PyVista's latest capabilities and bring a wide range of users to the forefront of 3D visualization in Python. - Use PyVista to create 3D visualizations from a variety of datasets in common formats. - Overview the classes and data structures of PyVista with real-world examples. - Be familiar of the various filters and features of PyVista. - Know which Python libraries are used and can be used by PyVista (meshio, trimesh etc). We see this tutorial catering to anyone who wants to visualize data in any domain, and this ranges from basic Python users to advanced power users. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/MHNTAD/ Ballroom D Tetsuo Koyama Alexander Kaszynski Bane Sullivan PUBLISH MP7C33@@cfp.scipy.org

-MP7C33

Thinking in arrays en

20250707T080000 20250707T120000 4.00000

Thinking in arrays

Array-oriented programming is a paradigm in its own right, challenging us to think about problems in a different way. From APL in 1966 to NumPy and ML libraries today, most users of array-oriented programming are scientists, analyzing or simulating data. This tutorial focuses on the thought process: all of the problems are to be solved in an imperative way (for loops) and an array-oriented way. Matplotlib will be used for plotting, but all plotting commands will be given (not prerequisites). We'll interleave four short lectures with four group projects (3‒4 people each), each followed by their solutions, while the problems are still fresh in mind. Tutors will be available for help during each of the group projects. Here is a general outline: * 0:00‒0:15 (15 min) Lecture 1: Array-oriented programming and its benefits * 0:15‒0:35 (20 min) Project 1: Conway’s Game of Life using arrays * 0:35‒0:45 (10 min) Break * 0:45‒1:00 (15 min) Solutions to project 1 * 1:00‒1:15 (15 min) Lecture 2: Disadvantages of array-oriented programming * 1:15‒1:35 (20 min) Project 2: Iterative computations on arrays * 1:35‒1:45 (10 min) Break * 1:45‒2:00 (15 min) Solutions to project 2 * 2:00‒2:15 (15 min) Lecture 3: JIT-compilation with Numba and JAX * 2:15‒2:35 (20 min) Project 3: JIT-compilation of the Mandelbrot set * 2:35‒2:45 (10 min) Break * 2:45‒3:00 (15 min) Solutions to project 3 * 3:00‒3:15 (15 min) Lecture 4: Ragged and deeply nested arrays * 3:15‒3:35 (20 min) Project 4: Exploring data in ragged arrays * 3:35‒3:45 (10 min) Break * 3:45‒4:00 (15 min) Solutions to project 4 Prerequisites: participants should have a basic familiarity with Python and NumPy, such as the content of the "Introduction to Numerical Computing with NumPy" tutorial. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/MP7C33/ Room 315 Jim Pivarski Peter Fackeldey PUBLISH GDN8PN@@cfp.scipy.org

-GDN8PN

Reproducible Machine Learning Workflows for Scientists with pixi en

20250707T133000 20250707T173000 4.00000

Reproducible Machine Learning Workflows for Scientists with pixi

As artificial intelligence (AI) and machine learning (ML) becomes a modern part of the scientific toolkit, the need to have robustly reproducible scientific computing environments that support hardware acceleration, e.g. with CUDA, becomes more important. However, historically just installing a working CUDA environment on a single machine, let alone on multiple platforms with different requirements, was considered a particularly difficult and painful task. This lead to many scientific machine learning workflows being reliably runnable on only particular machines, and, even worse, with environments that were not reproducible across time. With significant recent advancements by the NVIDIA open source team and the conda-forge open source community, the entire CUDA stack — from compilers to runtime libraries — is now distributed on conda-forge. This significantly reduces the overhead to _install_ CUDA dependencies, but packaging and distribution of binaries alone does not solve the problem of reproducibility. With automatic multi-platform hash-level lock file support for all dependencies that are available on package indexes (like PyPI and conda-forge), highly efficient solving strategies, and high level user interfaces, `pixi` provides a missing piece to the scientific researcher toolkit. With `pixi`, researchers are able to easily specify the hardware acceleration requirements they have, multiple different computational environments needed for their experiments, and the required software dependencies, and then quickly solve for a multi-platform lock file of all the dependencies required, down to the compiler level. This makes it possible to have multiple hardware accelerated environments defined that are able to run AI/ML workflows across heterogeneous machines with different GPU types and CUDA compatibility. This tutorial will be targeted to scientific researchers who use Python for scientific computing and use hardware accelerated workflows in their research, with a particular focus on AI/ML. No prior expertise with hardware accelerator systems is assumed. The tutorial structure will begin with an introduction to `pixi` as a computational environment manager, and explore how it provides features beyond other more common package managers that might be used for Python dependencies. It will then extend to adding CUDA requirements to `pixi` environments, and provide participants with exercises for solving environments and running simple AI/ML workflows using the PyTorch and JAX machine learning libraries. The tutorial will then move towards more complex environment requirements in later exercises. The tutorial will conclude with examples and exercises focusing on deploying `pixi` workflows to production environments by distributing `pixi` environments in Linux container images. Tutorial participants will code all examples themselves. Participants will also be given time to explore solutions to their own hardware accelerated Python workflows. To make the tutorial more practical and interactive, cloud GPU resources will be requested from industry partners, that will allow for participants to have hardware accelerated resources to run their own examples on. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/GDN8PN/ Room 315 Matthew Feickert Ruben Arts John Kirkham PUBLISH KA7ZYR@@cfp.scipy.org

-KA7ZYR

The Accelerated Python Developer's Toolbox en

20250707T080000 20250707T120000 4.00000

The Accelerated Python Developer's Toolbox

For every Python developer who has integrated CUDA into their codebase, there is probably another that has thrown up their hands and exclaimed, “CUDA is hard!”. In this talk, I hope to dispel this misconception and demystify the Python CUDA landscape. Many advances have been made since the introduction of CUDA and with a little guidance, you will find that using CUDA is easier than it’s ever been before. Learn how to pick which library is suited best for your use case, as well as understand when and if you need to compose your own CUDA kernels without resorting to using C++. In this tutorial we will cover: - What is a GPU and why is it different to a CPU? - An overview of the CUDA development model. - Numba: A high performance compiler for Python. - Writing your first GPU code in Python. - Managing memory. - Working with NumPy-style arrays on the GPU. - Working with low-level math libraries. - Working with Pandas dataframes on the GPU. - Performing some scikit-learn style machine learning on the GPU. This tutorial will introduce how to translate Pythonic ways of thinking into CUDA. No Python developer should be left behind as the adoption of massively parallel GPUs spreads through the software ecosystem. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/KA7ZYR/ Room 316 Katrina Riehl PUBLISH TPGZFY@@cfp.scipy.org

-TPGZFY

Introduction to Data Analysis Using Pandas en

20250707T133000 20250707T173000 4.00000

Introduction to Data Analysis Using Pandas

#### Section 1: Getting Started With Pandas We will begin by introducing the Series, DataFrame, and Index classes, which are the basic building blocks of the pandas library, and showing how to work with them. By the end of this section, you will be able to create DataFrames and perform operations on them to inspect and filter data. #### Section 2: Data Wrangling To prepare our data for analysis, we need to perform data wrangling. We will learn how to clean and reformat data (e.g. renaming columns, fixing data type mismatches), restructure/reshape it, and enrich it (e.g. discretizing columns, calculating aggregations, combining data sources). #### Section 3: Data Visualization The human brain excels at finding patterns in visual representations of the data; so in this section, we will learn how to visualize data using pandas along with the Matplotlib and Seaborn to help us better understand our data. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/TPGZFY/ Room 316 Stefanie Molin PUBLISH WSSAU7@@cfp.scipy.org

-WSSAU7

Scaling Clustering for Big Data: Leveraging RAPIDS cuML en

20250707T080000 20250707T120000 4.00000

Scaling Clustering for Big Data: Leveraging RAPIDS cuML

Clustering is a fundamental machine learning technique widely used across various industries for applications such as customer segmentation, topic modeling, anomaly detection, and more. However, traditional clustering algorithms such as K-Means and HDBSCAN struggle with large datasets due to their computational complexity. This tutorial aims to provide a comprehensive overview of different clustering algorithms, including K-Means, DBSCAN, and HDBSCAN, and demonstrate how to leverage NVIIDA cuML to accelerate these algorithms, achieving higher performance with minimal code changes. Participants will gain insights into the strengths and use cases of each clustering algorithm, enabling data scientists and developers to select the most appropriate method for their specific needs. By harnessing the power of GPUs, we will showcase how common clustering operations can be dramatically accelerated compared to traditional CPU-only systems, significantly reducing computation time and enhancing scalability. NVIDIA cuML offers an intuitive transition for those familiar with popular clustering algorithms in Python, requiring minimal to no code modifications. We will also cover common debugging and profiling techniques to optimize the performance of your clustering applications, ensuring you can fully exploit the capabilities of GPU acceleration. Beyond clustering algorithms, the workshop will explore dimensionality reduction techniques such as PCA, T-SNE, and UMAP for visualizing clusters, making complex data more interpretable. Additionally, we will explain how to perform hyperparameter tuning with Optuna and cuML for optimal clustering results. Lastly, we will delve into a few real-world use cases such as Topic Modeling for Natural Language Processing (NLP) which will help in identify hidden patterns and insights within textual data and customer segmentation. By the end of the workshop, attendees will be equipped with the knowledge and tools to implement and optimize clustering algorithms effectively, leveraging GPU acceleration to achieve superior performance. Similar Topic was Presented at Other Events: [Accelerate Clustering Algorithms to Achieve the Highest Performance](https://www.nvidia.com/gtc/session-catalog/?search=&tab.day=20250319&search.pdatasciencep=1699468273429001C6Db&search.pdatasciencep=1699468273429002C9NC&search.pdatasciencep=1736267755885001Dqaq&search.pdatasciencep=1736267755885002DToQ&search.pdatasciencep=1699468273429007CLJy&search.pdatasciencep=1699468273429011CRZA&search.pdatasciencep=1699468273429008Cv99&search.sessiontype=1724089713191001eGXX&search.sessiontype=1701905400491001STQ1&search.sessiontype=16573103299710016qZX&search.suggestedaudiencelevel=1732117107498001nAYg#/session/1734676710101001zmBj) PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/WSSAU7/ Room 317 Allison Ding PUBLISH 3YBVVH@@cfp.scipy.org

-3YBVVH

The-Silmaril: Practice #ontology engineering with Python (and other languages). en

20250707T133000 20250707T173000 4.00000

The-Silmaril: Practice #ontology engineering with Python (and other languages).

**Ontologies** make data more structured, meaningful, and machine-readable. This tutorial will guide participants through building and reasoning over ontologies in various domains, such as movies, music, healthcare, finance, and construction. We will use Python libraries like `rdflib`, `Owlready2`, `PySpark`, `Pandas`, `NetworkX`, and `SciPy` to model domain-specific concepts, relationships, and constraints. We will also explore how ontology-driven reasoning enhances queries beyond standard data representation approaches. This tutorial will cover: - **Introduction to Ontologies**: Basics of OWL, RDF, and SPARQL. - **Building Ontologies**: Developing domain-specific models with Python tools. - **Queries and Reasoning**: Writing SPARQL queries and applying inference. - **Comparison with Other Models**: Evaluating ontologies against relational and graph-based models. - **Developing a Rudimentary Reasoning Engine**: Implementing a simple rule-based system in Python. - **Hands-on Development**: Creating ontologies in up to ten domains, including: - Movies - Music - Supply Chain - Property & Casualty Insurance - Construction - Manufacturing - Stock Market / Equities Trading - Healthcare / EHR + Claims - Pharmaceutical Supply Chain (+ Bonus: Ontology Matching) - EPCC / LEMS / PMBoK-based large construction projects **Target Audience:** Designed for anyone interested in knowledge representation, semantic reasoning, and ontology-driven data modeling. Some familiarity with Python would be needed, also some familiarity with data processing tools like Pandas would be helpful; prior ontology knowledge is not needed. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/3YBVVH/ Room 317 Shaurya Agarwal PUBLISH DDV9NQ@@cfp.scipy.org

-DDV9NQ

All the SQL a Pythonista needs to know: an introduction to SQL and DataFrames with DuckDB en

20250707T080000 20250707T120000 4.00000

All the SQL a Pythonista needs to know: an introduction to SQL and DataFrames with DuckDB

In this tutorial, you will… - **Learn how to quack SQL with DuckDB.** We will dedicate the first half of the tutorial to a beginner-friendly introduction to SQL. You'll learn to load data, get a table, filter a table by column names, add a calculated column to your table and handy tools like "group by all". We'll also learn how to work with multiple tables by using joins and subqueries. For the exercises, you can use the example data we provide, or BYO data. - **Learn how to use SQL seamlessly with your favorite Python tools.** Here we’ll talk a bit about why we think DuckDB is so helpful for a Pythonista to have in their toolkit! If SQL is the right tool for the job, DuckDB is fluent in the world’s friendliest SQL dialect. DuckDB can also fit seamlessly into any existing dataframe workflow by reading and writing Pandas, Apache Arrow, and Polars dataframes. You'll learn how to use the DuckDB engine with no SQL in sight using either DuckDB’s relational API, PySpark API or Ibis. - **Run SQL queries in live data viz for on-the-fly analytics.** You'll learn how to use DuckDB in the browser and run SQL queries as part of interactive data visualization, and how to plot data directly from DuckDB with Matplotlib. - **Share your data via the Cloud.** Now that you've mastered the basics of SQL, you will teach your data to fly with Cloud providers such as AWS and MotherDuck. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/DDV9NQ/ Room 318 Guen Prawiroatmodjo Alex Monahan PUBLISH LZWWA3@@cfp.scipy.org

-LZWWA3

Develop Pythonic spreadsheets running Python in and out of the grid en

20250707T133000 20250707T173000 4.00000

Develop Pythonic spreadsheets running Python in and out of the grid

This tutorial will start with a brief overview on spreadsheets, including tools for viewing and working with them. Then we will delve into working with spreadsheets in Python using openpyxl and pandas, including opening, streaming, exporting files, editing cell data, and accessing images and metadata. Flipping things around, the next section of the tutorial will explore using Python directly in one of the most common spreadsheet tools, Excel, using the new Anaconda Toolbox and Python in Excel. We will also dig into the ways you can leverage custom data types and Repers to make working with your data in Excel Pythonic and polished. Participants will walk away with tons of new tools, techniques, and best practices making your work easier for Pythonistas and non-Pythonistas alike to collaborate and build on! PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/LZWWA3/ Room 318 Sarah Kaiser Jim Kitchen PUBLISH XUYKZZ@@cfp.scipy.org

-XUYKZZ

Building an AI Agent for Natural Language to SQL Query Execution on Live Databases en

20250708T080000 20250708T120000 4.00000

Building an AI Agent for Natural Language to SQL Query Execution on Live Databases

### Overview Natural Language to SQL systems enable non-technical users to access database insights. This tutorial bridges the gap between theoretical understanding and practical implementation through a RAG-based approach. Participants will build an AI agent that can: 1. Ingest and understand database schemas 2. Retrieve relevant context about tables and relationships 3. Generate accurate SQL from natural language questions 4. Execute queries safely on live databases 5. Present results in an understandable format We'll use the Kaggle dataset "[Brazilian E-Commerce dataset by Olist](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce)" as our working example, demonstrating how to handle multiple tables with complex relationships. This dataset will be hosted on an EC2 AWS instance for live interaction during the tutorial. The tutorial balances theoretical foundations with hands-on practice. Participants will start from a repository with backbone code and implement the key components during the session. By the end, attendees will have a working prototype they can adapt to their own datasets. ### Tools and Frameworks This tutorial will leverage modern tools and frameworks for efficient development: **Development Tools:** - pyproject.toml for standardized project configuration - UV for fast, reliable package management - Ruff for comprehensive Python linting and formatting - YAML for configuration management **AI and RAG Frameworks:** - LangChain's agent framework for orchestration - OpenAI models (GPT-4) with examples of alternatives - Vector databases (pgvector) for efficient retrieval **Database Tools:** - SQLAlchemy for database interactions - pandas for data manipulation and analysis - PostgreSQL as the database engine for the live dataset PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/XUYKZZ/ Ballroom A Cainã Max Couto da Silva PUBLISH 8HD89Q@@cfp.scipy.org

-8HD89Q

Scaling-up deep learning inference to large-scale bioimage data en

20250708T133000 20250708T173000 4.00000

Scaling-up deep learning inference to large-scale bioimage data

Methods such as U-Net, Cellpose, Stardist, and even adaptations of the Segment Anything Model for microscopy data, Micro-SAM, have been used extensively by the bioimage analysis community. These methods already offer efficient pipelines for their application in sub-regions of high-resolution images, such as Whole Slide Images (WSI). However, limitations in memory capacity of computer systems restrict their applicability to whole images, requiring manual extraction of image tiles for their individual analysis, and subsequent merging of the inference results. On the other hand, advances in image data management and storage, such as Next Generation File Formats (Zarr), and libraries for parallel computation, such as Dask, enable scaling-up the existing pipelines now without these limitations on memory. In this workshop, techniques to scale-up machine learning inference to large-scale images without manual extraction of tiles will be explored. This workshop is focused on applications in microscopy bioimage analysis, but the techniques learned during this workshop can be applied to any other modality. Finally, because the reviewed techniques use high-level functions from the Dask library, these can be executed on common laptops or High Performance Computing environments, depending on the scale of the analyzed images. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/8HD89Q/ Ballroom A Fernando Cervantes Sanchez Peter Sobolewski PUBLISH RYTBM8@@cfp.scipy.org

-RYTBM8

Show your work: Tutorial on building and hosting web applications en

20250708T080000 20250708T120000 4.00000

Show your work: Tutorial on building and hosting web applications

Scientific analysis often remains hidden within code repositories, limiting its impact and accessibility. This tutorial bridges the gap between analysis and dissemination by demonstrating how to transform Python scientific code into engaging web applications without leaving the Python ecosystem. Drawing from five years of experience developing data solutions for major scientific organizations, we will showcase how modern tools make this process accessible to scientists and engineers regardless of web development background. The tutorial begins with an exploration of why web applications are crucial for scientific communication, highlighting real-world examples where interactive visualization significantly enhanced understanding and engagement with complex data. We'll outline common barriers scientists face when trying to showcase their work and how web applications effectively overcome these challenges. Next, we'll navigate the landscape of Python web application frameworks, comparing Dash, Streamlit, Gradio, Shiny, and Quarto. This section will provide a decision framework to help participants select the appropriate tool based on their specific needs, considering factors like required functionality, development time, and deployment context. Through guided examples, attendees will learn to identify which framework best serves different scientific scenarios. The core of the tutorial focuses on Fast Dash, an open-source Python library we developed specifically for scientific prototyping needs. I'll explain how Fast Dash transforms Python functions into interactive web applications with minimal boilerplate code, emphasizing its advantages for geospatial visualization and scientific data presentation. Using a real-world case study of chlorophyll-a monitoring in New York's inland waters, we'll demonstrate how Fast Dash enabled the creation of an interactive dashboard that dramatically improved data accessibility compared to traditional static reporting methods. The hands-on portion guides participants through building their own web applications. Starting with simple data visualization, we'll progress to interactive applications incorporating maps, charts, and user controls. Exercises will cover essential patterns like: - Converting analytical functions to web interfaces - Integrating multiple data sources - Building effective interactive visualizations - Handling user input and filtering - Optimizing performance for larger datasets The final section addresses deployment strategies and best practices for scientific web applications. Participants will learn about hosting options ranging from local development servers to cloud platforms, with specific attention to maintaining scientific integrity while enhancing accessibility. Throughout the tutorial, we emphasize practical applications rather than theory. All examples come from real scientific workflows, demonstrating how interactive web applications can transform complex analyses into accessible tools. The methodology presented is applicable across disciplines, from environmental monitoring to genomics, machine learning, and beyond. Participants will leave with a working knowledge of the Python web application ecosystem, hands-on experience with Fast Dash, and a framework for selecting and implementing the right tools for their scientific communication needs. Most importantly, they'll gain the confidence to showcase their scientific work effectively through interactive web applications, ensuring their valuable research reaches broader audiences and achieves greater impact. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/RYTBM8/ Ballroom C Kedar Dabhadkar Archit Datar PUBLISH 7UJEBD@@cfp.scipy.org

-7UJEBD

Building LLM-Powered Applications for Data Scientists and Software Engineers en

20250708T133000 20250708T173000 4.00000

Building LLM-Powered Applications for Data Scientists and Software Engineers

This workshop is designed to equip software engineers with the skills to build and iterate on generative AI-powered applications. Participants will explore key components of the AI software development lifecycle through first principles thinking, including prompt engineering, monitoring, evaluations, and handling non-determinism. The session focuses on using LLMs to build applications, such as querying PDFs, while providing insights into the engineering challenges unique to AI systems. By the end of the workshop, participants will know how to build a PDF-querying app, but all techniques learned will be generalizable for building a variety of generative AI applications. If you're a data scientist, machine learning practitioner, or AI enthusiast, this workshop can also be valuable for learning about the software engineering aspects of AI applications, such as lifecycle management, iterative development, and monitoring, which are critical for production-level AI systems. **What You'll Learn:** * How to integrate AI models and APIs into a practical application. * Techniques to manage non-determinism and optimize outputs through prompt engineering. * How to monitor, log, and evaluate AI systems to ensure reliability. * The importance of handling structured outputs and using function calling in AI models. * The software engineering side of building AI systems, including iterative development, debugging, and performance monitoring. * Practical experience in building an app to query PDFs using multimodal models. **What is Unique About This Session:** This workshop uniquely bridges the gap between software engineering and generative AI development. While most AI workshops focus solely on model usage or tuning, this session emphasizes the entire AI software lifecycle — from prompt engineering to monitoring and tracing. Participants will learn how to manage non-determinism and create production-ready AI applications, giving them the knowledge to tackle the software engineering challenges of AI-powered apps. The hands-on approach ensures that attendees walk away with practical skills and a functional app. **Workshop Prerequisite Knowledge:** * Basic programming knowledge in Python. * Familiarity with REST APIs. * Experience working with Jupyter Notebooks or similar environments (preferred but not required). * No prior experience with AI or machine learning is required. * Most importantly, a sense of curiosity and a desire to learn! If you have a background in data science, ML, or AI, this workshop will help you understand the software engineering side of building AI applications. We will introduce you to certain modern frameworks in the workshop but the emphasis be on first principles and using vanilla Python and LLM calls to build AI-powered systems. [All tutorial material will be in this github repository](https://github.com/hugobowne/AI-for-SWEs). PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/7UJEBD/ Ballroom C hugo bowne-anderson Stefan Krawczyk PUBLISH MDJVGA@@cfp.scipy.org

-MDJVGA

Create custom image visualization and analysis tools with napari en

20250708T080000 20250708T120000 4.00000

Create custom image visualization and analysis tools with napari

Just like we take more pictures of food than we will ever look at, scientists are using powerful microscopes, telescopes, satellites, MRI machines and myriad other sensors to produce more images than they can ever look at. These images come in different file formats, they might be 3D, contain a timelapse component, many different channels, or other features that increase the complexity of loading them for visualization. Even when specialised viewers provide ways to load these images and look at them, analyzing them and visualizing the results of these analyses can still be a challenge. This tutorial is aimed at folks who have some experience in scientific computing with Python. To get the most out of it, you should be familiar with NumPy arrays, Jupyter notebooks, and Python scripts. Ideally, you should have some idea of how images can be represented as arrays of numbers, and the types of analyses that might be performed on these arrays e.g. filtering and segmentation. You don’t necessarily need to be familiar with how these tools and methods work - it’s enough to know that they are out there! The tutorial will be split into three main parts, each around an hour to 75 minutes long. Each part will cover a different aspect of how napari can be used to simplify your analysis workflows, and the workflows of your colleagues and coworkers. **Part 1: Using Python and napari to view and analyze imaging data** In this section we will look at opening and viewing 2D, 3D and even 4D images in napari. We will see how different layer types can help you display your analysis results, how Jupyter notebooks can streamline your image processing, and how napari’s plugins can help you access different analyses through the napari viewer. **Part 2: Customizing your analysis workflow by extending napari’s functionality** We will teach you how to customize your analysis workflow by adding new keybindings and mouse bindings to napari, and adding event handlers that can listen for different layer and viewer events. Finally, we will show you how easy it can be to add your own GUI widgets with minimal code. **Part 3: Distributing your customized functionality with plugins** Once you’re happy with your customized analysis tools, you may want to distribute them to other colleagues and coworkers, or to napari users at large! This section will cover how to package your custom bits of code into pip-installable napari plugins. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/MDJVGA/ Ballroom D Draga Doncila Pop Peter Sobolewski Tim Monko PUBLISH HEHW8W@@cfp.scipy.org

-HEHW8W

Shiny for Python: Building Production-Ready Dashboards in Python en

20250708T133000 20250708T173000 4.00000

Shiny for Python: Building Production-Ready Dashboards in Python

Shiny is a framework for building web applications and data dashboards in Python. In this one-day workshop, you will see how the basic building blocks of shiny can be extended to create your own scalable production-ready python applications. In particular, this workshop covers: - 0-50: Overview of the basic building blocks of a Shiny for Python application - How to refactor applications into shiny modules - How to write tests for your shiny application - Deploy and share your application At the end of this course you will be able to: - Build a Shiny app in Python - Refactor your reactive logic into Shiny Modules - Identify when to write Shiny modules - Write unit tests and end-to-end tests for your shiny application - Deploy and share your application (for free!) The workshop will have both a lecture component and hands-on live coding practical component. We will work together to build and understand one of our Shiny for Python's Dashboard Templates: <https://shiny.posit.co/py/templates/> ### Workshop Breakdown: First Hour: Introduction - :00-:20 Overview of the basic building blocks of a Shiny for Python application - :20-:35 Input components - :35-:50 Output components - :50-1:00 break Second Hour: Build a more complex app - 1:00 1:35 A more complex application with multiple input and output components - 1:35-1:50 Introduction to Shiny's reactivity programming model. - 1:50-2:00 Break 3rd Hour: Refactoring your application and Shiny Models - 2:00-2:15 Introduction to shiny modules - 2:15-2:30 Refactor current app into modules - 2:30-2:50 Import your Shiny Modules into the new application - 2:50-3:00 Break 4th Hour: Testing and deployment - 3:00-3:30 Testing your shiny apps with playwright - 3:30-4:00 Deploying your application to the web (for free!) ### Workshop preparation: We will be using Positron in the workshop with the VSCode Shiny extension. You can also use VSCode with the Shiny extension as well. - Positron: <https://positron.posit.co/> - VSCode: <https://code.visualstudio.com/> - Shiny Extension: <https://marketplace.visualstudio.com/items?itemName=Posit.shiny> You will need the following python packages installed. An example `requirements.txt`: ``` faicons shiny shinywidgets plotly pandas ridgeplot ipykernel ``` ### FAQ 1. What if I'm a complete beginner? - You should have a basic understanding of Python and be able to install packages with pip, do basic data manipulation, and draw plots. 2. What if I've never built a Shiny app before? This workshops doesn’t require any Shiny or web application experience. We'll focus more on practical examples in the course. We do have additional resources for you to dive more into more Shiny details, but we will cover the basics needed to build larger and scalable applications. 3. Why should I learn Shiny if I already know Streamlit or Dash? We believe that Shiny is the best framework for building data applications in Python. It’s reactive execution model means that you can build performant applications without explicitly caching data or managing application state. See [this blog post](https://posit.co/blog/why-shiny-for-python/) for more on why we think that Shiny is worth learning. 4. I already know Shiny for R, is this workshop for me? The R and Python Shiny packages are quite similar, so some of the content in this workshop may be familiar to you. That said it’s a great opportunity to fill in missing pieces and ask question about Python best practices. We will also talk about Shiny modules and testing in this workshop, which will also be a precursor for you to learn more or incorporate Python Packaging to your Shiny applications. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/HEHW8W/ Ballroom D Daniel Chen PUBLISH 7DDV7V@@cfp.scipy.org

-7DDV7V

Network Analysis Made Simple en

20250708T080000 20250708T120000 4.00000

Network Analysis Made Simple

In this tutorial, we will walk you through what we consider the most practical aspects of graph theory using NetworkX. While graph theory can seem abstract at first, having a computational framework like NetworkX makes it much more approachable. We will start with what we think is the most intuitive way to understand graphs - seeing them as computational objects we can manipulate with code. From there, we will show you how we approach common tasks like finding paths between nodes, analyzing graph structure, and creating visualizations that actually make sense. We will also cover how to store and read graphs to/from disk. Based on our experience working with graphs, we've selected three cutting-edge topics that we think are worth exploring: using graphs with LLMs for knowledge retrieval, scaling up to larger datasets with cuGraph and linear algebra, or an introduction to the use of graphs in deep learning. Tutorial participants will get to choose one of these topics live. This tutorial is structured based on what we wished we knew when we first started working with graphs, and is structured in the order that we believe to be most productive for learning. By the end of the tutorial, participants should be able to productively prototype with graphs immediately! PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/7DDV7V/ Room 315 Eric Ma PUBLISH A8D9Z7@@cfp.scipy.org

-A8D9Z7

Hierarchical Data Analysis with Xarray DataTree & Zarr en

20250708T133000 20250708T173000 4.00000

Hierarchical Data Analysis with Xarray DataTree & Zarr

Xarray is an open-source Python project that makes working with complex, multi-dimensional arrays elegant, intuitive, and efficient. Real-world datasets are often a collection of many related variables on a common grid rather than raw numbers. Such datasets are common in the disciplines of earth science, astronomy, biology, healthcare, infrastructure, and finance. These datasets are more than just arrays of values: they have labels describing how array values map to locations in dimensions such as space and time, and metadata that describes how the data was collected and processed. For example, Pandas-inspired label-based syntax `temperature.sel(place=”Boston”)` is more intuitive and less error-prone compared to NumPy syntax: `temperature[0]`. Xarray recently gained first-class support for hierarchical data through the release of xarray.DataTree, which can be used to analyze data with hierarchical or heterogeneous structure. The datatree model maps to an entire HDF5 file containing many groups, a structure familiar to scientists across many different domains. This model similarly maps onto a multi-group Zarr Store, which enables data-proximate computation on massive cloud-native data repositories. In this hands-on tutorial, users will work with example data from multiple fields of science (including biology and geosciences) to achieve these learning objectives: - Understand xarray’s core data structures - Named arrays and coordinates (Variable) - Groups of arrays with coordinates (DataArray and Dataset) - Hierarchical trees of related groups (DataTree) - Understand how to map typical xarray computations and workflows over hierarchical data, - Understand which common storage formats correspond to the DataTree model, focusing on HDF5 and Zarr, - Open a public Zarr store in the cloud and manipulate the contents, - Use Dask to parallelize the analysis of large hierarchical datasets. This hands-on tutorial assumes participants have some familiarity with Jupyter Notebooks, NumPy, Pandas, and Xarray, and focuses on intermediate workflows using hierarchical real-world datasets. All material will be presented in curated Jupyter Notebooks with exercises to solidify understanding of key concepts. Tutorial material is available [online](https://tutorial.xarray.dev/) with instructions for running examples on free hosted infrastructure or on a local computer. No specific scientific domain expertise is required to participate effectively in this tutorial. Example datasets will either be small enough to download locally or available as Zarr stores in public cloud buckets. We encourage participants to review last year’s [tutorial](https://tutorial.xarray.dev/workshops/scipy2024/index.html) prior to attending and bring your questions and enthusiasm to make our 4-hour session as interactive as possible! PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/A8D9Z7/ Room 315 Tom Nicholas PUBLISH Z3VBWR@@cfp.scipy.org

-Z3VBWR

Create Your First Python Package: Make Your Python Code Easier to Share and Use en

20250708T080000 20250708T120000 4.00000

Create Your First Python Package: Make Your Python Code Easier to Share and Use

*📦 Python packaging can feel overwhelming with its ever-evolving tools and options. But what if you had a **trusted, community-approved** workflow to follow?* 📦 Our pyOpenSci packaging workflow is a community-vetted approach that has been taught successfully numerous times to first-time package creators and those who have found the packaging frustrating. Our resources are co-developed by the pyOpenSci community and vetted by core packaging and Python maintainers. This means you’ll learn best practices from experts, avoid common pitfalls, and gain confidence using modern tools like Hatch. By following this tested workflow, you’ll create packages that are not only installable but also maintainable and citable. In this workshop, you’ll use the pyOpenSci [community-developed template](https://github.com/pyOpenSci/pyos-package-template), which provides a quickstart template approach to packaging. We will then explain all of the pieces of the package. If you don’t have Hatch installed, you can use our GitHub Codespaces environment to follow along without installing anything on your computer. By the end of the workshop, you’ll: * Have a functional Python package. * Understand how to publish your package to TestPyPI. * You’ll have a strong understanding of the essential components of a pure Python package. * *You’ll also get step-by-step resources that cover*: * How to publish to conda-forge. * How to add a DOI to your package using Zenodo. * How to add a `citation.cff` file to a GitHub repo to increase the visibility of your package’s citation information. We also welcome you to join our vibrant community for continued packaging support beyond the workshop. By the end of the workshop, you’ll clearly understand how to create, customize, and publish a pure Python package for better reusability and citability. ## Setup & Requirements To get the most out of this tutorial, you should be comfortable: * writing Python code, * using functions, and * using Python environments You should also have: * access to your personal GitHub account and * a computer that can connect to the Internet. If you want to follow along on your computer locally, you need to install Python, [Copier](https://pypi.org/project/copier/), and [Hatch](https://www.pyopensci.org/python-package-guide/tutorials/get-to-know-hatch.html). PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/Z3VBWR/ Room 316 Tetsuo Koyama Leah Wasser Inessa Pawson Carol Willing PUBLISH 87VTR7@@cfp.scipy.org

-87VTR7

(Pre-)Commit to Better Code en

20250708T133000 20250708T173000 4.00000

(Pre-)Commit to Better Code

## Section 1: Setting Up Pre-Commit Hooks After laying the foundation with an overview of Git hooks, we will discuss the use cases for hooks at the pre-commit stage (called pre-commit hooks), as well as a high-level explanation of how to set them up without any external tools. We will then introduce the `pre-commit` tool and disambiguate it from pre-commit hooks, before commencing a detailed walkthrough of the pre-commit hooks setup process when using `pre-commit`. ## Section 2: Creating a Pre-Commit Hook While there are a lot of pre-made hooks in existence, sometimes they aren't sufficient for the task at hand. In this section, we will walk step-by-step through the process of creating and distributing a custom hook. After wiring everything up, we will discuss best practices for sharing, documenting, testing, and maintaining the codebase. --- This tutorial is for anyone with intermediate knowledge of Python and basic knowledge of `git`. You must be comfortable writing Python code and working with `git` on the command line and using basic commands (`git clone`, `git add`, `git status`, `git commit`, `git push`). Attendees should have Python and `git` installed on their computers, as well as a text editor for writing code (e.g., Visual Studio Code). PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/87VTR7/ Room 316 Stefanie Molin PUBLISH WHKNQJ@@cfp.scipy.org

-WHKNQJ

Geospatial data visualisation in Python en

20250708T080000 20250708T120000 4.00000

Geospatial data visualisation in Python

This tutorial will give a broad overview of many of the core concepts in geospatial data science. Attendees will learn the skills needed to manipulate, analyse and plot geospatial data as well as combine geospatial datasets, generate new geospatial data from existing sources, generate insightful geospatial data visualisations and come away from the tutorial with the confidence to seek out new datasets to apply their skills to. When it comes to data visualisation, attendees will be encouraged to express themselves, material will be provided to get them to the point where they can generate their own visualisations without help but styling the plots will be up to them. This tutorial will provide a high-level overview of geospatial data analysis and visualisation and provide a list of open-source datasets that can be used to practice newly learned skills. Furthermore, the packages used are not all specific to geospatial data visualisation and are applicable to a wide range of scientific and data science problems. As such, it is open to everyone and hopefully beginners, intermediates and experts will all come away with a new skill or two. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/WHKNQJ/ Room 317 Adam Symington PUBLISH 9RH89Y@@cfp.scipy.org

-9RH89Y

Downscaling Satellite-Based Air Quality Maps (NO₂, PM2.5/AOD, CO) using Python and AI/ML en

20250708T133000 20250708T173000 4.00000

Downscaling Satellite-Based Air Quality Maps (NO₂, PM2.5/AOD, CO) using Python and AI/ML

Objectives: - Understand the fundamentals of satellite-based air quality data downscaling. - Acquire skills to preprocess and analyze air quality data (NO₂, PM2.5/AOD, CO). - Hands-on experience building and validating AI/ML models. - Create interactive visualizations for practical air quality assessment. --- Expected Outcomes: Participants will: - Successfully process and manage satellite data for NO₂, PM2.5/AOD, and CO. - Build accurate ML/DL downscaling models using Python. - Produce validated high-resolution maps for multiple air quality parameters. - Develop interactive visualizations to effectively communicate air quality data. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/9RH89Y/ Room 317 Gajendra Deshpande PUBLISH FNUDXC@@cfp.scipy.org

-FNUDXC

Processing Cloud-optimized data in Python with Serverless Functions (Lithops, Dataplug) en

20250708T080000 20250708T120000 4.00000

Processing Cloud-optimized data in Python with Serverless Functions (Lithops, Dataplug)

Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud object storage without needing to download the entire dataset. These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data. They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. For example, Dask can efficiently read data in parallel from Object Storage in CO formats like ZARR. Cloud-optimized formats are now widely used in geospatial settings with entire datasets available in the AWS Registry for Open Data like [Sentinel-2 Cloud Optimized GeoTIFFs](https://registry.opendata.aws/sentinel-2-l2a-cogs/). In this line, COPC (Cloud Optimized Point Cloud) was developed to overcome the limitations of LIDAR. Likewise, Cloud Optimized GeoTIFF (COG) was developed to facilitate cloud processing of GeoTIFF files. Nevertheless, there are no cloud optimized versions of widely used formats in genomics (FASTA, FASTQ, VCF, FASTQGZIP) and metabolomics (imzML). Furthermore, a costly preprocessing from legacy formats is required (from GeoTIFF to COG, from LIDAR to COPC). In this talk, we will present a novel data processing library called [Dataplug](https://github.com/CLOUDLAB-URV/dataplug) that enables Cloud-optimized access to legacy formats without a costly preprocessing and also avoiding huge data movements. Dataplug covers legacy formats like LIDAR but also major data formats found in bioinformatics (genomics, metabolomics) that lack appropriate Cloud Optimized alternatives. In this talk, you will learn how to process scientific data formats in Python using the [Dataplug library](https://github.com/CLOUDLAB-URV/dataplug) from any Python data analytics platform like Dask or Ray. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management. Furthermore, we will demonstrate how cloud optimized on-the-fly data partitioning is specially suited for serverless data processing toolkits launching thousands of functions in parallel. Serverless Computing introduces a novel paradigm of resource disaggregation and elasticity, enabling ephemeral, burstable distributed services that execute tasks on demand with fine-grained billing. This model is particularly advantageous for data-parallel applications that rely on Object Storage for direct data processing. Lithops stands out as a mature serverless data analytics platform with an active GitHub community\footnote{\url{https://github.com/lithops-cloud/lithops}} and extensible architecture. It supports compute and storage backends across major public cloud providers, including Amazon, Google, Azure, IBM, and Oracle. Lithops abstracts infrastructure management entirely. For instance, using lithops.map(func, bucket), users can transparently execute parallel tasks without worrying about resource management. The platform partitions data adaptively and allocates compute resources based on application-specific needs, launching a function for each data chunk. This allows Lithops to scale resources on demand for every map call, in stark contrast to static clusters, which rely on preprovisioned resources at the experiment's outset. The adaptability of Lithops underscores the exceptional suitability of serverless platforms for embarrassingly parallel and stateless tasks. These include data staging, Extract-Transform-Load (ETL) processes, and large-scale data preparation. We will demonstrate how Lithops and Dataplug together can excel in elastic data processing, outperforming cluster computing technologies like Dask. We will also show how to run the same Python code in Lithops in different Cloud providers, and thus overcoming vendor lock-in. ## Audience The talk is aimed at Python developers interested in processing data in the Cloud. In particular, it may be of interest in the following domains: geospatial data (COG, COPC, LIDAR, ZARR, Kerchunk), genomics data (FASTA, FASTQ, VCF, FASTQGZIP) and metabolomics data (imzML). This talk requires basic understanding of Cloud Object Storage and Serverless Functions. ## Objectives By the end of this tutorial , you will be able to: 1. Understand Cloud-Optimized data formats and their benefits for data processing in the Cloud 2. Learn how to process Cloud Optimized Data from Object Storage in Python using Dask 3. Use Dataplug library to enable on-the-fly partitioning of Cloud Optimized data (COG, ZARR, COPC). 4. Use Dataplug library to enable on-the-fly partitioning of non-Cloud Optimized formats (LIDAR, FASTQGZIP, FASTA, FASTQ, VCF,imzML) 5. Understand Lithops Serverless Data Analytics platform and parallel map APIs 6. Configure Lithops to use AWS Lambda and Amazon S3 7. Create and run a simple Python parallel code that can run in the cloud with hundreds of processes 8. Run the same Python parallel Map code in different Cloud providers (AWS,Google, IBM, Azure) 9. Process massive data in parallel in the Cloud with Python and Dataplug ## Outline Block 1: Cloud Optimized Data and Dataplug Introduction (10 minutes) 1. Understanding Cloud-Optimized data formats and Cloud Object storage 2. Processing Cloud-Optimized data in Dask Processing Cloud-optimized data in the Cloud with Python (90 minutes) 1. Processing COG (Cloud-Optimized GeoTIFFs) in Python in the [NDVI pipeline](https://github.com/cloudbutton/geospatial-usecase/tree/main/ndvi-diff) 2. On-the-fly processing of compressed genomic data (FASTQGZIP) with [Dataplug](https://github.com/CLOUDLAB-URV/dataplug/blob/master/examples/fastqgz_example.py) 3. On-the-fly processing of metabolomics data (imzML) with [Dataplug](https://github.com/CLOUDLAB-URV/dataplug/blob/master/examples/imzml_processed_example.py) 4. Commparing LIDAR and COPC processing with Dataplug library in Dask ([code](https://github.com/CLOUDLAB-URV/dataplug/blob/master/examples/lidar_example.py)) Exercises (20 minutes) TBD Block 2: Serverless Data Processing with Lithops Introduction (10 minutes) 1. Overview of [Lithops](https://github.com/lithops-cloud/lithops) 2. Understand [Lithops APIs](https://github.com/lithops-cloud/lithops/blob/master/docs/api_futures.md) and [storage API](https://github.com/lithops-cloud/lithops/blob/master/docs/api_storage.md) APIs. 3. Understand Lithops [backends](https://lithops-cloud.github.io/docs/source/compute_backends.html) and runtimes Serverless Data processing in the Cloud with Lithops (90 minutes) 1. [Configure Lithops](https://github.com/lithops-cloud/lithops/tree/master/config) to use AWS Lambda and S3. 2. Run [Word Count simple example](https://github.com/lithops-cloud/lithops/blob/master/examples/map_reduce_url.py) in the Cloud with Lithops 3. Run Word Count in different Cloud providers (AWS, Google, IBM) 4. Run Word Count in K8s in your own cluster 5. Run Word Count in K8s in Managed K8s services (AWS Fargate, IBM Code Engine) 6. Compare performance in different Clouds (AWS, Google, Azure, IBM) using [Lithops compute and storage benchmarks](https://github.com/lithops-cloud/applications/tree/master/benchmarks) 7. Show speedups with a Python parallel code in the Cloud ([Pi estimation](https://github.com/lithops-cloud/applications/tree/master/montecarlo/pi_estimation)) that can run in the cloud with hundreds of processes. 8. Learn how to combine Lithops and Dataplug for parallel data processing 9. Execute complex pipelines in Lithops (Metabolomics, genomics, astronomics) Exercises (20 minutes) TBD PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/FNUDXC/ Room 318 Pedro Garcia Lopez Enrique Molina-Giménez PUBLISH RPN9U9@@cfp.scipy.org

-RPN9U9

Bring Accelerated Computing to Data Science in Python en

20250708T133000 20250708T173000 4.00000

Bring Accelerated Computing to Data Science in Python

In this 4-hour hands-on lab, participants will dive into an end-to-end data science project focused on a fictional epidemic scenario. Using a synthetic dataset of infection rates, population demographics, and mobility patterns, attendees will harness GPU-accelerated tools to process, analyze, and model large-scale data efficiently. The lab is structured into five sections, blending practical coding with key insights into high-performance data science workflows. Section one will take 15 minutes. The lab begins with a presentation on the fundamental concepts of leveraging GPUs for data science workflows. We'll explore the difference between CPU and GPU computing architecture, and delve into the various approach to GPU programming. Section two will take 60 minutes. In this section, participants will use CuPy, a GPU-accelerated drop-in replacement for NumPy and SciPy, to handle a massive dataset of infection records (e.g., timestamps, locations, and case counts). Attendees will learn to perform array operations, statistical computations, and matrix manipulations at scale, comparing CuPy’s performance against traditional CPU-based NumPy. For example, they’ll calculate infection growth rates across regions, leveraging CuPy’s speed to process millions of data points in seconds. This section emphasizes how GPU parallelism accelerates foundational numerical tasks. Section three will take 60 minutes. In this section, participants will transition to data wrangling using cuDF, a GPU-accelerated alternative to pandas. They’ll load a multi-gigabyte dataset of patient demographics and mobility logs into a cuDF DataFrame, performing operations like filtering, grouping, and joining to identify high-risk populations and infection hotspots. For instance, attendees might aggregate cases by age group or merge mobility data with infection records to trace transmission patterns. This section highlights cuDF’s ability to handle large tabular datasets for analysis and visualization. Section four will take 30 minutes. In this section participants will use cuGraph to build and analyze a contact network. Starting with mobility data, they’ll construct a graph where nodes represent individuals or locations and edges denote interactions. Using cuGraph’s accelerated graph algorithms, attendees will compute metrics like centrality (to identify superspreaders) and shortest paths (to trace transmission chains). This section contrasts cuGraph’s performance with NetworkX, demonstrating how GPU acceleration enables rapid analysis of complex networks, critical for real-time epidemic tracking. Section five will take 60 minutes. In this section, participants will apply machine learning using cuML, a GPU-accelerated counterpart to scikit-learn. They’ll train models like random forests or logistic regression to predict infection risk based on features like age, mobility, and prior case data. Attendees will also explore clustering (e.g., k-means) to segment populations for targeted interventions. This section showcases cuML’s compatibility with scikit-learn workflows while delivering orders-of-magnitude faster training and inference, essential for iterating models on large epidemic datasets. Section six will take 15 minutes. The lab concludes with a discussion on practical considerations for working with large datasets. Topics include memory management (e.g., avoiding GPU memory overflow), data pipeline optimization, and integrating GPU tools into production workflows. Participants will reflect on trade-offs, such as when to use CPU vs. GPU processing, and learn best practices for scaling data science projects to real-world scenarios. PUBLIC CONFIRMED Tutorial https://cfp.scipy.org//scipy2025/talk/RPN9U9/ Room 318 Kevin Lee PUBLISH YC99YM@@cfp.scipy.org

-YC99YM

Breaking New Ground: Scalable Simulation-Based Inference at the LHC with Scientific Python en

20250709T104500 20250709T111500 0.03000

Breaking New Ground: Scalable Simulation-Based Inference at the LHC with Scientific Python

High-energy physics experiments, such as ATLAS at the Large Hadron Collider (LHC), rely on advanced simulation software to accurately model the dense and complex physical interactions occurring within particle detectors. These simulators generate particle collision events using an implicit likelihood model that is analytically intractable. Traditionally, statistical inference in the presence of intractable models relies on dimensionality reduction techniques, which result in a loss of measurement sensitivity. Simulation-Based Inference (SBI) is an emerging new set of techniques that use deep-learning to construct surrogate models for the analytically intractable likelihood functions directly using high-dimensional data. This leads to significantly improved precision in the measurements of the parameters of the model being tested, making the new set of techniques promising for searches of new physics at the energy frontier. While powerful, these techniques are also computationally demanding and have thus remained inaccessible for full measurements at the LHC since their first proposal in 2015. However, in 2024, using SBI techniques, we published the most precise measurement of the Higgs boson lifetime using ATLAS data. For this application, several novel ideas were developed that not only scale better to a full LHC measurement, but also offer efficient computational workflows using the Scientific Python ecosystem. The technique and workflow developed are described in four publications, and apart from offering a deeper insight on the Higgs boson properties, have opened up the applicability of SBI techniques for many other precision measurements at the LHC. In this talk, I will describe the significant role that the open-source Scientific Python ecosystem played in the development of the full framework at the ATLAS experiment. Using a live demo, I will demonstrate how we went from ideas to a working at-scale implementation, using tools like TensorFlow and JAX. I will highlight the various challenges, both foreseen and un-foreseen, that we encountered along the way and how the versatility and power of the Python tools helped in overcoming each of them and making the final measurement of the Higgs boson lifetime using the full ATLAS model possible. This talk will be of interest to people doing precision measurements in a complex likelihood-free setting, which could include any domain from particle physics to biology. Attendees will primarily learn about the versatility of libraries like JAX for efficient statistical inference, including the use of just-in-time compilation, auto-vectorization, and auto-differentiation. For people who don’t fall into this target audience, they will learn about how the various available tools in the Scientific Python ecosystem are being used in some of the cutting-edge research and what future developments can help tackle emerging challenges. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/YC99YM/ Ballroom Jay Sandesara PUBLISH LHMFPL@@cfp.scipy.org

-LHMFPL

Burning fuel for cheap! Transport-independent depletion in OpenMC en

20250709T112500 20250709T115500 0.03000

Burning fuel for cheap! Transport-independent depletion in OpenMC

Neutrons can induce nuclear fission in fissile nuclides like 235U. The fission reaction causes the nucleus to break apart, releasing both energy and new nuclides, many of which are radioactive isotopes of smaller elements. In nuclear reactors, which are fueled by fissionable material, this process is referred to as depletion, or burnup. Nuclear engineers model this process to design and license new reactors. Depletion can affect reactor physics and performance, and determines when the fuel must be shuffled or replaced. The typical approach to modeling depletion requires solving the neutron transport equation to obtain the neutron flux, which changes reaction rates in the fuel. These reaction rates inform the material composition at the next time step. This process is repeated each time step and is called transport-coupled depletion. OpenMC is an open source neutron transport code with a built-in depletion module. OpenMC solves the transport equation via Monte Carlo particle transport, which is accurate but expensive. OpenMC's depletion module was recently extended to enable depletion modeling without solving the transport equation at each time step. Instead, the transport equation is solved once. From this solution, the flux of neutrons and their cross sections are discretized in energy and normalized. These are called multigroup fluxes and cross sections, respectively, and are used during every time step of the depletion calculation. The process is called transport-independent depletion. This method is accurate for the first time step, but degrades at further time steps. Testing with a simple model indicates that the error with respect to transport-coupled depletion depends on the nuclide of interest. In this talk, I will cover: - A brief background on the physics and mathematics of depletion, - the accuracy of transport-independent depletion compared to transport-coupled depletion, - and two applications where transport-independent depletion has been used to great effect: shutdown dose-rate calculations for fusion energy applications, and fast depletion for fuel cycle analysis. The intended audience of this talk are people interested in nuclear reactor systems and open source software. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/LHMFPL/ Ballroom Oleksandr Yardas PUBLISH 97EQ39@@cfp.scipy.org

-97EQ39

An introduction to the JAX scientific ecosystem en

20250709T131500 20250709T134500 0.03000

An introduction to the JAX scientific ecosystem

Modern deep learning frameworks have provided a plethora of tools for numerical computing, offering advanced functionality like autodifferentiation, autoparallelism, and GPU support. Many of these new functionalities are of great use for scientific modelling! For example stiff differential equation solvers (for biochemical reactions, ...) benefit from autodifferentiation to compute Jacobians, whilst large-scale simulations (weather, astrophysics, ...) benefit from autoparallelism across GPU clusters. This fact has already become well-appreciated by the scientific modelling community, and there is now a substantial effort underway to build the necessary tools in these modern deep learning frameworks. JAX offers an excellent computational framework for such efforts: it offers a `jax.numpy` API which provides an easy onboarding experience to users of existing NumPy-based libraries, whilst its autodifferentiation and autoparallelism tools are best-in-class. This talk offers an introduction to the JAX scientific ecosystem, which is a well-developed effort of this sort, offering many of the necessary numerical computing primitives for scientific modelling: libraries like Diffrax offer solvers for ODEs, SDEs, some PDEs, and so on, whilst libraries like Optimistix offer solvers for nonlinear problems like root finding or nonlinear least squares. The primitives of the ecosystem are generally at the same level as the tools provided by SciPy, and so often refer to this effort as 'autodifferentiable GPU-capable SciPy'. There is a further benefit to this approach: many deep learning ideas are themselves finding direct use in scientific problems -- modern scientific models often feature neural networks somewhere in them! -- and so we will also discuss libraries like Equinox for expressing neural networks, or Optax for first-order gradient methods for minimisation problems. All libraries are available as permissively-licensed open-source projects on GitHub; we refer to https://github.com/patrick-kidger/equinox (2.3k stars) as a starting point, which offers further links to the rest of the ecosystem within. The intended audience for this talk are those already familiar with NumPy and SciPy, but no familiarity with JAX will be assumed. By the end the audience will have gained a basic familiarity with JAX, its abstractions, the suite of NumPy- and SciPy-like tools available within it, and several examples of these being applied to existing domain-specific problems. We expect that they will then feel sufficiently empowered to go forward and deploy this on their own problems! PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/97EQ39/ Ballroom Patrick Kidger PUBLISH LQNTGC@@cfp.scipy.org

-LQNTGC

Python is all you need: an overview of the composable, Python-native data stack en

20250709T135500 20250709T142500 0.03000

Python is all you need: an overview of the composable, Python-native data stack

Over the past year, integrations across the Python data ecosystem unlocked a cohesive vision for the composable Python analytics stack. In this talk, we will begin with a brief overview of the modern data stack, what it offers, and why the idea became so prevalent in data engineering. This will provide a baseline for the capabilities that a useful analytics stack should have. Next, we will introduce Ibis, the data processing workhorse of the Python analytics stack. Ibis is a portable Python dataframe library that supports 20+ execution backends, from local computing engines like Polars and DuckDB to distributed cloud data warehouses like Snowflake and BigQuery. Crucially, its deferred execution model makes it perfect for large-scale in-database data transformation, much like SQL. While we won't weigh in on the never-ending Python versus SQL debate (SQL is prevalent and effective and benefits from a mature data engineering tooling ecosystem), we will cover a few relevant advantages of using Python and specifically Ibis, including portability and suitability for other (e.g. data science, machine learning, and AI) workloads. Then, we will present two key components of the emerging stack: Kedro as the core transformation framework and Pandera for data validation. In both cases, we will highlight how Ibis extends the capabilities of the existing, established tool. Kedro gained popularity as a framework for authoring production-ready data science pipelines. While it has also been used by data engineers since its inception, most data engineering use cases were Spark-based or small data. However, integrating Ibis enabled building data engineering pipelines that scale. Furthermore, it complemented Kedro's concept of dev/prod parity; the exact same code could now be tested locally and deployed in production with just a difference in configuration. We'll demonstrate a Kedro port of the Jaffle Shop project as an example pipeline leveraging the Ibis integration. (The Jaffle Shop is the canonical dbt starter project.) Pandera is a lightweight Python data validation library. It already supported a variety of dataframe backends, including pandas, PySpark, and, most recently, Polars. By adding support for Ibis, we extended Pandera's flexible and expressive data testing API to execute natively on the database. We'll show what this looks like by adding validation to our Jaffle Shop pipeline. Finally, we will step back and look at the remaining pieces of the composable analytics stack. We will fill out the picture with Python-native recommendations for ingestion (dlt) and orchestration (Dagster). We will also be transparent about some of the current gaps compared to the more established SQL-first approach and ongoing work to address them. Attendees don't need previous experience with the modern data stack or any of the aforementioned technologies; those who already understand some of the usual limitations of using Python for data engineering workflows will benefit from learning how they can overcome these challenges, while others less familiar will be introduced to popular frameworks like Ibis, Kedro, and Pandera that they can explore and start to use in their day-to-day work or side projects. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/LQNTGC/ Ballroom Deepyaman Datta PUBLISH PBLESZ@@cfp.scipy.org

-PBLESZ

cuTile, the New/Old Kid on the Block: Python Programming Models for GPUs en

20250709T143500 20250709T150500 0.03000

cuTile, the New/Old Kid on the Block: Python Programming Models for GPUs

## Context In block-based programming models, you write seemingly sequential functions that operate on small, local arrays that subdivide your inputs. These functions are then invoked concurrently on multiple instances. Each instance has a group of threads associated with it, and array operations are parallelized across those threads. Concurrency and data movement within groups of threads are implicit and abstracted away, in contrast to models like [SIMT](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#asynchronous-simt-programming-model) where users must explicitly [synchronize and coordinate threads](https://developer.nvidia.com/blog/cooperative-groups/) and [tensor cores](https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/), [pipeline loading of data](https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/), [account for memory coalescing](https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/), etc. Block-based programming has been a staple in numerical and scientific computing for decades. Examples include [NWChem's Tensor Contraction Engine](https://nwchemgit.github.io/TCE.html), [BLIS](https://github.com/flame/blis), and [ATLAS](https://math-atlas.sourceforge.net/). Block-based programming is a form of array programming, and draws inspiration from languages and frameworks such as [APL](https://www.softwarepreservation.org/projects/apl/Books/APROGRAMMING%20LANGUAGE), [MATLAB](https://www.mathworks.com/company/technical-articles/a-brief-history-of-matlab.html), and [NumPy](https://numpy.org/). Recently, there has been an explosion of interest in block-based Python programming models targeting GPUs, driven by the machine learning community. Many new Python frameworks have been developed, such as [Triton](https://openai.com/index/triton/), [JAX/Pallas](https://docs.jax.dev/en/latest/pallas/index.html), and [Warp](https://nvidia.github.io/warp/modules/tiles.html). In March 2025, NVIDIA announced [a new block-based dialect (cuTile) and compiler stack for CUDA (Tile IR)](https://x.com/blelbach/status/1902113767066103949). ## Motivation This trend towards Pythonic block-based models for GPU programming is due to a variety of factors: - More and more scientists are programming GPUs, including those who are not experts in concurrency and hardware performance. - Block-based code is simpler to design, write, and debug for data-parallel GPU applications. - Compilers can reason about block-based programs without more complex and brittle analysis. - Array-centric paradigms are more intuitive for Python developers familiar with [NumPy](https://numpy.org/). - Block-based GPU frameworks offer better portability even as GPU architectures change more and more between generations. - Block-based models significantly simplifies programming machine learning acceleration technology like [tensor cores](https://www.nvidia.com/en-us/data-center/tensor-cores/). Simply put, more scientists have to use GPUs and GPU technology is evolving rapidly, creating a need for higher level and more portable paradigms. ## Results We'll present the recently announced [cuTile and Tile IR](https://x.com/blelbach/status/1902113767066103949). cuTile is a new block-based programming model for NVIDIA's CUDA platform. It is implemented with Numba and a novel compiler stack and intermediate representation called Tile IR. We will reveal further details about these new technologies for the first time during this SciPy session. We'll explore the use of block-based models for a variety of domains through examples from HPC, data science, and machine learning. We'll show a new reference large-language-model GPU application based on [LLAMA3](https://github.com/meta-llama/llama3) and implemented with a variety of different block-based GPU programming frameworks, including cuTile, as well as in traditional SIMT. We'll also present a port of the popular [miniWeather HPC mini-app](https://github.com/mrnorman/miniWeather) to Python and cuTile. We'll analyze performance results and discuss the tradeoffs between programming models. By attending this talk, you will: - Learn the best practices for writing block-based Python applications for GPUs. - Gain insight into the performance of block-based Python GPU code and how it actually gets executed. - Discover how to reason about and debug block-based Python GPU applications. - Understand the differences between block-based and SIMT programming and when each paradigm should be used. - Dive into real examples of block-based Python GPU code. - Explore NVIDIA's new cuTile and Tile IR for the first time. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/PBLESZ/ Ballroom Bryce Adelstein Lelbach PUBLISH DSYLVL@@cfp.scipy.org

-DSYLVL

Edge processing of X-ray ptychography: enabling real-time feedback for high-speed data acquisition en

20250709T152500 20250709T155500 0.03000

Edge processing of X-ray ptychography: enabling real-time feedback for high-speed data acquisition

1. Background The discovery of next-generation materials relies on understanding the structure-function relationships in materials across various length and time scales under realistic conditions. Microscopic imaging is fundamental for visualizing material structures and behaviors. Among the many microscopy modalities, hard X-ray ptychography stands out for delivering high-resolution information of large sample volumes with high detection sensitivity through phase imaging. Advances in accelerator technology, X-ray optics, detectors, and data acquisition methods have made modern X-ray ptychographic experiments increasingly accessible to scientists across multiple domains. However, the growing volumes of data generated at ever-increasing rates surpass traditional data processing methods, which often depend on data transfer to a disk and offline post-analysis. These delays inhibit decision-making, reducing the throughput and quality of data acquired during experiments. Here, we demonstrate how efficient GPU-based reconstruction algorithms deployed at the edge enable real-time feedback in high-speed continuous data acquisition experiments, paving the way for AI-augmented autonomous microscopic experimentation. 2. Methods The edge processing pipeline was developed and deployed at the Hard X-ray Nanoprobe (HXN) 3-ID beamline of the National Synchrotron Light Source II at Brookhaven National Laboratory. The beamline, designed for multimodal microscopy experimentation, features spatial resolution down to 12 nm and combines structural and elemental information from samples. For ptychographic experiments, the beamline employs rapid continuous scanning with samples mounted on fast translational stages, while the Eiger2 1M X-ray camera (DECTRIS Ltd) collects imaging data at a 1-2 kHz frame rate. The ptychographic data is processed using an in-house developed CuPy-based iterative reconstruction algorithm optimized for GPU deployment. 3. Results We developed a pipeline where imaging data from the camera and positional data from the translational encoders are streamed into a server equipped with a single A100 NVIDIA GPU for online reconstruction. By deploying an in-house developed reconstruction algorithm in an edge processing pipeline powered by NVIDIA Holoscan, we demonstrate that image reconstruction can be achieved with minimal latencies matching the camera’s frame rate. This enables researchers to gather real-time feedback during experiments, optimizing the throughput and quality of the collected data. The pipeline was further integrated with the EPICS-based beamline controls system via Ophyd library [0] to enable automatic readout of experimental parameters thus improving the user experience. 4. Future work This work builds on recent efforts to increase the efficiency of ptychographic experiments. The application of deep-learning techniques to ptychographic reconstruction has shown significant gains in real-time data reconstruction, despite the time and experimental costs associated with pretraining [1]. Additionally, new AI-driven algorithms for optimal sample scanning have been proposed in electron-based ptychography, further enhancing data acquisition efficiency [2]. With next-generation cameras expected to deliver data at speeds up to 120 kHz [3], integrating recent and current developments will be essential for performing AI-enabled ptychographic imaging at machine speeds, accelerating the discovery of novel materials. [0] https://www.tandfonline.com/doi/full/10.1080/08940886.2019.1608121 [1] https://www.nature.com/articles/s41467-023-41496-z [2] https://www.nature.com/articles/s41598-023-35740-1 [3] https://link.springer.com/article/10.1140/epjp/s13360-024-05224-w PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/DSYLVL/ Ballroom Seher Karakuzu Denis Leshchev PUBLISH ESGGCR@@cfp.scipy.org

-ESGGCR

Scaling NumPy for Large-Scale Science: The cuPyNumeric Approach en

20250709T160500 20250709T163500 0.03000

Scaling NumPy for Large-Scale Science: The cuPyNumeric Approach

Many data and simulation scientists use NumPy for its ease of use and good performance on CPU. This approach works well for single-node tasks, but scaling to handle larger datasets or more resource-intensive computations introduces significant challenges. Not to mention, using GPUs to speed-up compute-intensive parts of the code requires another level of complexity. Scientists at the Stanford Linear Accelerator Center (SLAC) need to process a large amount of data within a fixed time window, called beam time. The full dataset generated during experiments is too large to be processed on a single CPU. Additionally, the code often must be modified during the beam time to adapt to changing experimental needs. Being able to use NumPy syntax rather than lower level distributed computing libraries makes these changes quick and easy, allowing researchers to focus on conducting more experiments rather than debugging or optimizing code. To address these challenges, we developed a cuPyNumeric, an open-source drop-in replacement for NumPy that seamlessly distributes work across CPUs and GPUs. Built on top of task-based distributed runtime from Stanford University, it automatically parallelizes NumPy APIs across all available resources, taking care of data distribution, communication, asynchronous and accelerated execution of compute kernels on both GPUs or multi-core CPUs. In addition, cuPyNumeric can be used alongside with other popular Python libraries like SciPy, matplotlib, Jax. With cuPyNumeric, SLAC scientists successfully ran their data processing code distributed across multiple nodes and GPUs, processing the full dataset with a 6x speed-up compared to the original single-node implementation. This acceleration not only ensured timely processing of the full dataset but also enabled researchers to adapt their code dynamically to changing experimental needs. This talk is for Python developers who work with large data or want to speed up their code, whether or not they’ve used accelerated libraries before. It will showcase the productivity and performance of cuPyNumeric library on the example of scaling up the signal processing code from SLAC. It will also cover some details on library implementation. We propose following outline for the talk: 5 minutes: An Introduction to cuPyNumeric 5 minutes: Some details on its implementation. 5 minutes: Overview of the SLAC code and challenges researchers face when processing experimental data during the beam time. 5 minutes: Details on integration of cuPyNumeric library into the SLAC code, 3 minutes: Performance results in detail (including explanations of changes to the original code, that both improve code quality and performance at scale) 2 minutes: Conclusion remarks PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/ESGGCR/ Ballroom Irina Demeshko Quynh Nguyen PUBLISH 7KXVCV@@cfp.scipy.org

-7KXVCV

Lightning Talks en

20250709T170000 20250709T180000 1.00000

Lightning Talks

PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/7KXVCV/ Ballroom PUBLISH WDUNLN@@cfp.scipy.org

-WDUNLN

SciPy 2025 Poster Session en

20250709T180000 20250709T190000 1.00000

SciPy 2025 Poster Session

1. Ali Toghani - Enhancing Particle Behavior Analysis through Deep Learning in Biological Multiple Particle Tracking (Machine Learning, Data Science, and Explainable AI) 2. Allen S. Harvey Jr. - Wavefront-Based Visual Acuity Estimation Using Machine Learning (Celebrating the Sci in SciPy) 3. Andy Terrel - Scientific OSS in the Age of GenAI (Maintainers and Community) 4. Axel Sirota - Empowering Learning with Voice: Building an AI-Powered, Accessible Study Assistant (Teaching and Learning) 5. Bobby Jackson - Adaptive scanning strategies for Doppler lidars using edge computing. (Earth, Ocean, Geo, Climate, and Atmospheric Science) 6. Carl Kadie - Explore Solvable and Unsolvable Equations with SymPy (General) 7. Cassia Cai - Ocetrac: Tracking and Quantifying Gridded Structures in Climate Data (Earth, Ocean, Geo, Climate, and Atmospheric Science) 8. Cyril Joly - OptiMask: finding the largest non-contiguous submatrix without NaN (Machine Learning, Data Science, and Explainable AI) 9. Davin Potts - Zarr-dragon-store: a distributed in-memory backend for high-performance data processing (Bioinformatics, Computational Biology, and Neuroscience) 10. Deborah Khider, Julien Emile-Geay - Empowering Learners: Teaching Reproducible Research with Open-Source Tools (Teaching and Learning) 11. Deepak Cherian - Turbocharging Xarray GroupBy, oh my! (Earth, Ocean, Geo, Climate, and Atmospheric Science) 12. Dorota Jarecka - Building Data Models for the BRAIN Initiative Cell Atlas Network Using LinkML ecosystem (Bioinformatics, Computational Biology, and Neuroscience) 13. Elise Chavez - Enabling Innovative Analysis on Heterogeneous Clusters through HTCdaskgateway (Physics and Astronomy) 14. Heberto Mayorquin - Title: Neuroconv: Automating the Path to Data Standardization in Neurophysiology (Bioinformatics, Computational Biology, and Neuroscience) 15. Herve Aniglo - Making Beats with Heartbeats: Converting BioData into Music! (Celebrating the Sci in SciPy) 16. Ianna Osborne - Advancing High-Energy Physics Data Analysis with Julia: A Case for JuliaHEP (Physics and Astronomy) 17. Iason Krommydas - Scalable High Energy Physics Analysis with Dask, Awkward, and Distributed Histograms (Physics and Astronomy) 18. Ivan Perez Avellaneda - SciPy Optimize and Reachability of Nonlinear Systems (Celebrating the Sci in SciPy) 19. Jared Oyler, Jenna Epstein, Matthew Woelfle - Enabling integrated regional top-down and local bottom-up views of climate risk for a global high resolution climate dataset (Earth, Ocean, Geo, Climate, and Atmospheric Science) 20. Joe Hamman - Icechunk: Open-source, cloud-native transactional storage engine for multi-dimensional arrays (Earth, Ocean, Geo, Climate, and Atmospheric Science) 21. Johanna Haffner - Modular constrained optimisation in Optimistix (JAX + Equinox) (General) 22. John Kirkham, Akshay Subramaniam, Mads R. B. Kristensen - Unthrottling I/O bottlenecks to accelerate data analysis and machine learning using GPUs and Zarr v3 (General) 23. Jonny, Ryan Ly - LinkML Arrays & Numpydantic: Towards a lingua franca for implementation-agnostic data standards (Bioinformatics, Computational Biology, and Neuroscience) 24. Juanita Gomez - Recipe for Discovery: Building the UC Open Source Repository Browser from Scratch. (Maintainers and Community) 25. Julie Barnum - Science-specific Research Software Engineer Communities: benefits and lessons learned (Maintainers and Community) 26. Julius Boakye - The Python Community's Secret Sauce: Unlocking the Ecosystem's Potential (Maintainers and Community) 27. Justus Magin - Fast and scalable general geospatial regridding (Earth, Ocean, Geo, Climate, and Atmospheric Science) 28. Kevin Lin, Suh Young Choi, Zoe Huang, Arpan Kapoor, Aileen Kuang - Best Practices for Designing and Conducting Code Interviews (Teaching and Learning) 29. Kriyanshi Shah - Empowering Open Science with Scalable Interactive Computing Environments (Celebrating the Sci in SciPy) 30. Kyle Sunden - Dynamic Data with Matplotlib (General) 31. Kyohei Sahara - Quantum Chemistry Acceleration: Comparative Performance Analysis of Modern DFT Implementations (Physics and Astronomy) 32. Laura McBride - Chillin' with Polars: Building NCEI Products Faster with Python (Earth, Ocean, Geo, Climate, and Atmospheric Science) 33. Lavanya Gupta - Can long-context LLMs truly use their full context window effectively? (Machine Learning, Data Science, and Explainable AI) 34. Marco Colonna - ReDist: A python tool for model-agnostics binned-likelihood fits in High Energy Physics (Physics and Astronomy) 35. Nicholas McCarty - Performing Object Detection on Drone Orthomosaics with Meta's Segment Anything Model (SAM) (Machine Learning, Data Science, and Explainable AI) 36. Nikhil Pareeshwad - Mapping-and-Analyzing-Deforestation-Using-NASA-Satellite-Data-for-Wildlife Sanctuary (Earth, Ocean, Geo, Climate, and Atmospheric Science) 37. Parul Gupta - Pythonic Potions: Mastering Scientific Sorcery With Efficient AI Coding (Teaching and Learning) 38. Riku Sakamoto - Phlower: A Deep Learning Framework Supporting PyTorch Tensors with Physical Dimensions (Machine Learning, Data Science, and Explainable AI) 39. Ryan Ly, Benjamin Dichter, Stephanie Prince - HDMF: The modular data standardization framework that underlies the Neurodata Without Borders data standard (Bioinformatics, Computational Biology, and Neuroscience) 40. Samantha Obermiller, Beata Meluch, Olivia Hess - A Python Package to Improve Accessibility of the National Microbiome Data Collaborative API (Bioinformatics, Computational Biology, and Neuroscience) 41. Sinclair Combs - PyFaults: a Python tool for stacking fault screening (Celebrating the Sci in SciPy) 42. Stanislaw Jaroszynski - Low-Maintenance C++/Python Interoperability using CPPYY and Python Metaprogramming (General) 43. Suh Young Choi - Carmina: Introducing Programming to Latin Poetry (Teaching and Learning) 44. Tad Thurston - Generating Beautiful PDFs from Python pandas DataFrames (General) 45. Tammy Baylis - From new user to community approver: how to grow as an OpenTelemetry contributor (Maintainers and Community) 46. Tetsuo Koyama - PyVista: A Python Library for Interactive 3D Data Visualization and Analysis (General) PUBLIC CONFIRMED Poster https://cfp.scipy.org//scipy2025/talk/WDUNLN/ Ballroom PUBLISH DM3QPX@@cfp.scipy.org

-DM3QPX

DataMapPlot: Rich Tools for UMAP Visualizations en

20250709T104500 20250709T111500 0.03000

DataMapPlot: Rich Tools for UMAP Visualizations

A lot of data scientists use UMAP to help them quickly visualize and explore complex datasets. This could be exploring large unstructured datasets via neural embeddings, or working on LLM explainability by mapping out Sparse Autoencoder features. Making the visualizations good enough, and compelling enough, to present to end users is much harder. However, if done right a good UMAP plot can be a powerful communication tool, or a rich interactive experience that draws users in. Attendees will come away with a sense of what is possible, and an introduction to open source tools that can make it easy. This is a talk focussing on visual presentation of data, and the best way to do that is by example. The talk will provide examples of great visualizations, and give clear visual examples of the many pitfalls in trying to reconstruct them. The aim is to explain how to make UMAP plots compelling to users unfamiliar with the technical details – providing guides via annotation labels, clusters, and spatial palettes. It will introduce the DataMapPlot library, an open source project that specifically aims to make all the difficult parts of UMAP plots easy, and let users focus on the overall aesthetics. **Outline:** * UMAP and Data Maps (6 min) - High dimensional embedding vectors are everywhere + Introduction to data maps - Low dimensional embeddings to explore high dimensional data - Examples of impactful data map visualizations * Plotting Challenges: (8 min) - Common pitfalls of data map visualization + overplotting + color maps + placing text labels + interactivity - Importance of clear and effective visualizations - Real-world examples: comparing "plain" vs. well-designed plots * DataMapPlot (8 min) - Simplifying data map visualization creation with DataMapPlot - Generating static visualizations for publication - Generating interactive visualizations for exploration * Examples (6 min) - Showcasing some examples created with DataMapPlot (including code to reproduce) PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/DM3QPX/ Room 315 Leland McInnes PUBLISH 9MUQMM@@cfp.scipy.org

-9MUQMM

GBNet: Gradient Boosting packages integrated into PyTorch en

20250709T112500 20250709T115500 0.03000

GBNet: Gradient Boosting packages integrated into PyTorch

Github: https://github.com/mthorrell/gbnet # Audience GBNet significantly expands the functionality of XGBoost and LightGBM, some of the most popular Machine Learning packages. The talk will be of interest to almost any data scientists, ML practitioners, and researchers who use GBMs. Practitioners primarily using Neural Networks will also be interested because GBM robustness and interpretability may be attractive features in the building blocks they use to approach problems. The audience will learn about GBNet Modules and how to use them, primarily via examples. The examples will focus on model building and interpretability. Forecasting and ordinal regression are examples in the GitHub page (https://github.com/mthorrell/gbnet/tree/main/examples). Embedding examples will be part of the talk. In addition, GBM users will learn more about PyTorch, and PyTorch users will learn more about GBMs. # Outline ## Background & Motivation Gradient Boosting Machines (GBMs) such as XGBoost and LightGBM are the most popular and powerful general purpose Machine Learning (ML) algorithms. However, existing implementations of GBMs are architecturally limited. Applications just off the path of standard problems for GBMs (primarily regression, ranking and classification) are not solvable out-of-the-box with standard packages. Deep neural networks (DNNs), on the other hand, offer rich architectural possibilities, but, at least for tabular problem types, lack predictive power, interpretability and robustness. GBNet provides PyTorch modules wrapping XGBoost and LightGBM enabling new and rich architectural possibilities for users of XGBoost and LightGBM. GBNet allows GBMs to be applied to new problem types bringing strong predictive performance, better interpretability and improved robustness. ## Software Description GBNet provides PyTorch Modules that wrap XGBoost and LightGBM for insertion into PyTorch’s computational graph. The wrappers feed GBM predictions into the PyTorch graph and retrieve resulting gradients and hessians for GBM updates. GBNet provides exact information to both packages efficiently such that GBNet models can fit the same models as XGBoost and LightGBM and are roughly as scalable as those underlying packages. Building with GBNet is nearly the same as building with PyTorch. GBNet Modules can be mixed and combined with standard PyTorch Modules to create expressive architectures that rely on, for example, PyTorch Linear components and an XGBoost component and a LightGBM component simultaneously. Because GBNet wraps XGBoost and LightGBM, native features of those packages also come for free in GBNet. In particular, (1) categorical inputs are supported in GBNet without using PyTorch embeddings and (2) SHAP values can be generated for interpretability. ## Use Cases A limited number of use cases will be covered in the talk: - Forecasting - A hybrid model combining linear trends with seasonal patterns modeled by XGBoost achieves superior performance compared to standard methods like Meta’s Prophet. SHAP values can be used to understand periodicity and trend. - Embeddings & Contrastive Learning - GBNet allows embeddings to be trained using tree-based methods, supporting applications such as contrastive learning and word embeddings—tasks traditionally dominated by neural networks. Low dimensional embeddings can be fit to understand exact model dynamics. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/9MUQMM/ Room 315 Michael Horrell PUBLISH LEHUMF@@cfp.scipy.org

-LEHUMF

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs en

20250709T131500 20250709T134500 0.03000

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

The development and performance of Large Language Models (LLMs) are increasingly constrained by the availability of high-quality, diverse, and representative datasets. Traditional data collection and curation methods suffer from challenges related to cost, scalability, bias, and ethical concerns, often leading to limitations in model performance. Ensuring that training data is clean, deduplicated, and well-structured is critical for achieving superior accuracy and efficiency in LLMs. This session introduces NeMo Curator, an open-source, GPU-accelerated data curation framework designed to scale data processing across multi-node, multi-GPU systems, enabling the efficient preparation of terabyte-scale datasets for AI training. One of the key innovations of NeMo Curator is its modular, scalable pipeline architecture, which provides an end-to-end workflow for data cleaning, filtering, and deduplication. By integrating semantic deduplication, heuristic filtering, classification, and personally identifiable information (PII) redaction, NeMo Curator helps reduce noise and redundancy in training data, ultimately improving LLM performance by up to 7% on downstream tasks. Unlike conventional CPU-based preprocessing workflows, which can be slow and computationally expensive, NeMo Curator leverages NVIDIA RAPIDS and distributed computing to accelerate dataset processing, significantly reducing training bottlenecks. Beyond just removing duplicate data, semantic deduplication ensures that AI models are not overfitting on semantically identical but lexically different text, a common issue in web-scale datasets. Additionally, NeMo Curator supports automated text classification, allowing users to filter out low-quality data and balance dataset distributions for fairer, more robust model training. The PII redaction module ensures compliance with data privacy regulations, making NeMo Curator a valuable tool for industries such as healthcare, finance, and enterprise AI applications. This talk will provide a hands-on walkthrough of NeMo Curator’s data processing pipelines, demonstrating how to scale LLM dataset curation across multi-node environments. Attendees will learn how to configure pipelines for text deduplication, classification, and data quality improvement—all implemented using Python-based APIs in Jupyter Notebooks. Through real-world case studies, we will highlight how NeMo Curator enables organizations to preprocess large-scale datasets more efficiently, making LLM training more scalable, cost-effective, and accurate. By the end of this session, attendees will understand the challenges of LLM data processing and why scalable solutions are necessary and learn how NeMo Curator accelerates dataset curation using GPU-based optimizations. Additionally, they can explore semantic deduplication, filtering, and classification techniques to improve dataset quality and gain hands-on experience with configuring, optimizing, and deploying NeMo Curator pipelines for real-world AI applications. Detailed Outlines: 1. Challenges in LLM Data Processing (5 min) 2. Introducing NeMo Curator (10 min) 3. Core Functionalities & Workflow (5 min) 4. Hands-on Demonstration (5 min) 5. Real-World Applications & Q&A (5 min) Similar Topic was Presented at Other Events: [Scaling Data Processing for Training Large Language Models](https://www.linkedin.com/feed/update/urn:li:activity:7286196130395103232/?originTrackingId=hHldvqzWTvOC7XhlLhWX4g%3D%3D) PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/LEHUMF/ Room 315 Allison Ding PUBLISH GJRGVU@@cfp.scipy.org

-GJRGVU

Escaping Proof-of-Concept Purgatory: Building Robust LLM-Powered Applications en

20250709T135500 20250709T142500 0.03000

Escaping Proof-of-Concept Purgatory: Building Robust LLM-Powered Applications

LLMs have transformed the landscape of data-driven software, enabling applications in information retrieval, automated summarization, and intelligent assistants. However, many teams struggle to move beyond early-stage demos—where models appear to work well in controlled environments but fail in production due to **hallucinations, non-determinism, and poor evaluation practices**. This talk addresses that gap. It presents a **structured framework for incorporating LLMs into real-world applications**, grounded in software engineering best practices and scientific computing principles. Rather than focusing solely on model performance, we’ll emphasize how to **design, evaluate, and iterate on AI-powered systems effectively**. Attendees will gain insights into: - **The LLM software development lifecycle (SDLC)**—how it differs from traditional ML and software workflows. - **Evaluating business and scientific value**—ensuring LLM outputs align with real-world needs. - **Handling non-determinism and hallucinations**—logging, monitoring, and structured output techniques. - **Beyond conversations: Automating structured workflows**—using LLMs for knowledge extraction, document processing, and decision support. ### Intended Audience This talk is for **data scientists, software engineers, and AI/ML practitioners** looking to: - Move beyond toy LLM demos and build production-ready systems. - Understand how software engineering principles apply to AI-driven applications. - Learn how to evaluate and iterate on LLM outputs to ensure robustness and reliability. Prior experience with Python and scientific computing is expected, but attendees don’t need prior LLM expertise—this talk focuses on software and systems principles applicable across AI applications. ### Key Takeaways By the end of this talk, attendees will understand: - Why many LLM projects stall in proof-of-concept purgatory. - The key differences between LLM development and traditional software engineering. - How to design an iterative LLM SDLC, incorporating monitoring, evaluation, and structured outputs. - Strategies for handling non-determinism and ensuring AI models work reliably in production. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/GJRGVU/ Room 315 hugo bowne-anderson PUBLISH WM9UFJ@@cfp.scipy.org

-WM9UFJ

Keeping LLMs in Their Lane: Focused AI for Data Science and Research en

20250709T143500 20250709T150500 0.03000

Keeping LLMs in Their Lane: Focused AI for Data Science and Research

This talk is for Python data scientists, researchers, and developers looking to integrate AI into their work in a practical, responsible way—or skeptical that it's even possible. In data analysis, correctness and reproducibility are essential, yet general-purpose AI tools lack the structure and determinism needed to ensure reliable results. Instead of treating LLMs as open-ended assistants, we should focus on applying them to well-defined tasks with clear guardrails. When used this way, they can be not just useful, but (relatively) safe and highly effective. This talk will explore how to integrate LLMs into scientific workflows in a controlled and purposeful way. Instead of relying on generic AI assistants, we can build focused tools that guide and enhance research without introducing unnecessary complexity. I’ll discuss design principles for creating AI solutions that combine the creativity of LLMs, the reliability of deterministic software, and the safety of human oversight. Live demos will show how LLMs can be embedded in interactive applications, assisting with real-world data workflows while maintaining transparency and control. These applications produce analyses that can be not only verified and trusted, but even reused and extended. My hope is for attendees to leave the talk inspired to build their own thoughtful LLM solutions that accelerate their research without sacrificing rigor. ### Links to packages used for demos - [Sidebot](https://jcheng.shinyapps.io/sidebot/) - [Chatlas](https://posit-dev.github.io/chatlas/) - [Shiny for Python](https://shiny.posit.co/py/) ### Sample of previous talks - [The Past and Future of Shiny](https://www.youtube.com/watch?v=HpqLXB_TnpI) - [Shiny x AI](https://www.youtube.com/watch?v=AP8BWGhCRZc) PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/WM9UFJ/ Room 315 Joe Cheng PUBLISH 8WQQPV@@cfp.scipy.org

-8WQQPV

Scaling AI/ML Workflows on HPC for Geoscientific Applications. en

20250709T152500 20250709T155500 0.03000

Scaling AI/ML Workflows on HPC for Geoscientific Applications.

Scaling AI and ML workloads on HPC platforms demands a specialized approach to ensure efficiency and accuracy. This presentation focuses on methods for optimizing AI/ML workflows as they grow increasingly complex and data-intensive for geoscientific applications. We explore parallelization strategies, ranging from standard Data Parallelism (DP) to advanced techniques like Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP), that significantly reduce GPU memory footprints through mixed-precision training and activation checkpointing. Moreover, we address the challenges of configuring diverse communication backends (NCCL, MPI, and Gloo) within HPC environments, outlining practical solutions for seamless data exchange across both single-node multi-GPU and multi-node multi-GPU setups. Our presentation demonstrates how these combined optimizations can deliver stable and scalable model training and inference while significantly reducing training time and resource usage. Participants will gain actionable insights into the core technical obstacles and proven strategies for streamlining large-scale AI/ML workflows on HPC infrastructures. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/8WQQPV/ Room 315 Negin Sobhani PUBLISH 8SRJ3V@@cfp.scipy.org

-8SRJ3V

Physical XAI - Going Beyond Traditional XAI Methods in Earth System Science en

20250709T160500 20250709T163500 0.03000

Physical XAI - Going Beyond Traditional XAI Methods in Earth System Science

The rise of deep learning over the last 15 years has given rise to complex non-linear models that are much more difficult to interpret than heuristic or linear regression models, but often have better performance. In response, Explainable AI (XAI) algorithms have been developed to make these models' decisions more understandable. Many of these techniques originated in the computer vision community, driven by the need for interpretability in areas such as medical imaging. While effective for niche image-based tasks, these methods are often not well-suited for direct application to Earth system models. Earth system models present unique challenges for XAI. These models are high-dimensional, deal with large-scale data, and incorporate physically meaningful and autocorrelated spatial and temporal relationships. Most existing XAI approaches struggle to handle these types of inputs effectively. They often violate physical laws and fail to account for uncertainty, making their explanations less reliable and insightful for model developers and end users. We argue that “physical XAI”—interpretable methods that aim to verify that the model has learned physically meaningful relationships—should not simply involve applying standard XAI algorithms to Earth system models. Instead, it should include customized methods that respect the unique characteristics of the Earth systems. Furthermore, we advocate to expand the definition of “physical XAI” to include meaningful data analysis by domain experts throughout the model development process. To advance this idea of physical XAI, we introduce an approach that adapts and extends existing XAI techniques in three unique ways. First, we develop novel methods for perturbing features in more physically consistent ways. Many XAI techniques, such as partial dependence plots (PDPs), rely on feature perturbation to assess variable importance. However, autocorrelated features can cause standard perturbation methods to overestimate or underestimate a variable's importance. To address this, we propose perturbing groups of variables simultaneously relative to their own distributions, ensuring that the perturbed inputs remain physically realistic. This approach generates more consistent and meaningful explanations of model behavior. Secondly, we demonstrate our method on uncertainty values in addition to raw probabilities to gather more meaningful insight. Lastly, we apply our models and feature groups to global sensitivity analysis methods, such as the Sobol Indices method, which are commonly used for dynamic models but rarely used for data-driven models. This can provide higher-order variable (or group) interactions that have physically meaningfulness. We believe that XAI does not need to be constrained to an algorithm. Domain experts have invaluable knowledge that can be utilized by examining model training data and output. With expert interrogation of the data during the development process we can get a much better understanding of what the model has learned. We demonstrate this process in a variety of ways: examination of physical properties of local instances, physical examination of composites, and interactive examination of model output with user controlled input. This iterative examination often leads to improvements in the data curation, feature engineering, and/or model tuning process. These physical XAI methods are demonstrated on two Earth system models: a winter precipitation type classifier, and AI weather prediction models within the CREDIT (Community Research Earth Digital Intelligence Twin) framework both developed at NSF NCAR. In the CREDIT framework, we perform physically realistic perturbations of the atmospheric state in models trained with physical constraints vs those trained without to analyze the effect of the physical constraints on propagating changes through the atmosphere and provide a Jupyter notebook that enables anyone to do so. For the precipitation type model, we provide a jupyter widget that allows a user to interactively adjust the vertical atmospheric profiles and examine how the model probabilities and uncertainties change as a result. Additionally, we provide jupyter notebook demonstrations of physical explainability using the Sobol Indices variance-based approach and our novel extension to partial dependence plots. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/8SRJ3V/ Room 315 Charlie Becker PUBLISH PRSN9R@@cfp.scipy.org

-PRSN9R

Breaking the silo: composable bioinformatics through cross-disciplinary open standards en

20250709T104500 20250709T111500 0.03000

Breaking the silo: composable bioinformatics through cross-disciplinary open standards

The practice of data science in genomics and computational biology is fraught with friction. This is in large part because bioinformatic tools tend to be tightly coupled to file input/output. As a result, bioinformatic workflows shuffle data through meandering, labor-intensive, and time-consuming transformations in order to accommodate each tool’s requirements. Similarly, genomics visualization tools need to handle various complex file types or require further data conversion by end users. We argue that the **adoption of emerging open standards not tied to bioinformatics** can help alleviate this coupling, freeing authors to focus on problem-specific concerns, and enabling bioinformatic tools to integrate better into the wider data science, visualization, and AI/ML ecosystems. In this talk, we will present three libraries as short vignettes to illustrate the potential of composable bioinformatics. First, we present **oxbow** (https://github.com/abdenlab/oxbow), an adapter library that unifies access to common genomic data formats. Despite their varied on-disk representations, many specialized bioinformatic formats share a fundamentally tabular structure. Oxbow efficiently transforms queries to such files into a common in-memory representation, **Apache Arrow**. Arrow is a standard, columnar and self-describing layout for tabular data for both efficient in-memory analytics and binary transport. It is now widely supported by various open-source data technologies, including popular data frame libraries. Oxbow's core is written in Rust, which provides memory safety, performance, and ease of binding to high-level languages including Python and R. For file connectivity, Oxbow makes use of the noodles[1] implementation of GA4GH formats in Rust (SAM/BAM, VCF/BCF, etc.) as well as the bigtools[2] Rust library for the UCSC Big Binary Indexed formats (bigWig and bigBed). The Python API provides a simple interface to lazy and distributed data libraries used in Python, including dask, polars, and duckdb. Second, we present **bioframe**[3] (https://github.com/open2c/bioframe), a Python library from the Open Chromosome Collective (Open2C) for operations on genomic intervals in Pandas data frames, operations that are fundamental to bioinformatic analyses. Bioframe’s design principles emphasize reuse of existing general-purpose data structures. Namely, (1) bioframe does not introduce new custom objects: interval sets are standard Pandas data frames and (2) join operations are performed by reusing NumPy-based primitives rather than implementing interval tree structures, with similar performance. Bioframe facilitates smooth integration with the Python stack, removing the need to convert or serialize data between operations. Finally, we give a high-level overview **anywidget** [4,5] (https://anywidget.dev). Anywidget is a standard and toolkit for authoring web-based interactive widgets in computational notebooks. Trevor Manz presented and he and I gave a very positively received tutorial on anywidget at SciPy last year. In this talk, we will show how anywidget can be leveraged in a bioinformatics context. In practice, third-party Jupyter widgets are cumbersome to author, distribute, and install because widgets must be built, bundled, and installed as individual frontend extensions and the frontends of different Jupyter-compatible platforms (JCPs) – including JupyterLab, Google Colab, and VSCode – install and load extensions in disparate ways. Another consequence of this architecture is that kernel (Python) and frontend (Javascript) modules for a widget must be distributed separately. To address these difficulties, anywidget (1) supplies a single universal extension plugin for all JCP runtimes and (2) defines a narrow frontend widget API based on web standard **ECMAScript modules**, which work natively across all modern browsers without transformation. Consequently, Jupyter widgets authored using anywidget (anywidgets) do not require installation or the use of build toolchains for development. This enables modern conveniences to support rapid development cycles. Furthermore, anywidgets can be pasted and executed live in a code cell or distributed as single Python packages. We will demonstrate how to combine these tools to create a custom connected genomic analysis and visualization environment. We propose that components such as these, which leverage domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data processing workflows, computational analyses, and systems for exploratory data analysis and visualization. - [1] Macias, M. Noodles. Last accessed March, 2025. url: https://github.com/zaeleus/noodles - [2] Huey, J. D., Abdennur, N. (2024). Bigtools: a high-performance BigWig and BigBed library in Rust. Bioinformatics. - [3] Open2C, Abdennur, N., Fudenberg, G., Flyamer, I., M, Galitsyna, A., A, Goloborodko, A., Imakaev, M., & Venev, S. (2024). Bioframe: Operations on Genomic Intervals in Pandas Dataframes. Bioinformatics. - [4] Manz, T., Gehlenborg, N., Abdennur, N. (2024). Any notebook served: authoring and sharing reusable interactive widgets. Proceedings of the 23rd Python in Science Conference. - [5] Manz, T., Abdennur, N., Gehlenborg, N. (2024). anywidget: reusable widgets for interactive analysis and visualization in computational notebooks. JOSS. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/PRSN9R/ Room 317 Nezar Abdennur Trevor Manz PUBLISH TJNNBD@@cfp.scipy.org

-TJNNBD

ReSCU-Nets: recurrent U-Nets for segmentation of multidimensional microscopy data en

20250709T112500 20250709T115500 0.03000

ReSCU-Nets: recurrent U-Nets for segmentation of multidimensional microscopy data

Quantification of microscopy images is a central tool in cell and developmental biology. Quantification pipelines often begin with segmentation and tracking of the structures to be measured. Neural network architectures, such as the U-Net, have improved the accuracy of automated microscopy image segmentation. However, segmentation of multidimensional images remains challenging, as photobleaching and environmental changes can reduce the signal-to-noise ratio during image acquisition, limiting the ability of neural networks to recognize the same object in multiple images of a sequence. The sequential information in timelapse images provides an avenue for improving segmentation. A common way to use temporal information is to add recurrence. In a recurrent network, the output of a layer depends on both the current input and previous outputs. For example, in the Long Short-Term Memory (LSTM) U-Net convolutional layers are replaced with layers that recall information previously seen during inference. Because recurrence in LSTM U-Nets is within layers, the segmentation results are still solely based on the image being considered. Promptable segmentation methods, such as the Segment Any Model (SAM), can be used recurrently: the user provides a prompt for the first image of a sequence, and the segmentation masks produced by the network are used as prompts for subsequent images. SAM is built on large transformer models that require an abundance of training data (over 1 billion masks), something not accessible to the standard microscopist. We designed a neural network architecture to accurately segment multidimensional microscopy images with limited training data. Thus, we began from a U-Net, given the low data requirements of U-Net training. To incorporate temporal context, we added a prompt encoder. The prompt encoder used the output mask produced by the network to inform the segmentation of the same object in the following image. The image to be segmented and the prompt entered the network through split input streams that were eventually concatenated. We refer to this architecture as Recurrent Split Concatenated U-Net (ReSCU-Net). ReSCU-Nets combine the advantages of recurrent and promptable methods. The network is prompted with a segmentation of the first image in a sequence, produced manually, with a U-Net, or with other methods. The network then uses its outputs as prompts to segment the rest of the images. To assess network performance, we assembled three timelapse datasets, generated by imaging living Drosophila embryos. Datasets included the nuclei of migrating cardiac progenitors, the edge of epidermal wounds, and the membranes of epidermal cells. We compared the performance of ReSCU-Nets to U-Nets, LSTM U-Nets, and the pretrained SAM ViT Huge. ReSCU-Nets produced the most accurate segmentations, with intersection-over-union values of 886% (nuclei), 905% (wounds), and 896% (cells). ReSCU-Nets did not produce false positives due to prompting. The high accuracy of segmentations combined with the lack of false positives resulted in true positive values of 982%, 991%, and 991% for nuclei, wounds, and cells, respectively, significantly higher than any other network. Thus, ReSCU-Nets maximize accuracy and minimize user interventions required to correct false positives, outperforming state-of-the-art models for segmentation of timelapse microscopy images. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/TJNNBD/ Room 317 Rodrigo Fernandez-Gonzalez Raymond Hawkins PUBLISH MHMTLS@@cfp.scipy.org

-MHMTLS

EffVer: Versioning code by the effort required to upgrade en

20250709T131500 20250709T134500 0.03000

EffVer: Versioning code by the effort required to upgrade

Intended Effort Versioning ([EffVer](https://jacobtomlinson.dev/effver/)), the version scheme where you just tell your users what order of magnitude to expect the upgrade effort to be. Version numbers are hard to get right. Semantic Versioning (SemVer) communicates backward compatibility via version numbers which often lead to a false sense of security and broken promises. Calendar Versioning (CalVer) sits at the other extreme of communicating almost no useful information at all. Many Python projects follow a looser scheme called EffVer where instead of making promises around backward compatibility they communicate the likelihood and magnitude of work required to adopt a new version. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/MHMTLS/ Room 317 Jacob Tomlinson PUBLISH RECJVV@@cfp.scipy.org

-RECJVV

Packaging a Scientific Python Project en

20250709T135500 20250709T142500 0.03000

Packaging a Scientific Python Project

Preparing scientific software for distribution is a key part of the software lifecycle, but the tools and landscape in Python packaging seem to be constantly evolving. The *Scientific Python Development Guide* was written to provide up-to-date best practices for packaging, linting, and testing scientific software. It continues to be an unparalleled resource, offering a template that supports many different build backends, including several for compiled code. A WebAssembly-powered repo-review tool can even validate a GitHub repository against the recommendations in the guide, alerting users to new best practices as they evolve. We will begin by examining project setup, covering backend selection and how to prepare for distribution. We will go over the structure of a project, explaining why certain decisions were made, such as placing the package inside a `“src”` directory. Common mistakes and pitfalls will be discussed, along with tips on best practices for typical scenarios, such as loading a data file from the package. We will illustrate these concepts using a specific backend, along with a brief explanation of our choice. Additionally, we will cover setting up packaging metadata, including the recently finalized SPDX license system. The next portion will focus on tooling to facilitate development and distribution. Using the guide as a foundation, we will demonstrate how to use GitHub Actions to test and deploy your code, including recent enhancements like Trusted Publishing and SigStore-signed artifacts. We will also explore related tools for validating code quality and ensuring high-quality packages are reliably produced. Attendees will learn how to use the guide’s template to quickly set up all the discussed components for a new project. We will also showcase how the repo-review tool can assess an existing project’s adherence to best practices and recent packaging updates. Next, we will cover compiled components; this is an aspect often left off of introductions to packing, but adding compiled code like C++ to a package has become significantly simper with modern packaging tools. We will explore a collection of tools, including modern backends like scikit-build-core and meson-python, binding tools like pybind11 and nanobind, and creating redistributable wheels with cibuildwheel. Combined, these tools allow packages containing compiled code to be distributed with minimal extra effort that target all major platforms, including WebAssembly via Pyodide. We will conclude with a brief look at the new and proposed changes in the packaging ecosystem that will benefit the scientific community. Upcoming features in pybind11 3.0, culminating years of development, will enable new interoperability with C++. And cibuildwheel 3.0 will expand support to new targets like iOS wheels. Other topics will include the transition from extras and requirement files to dependency-groups, how to use uv in place of pip and traditional workflows, and new proposals coming from the WheelNext organization, such as default extras. Attendees will leave with a solid understanding of modern Python packaging, practical tools to streamline their workflows, and an optimistic outlook on the future of scientific software distribution. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/RECJVV/ Room 317 Henry Schreiner PUBLISH NRMNDX@@cfp.scipy.org

-NRMNDX

User guides: engaging new users, delighting old ones en

20250709T143500 20250709T150500 0.03000

User guides: engaging new users, delighting old ones

In this talk, I'll draw on my work on user guides for tools like [Great Tables](https://posit-dev.github.io/great-tables/articles/intro.html) and [Shiny for Python](https://shiny.posit.co/py/docs/overview.html)--as well as some great guides for popular libraries--to explore 3 big focuses for guides: * Onboarding: helping new users get started with your software. * Diving deeper: ensuring users can learn to perform key tasks with your software. * User guides in the wild: lessons learned from great user guides. **Onboarding** First, simplified whole tasks illustrate the big picture. Guiding users through the simplest, whole example of your tool helps them see the big pieces, and how they connect to each other. Second, backwards design lets users eat cake first. Rather than starting from the very first step and work forwards, starting from the end result often shows the most interesting part first. **Diving deeper** In this section, I'll look at how guides can help users take their next steps. First, domain models provide a conceptual backbone for documentation. By organizing guide pages around a diagram or conceptual workflow, users become better at navigating the guide as they learn your tool. Second, guide pages need to be sequenced to work like a course and a reference. Users should be able to read a guide from end-to-end, or come back and browse it to get help with specific topics. **User guides in the wild** In this section, I'll draw inspiration from three existing documentation sites. First, [DuckDB](https://duckdb.org/docs/stable/) provides examples up top at the very start of each page, with rules detailed below. These examples serve as a quick refresher for people who already have some experience with SQL. Second, [FastAPI](https://fastapi.tiangolo.com/learn/) uses an incredible number of simple, complete examples throughout its guide. Finally, [Polars](https://docs.pola.rs/) balances teaching high-level concepts, like its lazy expressions, with diving deeper into specific topics. **Inspiration** Note that this talk was inspired by my work on [quartodoc](https://machow.github.io/quartodoc/get-started/overview.html), a tool for generating API References. I noticed that while API References are often similar to each other, user guides often differ from package to package, and are hard to get right. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/NRMNDX/ Room 317 Michael Chow PUBLISH LS3LFX@@cfp.scipy.org

-LS3LFX

Challenges and Implementations for ML Inference in High-energy Physics en

20250709T152500 20250709T155500 0.03000

Challenges and Implementations for ML Inference in High-energy Physics

Machine learning has paved the way for new discoveries in high-energy physics at the Large Hadron Collider at CERN. While we already have state-of-the-art models for tasks such as data analysis, simulation, and track reconstruction, alongside effective training methodologies- their Inference still remains a challenge. Even if we develop a sophisticated model that captures the intricate patterns of fundamental particles, its impact is limited without efficient inference engines that enable its deployment and practical application. ML inference has become especially critical in high-energy physics, where data influx rates are extremely high. Although popular frameworks like TensorFlow and PyTorch provide robust inference capabilities, their integration into C++ environments presents several challenges, including flexibility constraints when interfacing with external frameworks. ONNX Runtime, which enables fast inference of ONNX models, also has limitations due to its lack of fine-grained control. To address these challenges, SOFIE, or the System for Optimized Fast Inference code Emit was developed. It's an inference engine designed to generate highly optimized C++ code from trained ML models. SOFIE converts models in ONNX format into its own intermediate representation and also offers limited support for models trained in Keras, PyTorch, and message-passing GNNs from DeepMind’s Graph Nets library. The key advantage of SOFIE is its ability to generate standalone C++ code that can be directly invoked within C++ applications with low latency and minimal dependencies, requiring only BLAS for numerical computations. This enables seamless integration into high-energy physics workflows and other computationally demanding applications. Additionally, the generated code can be compiled at runtime using Cling Just-In-Time compilation, allowing for flexible execution, including within Python environments. By eliminating the need for heavyweight machine learning frameworks during inference, SOFIE provides a highly efficient and easily deployable solution for ML inference. Recently conducted benchmarking demonstrates that SOFIE provides faster inference for event-level evaluations and consumes less memory for smaller models than standards such as ONNXRuntime and LibTorch, but still has scope of improvement in its optimization. For GNNs, SOFIE scales better by avoiding overheads from splitting models with a large number of operators. Further ongoing developments in SOFIE now include GPU support via multiple stacks such as SYCL, ALPAKA, and CUDA, along with integration with hls4ml for FPGAs and support for models developed in Flax. In this talk, we will explore machine learning opportunities at CERN and the challenges involved in implementing them. We will then delve into SOFIE’s architecture, use cases, and the latest developments in its optimization methods and extensions. ### Outline ### - Computing challenges at CERN - a brief introduction - How Machine Learning solves them? - Limitations and opportunities - Introducing TMVA SOFIE - Motivation - Why does CERN need super-fast inference of ML models with low latency and fewer dependencies? - Why frameworks like TensorFlow or PyTorch aren't much help at CERN for ML Inference? - SOFIE Architecture - Parser - Model Storage - Inference Code Generator - SOFIE Parser - ONNX Parser - Keras Parser - PyTorch Parser - SOFIE Inference Code Generator - SOFIE Advanced Models' Inference Support - Graph Neural Networks - Dynamic Computation Graph - SOFIE Optimization Methods - Inference on Accelerators - Benchmarking results - Future Goals ### Pre-requisites ### Intermediate knowledge of machine learning and the underlying mathematics will be helpful. The project is an ML inference engine developed using C++ with Python interfaces through the C-Python API. Thus, a basic understanding of the required libraries will be beneficial. Familiarity with mathematical functions such as GEMM, ReLU, matrix multiplication, and hardware accelerators will be useful for following the latest developments of the project. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/LS3LFX/ Room 317 Sanjiban Sengupta PUBLISH RM3UHE@@cfp.scipy.org

-RM3UHE

KvikUproot - Reading and Deserializing High Energy Physics Data with KvikIO and CuPy en

20250709T160500 20250709T163500 0.03000

KvikUproot - Reading and Deserializing High Energy Physics Data with KvikIO and CuPy

High energy physics (HEP) analyses are in need of larger and larger datasets to push the limits of experimental sensitivity to theoretical calculations. To meet these needs, upgrades to detectors and detector infrastructure are increasing data rates and GPUs are a natural choice for handling the increased data volume. For analysts at the Large Hadron Collider (LHC), it has become necessary to find more efficient ways of processing data. [Awkward Array](https://github.com/scikit-hep/awkward) already has rich functionality for its CUDA backend, which allows users to leverage the high throughput of GPUs without any knowledge of CUDA programming. However, current Python tools for reading and deserializing the particle physics domain specific [ROOT](https://root.cern/) file formats to GPU memory first read data from storage to the CPU and then finally copy it to the GPU. This unnecessarily introduces the CPU as a potential bottleneck in an analysis workflow. [KvikUproot](https://github.com/fstrug/kvikUproot) is a prototype module for the Uproot library which uses Python bindings to cuFile and nvCOMP provided by the [KvikIO](https://github.com/rapidsai/kvikio) library for reading and decompressing the ROOT "TTree" and newer "RNTuple" file formats. CuFile on GPU direct storage (GDS) enabled systems transfers data from storage directly to the GPU. NvCOMP provides a backend for decompressing raw data on the GPU. [CuPy](https://cupy.dev/), which is nearly a drop-in replacement for Numpy, provides an interface in Python to buffers stored on the GPU. ROOT data is then deserialized into Awkward Arrays with its CUDA backend. Implementation of cuFile and nvCOMP required a restructuring of Uproot’s current workflow to maximize performance. Currently, chunks of data are sequentially read, decompressed, and deserialized before being concatenated in Uproot. KvikUproot asynchronously reads many chunks of data at a time with cuFile, once read decompresses these chunks in parallel through nvCOMP, and then streams buffer deserialization operations with CuPy. Already, KvikUproot can decrease read times by 20% for TTree and 30% for RNTuple file formats without GDS support. There are present challenges to adoption of tools such as KvikUproot. Enabling GDS involves multiple hardware and software components with limited support for 3rd party solutions. This requires research and development of our computing infrastructure to fulfill the requirements for activating these high performance features. Additionally, cuFile, nvCOMP, and CuPy are specific to NVIDIA GPUs. Tools similar to KvikUproot for other GPU types must be developed separately to accommodate the diversity of computing resources available. Despite these challenges, KvikUproot reduces read times of ROOT data without GDS support when compared to Uproot. The continued development of KvikUproot furthers the mission of creating a suite of python tools for HEP physicists to complete analyses completely on the GPU without the CPU bottleneck. There is still development in supporting additional data types, creating a more user-friendly API, and optimizing analysis workflow integration. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/RM3UHE/ Room 317 Frank Strug PUBLISH BJL3U3@@cfp.scipy.org

-BJL3U3

Python for Climate Science: Using Intake to provide easy access to Climate Model data en

20250709T104500 20250709T111500 0.03000

Python for Climate Science: Using Intake to provide easy access to Climate Model data

Looking at a folder on a command line for the first time, scratching your head, and writing a bunch of for loops, regexes or globs to interrogate some data is a common rite of passage for new PhD students working with climate data. Unfortunately, this rite of passage can be slow, duplicates effort, and can produce suboptimal results. Enter Intake, and the Intake-ESM plugin – Python packages which allow us to systematically index and catalog the outputs of Earth System Models (ESM’s), as well as other similarly structured datasets. Using these tools, we’re able to efficiently generate a single catalog, allowing researchers to seamlessly access and analyse petabytes of data. In this talk, I’ll outline: - The difficulties of writing your own code to collate and analyse climate data output. - How the intake ecosystem can abstract away this issue, freeing up scientists to do science. - What’s necessary to make the tools your users want and need – and how to make your solutions easy to adopt. Expect to learn: - How we leverage Intake and Intake-ESM to make Australia’s climate data tractable to researchers. - The challenges and pitfalls of maintaining a catalog comprising over 1000 datasets from different sources. - How the plugin architecture of intake makes it possible for us to hide the necessary complexity, providing a simple and consistent interface to scientific users, who want to find climate datasets, not become a data engineer. - How we expose these tools to the Australian (and global) climate community, making data access easier, workflows faster, and results more reproducible. - How we train users in these tools, collect feedback, and produce the features they need and want. This talk is aimed at: - The international climate science community. - People looking to index and share their large datasets. - People interested in reproducibility for scientific workflows with large and complex datasets Source Code: - https://github.com/ACCESS-NRI/access-nri-intake-catalog - https://github.com/ACCESS-NRI/intake-dataframe-catalog PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/BJL3U3/ Room 318 Charles Turner PUBLISH HK3AAQ@@cfp.scipy.org

-HK3AAQ

Breaking Out of the Loop: Refactoring Legacy Software with Polars en

20250709T112500 20250709T115500 0.03000

Breaking Out of the Loop: Refactoring Legacy Software with Polars

In this talk, we will explore best practices for modernizing legacy software using Polars, a popular data manipulation library. Our discussion will feature real-world examples from software engineers working for NOAA’s National Centers for Environmental Information (NCEI) who have successfully refactored climate science applications with Polars. This session provides a unique opportunity to go under the hood of recently updated software projects, including: • Global Summary of the Month (GSOM), which provides monthly weather summaries for over 100,000 weather stations worldwide • International Best Track Archive for Climate Stewardship (IBTrACS), the most comprehensive global tropical cyclone dataset available • Datzilla, a system used to track data issues and corrections within NOAA’s environmental datasets By leveraging Polars, our teams significantly improved on the performance of the original Java programs. GSOM, for example, saw an 80% boost in processing speed! The refactoring wasn’t always straightforward, however. We’ll share the lessons we learned about writing Polars code that takes advantage of multi-core, parallel processing. This talk is for anyone interested in atmospheric science, but will be particularly relevant to software engineers and data professionals interested in learning more about refactoring code in Polars. While prior knowledge of Polars is not required, familiarity with Pandas, SQL, or spreadsheet macros would be helpful. During this session, Brodie will guide attendees through examples using Jupyter Notebook and VSCode. He’ll start with simpler usage cases and gradually build toward advanced techniques, including user-defined functions (UDFs) compiled into machine code using Numba. By the end of this talk, attendees will have a practical understanding of how to migrate legacy workflows to Polars and leverage its full potential to enhance performance. While the examples will primarily be related to climate science, the techniques covered in this session will help attendees write faster, more scalable code for any scientific application that requires large-scale data crunching. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/HK3AAQ/ Room 318 Brodie Vidrine PUBLISH P8B77T@@cfp.scipy.org

-P8B77T

Cubed: Scalable array processing with bounded-memory in Python en

20250709T131500 20250709T134500 0.03000

Cubed: Scalable array processing with bounded-memory in Python

Motivation: Serverless computing presents an ambitious vision for computation at scale in which users are relieved of complex cluster management, yet enjoy elastic scaling across arbitrarily large workloads (https://arxiv.org/abs/1702.04024). Scientific users especially want ease of use and predictable performance on unforeseen problems, so their ideal parallel processing system would also be robust to failures, resumable, and have predictable runtime memory usage. Design: We present Cubed (https://cubed-dev.github.io/cubed/), an open-source pure-Python parallel computing framework explicitly designed to provide these features for the case of N-dimensional array analytics. Cubed is a generalization of Pangeo’s Rechunker package (https://github.com/pangeo-data/rechunker), which parallelizes an all-to-all rechunk (shuffle) operation at arbitrary scale whilst respecting a preset memory limit by writing to persistent storage via Zarr. Cubed extends this to support the entire Python Array API Standard (https://data-apis.org/array-api/2023.12/), by expressing all chunked array operations as a series of bounded-memory “blockwise” or rechunk operations. This approach neatly sidesteps the issue of managing memory in designing and running distributed systems, where often successful usage of general-purpose systems such as Dask requires understanding and tuning the memory configuration before the shuffle. With every operation instead now a series of embarrassingly-parallel steps, each chunk can be processed by an independent task, without the need for a complex scheduler. This is an ideal fit for a “serverless” computing model, which means there is no need for a cluster for the user to manage. Cubed can run via various execution backends, including on a local machine and in the cloud. For the cloud it uses Lithops (https://github.com/lithops-cloud/lithops) as an abstraction layer to run on various cloud provider’s serverless services (e.g. AWS Lambda). Cubed’s Plan objects can also be converted to Dask graphs or Apache-Beam pipelines, and then run via a dedicated service. Executors for Ray, Spark, and HPC are also in development. Integration with Xarray allows scientific users to try out Cubed without altering their analysis code, and opens the door to an array ecosystem in which users can seamlessly test different approaches to scaling up computation, as if they were running SQL queries against different query engines. Results: We compare running Cubed in the cloud against other array frameworks such as Dask, for various common array analytics workloads, at TB scales. Using recent graph optimizations within Cubed, we show that comparable performance to cluster-based frameworks can be achieved at reasonable cost, whilst requiring fewer choices from the user, and increasing reliability by respecting memory constraints. Conclusion: Cubed provides a new paradigm for array analytics at scale, using serverless services to parallelize large computations in the cloud, whilst allowing users to focus on scientific questions rather than configuring clusters. Links: * Repository: https://github.com/cubed-dev/cubed * Documentation: https://cubed-dev.github.io/cubed/ * Blog posts: https://xarray.dev/blog/cubed-xarray, https://medium.com/pangeo/optimizing-cubed-7a0b8f65f5b7 Previous presentations by the authors: Tom White: * Genomics | Life Science Lightning Talk | Tom White | Dask Summit 2021 (https://www.youtube.com/watch?v=qt6YsHoPpZs) Tom Nicholas: * Cubed: Bounded-Memory Serverless Array Processing in Xarray (https://youtu.be/kYc6hIddjwA?si=AvtCgn7hHJpKvy2u) * Enabling Petabyte-scale Ocean Data Analytics- Thomas Nicholas, Julius Busecke | SciPy 2022 (https://www.youtube.com/watch?v=ftlgOESINvo) Ryan Abernathey: * Pangeo Forge: Crowdsourcing Open Data in the Cloud- Ryan Abernathey | SciPy 2022 (https://youtu.be/sY20UpYCAEE?si=x9TP0VRKb-pa6ugV) PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/P8B77T/ Room 318 Tom White Tom Nicholas PUBLISH ELZLHP@@cfp.scipy.org

-ELZLHP

Using Discrete Global Grid Systems in the Pangeo ecosystem en

20250709T135500 20250709T142500 0.03000

Using Discrete Global Grid Systems in the Pangeo ecosystem

Traditionally, data in the geosciences have been sampled on a projection-based grid. This allowed for very simple and intuitive grids represented as two dimensions in memory, with the most popular being grids based on equirectangular projections (i.e. simple longitude / latitude grids). However, this simplicity comes with a number of downsides: - The cells are geometrically distorted, with the distortion growing with the distance from the projection center - There are discontinuities along the edges of the projection space (for cylindrical projections e.g. dateline, poles) - Data may be oversampled in some parts of the grid (e.g. in polar regions for cylindrical projections) This is usually less relevant for small local grids, but becomes more important as the size of the area of interest grows or when combining multiple local grids. DGGS aim to resolve this by equally and recursively subdividing the earth (approximated as a sphere, which can be extended to an ellipsoid) using flat surfaces such as triangles, rectangles, or hexagons, forming a hierarchy or tree of cells. Since these cells are unique, each cell can be assigned a numeric ID. This ID can be used to efficiently traverse the hierarchy, allowing for operations like up-/downsampling, neighbours search, and alignment / co-location. Working with DGGS cells also allows avoiding the issues caused by discontinuities like the dateline. In addition, the cells are addressed by unique indexing systems that typically follow a space-filling curve. This cell ID serves as a 1-D index in Xarray. Not all of these grid systems or libraries currently implement such a DGG reference system (DGGRS) that allows for seamless traversal or neighbourhood operations. It is important to note, though, that the choice of the concrete DGGS still requires a careful tradeoff: while they are better than planar projections, DGGS still cannot preserve area, shape/angles and distances at the same time. In particular, some DGGS are specialized on preserving shapes and distances and thus are best used for navigation, while others preserve areas and thus are better for geophysical applications. The most well-known examples for navigational DGGS include H3 and S2, while examples for area-preserving DGGS are HEALPix and various ISEA grids like ISEA7H or ISEA4T. The way DGGS are designed means that new tooling and algorithms are required. Additionally, the geospatial location and refinement level represented by the cell IDs require specialized libraries, which all have a unique API. `xdggs` is a library that extends `xarray`[^1] to provide a unified interface for interacting with various DGGS based on the cell IDs. It implements basic operations like computing the cell centers and cell boundaries from cell ids, refinement level (cf. resolution), aligning datasets on the same grid, selecting cells using geographic coordinates as well as interactive visualization using libraries, such as `lonboard`[^2]. [^1]: https://github.com/pydata/xarray [^2]: https://github.com/developmentseed/lonboard Project links: - github: https://github.com/xarray-contrib/xdggs - docs: https://xdggs.readthedocs.io - earlier presentation: https://discourse.pangeo.io/t/pangeo-showcase-xdggs-using-discrete-global-grid-systems-with-xarray/4728 PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/ELZLHP/ Room 318 Tina Odaka Jean-Marc Delouis Justus Magin Anne Fouilloux Benoît Bovy Alexander Kmoch PUBLISH ETWXLC@@cfp.scipy.org

-ETWXLC

tobac: Tracking Atmospheric Phenomena on Multiscale, Multivariate Diverse Datasets en

20250709T143500 20250709T150500 0.03000

tobac: Tracking Atmospheric Phenomena on Multiscale, Multivariate Diverse Datasets

The identification and tracking of atmospheric phenomena such as clouds has been desired since the first satellite images were collected. Although automated tracking techniques have been around since at least the 1980s, these techniques were generally slow and only worked on the datasets they were originally designed for. The Tracking and Object-Based Analysis of Clouds (tobac) package, an open-source Python package, was designed to allow researchers to identify, track, and perform object-based analyses of atmospheric phenomena on any input variable and on any input grid. This means that identification and tracking can be performed on any user-specified variable, whether from a weather satellite, radar, numerical model, or other data source. The flexible and modular design of tobac allows it to be used on any gridded dataset. For example, tobac has already been used to track features using brightness temperature, vertical velocity, radar reflectivity, dust concentration, trace gases, and lightning. After tobac’s original release in 2019, the original developers moved on to other projects, necessitating a revitalization effort and for formal governance structures to be set up and the structure modernized to enable use and growth in the future. Development of tobac is done with open science principles in mind, with nearly all code reviewed similar to scientific peer review through pull requests rather than direct commits to the repository. Recent updates to tobac have focused on enabling the use of even larger and more diverse datasets and a pathway toward the identification and tracking of clouds and other atmospheric phenomena using multiple variables simultaneously. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/ETWXLC/ Room 318 Sean W. Freeman PUBLISH XMC8KU@@cfp.scipy.org

-XMC8KU

Generative AI in Engineering Education: A Tool for Learning, Not a Replacement for Skills en

20250709T152500 20250709T155500 0.03000

Generative AI in Engineering Education: A Tool for Learning, Not a Replacement for Skills

This talk is developed with interviews and discussions with engineering teachers and students , but the audience is meant to be more broadly: users of AI for education can training. Similar ethical issues arise in code development when users rely on generative AI tools to contribute code. Its important to discuss as a learning group the role of the users vs the role of generative AI in the development of ideas and products. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/XMC8KU/ Room 318 Ryan C Cooper PUBLISH XWYKHA@@cfp.scipy.org

-XWYKHA

Embracing GenAI in Engineering Education: Lessons from the Trenches en

20250709T160500 20250709T163500 0.03000

Embracing GenAI in Engineering Education: Lessons from the Trenches

Introduction I'll begin by setting the context of my Engineering Computations course, a beginner Python course focused on computational thinking, numerical tasks, and problem-solving. I'll explain my initial motivation for incorporating generative AI through a RAG-enabled chatbot grounded in course materials, aiming to provide students with legitimate AI support rather than fighting against inevitable AI usage. The Experiment: Initial Implementation I'll detail how I introduced AI into the course structure, including: - The design and capabilities of the RAG-enabled chatbot - My expectations for how students would use AI as a productivity enhancer - The initial enthusiasm from students when told AI usage was permitted - My hope that AI would serve as a collaborative learning partner What Went Wrong: Unintended Consequences This section will candidly explore the rapid emergence of problematic student behaviors: - Students using one-shot prompts to solve entire assignments - The iterative trial-and-error approach with AI and autograders - Dramatic decrease in class attendance (down to 30%) - Students deprioritizing the course relative to others with traditional exams - The collapse of class dynamics when assessment changes were proposed The Illusion of Competence I'll analyze the cognitive phenomenon where AI usage created a false sense of mastery: - Definition and psychology behind the illusion of competence - How AI-completed assignments gave students high scores without understanding - Parallels between passive learning methods and superficial AI use - Student resistance to acknowledging learning gaps - The striking disparity between assignment scores and exam performance Assessment Challenges I'll discuss the difficult balance between maintaining academic integrity and embracing AI: - The student backlash when assessment changes were considered - My decision-making process regarding secured exam conditions - The modifications made to the autograder for exams - Analysis of exam results despite open AI access - The impact on course evaluations and student satisfaction Lessons Learned and Path Forward This section will outline key insights and strategies for future implementation: - Designing assignments that work with AI rather than against it - Creating regular low-stakes assessments to ensure genuine engagement - Explicitly teaching effective AI collaboration skills - Balancing innovation with structure and accountability - Reframing expectations for both instructors and students Practical Implementation Strategies I'll offer specific, actionable approaches for educators: - Example assignments designed for effective AI collaboration - In-class exercises that leverage AI while ensuring learning - Assessment structures that maintain academic integrity - Methods for monitoring and guiding productive AI use - Approaches to cultivate student buy-in for responsible AI practices Conclusion and Q&A I'll close by reframing this challenging experience as valuable learning for the education community, emphasizing the importance of continued experimentation and honest dialogue about both successes and failures in educational innovation. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/XWYKHA/ Room 318 Lorena Barba PUBLISH A3UYEC@@cfp.scipy.org

-A3UYEC

GPUs & ML – Beyond Deep Learning en

20250710T104500 20250710T111500 0.03000

GPUs & ML – Beyond Deep Learning

The utilization of GPUs to accelerate machine learning model training and inference offers the potential for enormous speedups, addressing the growing computational demands of modern data science. However, writing dedicated GPU-accelerated pipelines can be challenging and time-consuming, often requiring specialized algorithms and dealing with overheads like memory migration and just-in-time compilation. The Python data science community has largely adopted scikit-learn as the standard for traditional machine learning algorithms. Its estimator paradigm allows for easily composable pipelines, yielding high reusability and reproducibility. Users now have multiple avenues to accelerate these scikit-learn-style ML pipelines on GPUs without explicitly implementing GPU compute kernels. We demonstrate how to accelerate exemplary ML pipelines using two methods: scikit-learn’s experimental Array API Standard support layer, and [cuML](https://github.com/rapidsai/cuml), part of the NVIDIA RAPIDS Data Science stack, which provides accelerated estimator algorithms mirroring those in scikit-learn, umap-learn, and hdbscan. The recently released cuML zero code change acceleration enables pipeline acceleration without any code changes through a transparent API intercept layer that dispatches to accelerated estimator variants where beneficial. We demonstrate how to accelerate exemplary ML pipelines using these methods, highlighting differences in approach and performance at varying data sizes. We provide guidance to help practitioners choose suitable approaches for different problem types and sizes, showing a natural progression from small-scale prototyping to large-scale accelerated production pipelines. We show how to minimize cost and runtime by effectively mixing hardware for different problems, e.g., performing model training on GPUs and inference on CPUs. We further show that GPU acceleration can be highly beneficial even at the prototype stage where users benefit significantly from reduced iteration times. cuML zero code change acceleration is designed to seamlessly integrate with existing machine learning workflows. Users can invoke their unaltered Python scripts with a simple command-line interface or use a Jupyter magic command for Jupyter notebooks to enable the acceleration mode. This mode automatically accelerates any supported estimators by leveraging cuML's GPU-optimized implementations, while gracefully falling back to CPU execution for unsupported methods. This is achieved by transparently and selectively intercepting class instantiation and function calls, and rerouting them to their GPU-accelerated counterparts. This is analogous to the zero-code change mode provided by another RAPIDS Data Science library, [cuDF for pandas](https://developer.nvidia.com/blog/rapids-cudf-accelerates-pandas-nearly-150x-with-zero-code-changes/) and [Polars](https://developer.nvidia.com/blog/polars-gpu-engine-powered-by-rapids-cudf-now-available-in-open-beta/). This seamless integration allows users to reuse existing code and generally focus on their machine learning tasks without worrying about the underlying hardware or the intricacies of GPU programming. We conclude the talk by discussing the current implementation status of cuML in terms of algorithmic coverage and parity with the scikit-learn, umap-learn, and hdbscan APIs, as well as future plans for expanding the cuML zero code change acceleration capability to other algorithms. By leveraging these advancements, practitioners can achieve significant performance improvements, enabling faster prototyping and more efficient production pipelines. The seamless integration with existing scikit-learn, UMAP, and HDBSCAN workflows ensures that users can easily transition to GPU-accelerated computing, maximizing the potential of their hardware. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/A3UYEC/ Ballroom Simon Adorf PUBLISH 7XRAQD@@cfp.scipy.org

-7XRAQD

Polyglot RAG: Building a Multimodal, Multilingual, and Agentic AI Assistant en

20250710T112500 20250710T115500 0.03000

Polyglot RAG: Building a Multimodal, Multilingual, and Agentic AI Assistant

This session will be a deep dive into building a next-gen AI assistant that goes beyond static RAG implementations by integrating voice, multiple languages, and agentic capabilities. We’ll walk through key concepts, architectures, and code implementations, demonstrating live how to build a fully interactive chatbot that can handle multimodal inputs (voice & text), respond in multiple languages, and autonomously retrieve information beyond its dataset. Outline (30 minutes) Introduction & Problem Statement (5 min) What are the limitations of traditional chatbots? Why multimodal, multilingual, and agentic capabilities matter Overview of our AI assistant demo Tech Stack & Architecture Overview (5 min) Using Gradio for the UI (voice + text inputs, multilingual response) Leveraging Whisper for voice input processing and TTS for voice responses Implementing RAG with a vector database for retrieval Introducing LangGraph for agentic workflows Hands-On: Building the Core RAG Assistant (8 min) Implementing retrieval over structured and unstructured data Connecting to a flight database for real-time search Handling multilingual queries with embeddings & tokenization Adding Agentic Capabilities with LangGraph (8 min) Enabling the assistant to autonomously search flights online if not found in the database Creating dynamic workflows for retrieval + API calling Showcasing the agent’s ability to take actions beyond static RAG responses Live Demo: A Fully Functional AI Assistant (3 min) Interactive multilingual, multimodal Q&A session Testing voice queries, text-based queries, and real-time flight searches Takeaways & Future Improvements (1 min) How to extend the system for other industries & use cases The next steps for building even more autonomous AI systems PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/7XRAQD/ Ballroom Axel Sirota PUBLISH XNSXMY@@cfp.scipy.org

-XNSXMY

Numba v2: Towards a SuperOptimizing Python Compiler en

20250710T142000 20250710T145000 0.03000

Numba v2: Towards a SuperOptimizing Python Compiler

In today's landscape of AI/ML-dominated computing and ever-increasing programming complexity, flexible compiler tooling has become more critical than ever. Building on a decade of experience developing Numba, our team is creating a next-generation compiler (Numba v2) that supports composable term rewriting rules, making compiler development modular, extensible, and accessible to domain experts across AI, machine learning, and traditional numerical applications. **Key Challenges in the Current Landscape** Our experience has highlighted three critical challenges: 1. The Python ecosystem's strength lies in its diverse libraries, offering various implementations of core numerical routines and specialized hardware access through different APIs. As these libraries, APIs, and target hardware evolve, developers must continuously adapt their codebases to effectively utilize both existing and emerging capabilities. 2. While numerical codebases rely heavily on compiler technology for performance optimization, current Python compilation faces two significant limitations: * Python's language structure doesn't naturally align with the structured forms expected by common compilation technologies like MLIR/LLVM, complicating optimization efforts. * Traditional compiler technology depends on heuristics—predefined compiler passes optimized for general cases—forcing developers to over-specialize their programs through various flags and implementation "tricks". 3. Domain experts possess valuable optimization knowledge but lack straightforward methods to implement and share these optimizations without extensive source code modifications. **Numba v2: A New Approach** To address these challenges, we're developing a next-generation compiler that broadens access to compiler technology. Beyond modernizing the core compiler, Numba v2 introduces "rewrite rules" that allow users to express adaptations and optimizations for both existing and new code. By leveraging equality saturation—which explores all program variants derived from rewriting rules—Numba v2 achieves superoptimization through cost-based extraction. These rules serve as shareable, distributable, domain-specific optimizations that enhance both new and established workflows through simple recompilation. **Practical Applications and Benefits** We will demonstrate how this approach enhances machine learning and numerical computing by unlocking new optimization opportunities, including: * **Numerical Approximation:** High tolerance for floating-point imprecision enables optimizations beyond traditional `-ffast-math`, incorporating ISA-specific techniques that push efficiency further. * **Automatic Hardware Acceleration:** Numba v2 can seamlessly offload NumPy array expressions to GPUs, optimizing performance without requiring explicit user intervention. * **Energy-efficient Computation:** New floating-point optimizations enable replacing fundamental operations—such as multiplication—with more efficient variants (such as L-Mul), potentially reducing power consumption in deep learning models and numerical applications. This talk is essential for numerical codebase maintainers, domain experts interested in sharing optimization knowledge, and anyone working with hardware acceleration or superoptimization techniques. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/XNSXMY/ Ballroom Siu Kwan Lam PUBLISH GJACX9@@cfp.scipy.org

-GJACX9

Reproducible Science Made Easy: Package Management with Pixi en

20250710T150000 20250710T153000 0.03000

Reproducible Science Made Easy: Package Management with Pixi

[Pixi](https://pixi.sh) is a new way of managing Conda and PyPI packages in a project. It's build on all the knowledge gained from building Mamba, and it draws inspiration from, among others, Cargo (Rust), PNPM (Node.js), and Poetry, combining their best features with a cross-platform task system. This allows users to build projects that work seamlessly on Linux, macOS and Windows. Pixi can complement Docker or even replace it in many cases, reducing overhead – which is especially important in cases like large data projects and HPC. With Pixi and the underlying technology, we have completely rebuilt the Conda ecosystem in the Rust programming language, making it faster and more maintainable for the future. It's 100% Open-Source with a permissive BSD-3-Clause License. The key features of pixi are: - `pixi global`: globally install your favorite tools and applications - `pixi project`: separate, isolated projects that come with lockfiles, task descriptions, and their own isolated dependencies. Add the `pixi.toml` and `pixi.lock` files to your git repository for perfect reproducibility of your scientific projects. - `pixi build`: because we know that scientists deal with hairy, compiled dependencies, pixi build will deal with it for you. Compile Fortran, C/C++, Rust projects and Python bindings by using the vast compiler ecosystem from conda-forge. The performance increase over `conda` with `pip` in the worst case is 200% but more often close to 1000% faster or more [1][2]. This talk will cover the following topics: - **Introduction:** What is package management, conda vs. pip, … - **Pixi overview:** features and benefits of pixi for the scientific use case - **Live demo:** Pixi in action by looking at big projects like `scipy` as an example. - **Practical takeaways:** fast workflows for data scientist, and reliable environments for DevOps. ### Audience This talk will be of interest to everyone that uses or builds software in the scientific or data-science community. Especially those dealing with running their environments in multiple machines, e.g. CI, HPC, Docker. Whether you are building libraries, application or using them for running notebooks, this talk can kickstart a more efficient workflow for you. [1]: Simple pixi vs conda cached installation benchmark on GitHub: https://github.com/ruben-arts/pixi_bench/actions/runs/13334032029 [2]: Blog post on faster repodata fetching: https://prefix.dev/blog/sharded_repodata PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/GJACX9/ Ballroom Ruben Arts Wolf Vollprecht PUBLISH KAPJYZ@@cfp.scipy.org

-KAPJYZ

Zamba: Computer vision for wildlife conservation en

20250710T155000 20250710T162000 0.03000

Zamba: Computer vision for wildlife conservation

#### Motivation The conservation sector has embraced camera traps as a core technology to enable measures of species occupancy, relative or absolute abundance, population trends over time, patterns in animal behavior, and more. While the downstream uses of camera traps vary, all camera trap users share a common pain point: camera traps cannot automatically label and filter the species they observe. It therefore takes the valuable time of teams of experts, or thousands of citizen scientists, to manually process this data and identify videos and images of interest. Automated, accurate, and accessible species detection unlocks improved monitoring of animal populations and evaluation of the impacts of conservation efforts on species abundance. Faster processing of camera trap data means that conservationists can do these assessments in months rather than waiting years for their results. This enables both faster conservation interventions as well as faster evaluation of conservation measures, enabling a course change if something is not proving effective. It also supports conservationists in collecting more images and from more locations. #### Methods [Zamba](https://github.com/drivendataorg/zamba) is an open source Python package that leverages machine learning and computer vision to automate time-intensive processing tasks for wildlife camera trap videos and images. It allows conservationists to direct their time toward more complex secondary analysis and making evidence-based conservation decisions. In this talk, we will cover: * Zamba’s origins in the winning approaches from the [Pri-matrix Factorization](https://www.drivendata.org/competitions/49/deep-learning-camera-trap-animals/) machine learning challenge, hosted by DrivenData, and how crowdsourcing top methodologies can kickstart the development of such tools. * An overview of Zamba's capabilities for processing camera trap images and videos and how machine learning supports conservation use cases. * Why videos are a much more difficult data modality to process than images and how Zamba approaches this with multi-stage processing and a student–teacher model. * The importance of custom model training functionality for handling the wide variety of habitats and species under study by conservation efforts around the world. * How designing for conservationists, who aren't programmers and shouldn't need to be, led to the development of [Zamba Cloud](https://www.zambacloud.com/), a web service (also developed with Python) that provides a code-free interface to Zamba's machine learning capabilities. #### Results Zamba has pretrained models for species classification, depth estimation, and blank removal. In addition, users can train custom models using their own labeled data to identify species in their particular habitats. Zamba’s base image model was trained on over 10 million camera trap images from [lila.science](https://lila.science/) capturing 178 species groupings. The video models were trained on over 250,000 expert-labeled videos from 14 countries in West, Central, and East Africa capturing 30 species (as well as blanks and humans). Using a [holdout set of 30,000 videos](https://github.com/drivendataorg/zamba/blob/master/docs/docs/models/td-full-metrics.md), we found a top-1 accuracy of 82% and a top-3 accuracy of 94% for the species classification models. Using the blank detection model, a true positive rate of 80% for blank detection can be achieved with a 10% false negative rate. To date, over 300 users from around the globe have used Zamba Cloud to process more than 1.1 million videos. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/KAPJYZ/ Ballroom Jay Qi Emily Dorne PUBLISH EMLLYF@@cfp.scipy.org

-EMLLYF

Accelerated DataFrames for all: Bringing GPU acceleration to pandas and Polars en

20250710T163000 20250710T170000 0.03000

Accelerated DataFrames for all: Bringing GPU acceleration to pandas and Polars

The job of a Python data scientist is challenging enough without worrying about finding fast enough tools for the task at hand. GPU acceleration provides an attractive solution, allowing users to continue using familiar APIs that implement transparent GPU execution rather than having to seek better performance from new, unfamiliar APIs. We built the cuDF GPU DataFrame package to satisfy this need, but ultimately we found that the gaps in cuDF’s coverage of the pandas API were enough of a barrier to prevent adoption for many users. In this talk, we discuss how we have removed that barrier by providing seamless GPU acceleration for both the pandas and Polars DataFrame libraries. Our solution for pandas was the cudf.pandas plugin, which allows users to GPU-accelerate their code by simply loading a module or using a Jupyter magic command. It accomplishes this task using an intricate proxying scheme that uses Python’s import machinery to replace pandas functions and classes with their equivalents in cuDF. This complex approach is taken out of necessity: the pandas eager execution model necessitates a tight coupling between its front-end and the execution layer, which in turn means that there is no low-level centralized entry point at which to inject GPU execution. We will discuss how, despite the complexity, the cudf.pandas approach works remarkably well at transparently accelerating as much of pandas as cuDF supports while seamlessly running everything else on the CPU. The Polars GPU engine offers many of the same core features, such as a high-level on-off switch and easy profiling of GPU utilization using the Polars lazy API. Under the hood, however, cudf-polars is quite different from cudf.pandas. Polars has a lazy API that allows for a clear separation between the front-end and the underlying execution engine, allowing a direct translation of the Polars expression IR into a set of concrete cuDF operations to run. As part of this talk we will do a deep dive comparing the implementation of cudf-polars to cudf.pandas. The actual execution in the Polars GPU engine is handled by pylibcudf, a new library of low-level data processing primitives that allows us to decouple core algorithms from the high-level semantics of different engines. pylibcudf now serves as the core engine for both cudf-polars and the classic cuDF package (and by extension, cudf.pandas), and in addition it can be used directly by power users seeking peak performance or library developers accelerating other tools. In this talk, you will see plenty of examples of usage so that you can learn to speed up your own data analyses. You will get a peek under the hood of how cudf.pandas and cudf-polars work in ways that will help you understand how best to accelerate your workflows using each library and what tools are available to you to debug slow performance. Finally, library developers (particularly of other DataFrame libraries) will get a sense for how they can bring GPU acceleration to their own packages. If you are interested in the guts of how we accelerate existing data processing libraries using GPUs, or if you are simply interested in learning how to make your workflows faster today, this talk is for you. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/EMLLYF/ Ballroom Vyas Ramasubramani PUBLISH XDFZXG@@cfp.scipy.org

-XDFZXG

Lightning Talks en

20250710T172000 20250710T182000 1.00000

Lightning Talks

PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/XDFZXG/ Ballroom PUBLISH UDTLL7@@cfp.scipy.org

-UDTLL7

Can Scientific Python Tools Unlock the Secrets of Materials? The Electrons That Machine-Learning Can't Handle en

20250710T104500 20250710T111500 0.03000

Can Scientific Python Tools Unlock the Secrets of Materials? The Electrons That Machine-Learning Can't Handle

Understanding how materials behave at the atomic level is crucial for designing new technologies, but it's incredibly challenging. Predicting chemical reactions requires solving complex quantum mechanical equations, which are computationally expensive, limiting simulations to a few hundred atoms at most. This makes it difficult to compare theoretical predictions with real-world experimental observations. Simpler methods exist, but they often lack the accuracy needed to understand the crucial details of bond formation and breakage. Scientific Python has revolutionized materials science. Packages like NumPy, SciPy, and Matplotlib, along with specialized tools like ASE (for manipulating atomic structures) and MDAnalysis (for analyzing simulations), empower researchers to tackle this problem across vastly different scales. We can now simulate everything from massive systems using simplified models to highly accurate, but computationally expensive, methods like Density Functional Theory (DFT). A promising new approach uses Machine Learning Interatomic Potentials (MLIPs). These learn from detailed DFT calculations to create fast simulations of large systems, but often lack the electronic structure information needed for detailed comparison with experiments. My research addresses this limitation by leveraging Density Functional Tight Binding (DFTB), a computationally efficient method that retains essential information about the electrons. In this talk, I will present my work on DFTB, demonstrating how its power to bridge the gap between theory and experiment was only possible thanks to scientific Python tools. To further promote its use, we've created a public GitHub repository (https://github.com/Voss-Lab/SK_repository) containing parameters for a wide range of materials, making it easier to compute electronic properties and compare them directly with experimental data such as XPS measurements. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/UDTLL7/ Room 315 Filippo Balzaretti PUBLISH RXCE3F@@cfp.scipy.org

-RXCE3F

Advanced Machine Learning Techniques for Predicting Properties of Synthetic Aviation Fuels using Python en

20250710T112500 20250710T115500 0.03000

Advanced Machine Learning Techniques for Predicting Properties of Synthetic Aviation Fuels using Python

Synthetic aviation fuels (SAFs), derived from biological sources, represent a critical opportunity to enhance efficiency in the aviation industry. However, the high costs and volume requirements often delay the experimental testing of novel blends, increasing the risk of scaling up SAFs that may underperform. This presentation addresses these challenges by showcasing how advanced machine learning techniques can revolutionize the development process for SAFs. In recent years, machine learning has emerged as a powerful tool for developing property prediction models that accelerate SAF development by enabling early predictions of key fuel properties. However, many existing models face limitations, including reliance on complex analytical techniques, narrow focus on specific property ranges, and lack of interpretability. In 2020, we presented our approach at SciPy, which enabled the prediction of properties for over 10,000 molecules using molecular descriptors, later published in Fuel (https://doi.org/10.1016/j.fuel.2022.123836). In 2023, we introduced our preliminary method for predicting high-throughput aviation fuel properties using FTIR spectra, focusing on feature cleaning and transformation while evaluating dimensionality reduction techniques for spectra. This year, we present our finalized approach, which employs non-negative matrix factorization (NMF) to decompose FTIR spectra into interpretable features. By integrating these NMF features with property data in ensemble models, we achieve accurate predictions of fuel properties and uncover significant correlations with blend composition. This enhanced methodology not only improves prediction accuracy but also provides critical insights into the relationships between fuel composition and performance. Our presentation will detail the refined workflow for training property prediction models, using libraries such as NumPy, pandas, and scikit-learn. Key aspects will include the mathematical intuition of NMF and its features, its practical implementation, and the challenges encountered with pipeline optimization tools like TPOT, which can yield suboptimal results compared to a more tailored approach. We will conclude by presenting our model results, emphasizing their interpretability and the insights gained regarding blend composition and predicted properties. By demonstrating how our open-source web tool (https://feedstock-to-function.lbl.gov/) can significantly reduce the time and costs associated with bioprocess optimization and the scale-up of SAFs, we highlight the potential for scientific Python to address pressing research challenges in synthetic aviation fuel development. This talk is positioned to engage the scientific computing community by illustrating how computational techniques can solve complex research problems, ultimately contributing to the advancement of new energy solutions. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/RXCE3F/ Room 315 Ana Comesana PUBLISH VWB7YP@@cfp.scipy.org

-VWB7YP

Probing the Hidden World of Battery Chemistry With X-rays en

20250710T142000 20250710T145000 0.03000

Probing the Hidden World of Battery Chemistry With X-rays

# Introduction and Background Batteries are ubiquitous in modern life, from enabling long-lasting mobile devices to stabilizing the electric grid. In many ways it is amazing they exist at all, since stored chemical energy will always tend towards equilibrium (i.e. a dead battery). Improving the charge-storage capabilities of modern batteries requires continued research to understand the chemistry driving the charge storage mechanisms. Batteries form a hierarchical system: the overall performance of a given cell depends on the behavior at smaller scales. For example, a typical Li-ion battery cathode contains countless ~10 micrometer-sized secondary particles whose behavior is driven by the individual ~100 nm primary particles of which they are composed. These primary particles, in turn, depend on the behavior of their individual atom constituents. As a result, the progress of battery technology depends on improving our understanding of the chemical processes that are dominant at each of these length-scales, as well as how they relate to one another[1]. # Methods X-rays are a versatile probe of materials chemistry that is well-suited to battery research. For example, X-rays can be used to form a three-dimensional representation of the inside of a battery, similar to a medical CT scan. Furthermore, X-ray spectroscopy measures the amount of light transmitted based on its wavelength, revealing element-specific states of charge. A battery can then be charged while repeatedly performing spectroscopy and/or imaging measurements to reveal the structural and chemical properties as a function of the state of charge of the battery, allowing us to see the chemistry taking place within. For one battery cell, this produces a data-set with four to five unique axes, some of which may even be complex-valued. To be useful, these data must to be cleaned, aligned, analyzed, and visualized using tools such as scipy and scikit-image. The individual pixel-wise spectra must them be converted to a scientifically meaningful number, requiring more specific tools. This must be done in a way that is reliable, and reproducible to ensure the scientific conclusions are properly supported. The *xanespy* package combine these python packages in ways suited for analyzing these types of data[2]. # Results The python ecosystem is well-suited for these analysis tasks, producing maps showing the oxidation states of individual particles within a battery cathode over time[3]. These maps revealed that individual secondary particles did not follow the predictions of thermodynamics. Instead, each particle underwent an initial latent period followed by rapid oxidation to its fully charged state. Physics simulations showed that the best explanation for this behavior is a change in the kinetics of how lithium leaves the cathode and enters the electrolyte at the surface of the battery particles. Additional coherent imaging (ptychography) experiments probed down to the level of individual particles ~100nm in diameter[4]. After extracting spectral signals using a Bayesian optimization, we observed gradients in the state of charge for these particles, from the surface up to several tens of nanometers into the particle, with the outer atomic layers showing discharged battery material even when the battery was fully charged. Taken together, these behaviors translate into inefficient energy storage in the battery. # Conclusions The success of these experiments shows the versatility of the scientific python ecosystem for handling these otherwise unwieldy data sets. Alternative tools are either designed with a specific data structure in mind and are therefore not adaptable to novel experiments, or else generic and so are not tailored to these kinds of analyses. Synchrotron radiation sources are continuing to improve, providing faster and more precise measurements, resulting in ever-growing data-sets. New tools are now being developed that will allow us to both execute these experiments effectively and access the resulting data in an efficient manner. # References [1] https://doi.org/10.1021/acs.chemmater.6b05114 [2] https://github.com/canismarko/xanespy [3] https://doi.org/10.1002/aenm.202300895 [4] https://doi.org/10.1021/acs.chemmater.0c01986 PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/VWB7YP/ Room 315 Mark Wolfman PUBLISH NEGTPD@@cfp.scipy.org

-NEGTPD

Noise-Resilient Quantum Computing with Python en

20250710T150000 20250710T153000 0.03000

Noise-Resilient Quantum Computing with Python

Quantum computing introduces a fundamentally different paradigm of computation, but just like classical computing, it is prone to errors. However, debugging quantum computations is far more challenging. In classical computing, when a program fails or returns incorrect results, we can inspect memory, log intermediate values, or rerun the program step by step to diagnose the issue. In contrast, quantum computations operate on delicate quantum states that collapse when measured, making it impossible to directly observe their evolution. This fundamental limitation means that traditional debugging techniques, like printing intermediate values or checking for errors after execution, do not translate directly to quantum systems. This talk will explore the nature of quantum noise, why debugging quantum computers is fundamentally different from classical systems, and how Python has enabled researchers to develop techniques for mitigating errors. We will begin by breaking down the key sources of quantum noise, such as decoherence, crosstalk, and gate imperfections, and discuss how they corrupt computations in ways that are difficult to detect. Unlike classical errors, quantum errors can occur probabilistically and affect computations in subtle ways that make identifying and correcting them unique. To address these issues, researchers have developed various error mitigation techniques that allow us to infer and reduce the impact of noise without direct observation. These include zero-noise extrapolation, which runs circuits at different noise levels and estimates an idealized result, and probabilistic error cancellation, which models and statistically corrects for noise effects. Other techniques, such as dynamical decoupling and noise-aware compilation, take a more proactive approach by structuring computations in ways that naturally reduce error rates. While techniques were published in literature, the quantum software community was quick to catch up with implementation in Python. Enabling fast prototyping, simulation, and benchmarking, Python has become the de facto standard for error mitigation protocols, and even quantum computing at large. The flexibility and ease of use of the language has been crucial in a rapidly evolving field, where new error mitigation methods emerge more frequently and require testing and validation. Attendees will gain insight into how debugging quantum computers differs from debugging classical systems, why noise is such a significant challenge, and how Python is enabling progress in error mitigation research. This talk is designed for researchers, engineers, and Python developers interested in quantum computing more broadly. No prior experience with quantum computing is required. By the end of the talk, attendees will have a deeper understanding of the complexities of quantum noise and a practical approaches for dealing with it, as well as how they can use Python to contribute to this growing field. Nate has presented quantum computing to general science audiences at PyData (https://www.youtube.com/watch?v=8ZfyOUuBv3g) and the City University of Seattle (https://www.youtube.com/watch?v=9zB3aCDII7Q), and is the lead developer of the leading Python package for quantum error mitigation: mitiq (https://mitiq.readthedocs.io/, 210k+ downloads). PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/NEGTPD/ Room 315 nate stemen PUBLISH JBNR9A@@cfp.scipy.org

-JBNR9A

VirtualiZarr and Icechunk: How to build a cloud-optimised datacube of archival files in 3 lines of xarray en

20250710T155000 20250710T162000 0.03000

VirtualiZarr and Icechunk: How to build a cloud-optimised datacube of archival files in 3 lines of xarray

Many scientific datasets, including level-3 geoscience data products, are distributed as collections of thousands of individual files or granules which makes it difficult to address the data as a coherent datacube. Worse, the data is often stuck in pre-cloud archival file formats, precluding efficient access from cloud object storage. VirtualiZarr [1] is a python tool for creating “virtual” Zarr datacubes, enabling cloud-optimized access to a range of archival file formats (e.g. netCDF and TIFF) without copying the original data. Data is accessed via Icechunk, an open-source cloud-native transactional storage engine, which can store “virtual Zarr chunks” in the form of references to byte ranges in other objects. Virtualization provides a win-win-win for users, data engineers, and data providers: Users access fast-opening zarr-compliant stores that work performantly out of the box with libraries like Xarray and Dask, data engineers need only add a lightweight virtualization layer on top of existing data (even without the data provider’s involvement), and data providers don’t have to modify their legacy files to provide cloud-optimized access. VirtualiZarr works by creating a metadata-only representation of files in legacy formats, including references to byte ranges inside specific chunks of data on disk. VirtualiZarr is similar to the Kerchunk package (which inspired it), except that it uses an array-level representation of the underlying data, stored in “chunk manifests”. Metadata-only references to data are saved to disk either via the Kerchunk on-disk reference file format, or using the Icechunk transactional storage engine, which facilitates later cloud-optimized access using Zarr-Python v3 and Xarray. This approach has three advantages: 1. An array-level abstraction means users of VirtualiZarr do not need to learn a new interface, as they can use Xarray to manipulate virtual representations of their data to arrange the files comprising their datacube. 2. “Chunk manifests” enable writing the virtualized arrays out as valid Zarr stores directly (using Icechunk), meaning Zarr API implementations in any language can read the archival data directly. Zarr as a “universal reader” will allow data providers to serve all their archival multidimensional data via a common high-performance interface, regardless of the actual underlying file formats. 3. The integration with Icechunk allows “virtual” and “native” chunks to be treated interchangeably, so that an initial version of a datacube pointing at archival file formats can be gradually updated with new icechunk-native chunks with the safety of ACID transactions without the data users needing to make any distinction. This talk is useful to anyone who wants to make large scientific datasets publicly available via the cloud. You will learn how to use VirtualiZarr and Icechunk to create a virtual Zarr datacube, using the [C]Worthy Ocean Alkalinity Enhancement Efficiency Map [2] dataset as an example. This dataset consists of ~50TB of data spread across ~500,000 netCDF files, so virtualizing it requires complicated array manipulations, and using serverless frameworks to generate references to many files at scale. Nevertheless, VirtualiZarr’s xarray-compatible API makes this possible in essentially just 3 lines of code. [1] https://github.com/zarr-developers/VirtualiZarr [2] https://carbonplan.org/research/oae-efficiency-explainer PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/JBNR9A/ Room 315 Tom Nicholas PUBLISH Z7AL7K@@cfp.scipy.org

-Z7AL7K

The brave new world of slicing and dicing Xarray objects. en

20250710T163000 20250710T170000 0.03000

The brave new world of slicing and dicing Xarray objects.

Xarray is an open-source Python project that enables its users to use a dataset's metadata for easy, expressive, and readable analytics. For example, one can use dimension names and coordinate labels to select and subset data: `.sel(time="2024-01-01")`. In this way, Xarray users can express themselves quite naturally in the labeled coordinate system of the data, rather than the unlabeled coordinate system of bare arrays, e.g. in Numpy: `array[0, ...]`. The underlying functionality is enabled by an "index" (in this case, a Pandas Index). Till much recently, Xarray's indexing features were limited to indexing along one-dimensional coordinates, leaving users to construct ad-hoc solutions for more complicated grids. Funded by CZI and NASA grants, steady work over the past few years has let us relax this restriction. Xarray now allows a user to associate "custom indexes" across multiple coordinate variables and dimensions of an Xarray dataset, unlocking a incredible range of use cases for Xarray's geoscience users — from handling complex grids to entire new data models! We will take a whirlwind tour of specific examples that demonstrate the power and flexibility now available: 1. handling time and space intervals, periodic boundaries, and units; 2. using tree structures for geographical lat-lon indexing from simpler curvilinear grids to more complex discrete global grid systems ([`xdggs`](https://xdggs.readthedocs.io/en/latest/)); 3. handling georeferenced coordinate spaces for raster data with affine transforms; 4. handling very large coordinates with lazy out-of-core indexes; 5. enabling sophisticated time-dimension indexing of weather Forecast Model Run Collections, such as selecting a "best estimate" time series or selecting all forecasts for a given future time instant, and [more](https://www.unidata.ucar.edu/presentations/caron/FmrcPoster.pdf); 6. enabling the "vector data cube" model that marries the raster and vector data worlds by allowing one to index a dimension using Shapely geometries ([`xvec`](https://xvec.readthedocs.io/en/stable/)). We expect this talk to provoke stimulating discussion on possible new use cases, and stimulate collaborations and contributions during the sprints. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/Z7AL7K/ Room 315 Deepak Cherian Justus Magin Benoît Bovy PUBLISH AARA39@@cfp.scipy.org

-AARA39

Xarray across biology. Where are we and where are we going? en

20250710T104500 20250710T111500 0.03000

Xarray across biology. Where are we and where are we going?

**Background** Biological datasets come in a wide variety of shapes, sizes, and types. However, there are common challenges faced across biology when dealing with complex structured data, and keeping track of real-world coordinates. Xarray provides a powerful solution to these issues. Additionally, Xarray provides first class support for HDF and Zarr files, formats already in wide use in biology. Despite these advantages, Xarray has yet to see widespread adoption in the biological research community. Developers of many biology-focused Python packages want to see greater adoption of Xarray as a data model and tool kit for common challenges. This presentation will explore what the challenges for greater adoption have been, what has changed, and present a roadmap for future work. **Motivation** To motivate using Xarray for biologists, I will demonstrate its usage in biological workflows through common workflows: - Examples from the [sgkit](https://sgkit-dev.github.io/sgkit/latest/) project - Managing segmentation of timelapse microscopy experiment - Xarray as an interface to dask and zarr - Keeping track of metadata - Examples of processing neurophysiology data (similar to [this](https://xarray.dev/blog/xarray-for-neurophysiology)) **What Has Limited Adoption** Based on interviews with biological software thought leaders conducted in Spring 2025 I will discuss the major factors that have prevented wider adoption of Xarray in biological research. - Social - Lack of documentation and examples with biological data - Fragmentation of biology software between Python, Java, Matlab, etc. - Lack of clear reference implementations - Technical - Integration (or lack of) with existing biology software tools - Data model incompatibility - Hierarchical data - [compatibility](https://github.com/ome/ngff/issues/48) with [ome-zarr](https://ngff.openmicroscopy.org/latest/) - [Flexible coordinates](https://github.com/pydata/xarray/issues/1094) - important for large volumetric imaging **Recent Improvements** Recent improvements in Xarray and developments in downstream packages will enable wider adoption Xarray. For each point, I will demonstrate with a brief real-world example use case. - Technical: - [`xarray.DataTree`](https://docs.xarray.dev/en/latest/user-guide/hierarchical-data.html) aligns the data model with next generation microscopy data format (OME-zarr) - Multiscale imaging -Flexible Transformations - Within an [implementation merged](https://github.com/pydata/xarray/pull/9543) in February 2025, Xarray now supports functional definitions of coordinates. This enables using Xarray for use cases such as volumetric imaging - Increased usage in user-facing packages - AllenSDK - napari - Social: - This presentation is part of a multi-year effort to increase the visibility of Xarray to biologists - Addition of biology examples to Xarray documentation - Blog post series demonstrating the advantages Xarray for common biological workflows. This will also be a call for interested writers **Roadmap** To conclude, I will discuss how the biology community and Xarray community work to accelerate the adoption of Xarray in biology. This roadmap for the future will include: efforts to better integrate Xarray with existing projects, education efforts, and feature development in Xarray. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/AARA39/ Room 317 Ian Hunt-Isaak PUBLISH CWJ7XR@@cfp.scipy.org

-CWJ7XR

An Active Learning plugin in Napari to fine tune models for large-scale bioimage analysis en

20250710T112500 20250710T115500 0.03000

An Active Learning plugin in Napari to fine tune models for large-scale bioimage analysis

## Introduction This talk introduces the “napari-activelearning” plugin and briefly describes the implemented human-in-the-loop method from Active Learning. The talk is intended for people interested in using transfer learning to fine-tune models, such as Cellpose, for large-scale image analysis. The target audience are attendees with some understanding of machine learning in general, but not any advanced knowledge on the transfer learning field. Attendees will learn about the Active Learning framework implemented in this plugin and about the capabilities of this open-source project. The code of the “napari-activelearning” plugin can be found at https://github.com/TheJacksonLaboratory/activelearning or be installed via pip from its PyPI project website. A previous demo of this plugin was presented during the Virtual I2K 2024 (https://www.youtube.com/watch?v=mllzxHQuIY0&list=PLdA9Vgd1gxTbvxmtk9CASftUOl_XItjDN&index=12). ## Motivation Adoption of deep learning methods for image analysis has grown exponentially in recent years. Part of such success is thanks to transfer learning methods that enable using models trained with large volumes of data in tasks where annotated data is scarce. However, transferring learning from one task to another still requires human-labeled data of quality. This becomes a challenge when the target domain offers large volumes of data that could overwhelm the annotator, e.g. tissue labeling in high-resolution microscopy Whole Slide Images (WSI). The “napari-activelearning” plugin was developed with the purpose of easing the constraints of applying transfer learning methods to large volumes of large-scale image data. ## Methods Next Generation File Formats, such as Zarr, have been increasingly adopted by the bioimage analysis community. Zarr format stores large-scale image data as independent n-dimensional tiles, also called chunks, either on local disk or cloud storage. By using chunks as units of storage the amount of data required to be loaded into memory when accessing specific regions of the image is reduced. This is useful when applying a model for inference in larger-than-memory image data. On the other hand, to reduce the amount of data presented to the human annotator, concepts from the Active Learning framework are used. This field studies methods for human-in-the-loop learning workflows that prevent overwhelming the annotator. This is achieved through computation of Acquisition Functions that assist the selection of samples predicted with low confidence, and when annotated by a human, these could improve the model’s performance after fine-tuning. Napari is a user-friendly visualizer for n-dimensional data which capabilities are extensible through plugins. This visualizer already offers tools for data annotation and is compatible with Next Generation File Formats such as Zarr. ## Results A brief overview of the “napari-activelearning” plugin’s graphical interface shows the tool integrated in Napari and general usage of the plugin controls. ## Conclusion While this plugin was developed with the goal of easing adoption of deep learning models in bioimage analysis projects, it is not restricted to these imaging modalities. Moreover, it can be applied as a transfer learning tool for methods that lack from an existing user interface or that are not adapted to work with large-scale image data. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/CWJ7XR/ Room 317 Fernando Cervantes Sanchez PUBLISH Z8P8JH@@cfp.scipy.org

-Z8P8JH

Unlocking the Missing 78%: Inclusive Communities for the Future of Scientific Python en

20250710T142000 20250710T145000 0.03000

Unlocking the Missing 78%: Inclusive Communities for the Future of Scientific Python

Despite the immense success of the Python ecosystem, gender disparity remains a persistent challenge. Women make up just 15–22% of data science professionals globally (BCG, Deloitte) and an estimated 2–3% of contributors to Python GitHub projects. Addressing this gap is not only a moral imperative but a strategic one, as diversity drives innovation, collaboration, and growth. This talk will cover: Understanding the Gap Present key statistics on gender disparity in data science and Python communities. Explore systemic barriers, including lack of mentorship and inaccessible community structures. Practical Strategies for Inclusion Share lessons from successful programs, including Women in AI at IBM, TechWomen mentorship, and PyData Global initiatives. Provide actionable steps for creating inclusive spaces, such as targeted outreach, mentorship pipelines, and accessible events. Case Studies and Measurable Impact Highlight success stories from NumFOCUS and PyData Global, showcasing increased participation and leadership from underrepresented groups. Roadmap for the SciPy Community Equip attendees with a roadmap for fostering diversity in their own projects and communities, tailored specifically to scientific Python. By addressing the “missing 78%,” the SciPy community has an opportunity to unlock new talent, spark innovation, and create a thriving, diverse ecosystem. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/Z8P8JH/ Room 317 Noor Aftab PUBLISH BQMZFM@@cfp.scipy.org

-BQMZFM

Towards a more sustainable and reliable mybinder.org en

20250710T150000 20250710T153000 0.03000

Towards a more sustainable and reliable mybinder.org

mybinder.org allows users to get a custom interactive computing environment in their browser via simply clicking a link, allowing them to explore various materials without having to go through the laborious process of setting up a local environment exactly right. It is fully open to the public, allowing anyone to create and share links without requiring logins or payment. mybinder.org has now served millions of users in the scientific python community over the last 8 years. As an open source *infrastructure* project, it's also had several challenges in sustainability and reliability that has manifested in users receiving poorer service some of the time. This talk aims to go over some of those challenges, how they manifest for users, and new experiments (technical and social) for addressing these. Here's a rough outline of various questions you will have better answers to at the end of this talk: 1. What is mybinder.org? Why should I care? 2. What does it mean to run 'open infrastructure'? How is that different from 'open source'? 3. When I click a mybinder.org link, it just sometimes hangs. What are the complex social factors beyond my control that have led me to this frustrating moment in time? 4. So what happens when we can no longer rely on big huge corporations to give open source projects free cloud credits out of the goodness of their hearts? 5. As open infrastructure, mybinder.org made a bet on using open source cloud agnostic technology (kubernetes) very early on, putting effort into not being locked into any specific cloud provider. Has that helped mybinder.org survive or made things worse? 6. Ok that's all fine, what is happening now to improve the sustainability situation? 7. The new UI on mybinder.org looks nice! How did that happen? Are we getting more new features? 8. I can see why a reliable and sustainable mybinder.org serves an important purpose in the scientific python ecosystem. How can I help? PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/BQMZFM/ Room 317 Yuvi PUBLISH YKCR8S@@cfp.scipy.org

-YKCR8S

AI as a Detector: Lessons in Real Time Pulsar Discovery en

20250710T155000 20250710T162000 0.03000

AI as a Detector: Lessons in Real Time Pulsar Discovery

**Background** The radio astronomy and broader signal processing communities have long relied on classical signal processing techniques to identify needles of interest in the galactic haystack. Techniques to improve the signal to noise ratio, like dedispersion, and detect known emissions, like matched filtering, work well when the RF environment is relatively clean and one knows what she is looking for; but what do we do when we're hunting for new discoveries we haven't yet observed? Artificial Intelligence techniques have shown great promise towards new event detections in the radio astronomy field. Gerry Zhang et al reported 72 previously undiscovered Fast Radio Bursts in recorded observations at the Green Bank Observatory through usage of a convolutional neural network applied to spectrograms in C-Band. But how do we apply these types of detectors to real data streams? Through SciPy tools like scipy.signal and GPU equivalents (cupyx.scipy.signal), it's become possible to deliver high performance signal processing performance with easy data movement between AI frameworks like PyTorch, ensuring that data doesn't migrate needlessly from CPU to GPU. Moreover, the emergence of AI sensor processing frameworks like NVIDIA's Holoscan allows one to easily connect GPU compute resources to physical sensors, like FPGAs. Together, these community driven tools can culminate into real time scientific discovery. **Methods** We trained our fast radio transient model with a modified ResNet architecture and simulated over 200,000 samples of fast radio bursts using the [InjectFRB](https://github.com/liamconnor/injectfrb) library. This augmented real collections of FRB founded previously at a variety of instruments. The model was trained in PyTorch using the ADAM optimizer and was later optimized with NVIDIA's [TensorRT](https://developer.nvidia.com/tensorrt) SDK. Further, we determined that the ResNet based approach to signal identificaion outperformed SPANDAK, the current fast radio transient search algorithm used in production at the Allen Telescope Array. One the model has been trained and optimized, we leveraged NVIDIA's [Holoscan](https://github.com/nvidia-holoscan/holoscan-sdk) real time AI sensor processing SDK to connect incoming RF samples from the instrument to GPU computing running a real time signal processing and ML inferencing pipeline. **Results** In this experiment, we collected an aggregated 100Gbps of UDP Ethernet data from a total of 28 antenna feeds at the Allen Telescope Array. Each feed collected 96 MHz of data, and at 2 polarizations, the total experiment had an aggregate bandwidth of 5.4GHz. The GPU processor used was an NVIDIA IGX (Orin + ConnectX 7 NIC + RTX A6000 GPU). On the IGX, a Holoscan pipeline read UDP data straight from NIC to GPU and then performed GPU based beamforming and AI inferencing for radio transient detection. We detected a candidate pulse at 1.236 GHz when pointed at the Crab Nebula. Upon further processing, we confirmed this emission was originating from the Crab Pulsar (PSR B0531+21), thus being the first pulsar detected with an online AI pipeline on raw sensor data. **Conclusion** This talk is a comprehensive "how-to" guide to both building science based ML models and deploying them to a real time instrument in production. It covers critical pieces of a real time pipeline like ML model optimization, data movement from sensors to compute, real time visualization, accelerated computing, and culminates in the detection of a pulsar. For the non astronomers, we hope this talk will provide pointers on how to build real time AI workflows with liberal use of the SciPy library ecosystem. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/YKCR8S/ Room 317 Adam Thompson Luigi Cruz PUBLISH EPTPDH@@cfp.scipy.org

-EPTPDH

Enabling Innovative Analysis on Heterogeneous Clusters through HTCdaskgateway en

20250710T163000 20250710T170000 0.03000

Enabling Innovative Analysis on Heterogeneous Clusters through HTCdaskgateway

Introduction: High energy particle (HEP) physics research is going through fundamental changes as we move to collect larger amounts of data from the Large Hadron Collider (LHC). In HEP, we parse this collected data to perform statistical analysis but we are still relatively inefficient at this due to the sheer amount of data and the difficulties in developing an analysis. Given these challenges, there has been research into both how we do analysis and new software tools. One part of the solution is emphasizing distributed computing, specifically, High Throughput Computing (HTC), and another part is an Analysis Facility, which can be loosely thought of as a suite of tools aimed at making a coherent analysis ecosystem using software such as Jupyterhub and systems such as Kubernetes. Analysis facilities enable users to build more pythonic analyses that incorporate libraries from the PyHEP and Scikit-HEP ecosystem, such as matplotlib and hist, which has, until now, been less common in HEP research. For analysis facilities to provide distributed computing, they must be able to properly communicate with schedulers and workers. The htcdaskgateway: Htcdaskgateway is a Dask gateway extension that allows for communication between non-Dask workers, for example HTCondor workers, and a Dask system on an OKD kubernetes cluster where the jobs must be submitted by the user for authentication. OKD is a community driven kubernetes distribution for application management. Analysis facilities rely on software like htcdaskgateway to allow users to spawn workers compatible with both their analysis and the heterogeneous clusters they are working on. Like Dask gateway, it gives users the ability to choose images which allows for flexibility in the analysis pipeline. The htcdaskgateway specifically pre-configures an image utilizing Coffea, a new pythonic analysis framework being developed to make HEP analysis easier, that rests on several scientific python libraries. This is allowing physicists to utilize and engage with scientific python in ways they could not before due in large part to analysis being done through traditional domain specific C++ tools. This means access to awkward-array, vector, SciPy, pyhf, vector, numba, and so much more. A concrete example of the htcdaskgateway is Fermilab’s Elastic Analysis Facility that is supporting and pioneering the next generation of analysis, specifically the application of Coffea and python. Conclusion: HEP analysis is changing and htcdaskgateway is a large part of it, we hope to enable many more users to perform large scale analyses with the benefits of scientific python through Dask. While it needs more development to become a fully generalized tool, we are on our way to making a tool that could help connect other physicists, and possibly other scientists, to scientific python who need a similar tool where authentication is tied to the user. Audience: This talk appeals to physicists and research software engineers interested in making scientific python libraries available to general researchers on a heterogeneous cluster for distributed computing. This is also interesting for students, professionals, and academics that want to know more about connecting Dask, Kubernetes, and HTCs/HPCs. The audience, regardless of level, will learn about Dask and OKD in the context of Dask gateway and htcdaskgateway. They will also learn about the use of scientific python in the HEP community. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/EPTPDH/ Room 317 Elise Chavez PUBLISH RHKQKD@@cfp.scipy.org

-RHKQKD

RydIQule: A Package for Modelling Quantum Sensors en

20250710T104500 20250710T111500 0.03000

RydIQule: A Package for Modelling Quantum Sensors

Quantum sensing is a fundamentally new technology with no classical analogue. This rapidly expanding research field enables unprecedented capabilities in navigation, time-keeping, magnetometry, and electrometry. A particular focus of our work is using Rydberg atomic sensors for radio-frequency field sensing. These sensors leverage quantum states that are highly sensitive to ambient electric fields, enabling omniband operation spanning from DC to a terahertz with many simultaneous frequencies in a single device. While they show great promise, designing these devices and understanding their behavior can pose many challenges. The physics behind quantum sensors, especially in the context of prototyping experiments, makes efficient simulation difficult. A quantum system of this type consists of a set of discrete energy levels with lasers or radio-frequency fields coupling them. The number of levels can easily exceed 100, and the couplings between them can be arbitrarily complicated. Coupled sets of ordinary differential equations describe the dynamics of the system, their size scaling with the square of the number of levels present in the system. When simulating experiments in which laser parameters are swept over a range of values, each combination of parameters yields a unique set of differential equations, leading to even further difficulties. As the dimension of the parameter space increases, the number of equation sets increases exponentially. Answering these needs requires a common tool for modeling experiments and designing novel classes of devices. To this end, we have developed the Rydberg Interactive Quantum Module (RydIQule), a tool leveraging principles of scientific python to enable efficient simulation. The first challenge we address is choosing a data structure to represent the quantum system that is both expressive of the physics and flexible enough to handle the wide variety of problems in Rydberg sensing and beyond. Our novel insight is that atomic states and the fields that couple them are naturally represented as the nodes and edges of a directed graph. The networkx package provides an easy-to-use implementation, allowing us to store all relevant information of a system within the graph. By reading parameters from the graph, RydIQule has enormous algorithmic flexibility in generating differential equations describing the dynamics of the system. RydIQule handles the exponential parameter space by leveraging numpy’s broadcasting operations in ways that suit experiment simulation well. Casting parameter arrays to the correct shape at the outset ensures the rest of the procedure can be scripted without any consideration of matrix shape. Furthermore, it allows extracting high-dimensional tensor diagonals from the parameter space. This technique simulates two parameters swept in parallel (common in optical physics experiments), and reduces the dimensionality of the parameter space by trimming irrelevant equations. RydIQule has already seen growing use because it is both intuitive to physicists and sound in software design principles. It is a demonstration of insights from fields like machine learning leading to transformative changes in researchers’ software tools. Atomic sensing is not unique in its need for such a tool, and we encourage more dialogue between science and software to rapidly accelerate research across fields. RydIQule’s source is available on [github](https://github.com/QTC-UMD/RydIQule) with documentation on [readthedocs](https://rydiqule.readthedocs.io/en/latest/). Our peer-reviewed article introducing it to the research community is available at [Computer Physics Communications](https://doi.org/10.1016/j.cpc.2023.108952). PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/RHKQKD/ Room 318 Ben Miller PUBLISH U9GYS3@@cfp.scipy.org

-U9GYS3

zfit: scalable pythonic likelihood fitting en

20250710T112500 20250710T115500 0.03000

zfit: scalable pythonic likelihood fitting

# scalable, pythonic likelihood fitting The library has a github repository: https://github.com/zfit/zfit and tutorials that can be run in the browser: https://zfit-tutorials.readthedocs.io/en/latest/ ## The problem Fitting distributions, such as Normal, Poisson and polynomials, is a common task in many scientific fields. The Python ecosystem contains multiple libraries, such as scipy, lmfit, statsmodels and more, to provide some basic tools for building models and fitting them. However, these tools have strongly limited features: they generally lack the ability to compose models such as sums, products, convolutions or build multidimensional distributions; they are restricted to analytic integrable functions; they do not offer extensive customization possibilities; and the performance is not competitive with libraries written in compiled languages. ## Talk summary We present zfit, a scalable, general purpose fitting library; it is built to significantly enhance the fitting capabilities in the Python ecosystem. The talk will cover the main the topic of distribution fitting with the main features of zfit and is targeted towards a wide scientific audience to anyone who ever needed to fit a function. Focus of the talk will be the extensive model building part using distributions including custom ones, the fitting and the performance. The latter originates from the backend, TensorFlow, that will be discussed as a general way of increasing the performance of scientific computing. We will also discuss the integration of zfit with other libraries in the Python ecosystem for data loading, plotting and statistical inference. ## Description of zfit in detail ### Distributions zfit has an extensive model building part: it contains PDFs from simple analytical functions, such as Normal distributions, to complex multidimensional distributions, and allows the composition of these models. To incorporate functions that are specific to a domain and not made up of basic shapes, convenient baseclasses allow the user to implement a function using a numpy-like syntax. This PDFs can directly be used, as zfit automatically takes care of the numerical normalization and sampling. Furthermore, PDFs can also be binned and mixed with unbinned data to accomodate large data samples. ### Fitting The fitting is primarily based on the minimization of a loss function, typically a likelihood. The loss can be customized to include constraints and penalties to incorporate simultaneous fits and arbitrary correlations between parameters. The minimization is performed using a variety of minimizers, including the popular ones from scipy, nlopt or iminuit. The result has methods for simple error estimation of the parameters. Due to the general API and workflow, other libraries integrated zfit to perform further statistical inference, such as the statistical library hepstats. ### Performance zfit uses a numpy-like computing backend that is also used by TensorFlow. This allows for just-in-time compiled code that significantly speeds up the performance to C++-like speeds and can further be run in a distributed manner on CPUs and GPUs. Automatic gradients provided by the backend are used in the gradient-based minimizers, which also speed up the minimization process. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/U9GYS3/ Room 318 Jonas Eschle PUBLISH NYPUVH@@cfp.scipy.org

-NYPUVH

Jupyter Book 2.0 – A Next-Generation tool for sharing for Computational Content en

20250710T142000 20250710T145000 0.03000

Jupyter Book 2.0 – A Next-Generation tool for sharing for Computational Content

# Description Jupyter Book has been a core tool for sharing computational science, powering more than 14,000 open, online textbooks, knowledge bases, lectures, courses and community sites. From numerical methods primers to open science guides like *The Turing Way*, Jupyter Book has become a de facto standard for researchers and educators who want to publish computational narratives. Over the past two years, we have rebuilt Jupyter Book from the ground up; focusing on producing machine readable, semantically structured content that can be flexibly deployed and that supports reuse and cross referencing in unprecedented ways. This was achieved by adopting the MyST Markdown (https://mystmd.org) document engine which is more flexible, extensible and is more deeply integrated with Jupyter for interactive computation. Jupyter Book 2 (JB2) is now an official Jupyter Subproject and represents a major leap forward in how we share and distribute computational content on the web. ## Key Features of Jupyter Book 2.0 Jupyter Book 2.0 provides several powerful new capabilities: * Executable Figures & Notebooks — Fully interactive, executable outputs powered by JupyterHub/Binder allow readers to run computations directly from the book (including using pyodide/JupyterLite). * Rich Hover Previews — Interactive link previews provide instant context when navigating content, whether inside a book, across federated Jupyter Book sites. * Typst PDF Export — First class support for Typst for fast, modern and high-quality PDF generation. * Standards for Content Reuse & Machine-Addressability — Built on MyST Markdown, enabling content interoperability, reuse, and distribution across platforms. * Modern Web-Based Publication — A streamlined build process for fast, scalable, and reproducible publishing workflows. ## Real-World Adoption Jupyter Book 2 is already transforming how scientific and computational content is communicated. *The Turing Way*, a guide to reproducible, ethical, and collaborative data science, has successfully migrated to Jupyter Book 2, taking advantage of its interactive features and improved publishing pipeline. More projects are in the process of migrating, and we believe that JB2 has the potential to become a standard for the next-generation of knowledge sharing and interactive computation.. ## Live Demo: From Notebooks to a Published Site in 5 Minutes We will conclude with a live, hands-on demonstration, showing (in five minutes) how to: 1. Start with a folder of Jupyter notebooks and markdown files. 2. Use Jupyter Book 2 to build an interactive, shareable website. 3. Publish the site online. Attendees will leave with an understanding of the goals of the Jupyter Book 2.0, the game changer features and how these could be applied in their projects/work. ## Target Audience This talk is ideal for: * Educators creating interactive textbooks or tutorials. * Researchers sharing computational narratives and reproducible workflows. * Documentation authors looking for standards-based, extensible publishing tools. * Anyone who uses Jupyter Notebooks and wants to publish content beyond the notebook. # Prerequisites Attendees who want to follow along with the demo should have: * A folder with Jupyter notebooks and markdown files. * A working virtual environment with dependencies installed (Python 3.11 or above). * Jupyter Book 2 installed ([https://next.jupyterbook.org/start/install](https://next.jupyterbook.org/start/install)) # Links & Resources * 📖 Jupyter Book 2 Documentation: [https://next.jupyterbook.org/](https://next.jupyterbook.org/) * ✍ MyST Markdown: [https://mystmd.org/](https://mystmd.org/) * 📘 The Turing Way (Real-World Example): [https://book.the-turing-way.org/](https://book.the-turing-way.org/) PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/NYPUVH/ Room 318 Steve Purves Franklin Koch Rowan Cockett Angus Hollands Chris Holdgraf PUBLISH S77TUP@@cfp.scipy.org

-S77TUP

Teaching Python with GPUs: Empowering educators to share knowledge that uses GPUs en

20250710T150000 20250710T153000 0.03000

Teaching Python with GPUs: Empowering educators to share knowledge that uses GPUs

GPUs are everywhere, but not necessarily immediately in front of you. How do you get one, and how do you maximize its potential? These are common questions when planning to teach content involving GPUs. With AI spreading through every field and data volumes continuously expanding, teaching GPU computing has never been more relevant. When it comes to teaching programming concepts, it's well known that the classical lecture style doesn't cut it anymore. Students learn more if they are in an interactive environment. Teaching computing interactively and with an active learning approach can be hard, even when the software you are teaching is Open Source. If you've ever taught a course, or given a tutorial, you may have stumbled upon some challenges like resources accessibility, differences in operating systems, flavors of library installations, level of depth of concepts, among others. Imagine that the thing you want to teach is using cutting edge features of the latest and greatest hardware. You're now motivated to find a way to get this hardware in the hands of everyone in the room. The CUDA Python ecosystem has matured over the last few years and leveraging GPUs in your Python code is now easier than ever. To unlock GPUs in your projects you have to have the right hardware and software setup in addition to the standard Python stack. During this talk we will go over some of the challenges around getting set up, and discuss practical strategies to promote GPU-computing active learning. Outline: - **Resource Accessibility:** How do you provide GPU access? What are the options available? Cloud-based solutions, Colab notebooks, and ready-to-go Jupyter + RAPIDS deployments. - **Managing knowledge Levels:** zero-code-change, low-code-change, CUDA Python, CUDA C++. When and How? - **Environment Management and Dependencies:** Explain complexity of managing GPU dependencies. Understanding errors, versions incompatibility and introducing [RAPIDS Doctor](https://global2024.pydata.org/cfp/talk/B9GZWJ/). - **Packaging and deployment:** Discuss how GPU software libraries are built and distributed, and how to install and deploy them on various platforms. This talk is intended for educators, researchers, and developers who are interested in teaching or learning about GPU computing with Python. By the end of this talk attendees will leave with the confidence that if they want to teach something that requires GPU acceleration they will be able to get their audience up and running quickly. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/S77TUP/ Room 318 Jacob Tomlinson Naty Clementi PUBLISH B3PXML@@cfp.scipy.org

-B3PXML

Keeping Python Fun: Using Robotics Competitions to Teach Data Analysis and Application Development en

20250710T155000 20250710T162000 0.03000

Keeping Python Fun: Using Robotics Competitions to Teach Data Analysis and Application Development

The Issaquah Robotics Society (IRS) has been teaching Python and data analysis to high school students since 2016. We use custom-built Python applications at our tournaments to analyze robot performance and choose effective competition strategies. Our presentation will focus on the best practices that we’ve discovered during the past nine years and is for anyone who wants to help others learn Python. We believe that instructional techniques that work for high schoolers with no prior programming experience can work for everyone. ## Background Relative to classroom instruction, teaching Python via an extracurricular activity requires different instructional techniques. Providing constructive feedback is more challenging without grades or exams. More importantly, students who get frustrated are free to switch to other activities or drop robotics altogether. Ensuring that students are having fun is crucial. Teaching Python by developing a data analysis application has advantages over traditional classroom instruction. * Direct feedback from users. * Provides experience with multiple aspects of application development, including version control, testing, and deployment. * Delivery dates cannot slide. Applications must be ready by day one of our first competition. Students are forced to prioritize when deciding what features to implement. * Demonstrates how data analysis can improve decision making. Students rarely have trouble understanding programming concepts. Even students with no prior programming experience easily understand concepts such as loops, conditional statements, functions, and composite data structures. Learning to use tools like IDEs, Git, or virtual environments is usually a bigger source of frustration than learning Python itself. For our curriculum to be successful, we must carefully plan how we’ll introduce students to programming tools. We must strike a balance between two competing objectives: 1. Avoid student frustration with tools when they are starting to learn Python. 2. Effectively use tools for team-driven, time-limited application development. ## Presentation Outline Our presentation covers the following topics: 1. A brief introduction to our problem domain, FIRST* Robotics Competitions, and how data analysis with Python makes us more competitive. 2. What’s different about teaching high school students (and what’s not). 3. Best practices related to teaching Python and data analysis * Making it fun * Learning the tools * Giving feedback without making learning Python feel like just another class ## About the Presenters The presentation will be provided by current IRS students and their analytics mentor (who moonlights as a data scientist at the Pacific Northwest National Laboratory). IRS students are experienced presenters who frequently provide presentations to robotics competition judges. The IRS has won numerous awards during its 22-year existence, including the 2024 Pacific Northwest FIRST Robotics championship. The IRS participates in FIRST* Robotics Competitions and is based in Issaquah, WA. FIRST’s mission is to get students excited about science and technology and give them skills and confidence that will help them pursue STEM careers. We want our students to understand that programming is not just for software developers - that Python and data analysis are highly valuable skills for anyone who participates in STEM. \* FIRST: For InspiRation of Science and Technology PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/B3PXML/ Room 318 Stacy Irwin PUBLISH 9QG8PU@@cfp.scipy.org

-9QG8PU

Getting all your snakes in a grid: collaborating and teaching with Python in Excel and the Anaconda Toolbox en

20250710T163000 20250710T170000 0.03000

Getting all your snakes in a grid: collaborating and teaching with Python in Excel and the Anaconda Toolbox

- Introduction - Importance of working with data in grids or spreadsheets for collaboration - Various tools available to view and edit files - Working from Python you can use packages like openpyxl and pandas - New Pythonista tools for spreadsheet work - Overview of Python in Excel feature - Introduction to the Anaconda Toolbox add-in - Advantages for Pythonistas - Easier access to and collaboration on code within spreadsheets - No environment setup - Demo: Running Python directly in cells in a spreadsheet with other people - Advanced features you can make your data polished: - Python = Plotting upgrades - custom data types - custom Reprs - Case Studies - Teach Python to Finance professionals - Collaborate with Python in industry - Conclusion - Recap of key points - Implications for data analysis and collaboration PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/9QG8PU/ Room 318 Sarah Kaiser PUBLISH DTJMTR@@cfp.scipy.org

-DTJMTR

Lessons Learned from Adding Backend Dispatching to NetworkX and scikit-image en

20250711T104500 20250711T111500 0.03000

Lessons Learned from Adding Backend Dispatching to NetworkX and scikit-image

Scientific Python libraries often reinvent functionality to support new hardware or data structures, leading to fragmentation. For example, a GPU-enabled library might closely mimic existing library's APIs (numpy/cupy, pandas/cudf, networkx/cugraph, scikit-learn/cuml, scikit-image/cucim, etc.) Dispatching allows existing libraries to act as a common interface for multiple backends, reducing redundancy and empowering users to switch implementations without rewriting code. This greatly reduces the work for backend developers, because the original library provides documentation, tests, and a broader community of users. Similarly, the original library is enhanced by having external backends for little effort, because backends are separately developed, maintained, and tested. Dispatching and backend selection is a complex field with various possible implementations. Many projects implement multiple dispatching based on types, and other projects have experimented with explicit backend selection that goes beyond type dispatching and allows swapping in a different algorithm. NetworkX, the most popular library for graph analytics, has both. Dispatching in NetworkX has been developed over the last three years, and it has added features such as automatic converting (and caching) of inputs, and incorporating backend information into its documentation. Dispatching in scikit-image is much newer, and it takes a minimal, but similar, approach. From Blaze to Ibis to uarray to array protocols to the Array API to Narwhals and many others, the SciPy ecosystem has experienced many efforts of dispatching over the years. Adding dispatching to an existing library such as NetworkX is less disruptive--and much less work--than trying to come up with a _new_ standard API. NetworkX is already the *de facto* standard for graph analytics, having been developed by many contributors over decades. Its strengths are its API, documentation, tests, community, readability, and maintainability--all the difficult but vital aspects of open-source software! Its main shortcoming is its scalability because it is written in pure Python, which can be overcome by dispatching to an accelerated backend such as nx-cugraph. When successful, dispatching can be a "win-win-win" for library maintainers, backend developers, and users, and we expect it to become more and more needed as hardware and software becomes more diverse and specialized. Target audience: - Maintainers of libraries seeking to support multiple backends - Developers of accelerated implementations - Users frustrated by API fragmentation - Users interested in zero-code change acceleration on NVIDIA GPUs - Anybody interested in helping standardize dispatch patterns via [`scientific-python/spatch`](https://github.com/scientific-python/spatch) PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/DTJMTR/ Ballroom Erik Welch PUBLISH TQK3T9@@cfp.scipy.org

-TQK3T9

SciPy’s New Infrastructure for Probability Distributions and Random Variables en

20250711T112500 20250711T115500 0.03000

SciPy’s New Infrastructure for Probability Distributions and Random Variables

SciPy [1] provides objects that represent well over 100 univariate probability distributions. For example, `scipy.stats.norm` is one of the most popular, and GitHub code search finds over 100,000 uses [2]. Each object has methods to compute statistics and other functions of the distribution, such as the probability density function (PDF), cumulative distribution function (CDF), inverse CDF (also “percent point function” or PPF), differential entropy, and moments. In theory, all other functions of a distribution follow from one of the others; for instance, the CDF is the integral of the PDF from the left end of the distribution’s support. Although methods of some distributions are overridden for improved performance or accuracy, the infrastructure also provides generic implementations; for instance, the default `cdf` method integrates the PDF numerically using `scipy.integrate.quad`. Besides providing generic method implementations, the infrastructure is responsible for ensuring a consistent API, generating documentation, and executing a common test suite. As useful as the legacy infrastructure has been, users have reported many shortcomings over the past two decades [3]. For example, although the distributions permit vectorized use with array arguments and shape parameters, the generic method implementations work only on scalars, and they must loop in Python over each element of array inputs, eliminating the performance advantage of vectorized code. In some aspects, the API is oppressively self-consistent: all distributions inherit common location and scale (`loc` and `scale`) parameters that are not standard in the literature of many distributions, confusing users and leading to problems when fitting distributions to data. Distribution documentation generation involves use of `exec` and a nonstandard string replacement syntax, and the test suite is not comprehensive, so new bugs continue to be found. Users have also requested new features, which have become increasingly difficult to work into the patchwork codebase of many separate contributions. It has been clear for several years that a fresh start was needed. The new infrastructure, released with SciPy 1.15.0, addresses these shortcomings. Generic implementations of methods are natively vectorized, leveraging SciPy’s new array API compatible functions for quadrature, series summation, root finding, and minimization [4]. Distributions support multiple parameterizations and do not force the inclusion of `loc` and `scale` parameters, so users can work with the parameterizations they are familiar with. Documentation is generated using more modern features of Python (such as f-strings), and the test suite is thorough. Many new features are implemented atop these solid foundations. • When a distribution-specific implementation of a method is not available, a decision tree chooses among several generic implementation strategies to ensure efficient computation of accurate results. For instance, central moments of a distribution can be computed by shifting raw moments, scaling standardized moments, numerically integrating the PDF, or numerically integrating the inverse-CDF; the best choice depends on which distribution-specific implementations are available. Using a `method` argument, the results of independent computation strategies can be compared against one other to assess accuracy – and indeed, this is the foundation of the extensive property-based test suite. • Instances of distribution classes behave like random variables and can be manipulated with composable transformations including: o elementary arithmetic operations (i.e. addition/subtraction for shifting location, multiplication/division for scaling) between random variables and arrays; o functions of random variables (e.g. reciprocal, square and other powers, absolute value, natural exponential and logarithm); and o truncation of the support. • Quasi-Monte Carlo samples can be drawn from arbitrary distributions. All distributions now have methods for computing the mode, the inverse of logarithmic distribution functions, and a `plot` method for convenient visualization. Random variables representing order statistics can be derived from any other random variable. • Multiple random variables can be combined in mixture models. The new infrastructure also paves the way toward even more advanced features, such as: • arithmetic operations between two random variables, • circular distributions, including wrapped versions of arbitrary distributions, • support for other Python Array API Compatible [5] backends beyond NumPy [6] (e.g. CuPy [7], PyTorch [8], JAX [9], and Dask [10]). The talk will introduce users to the new infrastructure and demonstrate its many advantages in terms of usability, flexibility, accuracy, and performance. References: [1] SciPy, https://scipy.org/ [2] Code search results for “from scipy.stats import norm” and "scipy.stats.norm", https://github.com/search?q=%22from+scipy.stats+import+norm%22&type=code [3] Matt Haberland. “RFC: stats: univariate distribution infrastructure”. GitHub scipy/scipy#15928. https://github.com/scipy/scipy/issues/15928 [4] Matt Haberland, Albert Steppi, and Pamphile Roy. “Vectorized Quadrature, Series Summation, Differentiation, Optimization, and Rootfinding in SciPy”. SciPy 2024 Conference. https://doi.org/10.25080/uyyk2727. [5] Array API Standard, https://data-apis.org/array-api/latest/ [6] NumPy, https://numpy.org/ [7] CuPy, https://cupy.dev/ [8] PyTorch, https://pytorch.org/ [9] JAX, https://jax.readthedocs.io/en/latest/ [10] Dask, https://www.dask.org/ PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/TQK3T9/ Ballroom Matt Haberland Albert Steppi PUBLISH LNWCSE@@cfp.scipy.org

-LNWCSE

From One Notebook to Many Reports: Automating with Quarto en

20250711T131500 20250711T134500 0.03000

From One Notebook to Many Reports: Automating with Quarto

This talk describes Quarto parameterized reports, a powerful tool for automating the creation of customized, publication-ready documents from Python notebooks. Data professionals can streamline their workflows and enhance scientific communication by generating multiple personalized reports from a single source, such as risk summaries for different zip codes or individualized soil health reports for farmers. The talk will walk through a practical example of taking a notebook that produces a single report to a workflow that generates many customized reports. Attendees will learn how to: * Add parameters to a notebook so Quarto recognizes them * Use parameters in their code cells (spoiler: just use them like any other variable) * Render to an output format with specific parameter values * Automate rendering many reports over a set of parameter values I’ll also touch on how to customize style to match organizational branding guidelines. This talk is for data professionals that need to communicate their results to multiple stakeholders. It will give them a tool to take the code they already have to produce values, tables, and plots, and turn it into a set of customized reports. I’ll assume no prior Quarto knowledge. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/LNWCSE/ Ballroom Charlotte Wickham PUBLISH CG8P37@@cfp.scipy.org

-CG8P37

marimo: an open-source reactive Python notebook en

20250711T135500 20250711T142500 0.03000

marimo: an open-source reactive Python notebook

This talk explores the following question: How might one design a Python notebook that blended the best parts of interactive computing with the reproducibility, maintainability, and reusability of traditional software? Python notebooks are a workhorse for scientific computation and research. And yet, while great for exploring data, traditional notebooks — in which scientists run cells one at a time, manually managing the state of their kernel in a tedious and error-prone fashion, while also manually managing their notebook's dependency environment — suffer from a well-documented reproducibility crisis. One study, by [Pimentel et al.](https://leomurta.github.io/papers/pimentel2019a.pdf), found that just a quarter of notebooks on GitHub could be run, and just four percent reproduced their outputs when re-executed. Additionally, because they are typically stored as JSON, notebooks fail to enjoy the usual benefits of code — maintainability, reusability, and interoperability with the Python ecosystem. This talk presents a [marimo](https://github.com/marimo-team/marimo), an open-source reactive Python notebook that addresses these concerns by modeling notebooks as dataflow graphs and storing them as Python files. marimo notebooks are reproducible, with a reactive runtime that eliminates hidden state and opt-in package management; interactive, with UI elements that are automatically synchronized with Python (no callbacks); expressive, supporting markdown and SQL that can be parametrized by arbitrary Python values; stored as pure Python files, so they are Git-friendly; executable as scripts; and shareable as web apps or WASM-powered static HTML. To keep code and outputs consistent, marimo models notebooks as directed acyclic graphs on cells, based on the variables declared and referenced by each cell, and pairs this graph with a reactive runtime. Run or delete a cell and the runtime automatically marks affected cells as stale, optionally running them to eliminate hidden state and hidden bugs. We discuss our implementation, which is powered by static analysis and takes inspiration from reactive notebooks for other languages, particularly Pluto.jl. As a bonus, reactive execution makes it easy to work with interactive widgets, which are made available to the user upon importing marimo as a library into their notebook, and reuse notebooks as scripts or apps. The marimo file format has the following properties: it is human readable, Git-friendly, usable as a regular Python module, executable as a script, and editable with a text editor. We show how the file format paves the way for virtual environment management as well as integration with other tools designed for code, like pytest. marimo's design decisions come with tradeoffs, which we discuss. Reactive execution hardens reproducibility and enables rapid experimentation with data, but it imposes constraints on the ways that variables can appear in the notebook. Storing notebooks as code simplifies package management (and versioning with git), but means a separate solution must be developed for serializing and viewing notebook outputs. These tradeoffs may not be acceptable to all notebook users, especially those who use notebooks as extended REPLs or scratchpads — however, through a series of examples, we hope to demonstrate why these tradeoffs are worth it for researchers who use notebooks as a core part of their scientific process. This talk is directed at anyone with an interest in about notebooks and interactive computation. By using marimo as a case study, we hope to answer a broader question. If notebooks were re-imagined as reactive and reusable Python programs, instead of REPL-like scratchpads — would that change the way you worked with code and data? PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/CG8P37/ Ballroom Akshay Agrawal PUBLISH UEBEUP@@cfp.scipy.org

-UEBEUP

Dive into Flytekit's Internals: A Python SDK to Quickly Bring your Code Into Production en

20250711T143500 20250711T150500 0.03000

Dive into Flytekit's Internals: A Python SDK to Quickly Bring your Code Into Production

Flyte is an open source data and machine learning orchestrator built on top of Kubernetes. Although the backend is built with Golang, most developers use the Python SDK, flytekit, for interacting with Flyte. With the core logic written in Golang, we get the benefits of a compiled language while having the developer ergonomics of Python. In this talk, we will learn about the design constraints and implementation details of flytekit’s core features. - Flytekit enables us to run the same code locally and quickly bring that into a remote cluster by adding a `--remote` flag. The mechanism for sending our code to the cluster is called “fast registration”, which intelligently finds what our code needs and uploads that to a blob store. - Flytekit requires the developer to provide types for all functions. Flyte’s “type transformers” use the Python types, so it knows how to serialize data between tasks in a workflow. - Scientific Python code has many dependencies and building Docker images can be complex. Flytekit’s image builder abstraction enables us to build images without learning Docker and quickly get up and running. - Flytekit’s Python SDK communicates with the Golang backend using Protobuf. This communication mechanism allows Flyte to interoperate with other languages, such as Javascript or Java. - Flyte’s plugin system allows us to extend the flytekit with custom code. For example, we can define our own “type transformers”, run Distributed GPU workflows, or use Dask and Ray on Flyte. By the end of this talk, you will have a deeper understanding of the design decisions and compromises made by flytekit for building a Python SDK used to orchestrate data and machine learning workflows. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/UEBEUP/ Ballroom Thomas J. Fan PUBLISH XXU9BP@@cfp.scipy.org

-XXU9BP

Lightning Talks en

20250711T153000 20250711T163000 1.00000

Lightning Talks

PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/XXU9BP/ Ballroom PUBLISH KE7C9X@@cfp.scipy.org

-KE7C9X

SciPy Proceedings: An Exemplar for Publishing Computational Open Science en

20250711T104500 20250711T111500 0.03000

SciPy Proceedings: An Exemplar for Publishing Computational Open Science

**Presentation Outcomes** By highlighting the new capabilities in the SciPy Proceedings and putting them in context of other ongoing open-science initiatives, we aim to: - Strengthen and celebrate the SciPy Proceedings as a leading example of modern computational publishing; - Showcase the possibilities of interactive and executable research articles with live examples; - Demonstrate open workflows, tools and publishing environments that make computational research publishing more accessible and reproducible. **Target Audience** This talk is ideal for: - People who are interested in scientific publishing and reproducibility - Researchers interested in sharing computational narratives and reproducible workflows. - Attendees who have published in the SciPy Proceedings and are interested in the underlying workflows. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/KE7C9X/ Room 315 Rowan Cockett PUBLISH Z78CFX@@cfp.scipy.org

-Z78CFX

From Legacy to Leading-Edge: Revamping NCEI Software for the Cloud Era en

20250711T112500 20250711T115500 0.03000

From Legacy to Leading-Edge: Revamping NCEI Software for the Cloud Era

Extreme weather events are increasing, posing risks to industries and economic stability. NOAA’s National Centers for Environmental Information (NCEI) supports informed decision-making through the Industry Proving Grounds (IPG), modernizing data delivery and accessibility. The IPG bridges industry needs with scientific data by engaging sectors like architecture, re/insurance, and retail to co-develop useful data products. This ensures data services are technically robust and practically applicable, helping industries navigate risks and sustain economic growth. Modernizing NOAA’s product development infrastructure is key to improving reliability, performance, cost-efficiency, maintainability, and ease of future improvements. Efforts focus on streamlining procedures, enhancing capabilities, simplifying the code base, and leveraging cloud-based solutions. This enables NCEI to provide timely insights while reducing maintenance costs and improving system resilience. This presentation explores the technical innovations behind IPG’s success. We will discuss standardizing coding language, integrating Polars, a fast Python DataFrame library optimized for large-scale data analysis. Polars' multi-threaded execution and lazy evaluation improve performance when processing NOAA’s vast datasets. Additionally, using AWS enhances data accessibility, scalability, and computational power for real-time processing and distribution. CI/CD pipelines are crucial to IPG’s efficiency, ensuring continuous integration and deployment of software updates. Automated testing and seamless deployment minimize downtime and enhance responsiveness to industry needs. These practices demonstrate how DevOps methodologies improve scientific computing workflows and data-driven decision-making. Core tools like Polars, AWS, and CI/CD enable streamlined data ingestion, processing, and dissemination, reducing latency and enhancing data quality. Information centers play a key role in open-source software development by providing real-world feedback, identifying bottlenecks, and improving performance. This talk highlights how scientific computing, cloud technology, and software engineering contribute to climate resilience. NCEI’s approach serves as a model for integrating open-source tools, cloud services, and CI/CD practices to maximize environmental data’s impact. Participants will gain insights into science, data engineering, and industry resilience, learning how software engineering best practices can enhance real-world decision-making and tackle global challenges. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/Z78CFX/ Room 315 Sarah Purpura PUBLISH SDPLQQ@@cfp.scipy.org

-SDPLQQ

Real-time ML: Accelerating Python for inference (< 10ms) at scale en

20250711T131500 20250711T134500 0.03000

Real-time ML: Accelerating Python for inference (< 10ms) at scale

## Relevance to SciPy audience and high-level overview * Offers insights into parsing Python's abstract syntax tree using Python’s ast module * Demonstrate how we parse type annotations to build a DAG that’s later leveraged to dynamically create query plans at inference time * Filters and projections are pushed down E2E * Parse Python functions to generate static expressions that are executed either * in C++ post fetching data * or as SQL UDF’s at query time to achieve orders of magnitude speed up * We also built our own SQL Driver in C++ that we interface with to replace SQLAlchemy * Python functions that can’t be converted into static expressions are run against isolated processes (Ray cluster/or custom sub-process) for parallelism * Execute the query plan using Velox * Velox is an extensible C++ execution engine library for building data management systems * Open-source with 4k stars on GitHub * Contribute upstream to Velox, as it wasn’t built with ML use cases in mind ## Background & Motivation * Real-time machine learning demands rapid response times * Python’s speed is often a bottleneck * Overview of the strategies we’ve employed to circumvent this * Need for real-time is only growing * Demand for contextually aware systems that can process and react to information instantly is increasing * Demand also grows as the individual touchpoints between humans and AI systems multiply **What are the real-world applications for real-time ML?** Inference, recommendations, reranking, etc. * Predictive monitoring + health * Predictive monitoring with sensor data (Pathogens, pollution, IoT, etc.) * Acute disease detection * Operations e.g. ambulance routing * Fraud: Is this transaction fraudulent? Is this website phishing (AI-generated content)? * Lots of features (which are only meaningful when meticulously combined) * Ecommerce: Recommendations e.g. other shoppers also bought X * Collaborative filtering * Marketplaces: Re-ranking matches between buyers and sellers * Event-driven similarity and vector search at a huge scale * single user + real-time preferences vs many sellers / products + real-time preferences ## Methods & Approach * DAG construction using Python’s AST * Automatic parallel execution of concurrent pipeline stages * Vectorization of pipeline stages that are written using scalar syntax * Low-latency key/value stores like Redis, Bigtable, and DynamoDB to minimize cached feature fetch time * Statistics-informed join planning * JIT transformation of Python code into native code ## Results & Effects * Throughput of hundreds of millions of features per second. * Sub-second latencies (E2E) * Zero-ETL - Our ML model can now be fully decoupled from ETL unlocking similar affordances to infrastructure-as-code * Single source of truth * Features and logic are shared and consistent across teams i.e. another team can easily use the (exact same) logic for computing creditworthiness * Version control > easily rollback to the previous iteration without needing to backfill data * Branch deploys - Data scientists can experiment in their own branches * Easily simulate model runs with historical production data * Experiment, A/B test, and evaluate models * Prevent drift between training/serving, staging/prod, etc. * It’s the same underlying data * Integrate more data sources * Minimize dependency and schema requests to your DE teams since data is transformed post-fetch * Broadened contexts by pulling from Postgres, Kafka, AWS Glue, Snowflake, a microservice/3rd party API PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/SDPLQQ/ Room 315 Elliot Marx PUBLISH 3KJRWT@@cfp.scipy.org

-3KJRWT

From Model to Trust: Building upon tamper-proof ML metadata records en

20250711T135500 20250711T142500 0.03000

From Model to Trust: Building upon tamper-proof ML metadata records

The integrity and provenance of machine learning models are critical for building trustworthy AI systems. While cryptographic signing protects many digital assets, a standardized approach for verifying model origins and ensuring they haven't been tampered with is still missing. We are addressing this gap by building upon the OpenSSF Model Signing project – a PKI-agnostic method for creating verifiable claims on bundles of ML artifacts. We show how this project can expand beyond just model signing to also cover datasets, and other associated files, recording all integrity information in a single manifest. In fact, this can be used as a foundation layer upon which we can build useful AI supply-chain solutions, both in terms of security and in terms of reducing development costs. Imagine querying "What datasets were used to train this model?" or determining which models and agents have been trained on a poisoned dataset, even before these get deploy in production systems. This is all possible by merging model signing, model cards, SLSA and AI-BOM information and analyzing all this metadata using tools such as GUAC. Our talk lays the groundwork for such capabilities.Benefits to the ecosystem We have a vision on how a secure end-to-end ML system can look like, in a way that not only enhances security but also allows companies to keep costs down. This talk lays down the foundations of this vision, presents what is already here and what we are planning to work on under OpenSSF and CoSAI this year to achieve this vision. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/3KJRWT/ Room 315 Mihai Maruseac PUBLISH ZNLF8V@@cfp.scipy.org

-ZNLF8V

Accelerating scientific data releases: Automated metadata generation with LLM agents en

20250711T143500 20250711T150500 0.03000

Accelerating scientific data releases: Automated metadata generation with LLM agents

We present a novel approach to automate metadata generation for scientific data using LLM agents. Building on the use case of USGS ScienceBase data repository, we demonstrate how pre-trained models can be fine-tuned to understand new scientific data sets and generate standard-compliant metadata. Our system orchestrates a modular pipeline that leverages multiple LLM agents to parse, analyze, and generate high-quality metadata for a variety of scientific datasets, including images, time series, and text data. We discuss the technical challenges and opportunities of using LLMs for metadata generation and outline strategies for community-driven enhancements. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/ZNLF8V/ Room 315 Chirag Tudor Garbulet PUBLISH PNSEX8@@cfp.scipy.org

-PNSEX8

SpikeInterface: Streamlining End-to-End Spike Sorting Workflows en

20250711T104500 20250711T111500 0.03000

SpikeInterface: Streamlining End-to-End Spike Sorting Workflows

## Background and Motivation Spike sorting—the process of extracting individual neuron spiking activity from raw electrophysiological recordings—is a central technique in systems neuroscience, yet it remains fraught with challenges. Researchers routinely struggle with a fragmented landscape of file formats (both open and proprietary) and heterogeneous sorting algorithms, each with its own API and complex dependencies. These technical hurdles are further compounded by variability in data pre-processing pipelines, massive datasets generated by dense electrode arrays, and persistent difficulties in reproducing analyses across laboratories. SpikeInterface aims to solve this problem by providing an end-to-end, unified Python framework for spike sorting. It establishes a common interface to diverse recording formats (e.g., Open Ephys, Plexon, Neurodata Without Borders) and spike sorters (e.g., Kilosort2, SpyKING Circus, IronClust), liberating neuroscientists to focus on scientific questions rather than writing complex data conversion and management code. ## Technical Discussion and Broader Relevance This talk will highlight several core technical features in SpikeInterface: - **Unified I/O and Metadata Management.** By abstracting data I/O interactions behind a standardized Extractor object, SpikeInterface simplifies loading and saving data. This approach not only solves many cross-format incompatibilities, but also enables further workflows such as attaching metadata about probes, electrode geometries, sampling rates, and additional descriptors crucial for reproducibility. - **Modular Pre-processing and Post-processing.** SpikeInterface supports a wide range of pre-processing (e.g., filtering, re-referencing) and post-processing (e.g., principal component analysis, waveform extraction, quality metrics) steps. Most are implemented in a lazy fashion to enable memory-efficient access. - **Containerized Algorithms and Parallelization.** We provide containerized versions of the most common sorting algorithms through Docker and Singularity, plus a framework for adding future ones. This approach effectively resolves library conflicts between sorters implemented in MATLAB, C++, Python, or mixed languages, while maintaining a consistent API for end users regardless of the underlying implementation. - **Modular Sorting Components.** Despite diverse designs, most spike sorting algorithms share common building blocks: peak detection, feature extraction, clustering, template matching, and drift correction. SpikeInterface exposes these steps modularly, allowing users to modify peak detection thresholds, compare feature extraction strategies, or experiment with different clustering approaches. This composable design enables mixing methods across sorters, reusing validated components, and benchmarking against various datasets (ground-truth, hybrid, and synthetic). Researchers can adapt core steps for different recording conditions while maintaining a unified API, reducing code redundancy and accelerating research and experimentation. ## Intended Audience The talk is aimed at neuroscientists, data engineers, and scientific software developers. Attendees who might be interested in the following topics will benefit: - Strategies for handling heterogeneous file formats and complex dependency trees - Robust pre-processing, post-processing, and curation workflows for big electrophysiological data - Reproducible benchmarking for spike sorters ## Project Resources and Previous Presentations - **SpikeInterface documentation:** [Library documentation](https://spikeinterface.readthedocs.io/en/latest/) - **Source code:** [GitHub repo](https://github.com/SpikeInterface/spikeinterface) - **Previous presentations on SpikeInterface project:** - *Spike Sorting Workshop (2024):* overview of SpikeInterface, main features and capabilities, and history of the project - *Oxford Cortex Club slides (2025):* slides for a more recent presentation with similar content [Watch the presentation](https://www.youtube.com/watch?v=tFa45h2m9o4&t=251s) [View the slides](https://docs.google.com/presentation/d/11XvYJXsbrwYEmSq0HpJKCkStx_9mLf2539gyCVi7m-c/edit?usp=sharing) PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/PNSEX8/ Room 317 Heberto Mayorquin Alessio Buccino PUBLISH K98LXU@@cfp.scipy.org

-K98LXU

Processing Cloud-optimized data in Python (Dataplug) en

20250711T112500 20250711T115500 0.03000

Processing Cloud-optimized data in Python (Dataplug)

-7MURRS

Learning the art of fostering open-source communities en

20250711T131500 20250711T134500 0.03000

Learning the art of fostering open-source communities

The project’s maturity levels define the size of the community. Although the larger the size, the more effort it requires to manage it, certain practices can be applied to every project, no matter the size. In order to maintain a community, the maintainers should focus on building and growing it. I’ll talk about ways in which you can identify your community and how to grow it. I’ll then dive into certain practices to preserve and flourish it. Here’s the outline— First, I'll talk about: ### Introduction (5 mins.) - But wait, *what* is the community? - Identifying the community of your project, including users, maintainers and essential stakeholders - How do I identify the maturity level of my project? - How do I grow my community? - Ensuring essential guides, licenses and documents are updated - Having a comprehensive contributing guide - Ensuring the CoC and Governance are complete and include essential elements, such as a reporting mechanism, project lead team, adding a new core member, etc. After this: ### Best Practices (15 mins.) - Various ways to effectively engage with the community - How to make your community meetings enjoyable and *not* so boring! - Office hours—a hidden venue to understand and help your community - How to manage online spaces like chat rooms, forums, and mailing lists like a pro - Conflicts—we hate it! Contributions—we love it! - Establish conflict resolution mechanisms - Creating processes and workflows to handle substantial contributions - How do I grow my community? - Building scaffolding to streamline, attract, and encourage contributions - How do I identify important stakeholders or potential donors for my project? - Leverage educational programs like GSoC and Outreachy Then, I'll discuss: ### Pitfalls (5 mins.) - What should you avoid when your project expands exponentially? (Note to self: Welcome bot for PR and issues stuff) - How can you identify and remove the bottlenecks in various aspects of your community? - How do we ensure that harmful elements do not hinder growth? And finally, closing by: ### Conclusion (5 mins.) - Key takeaways - QnA This talk aims to address humans who are maintaining open-source projects and looking for the best advice on managing the open-source community. I’d also like to invite anyone interested in the lessons I’ve learnt by maintaining the Zarr project throughout the years. The tone of the talk is set to be informative, story-telling and fun. ### After this talk, you’d: - Understand how to identify your project’s community - Best practices to manage and evolve it - And what not to do when managing it PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/7MURRS/ Room 317 Sanket Verma PUBLISH LND7LC@@cfp.scipy.org

-LND7LC

From the outside, in: How the napari community supports users and empowers transition to contribution en

20250711T135500 20250711T142500 0.03000

From the outside, in: How the napari community supports users and empowers transition to contribution

Open-source software thrives on community contributions, but for users without formal software development training, stepping into development can be intimidating. The perception that core contributors come from well-established software backgrounds and mentorship lineages can create an implicit barrier, leaving eager users without clear pathways to contribute. However, [napari](https://github.com/napari)—a fast, interactive, multi-dimensional scientific data viewer—breaks down these barriers and empowers contributors through a strong presence in the bioimage analysis community (e.g. [forums](https://forum.image.sc/), [video tutorials](https://youtube.com/playlist?list=PLbmHuawY0ITJugyk7KSqDRJEEjX_KWBp2&si=_3aoiRzyeOPBXCgb), and [popular learning materials](https://haesleinhuepf.github.io/BioImageAnalysisNotebooks/intro.html)), fantastic human-centered [documentation](https://napari.org/), engagement with community through accessible channels (e.g. [Zulip](https://napari.zulipchat.com/), [Github](https://github.com/napari/napari/issues), and [community meetings](https://napari.org/dev/community/meeting_schedule.html)) and opportunity for anyone to contribute to its rich [plugin ecosystem](https://www.napari-hub.org/). I stumbled into bioimage analysis eight years ago as a bench scientist automating the quantification of microscopy data, a manual task amongst my peers. At the time, [ImageJ/FIJI](https://imagej.net/software/fiji/), written in Java, was the most well established open source tool for this undertaking. While its macro-recorder provided an accessible entry point into scripting and sharing my workflows, transitioning beyond it—especially towards GUI development—was incredibly challenging without programming experience. During my postdoc, I sought an ecosystem that integrated data processing, quantification, and analysis without constant tool-switching. The scientific Python stack stood out, and I was particularly drawn to the rapid development and excitement surrounding napari. Beyond its powerful core features, napari’s plugin ecosystem brings a diverse set of features, from advanced [visualization](https://github.com/brainglobe) and [image reading and writing](https://github.com/AllenCellModeling/napari-aicsimageio), to fully featured [interfaces to develop workflows,](https://github.com/haesleinhuepf/napari-assistant) to making [machine](https://github.com/haesleinhuepf/apoc) and [deep learning](https://github.com/computational-cell-analytics/micro-sam) analyses accessible. Napari’s ecosystem is great as both a general user and as a bioimage analyst looking to share their work with the community. Thus, I began to work on developing [my own plugin](https://github.com/timmonko/napari-ndev), which was made accessible by great tutorials and understandable tools to make contributing easier, like the [napari-plugin-template](https://github.com/napari/napari-plugin-template) and [magicgui](https://github.com/pyapp-kit/magicgui) (a project that grew out of the napari community’s efforts). The plugin model allowed me to explore, try things out, and make mistakes while improving my own plugin, without having to work directly on core code. I will discuss the great direct and indirect mentorship that empowered me to achieve my goals, all the while gaining confidence and the satisfaction that I contributed meaningfully to the community. I gained the confidence to contribute directly to napari and have found the process inspiring and fulfilling. I will highlight ways in which napari maintainers and contributors help guide and encourage users through issues and first pull requests, including accessibility through discussion at frequent open community meetings. I will walk through the challenges that I and other contributors have faced and the ways in which the napari team works to resolve these gaps; I’ll particularly highlight how the napari team formats the documentation to provide clear tutorials on [functionality](https://napari.org/stable/usage.html), [architecture](https://napari.org/stable/developers/architecture/index.html), and [how-to develop and test](https://napari.org/stable/developers/contributing/index.html). Finally, I will share how napari’s emphasis on clear goals, transparent decision-making, and inviting culture has fostered a community where scientists can not only use, but also shape the tools they rely on. By sharing strategies that have worked in napari and the bioimage analysis community, this talk aims to provide inspiration and practical takeaways for building and sustaining welcoming communities, bridging the gap between users and contributors and between scientists and software developers. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/LND7LC/ Room 317 Tim Monko PUBLISH X7S9PA@@cfp.scipy.org

-X7S9PA

Remote development for students and indie researchers with Spyder en

20250711T143500 20250711T150500 0.03000

Remote development for students and indie researchers with Spyder

The last 10 years have witnessed the rise of the PyData ecosystem to become the lingua franca of scientific computing. Behind that incredible success, no doubt, there has been its open source nature, which allows students and researchers in any part of the world to use it at no cost. However, the fact that the ecosystem is a public good doesn't necessarily make it a democratic one. Although all PyData resources (i.e. its libraries and documentation) are free, the ecosystem still requires a good amount of technical knowledge to set it up and get the most out of it. This is especially true for remote development. It's not an easy task to move code developed locally to HPC clusters or the cloud in order to access a larger pool of resources (memory, CPUs or GPUs) to execute it. Perhaps the simplest workflow to do that involves logging through SSH to the remote machine, creating a Python environment there, moving your code using the `scp` command and finally running it. That's certainly not straightforward for many scientists and engineers, but doable. However, what happens when your code doesn't work because the local and remote environments are slightly different? Or when you need to do some adjustments to make it work remotely? Then the local and remote versions of the code start to diverge and you're more or less forced to work on an SSH terminal for remote development, which is atrocious. A possible solution to that, using OSS tools, is to set up JupyterHub and JupyterLab on the remote machine and work from the latter. But without IT staff to rely on for help, that could be a daunting endeavor. That's even more difficult in third-world countries because that staff is either small or non-existent. Spyder 6.1 aims to democratize this process by drastically simplifying what is required from users to get started with remote development. As it'll be shown in the first part of this talk, all users have to do is enter their SSH credentials and decide what packages they'd like to use for their code. Then Spyder will establish a connection to the remote machine, install a server there (called Spyder Remote Services) to handle all facilities required for remote development, and create a Conda environment with the requested packages. Finally, by simply opening an IPython console for the machine, users will be ready to run their code on it and graphically upload/download data sets or other files to/from it. The second part will describe how the architecture behind these enhancements was carefully designed to make it easily extensible by third-party plugins. Spyder Remote Services is a Jupyter Server extension, so it can be customized by other extensions and also installed in an existing JupyterHub deployment. On the Spyder side, the Remote Client plugin offers a context manager to interact with the server API as needed. This is no small detail because the remote facilities of other IDEs (e.g. VSCode and PyCharm) don't offer any way to do something similar. PUBLIC CONFIRMED Talk https://cfp.scipy.org//scipy2025/talk/X7S9PA/ Room 317 Carlos Cordoba C.A.M. Gerlach