SciPy 2024

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
07:00
07:00
60min
Registration and Breakfast
Ballroom A
08:00
08:00
240min
A Practical Introduction to NumPy
Tim Diller

Do you have a basic understanding of Python and want to "level-up" your computational skills? Do you instinctively write for-loops to perform computations on your arrays? Have you ever heard someone complain "Python is slow" and want to prove them wrong? Do you want to know how to manipulate NumPy arrays like a master? If any or all of these is true, then this tutorial is for you!

Tutorials
Room 315
08:00
240min
A hands-on forecasting guide: from theory to practice
Ian Spektor, Diego Kiedanski, Mathieu Guillame-Bert

Forecasting is central to decision-making in virtually all technical domains. For instance, predicting product sales in retail, forecasting energy demand, and anticipating customer churn all have tremendous value across different industries. However, the landscape of forecasting techniques is as diverse as it is useful, and different techniques and expertise are adapted to different types and sizes of data.
In this hands-on workshop, we give an overview of forecasting concepts, methods, and practical application on the M5 forecasting challenge. We’ll walk you through data exploration, data preparation, feature engineering, statistical (e.g., STL, ARIMA) and machine learning (e.g., decision forests) modeling, meta-modeling (e.g., hierarchical and relational modeling, ensembles, resource models), and how to safely evaluate such temporal models.

Tutorials
Room 316
08:00
240min
Data Visualization with Vega-Altair
Jon Mease, Christopher Davis

This tutorial is an introduction to data visualization using the popular Vega-Altair Python library. Vega-Altair provides a simple, friendly, and consistent API that supports rapid data exploration. Vega-Altair’s flexible support for interactivity enables the creation and sharing of beautiful interactive visualizations.

Participants will learn the foundational concepts that Vega-Altair is built on and will gain hands-on experience exploring a variety of datasets. Of particular interest to the scientific community, this tutorial will cover recent advancements in the Vega-Altair ecosystem that make it possible to scale visualizations to large datasets, and to easily export visualizations to static image formats for publication.

Tutorials
Ballroom A
08:00
240min
Determining Climate Risks with NASA Earthdata Cloud
Dhavide Aruliah, Karthik Venkataramani, Patricia A. Loto

This tutorial walks participants — Earth scientists with some prior Python experience — through analyses of two particular climate risk scenarios: floods & wildfires. The goal is to obtain hands-on experience with common reproducible Jupyter/Python workflows based on data products from the NASA Earthdata Cloud. The case studies highlight the interplay of distributed data with scalable numerical strategies — "data-proximate computing" — implemented using scientific Python libraries like Xarray & Dask. This tutorial — co-developed by 2i2c and MetaDocencia — constitutes part of NASA's Transform to Open Science (TOPS) initiative to reinforce principles of Open Science & reproducibility.

Tutorials
Ballroom C
08:00
240min
Intro to Ibis: blazing fast analytics with DuckDB, Polars, Snowflake, and more, from the comfort of your Python repl.
Gil Forsyth, Phillip Cloud, Naty Clementi, Jim Crist-Harif

Tabular data is ubiquitous, and pandas has been the de facto tool in Python for analyzing it. However, as data size scales, analysis using pandas may become untenable. Luckily, modern analytical databases (like DuckDB) are able to analyze this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Many of these systems only provide a SQL interface though; something far different from pandas’ dataframe interface, requiring a rewrite of your analysis code.

This is where Ibis comes in. Ibis is a pure-Python open-source library that provides a dataframe interface to many popular databases and analytics tools (DuckDB, Polars, Snowflake, Spark, etc...). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL. No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend.

https://ibis-project.org/
https://github.com/ibis-project/ibis

Tutorials
Room 317
08:00
240min
TorchGeo: Advancing Earth Observation Through Machine Learning
Isaac Corley, Adam Stewart

TorchGeo is an open-source PyTorch domain library designed to make it simple for ML experts to work with geospatial data and for remote sensing experts to explore ML solutions. It abstracts away the complexities of reading, reprojecting, and resampling data, allowing researchers to instead focus on the science. It provides over 75 builtin datasets for uncurated raster data (e.g., Sentinel, Landsat) and segmentation masks (e.g., NLCD, CDL), datasets built for self-supervised learning (e.g., SeCo, SSL4EO), and curated task-specific benchmark datasets (e.g., EuroSAT, BigEarthNet). It also comes with over 40 models pre-trained on your favorite satellites.

Tutorials
Ballroom D
08:00
240min
Unlocking Dynamic Reproducible Documents: A Quarto Tutorial for Scientific Communication
Mine Çetinkaya-Rundel

Quarto is an innovative, open-source scientific and technical publishing system compatible with Jupyter Notebooks and plain text markdown documents. Quarto provides data scientists with a seamless way to publish their work in a high-quality format that is reproducible, accessible, and shareable. With Quarto, researchers can turn their Jupyter Notebooks and literate plain text markdown documents into professional-looking publications in various formats. This workshop will demonstrate how Quarto enables data scientists to turn their work products into professional, high-quality documents, slides, websites, scientific manuscripts, and other shareable artifacts.

Tutorials
Room 318
12:00
12:00
90min
Lunch Break
Ballroom A
13:30
13:30
240min
Enhancing Predictive Analytics with tsbootstrap and sktime
Sankalp Gilda, Franz Kiraly

Explore tsbootstrap and sktime in our 4-hour tutorial, focusing on enhancing time series forecasting and analysis. Discover how tsbootstrap's bootstrapping methods improve uncertainty quantification in time series data, integrating with sktime's forecasting models. Learn practical applications in various domains, boosting predictive accuracy and insights. This interactive session will provide hands-on experience with these tools, offering a deep dive into advanced techniques like probabilistic forecasting and model evaluation. Join us to expand your expertise in time series analysis, applying innovative methods to tackle real-world data challenges.

Tutorials
Room 316
13:30
240min
Github Actions for Scientific Data Workflows
Valentina Staneva, Quinn Brencher

In this tutorial we will introduce Github Actions to scientists as a tool for lightweight automation of scientific data workflows. We will demonstrate that GitHub Actions are not just a tool for software testing, but can be used in various ways to improve the reproducibility and impact of scientific analysis. Through a sequence of examples, we will demonstrate some of Github Actions' applications to scientific workflows, such as scheduled deployment of algorithms to sensor streams, updating visualizations based on new data, processing large datasets, model versioning and performance benchmarking. GitHub Actions can particularly empower Python scientific programmers who are not willing to build fully-fledged applications or set up complex computational infrastructure, but would like to increase the impact of their research. The goal is that participants will leave with their own ideas of how to integrate Github Actions in their own work.

Tutorials
Room 318
13:30
240min
Hobby Drones, Urban Forests: A Geospatial Journey to Greener Cities
Kevin Lacaille

Drone imagery is more widely available than ever before, allowing the public to capture ultra high-resolution Earth images with hobbyist drones. In this workshop, we will explore drone imagery with Python tools such as geopandas, OpenCV, rasterio, numpy, and shapely. Afterwards, we will assess urban green spaces, focusing on counting trees and estimating their role in capturing carbon to fight climate change. This practical exercise will not only enhance our understanding of urban ecology, but also highlight the importance of trees in urban planning and environmental sustainability.

Tutorials
Ballroom C
13:30
240min
Interactive data visualizations with Bokeh (in 2024)
Timo Metzger, Bryan Van de Ven, Pavithra Eswaramoorthy

Bokeh is a library for interactive data visualization. You can use it with Jupyter Notebooks or create standalone web applications, all using Python. This tutorial is a thorough guide to Bokeh and its most recent new features. We start with a basic line plot and, step-by-step, make our way to creating a dashboard web application with several interacting components. This tutorial will be helpful for scientists who are looking to level up their analysis and presentations, and tool developers interested in adding custom plotting functionally or dashboards.

Tutorials
Ballroom A
13:30
240min
Pretraining and Finetuning LLMs from the Ground Up
Sebastian Raschka

This tutorial is aimed at coders interested in understanding the building blocks of large language models (LLMs), how LLMs work, and how to code them from the ground up in PyTorch. We will kick off this tutorial with an introduction to LLMs, recent milestones, and their use cases. Then, we will code a small GPT-like LLM, including its data input pipeline, core architecture components, and pretraining code ourselves. After understanding how everything fits together and how to pretrain an LLM, we will learn how to load pretrained weights and finetune LLMs using open-source libraries.

Tutorials
Ballroom D
13:30
240min
Thinking In Arrays
Gordon Watts, Vangelis Kourlitis

Despite its reputation for being slow, Python is the leading language of scientific computing, which generally needs large-scale (fast) computations. This is because most scientific problems can be split into "metadata bookkeeping" and "number crunching," where the latter is performed by array-oriented (vectorized) calls into precompiled routines.

This tutorial is an introduction to array-oriented programming. We'll focus on techniques that are equally useful in NumPy, Pandas, xarray, CuPy, Awkward Array, and other libraries, and we'll work in groups on three class projects: Conway's Game of Life, evaluating decision trees, and computations on ragged arrays.

Tutorials
Room 315
13:30
240min
Xarray: Friendly, Interactive, and Scalable Scientific Data Analysis
Scott Henderson, Don Setiawan, Tom Nicholas, Wietze Suijker, Jessica Scheick, Max Jones, Luis Lopez, Negin Sobhani

Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines the convenience of labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. This hands-on tutorial focuses on intermediate and advanced workflows using complex real-world data. We encourage participants in this workshop to bring your own dataset as we will dedicate ample time to apply tutorial concepts to datasets of interest!

Tutorials
Room 317
07:00
07:00
60min
Registration and Breakfast
Ballroom A
08:00
08:00
240min
3D Visualization with PyVista
Tetsuo Koyama, Alexander Kaszynski, Bill Little, Bane Sullivan

PyVista is a general purpose 3D visualization library used for over 2000+ open source projects for the visualization of everything from computer aided engineering and geophysics to volcanoes and digital artwork.

PyVista exposes a Pythonic API to the Visualization Toolkit (VTK) to provide tooling that is immediately usable without any prior knowledge of VTK and is being built as the 3D equivalent of Matplotlib, with plugins to Jupyter to enable visualization of 3D data using both server- and client-side rendering.

Tutorials
Ballroom C
08:00
240min
All the SQL a Pythonista needs to know: an introduction to SQL and DataFrames with DuckDB
Guen Prawiroatmodjo, Alex Monahan, Mehdi Ouazza, Elena Felder

Structured Query Language (or SQL for short) is a programming language to manage data in a database system and an essential part of any data engineer’s tool kit. In this tutorial, you will learn how to use SQL to create databases, tables, insert data into them and extract, filter, join data or make calculations using queries. We will use DuckDB, a new open source embedded in-process database system that combines cutting edge database research with dataframe-inspired ease of use. DuckDB is only a pip install away (with zero dependencies), and runs right on your laptop. You will learn how to use DuckDB with your existing Python tools like Pandas, Polars, and Ibis to simplify and speed up your pipelines. Lastly, you will learn how to use SQL to create fast, interactive data visualizations, and how to teach your data how to fly and share it via the Cloud.

Tutorials
Room 315
08:00
240min
Automate your research with automan
Prabhu Ramachandran, Pawan Negi

In research involving any kind of computer simulation, we often have to
execute several simulations that might become a part of the final manuscript.
It is found that automating these simulations and their post-processing
introduces significant personal benefit in the form of improving research
output and productivity. Automation makes it much easier to run large
parameter sweeps and studies and allows you to focus on the important
questions to ask rather than managing hundreds or thousands of simulations
manually. This takes the drudgery of data/file management out of your hands,
systematizes your research, and makes it possible to incrementally improve and
refine your work. The added nice benefit is that your research also becomes
much easier to reproduce.

Tutorials
Room 317
08:00
240min
Bring your __repr__’s to life with anywidget
Trevor Manz, Nezar Abdennur, Fritz Lekschas

Jupyter Widgets connect Python objects with web-based visualizations and UIs, enabling both programmatic and interactive manipulation of data and code. For example, lasso some points in a scatterplot visualization and access that selection in Python as a DataFrame.

anywidget makes it simple and enjoyable to bring these capabilities to your own Python classes, and it ensures easy installation and usage by end users in various environments. In this tutorial, you will create your own custom widgets with anywidget and learn the skills to be effective in extending your own Python classes with web-based superpowers.

Tutorials
Room 318
08:00
240min
Cookiecutter: Project Templates and Much More
Reka Anna Horvath

Cookiecutter is mainly known as a tool for software project templates. But its possible use cases are much more versatile: plain text and code, small building blocks and whole projects.
You can get started and build powerful templates without any programming - by using a CLI tool and editing text files. And if you're willing to throw some Python code and Jinja extensions into the mix, you can build pretty sophisticated and flexible automations.
The main goal of this workshop is to give some inspiration: How to detect candidates for automation in your workflow? Where can you improve speed and consistency and free up some mental energy for the actual content of your task?

Tutorials
Room 316
08:00
240min
From RAGs to riches: Build an AI document inquiry web-app
Pavithra Eswaramoorthy, Dharhas Pothina, Andrew Huang

As we descend from the peak of the hype cycle around Large Language Models (LLMs), chat-based document inquiry systems have emerged as a high-value practical use case. Retrieval-Augmented Generation (RAG) is a technique to share relevant context and external information (retrieved from vector storage) to LLMs, thus making them more powerful and accurate.

In this hands-on tutorial, we’ll dive into RAG by creating a personal chat app that accurately answers questions about your selected documents. We’ll use a new OSS project called Ragna that provides a friendly Python and REST API, designed for this particular case. We’ll test the effectiveness of different LLMs and vector databases, including an offline LLM (i.e., local LLM) running on GPUs on the cloud-machines provided to you. We'll then develop a web application that leverages the REST API, built with Panel–a powerful OSS Python application development framework.

Tutorials
Ballroom D
08:00
240min
hvPlot and Panel: Easy data visualization, data exploration, and data apps
James A. Bednar

This tutorial will show you how to use the Pandas, Dask, or Xarray APIs you already know to interactively explore and visualize your data even if it is big, streaming, or multidimensional. Then just replace your expression arguments with widgets to get an instant web app that you can share as HTML+WASM or backed by a live Python server. These tools let you focus on your data rather than the API, and let you build linked, interactive drill-down exploratory apps without having to run a web-technology software development project, which you can then share without becoming an operations specialist.

Tutorials
Ballroom A
12:00
12:00
90min
Lunch Break
Ballroom A
13:30
13:30
240min
Building Complex Web Apps with Jupyter Widgets
Nicole Brewer, Matt Craig, Juan Cabanela, Maarten Breddels

Interactive widgets were introduced to the Jupyter ecosystem over 10 years ago. A number of progressively more powerful interactive widget packages have been developed since then, supporting the construction of sophisticated dashboards and interactives. This tutorial will describe a number of approaches to developing and managing complex web apps that are compatible with Jupyter widgets and promote scalable application development.

Tutorials
Room 318
13:30
240min
Create Your First Python Package: Make Your Python Code Easier to Share and Use
Leah Wasser, isabel zimmerman, Jeremiah Paige

Creating code that can be shared and reused is the pinnacle of open science. But tools and skills to share your code can be tricky to learn. In this hands-on tutorial, you’ll learn how to turn your code into an installable Python module that can be shared with others. To get the most out of this tutorial, you should be familiar with writing Python code, Python environments and functions.

You will leave this tutorial understanding how to:

  • Create code that can be installed into different environments
  • Use Hatch as a workflow tool, making setup and installation of your code easier
  • Use Hatch to publish your package to (test) PyPI
Tutorials
Room 316
13:30
240min
Data of an Unusual Size (2024 edition): A practical guide to analysis and interactive visualization of massive datasets
Pavithra Eswaramoorthy, Dharhas Pothina

While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.

In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a public cloud provided by the presenters – starting from how the data is stored and read, to how it is processed and visualized.

Tutorials
Ballroom A
13:30
240min
Generative AI Copilot for Scientific Software – a RAG-Based Approach using OLMo
Vani Mandava, Cordero Core, Don Setiawan, Niki Burggraf, Anant Mittal, Anshul Tambay, Madhav Kashyap, Anuj Sinha, Ishika Khandelwal

Generative AI systems built upon large language models (LLMs) have shown great promise as tools that enable people to access information through natural conversation. Scientists can benefit from the breakthroughs these systems enable to create advanced tools that will help accelerate their research outcomes. This tutorial will cover: (1) the basics of language models, (2) setting up the environment for using open source LLMs without the use of expensive compute resources needed for training or fine-tuning, (3) learning a technique like Retrieval-Augmented Generation (RAG) to optimize output of LLM, and (4) build a “production-ready” app to demonstrate how researchers could turn disparate knowledge bases into special purpose AI-powered tools. The right audience for our tutorial is scientists and research engineers who want to use LLMs for their work.

Tutorials
Ballroom D
13:30
240min
Image analysis and visualization in Python with scikit-image, napari, and friends
Lars Grüter, Erick Martins Ratamero, Biola Adeyemi, Jordão Bragantini

Between telescopes and satellite cameras and MRI machines and microscopes, scientists are producing more images than they can realistically look at. They need specialized viewers for multi-dimensional images, and automated tools to help process those images into knowledge. In this tutorial, we will cover the fundamentals of algorithmic image analysis, starting with how to think of images as NumPy arrays, moving on to basic image filtering, and finishing with a complete workflow: segmenting a 3D image into regions and making measurements on those regions. At every step, we will visualize and understand our work using matplotlib and napari.

Tutorials
Ballroom C
13:30
240min
Introduction to Property-Based Testing
Zac Hatfield-Dodds

Writing correct software is difficult, and even scientists don’t always get it right
[citation needed].

Hypothesis is a testing package that will search for counterexamples to your
assertions – so you can write tests that provide a high-level description of your
code or system, and let the computer attempt a Popperian falsification. If it
fails, your code is (probably) OK… and if it succeeds you have a minimal input
to debug.

Come along and learn the principles of property-based testing, how to use
Hypothesis, and how to use it to check scientific code – whether highly-polished
or quick-and-dirty!

Tutorials
Room 317
13:30
240min
Working with U.S. Census Data in Python: Discovery, Analysis, and Visualization
Darren Vengroff

The United States Census Bureau publishes over 1,300 data sets via its APIs. These are useful across a myriad of fields in the social sciences. In this tutorial, attendees will learn how to use open-source Python tools to discover, download, analyze, and generate maps of U.S. Census
data. The tutorial is full of practical examples and best practices to help participants avoid the tedium of data wrangling and concentrate on their research questions.

This tutorial will consider the full breadth and richness of data available from the U.S. Census. We will cover not only American Community Survey (ACS) and similarly well-known data sets, but also a number of data sets that are less well-known but nonetheless useful in a variety of research contexts.

The tutorial has no slides. Instead, it will be presented from a series of live Jupyter notebooks. These notebooks consist of a combination of live demonstrations, starting with the most basic queries and mapping and moving through more advanced topics related to geographies, variables, groups and trees of related variables, and data set exploration. Throughout the tutorial, participants will engage in short 5-minute exercises to modify and extend the code as it is demonstrated.

After covering the concepts, the group as a whole will go through a complete end-to-end research example. Finally, individuals and small groups will have a chance to complete a series of short interactive exercises extending what they have learned and share the results with their peers.

All Python tooling used in the workshop is available as open-source software. Final versions of the notebooks used in the tutorial will also be made available via open-source.

Tutorials
Room 315
17:30
17:30
120min
WELCOME RECEPTION

Hosted by Streamlit in the Main Foyer the Convention Center. Catch up with old friends or meet new fellow attendees! Food and drinks will be served.

Social Event
Tacoma Convention Center Foyer
07:30
07:30
90min
Registration and Breakfast
Plenary (Ballroom B-D)
09:00
09:00
15min
Opening notes
Plenary (Ballroom B-D)
09:15
09:15
45min
Keynote by Julia Silge

Data Scientist and Engineering Manager at Posit PBC

Keynote
Plenary (Ballroom B-D)
10:00
10:00
25min
SciPy Tools Plenary
Plenary (Ballroom B-D)
10:25
10:25
20min
Break
Plenary (Ballroom B-D)
10:45
10:45
30min
Atomistic uncertainty driven data generation in ANI neural network potentials
nterrel

Using machine learning to predict chemical properties and behavior is an important complement to traditional approaches to computation and simulation in chemistry. The ANAKIN-ME (ANI) methodology has been shown to produce generalized and transferable neural network potentials, trained on density functional theory (DFT) molecular energies, at a greatly reduced computational cost. The work presented here details an approach to generating new data in an active learning scheme in order to improve predictions in the regions of chemical space with high predictive uncertainty at the atom level.

Materials and Chemistry
Room 315
10:45
30min
Python for early-stage design of sustainable aviation fuels
Ali Martz, Kyle Niemeyer, Vi Rapp

Aviation accounts for 2% of global greenhouse gas emissions, and reliance on liquid petroleum-based fuels makes this sector challenging to decarbonize. We seek to accelerate the development of sustainable aviation fuels using an early-stage design tool with a data-driven approach. We developed our strategy using the Python-based optimization packages BoTorch and Ax, and also rely on Pandas. We will discuss how to down-select from many possible fuel components to a specified number of chemical species and identify which combinations are most promising for a novel sustainable aviation fuel. We will also present its integration in our open-source web tool supporting biofuel research.

General
Ballroom A
10:45
30min
Scaling your data science workflows with Modin
Doris Lee

pandas is one of the most commonly used data science libraries in Python, with a convenient set of APIs for data cleaning, preparation, analysis, and exploration. However, despite its widespread adoption, pandas suffers from severe memory and performance issues on even moderately sized datasets. Modin is an open-source project that serves as a fast, scalable drop-in replacement for pandas (https://github.com/modin-project/modin). By changing just a single line of code, Modin seamlessly speeds up pandas workflow on a laptop or in a cluster. Originally developed at UC Berkeley, Modin has been downloaded more than 17 million times and is used by leading data science teams across industries.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
10:45
30min
Starsim: A flexible framework for agent-based modeling of health and disease
Cliff Kerr

Agent-based models (ABMs) are powerful tools for understanding how people behave and interact. However, many ABMs are slow, cumbersome to use, or both. Here we introduce Starsim, an open-source, high-performance ABM specialized for modeling different diseases (such as HIV, STIs, and tuberculosis). Built on NumPy, Numba, and SciPy, Starsim's performance rivals ABMs implemented in Java or C++, but with the convenience provided by Python: specifically, the ability to quickly implement and refine new disease modules. Starsim can also be extended to other applications in which people interact on timescales from days to decades, including economics and social science.

Human Networks, Social Sciences, and Economics
Room 316
11:25
11:25
30min
Dask in Production
Matthew Rocklin

Dask and related Python technologies are used to build ETL pipelines that serve recurring, production systems at large scale including tasks like …

  • Data pre-processing
  • Large scale queries.
  • Model training

This talk covers common problems and questions faced in building these systems, including pragmatic questions like …

  • How do I wrangle 20 TiB datasets?
  • How do I wrangle them every day?
  • How do I deploy this in the cloud?
  • How do I push changes to my pipeline with a CI/CD pipeline?
  • How do I anticipate costs?
  • How do I access logs?

In doing so, we discuss both Dask, and common related tooling in the Python ecosystem. We’ll cover this both in general, pointing to various tools, and also concretely, going through a reference architecture.
The audience should learn something even if they’re not Dask users. We’ll use Dask as the technical core, but the broader lessons and context should be valuable to anyone trying to set up production pipelines.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
11:25
30min
Monte Carlo/Dynamic Code: Performant and Portable High-Performance Computing at Scale via Python and Numba
Joanna Piper Morgan, Kyle Niemeyer

Monte Carlo / Dynamic Code (MC/DC) is a performant and scalable Monte Carlo radiation transport simulation package with GPU and CPU support. It is written entirely in Python and uses Numba to accelerate Python code to CPU and GPU targets. This allows MC/DC to be a portable, easily installable, single language source code ideal for rapid numerical methods exploration at scale. We will discuss the benefits and drawbacks of such a scheme and make comparisons to traditionally compiled codes as well as those written using other modern high-level languages (i.e., Julia).

General
Ballroom A
11:25
30min
Uncertainty quantification and propagation of the NaCl-KCl-MgCl2 pseudoternary system for molten salt application
Jorge Paz Soldan Palma

The CALculation of PHAse Diagram (CALPHAD) method coupled with uncertainty quantification and propagation (UQ & UP) calculations is a viable tool to predict thermodynamic properties in a multicomponent region at different temperatures and compositions with a confidence interval. These types of calculations provide upper and lower bounds of thermochemical property predictions when choosing the chemistry of candidate salt mixtures and therefore are vital for molten salt reactor engineering applications. The present work will study NaCl-KCl-MgCl2 salt mixture because of its high interest for molten salt applications with the aid of the ESPEI and PyCalphad open-source codes for UQ and UP calculations.

Materials and Chemistry
Room 315
11:25
30min
geosnap: The Geospatial Neighborhood Analysis Package
eli knaap

The representation, synthesis, modeling, and visualization of neighborhoods is a fundamental pursuit across a range of social sciences. In recent decades, recogni

Human Networks, Social Sciences, and Economics
Room 316
12:00
12:00
75min
Lunch Break
Plenary (Ballroom B-D)
12:00
75min
Lunch Break
Ballroom A
12:00
75min
Lunch Break
Room 315
12:00
75min
Lunch Break
Room 316
13:15
13:15
30min
Development of AstroPhot: Fitting Everything Everywhere all at Once in Astronomical Images
Connor Stone

We present AstroPhot, a tool to accelerate the analysis of astronomical images. AstroPhot allows for simultaneously modelling images with galaxies and point sources in multi-band and time domain data. In this talk I will the benefits and challenges of how we used PyTorch (a differentiable and GPU accelerated scientific python library) to allow for fast development without sacrificing numerical performance. I will detail our development process as well as how we encourage users of all skill levels to engage with our documentation/tools.

General
Ballroom A
13:15
30min
How to foster an open source culture within your data science team
Eric Ma

In this talk, I will discuss how one can foster a culture of open source contributions at one's company. Based on my successes and failures as a data scientist working in the biotech space, I will describe two key ideas (fostering internal open source and articulating value to key senior leadership) as being on the critical path to generating buy-in within the organization.

Maintainers and Community
Room 316
13:15
30min
Teaching and Learning Scientific Computing in the age of ChatGPT
Ryan C Cooper

In this talk, I will discuss learning, ethical, and legal issues when using large language models to supplement learning and practicing scientific computing. Engineering disciplines are becoming
increasingly dependent upon computational tools and resources. Writing scientific computing code has become increasingly easier with GitHub CoPilot, ChatGPT, and more LLM tools. Problems can arise when
practitioners begin to use code without reviewing and understanding why it was written that way. I will present my findings when incorporating ChatGPT into the wealth of learning resources and how I discuss academic integrity in relation to U.S. copyright law and ethical responsibility. What constitutes intellectual contribution" and "independent work" and "plagiarism" when we resuse code from open source software and from LLMs?

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
13:15
30min
Towards MDAnalysis 3.0: a fast, interoperable, and extensible community-driven ecosystem for handling molecular simulation data
Irfan Alibay

MDAnalysis (https://www.mdanalysis.org) is one of the most widely used open-source Python libraries for molecular simulation analysis, with applications ranging from understanding the interaction of drugs with proteins to the design of novel materials. With over 200 contributors and 18 years of development, MDAnalysis has established a mature, stable API and a broad user community. Here we present the current status of the library’s capabilities as it approaches its next major release. We also detail ongoing work to address modern challenges in the ever-evolving landscape of molecular simulation, such as handling increasingly large simulation datasets and meeting the tenets of FAIR.

Materials and Chemistry
Room 315
13:55
13:55
30min
Building Multi-Agent Generative-AI Applications with AutoGen
Victor Dibia, Chi Wang

Discover the potential of multi-agent generative AI applications with AutoGen, a pioneering framework designed to tackle complex tasks requiring multi-step planning, reasoning, and action. In this talk, we will explore the fundamentals of multi-agent systems, learn how to build applications using AutoGen, and discuss the open challenges associated with this approach, such as control trade-offs, evaluation challenges, and privacy concerns.

With AutoGen's open-source platform and growing ecosystem, developers can harness the power of generative AI to create advanced AI assistants and interfaces for the digital world. This talk is ideal for those with a general understanding of generative AI and Python application development.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
13:55
30min
Delivering state of the art imaging data science to aid research and development at Johnson Matthey
Aakash Varambhia

Johnson Matthey (JM) leads in sustainable technologies, employing advanced science to address global challenges in energy, chemicals, and automotive sectors. Our cutting-edge research and development (R&D) facilities include state-of-the-art characterization tools, handling diverse datasets like images, timeseries, 3D tomograms, spectra, and digital twins. With the rising demand for data-driven insights, Python has emerged as a vital tool in enhancing decision-making processes. We showcase our utilization of the open-source community to construct our data science research platform, marking a significant step forward in our innovation journey.

Materials and Chemistry
Room 315
13:55
30min
How the Scientific Python ecosystem helps answering fundamental questions of the Universe
Vangelis Kourlitis, Matthew Feickert, Gordon Watts, Giordon Stark

The ATLAS experiment at CERN explores vast amounts of physics data to answer the most fundamental questions of the Universe. The prevalence of Python in scientific computing motivated ATLAS to adopt it for its data analysis workflows while enhancing users' experience. This talk will describe to a broad audience how a large scientific collaboration leverages the power of the Scientific Python ecosystem to tackle domain-specific challenges and advance our understanding of the Cosmos. Through a simplified example of the renowned Higgs boson discovery, attendees will gain insights into the utilization of Python libraries to discriminate a signal in immersive noise, through tasks such as data cleaning, feature engineering, statistical interpretation and visualization at scale.

General
Ballroom A
13:55
30min
My NumPy year: From no CPython C API experience to shipping a new DType in NumPy 2.0
Nathan Goldbaum

Support for string data in NumPy has long been a sore spot for the community. At the beginning of 2023 I was given the task to solve that problem by writing a new UTF-8 variable-length string DType leveraging the new NumPy DType API. I will offer my personal narrative of how I accomplished that goal over the course of 2023 and offer my experience as a model for others to take on difficult projects in the scientific python ecosystem, offering tips for how to get help when needed and contribute productively to an established open source community.

Maintainers and Community
Room 316
14:35
14:35
30min
Coming Online: Enabling Real-Time and AI-Ready Scientific Discovery
Adam Thompson, Luigi Cruz

From radio telescopes to proton accelerators, scientific instruments produce tremendous amounts of data at equally high rates. To handle this data deluge and to ensure the fidelity of the instruments’ observations, architects have historically written measurements to disk, enabling downstream scientists and researchers to build applications with pre-recorded files. The future of scientific computing is interactive and streaming; how many Nobel Prizes are hidden on a dusty hard drive that a scientist didn’t have time or resources to analyze? In this talk, NVIDIA and the SETI institute will present their joint work in building scalable, real time, high performance, and AI ready sensor processing pipelines at the Allen Telescope Array. Our goal is to provide all scientific computing developers with the tools and tips to connect high speed sensors to GPU compute and lower the time to scientific insights.

General
Ballroom A
14:35
30min
Introduction to Causal Inference using pgmpy
Ankur Ankan

In the domain of data science, a significant number of questions are aimed at understanding and quantifying the effects of interventions, such as assessing the efficacy of a vaccine or the impact of price adjustments on the sales volume of a product. Traditional association based methods machine learning methods, predominantly utilized for predictive analytics, prove inadequate for answering these causal questions from observational data, necessitating the use of causal inference methodologies. This talk aims to introduce the audience to the Directed Acyclic Graph (DAG) framework for causal inference. The presentation has two main objectives: firstly, to provide an insight into the types of questions where causal inference methods can be applied; and secondly, to demonstrate a walkthrough of causal analysis on a real dataset, highlighting the various steps of causal analysis and showcasing the use of the pgmpy package.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
14:35
30min
Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars
Patrick Hoefler

Dask is a library for distributed computing with Python that integrates tightly with pandas. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). The re-implementation of the DataFrame API addresses all of the pain points that users ran into. We will look into how Dask is a lot faster now, how it performs on benchmarks that is struggled with in the past and how it compares to other tools like Spark, DuckDB and Polars.

Maintainers and Community
Room 316
14:35
30min
mrfmsim: a modular simulation platform for magnetic resonance force microscopy experiments
Peter Sun

We present mrfmsim, an open-source framework that facilitates the design, simulation, and signal validation of magnetic resonance force microscopy experiments. The mrfmsim framework uses directed acyclic graphs (DAGs) to model experiments and employs a plugin system that adds custom experiments and functionalities. Differing from common DAG-powered workflow packages, mrfmsim allows flexible customizations of experiments post-definition without rewriting the internal model, such as optimized looping. In the talk, we present the challenges in building simulation packages for experiments undergoing continuous development in a graduate research setting. We discuss the current one-off approach that led to error-prone code and how modularity, extendibility, and readability can speed up the development cycle.

Materials and Chemistry
Room 315
15:05
15:05
20min
Break
Plenary (Ballroom B-D)
15:05
20min
Break
Ballroom A
15:05
20min
Break
Room 315
15:05
20min
Break
Room 316
15:25
15:25
30min
Introduction to Causal Inference with Machine Learning
Hajime Takeda

Causal inference has traditionally been used in fields such as economics, health studies, and social sciences. In recent years, algorithms combining causal inference and machine learning have been a hot topic. Libraries like EconML and CausalML, for instance, are good Python tools that facilitate the easy execution of causal analysis in areas like economics, human behavior, and marketing. In this talk, I will explain key concepts of causal inference with machine learning, show practical examples, and offer some practical tips. Attendees will learn how to apply machine learning to causal analysis effectively, boosting their research and decision-making.

Human Networks, Social Sciences, and Economics
Room 315
15:25
30min
LPython: Novel, Fast, Retargetable Python Compiler
Ondřej Čertík

We are developing a modern open-source Python compiler called LPython
(https://lpython.org/) that can execute user's code interactively in Jupyter to
allow exploratory work (much like CPython, MATLAB or Julia) as well as compile
to binaries with the goal to run user's code on modern architectures such as
multi-core CPU, GPU, as well as unfamiliar, new architectures like GSI's APU,
which features programmable compute-in memory. We aim to provide the best
possible performance for numerical array oriented code. The compiler itself is written in C++ for robustness
and speed.

General
Ballroom A
15:25
30min
Sparse arrays in scipy.sparse
Dan Schult

SciPy package scipy.sparse is moving from its matrix API to an array API. This will allow sizes other than 2D, and clean up treatment of the multiplication operator *. This talk will start by describing the changes and their impacts. We will then discuss the process of revamping an API without messing up existing user code too much. And the trade-offs between slow changes over many releases vs. faster, perhaps breaking changes. And choosing whether to just make a new package instead. The talk should be useful for users of scipy.sparse and also for packages considering a major API change.

Maintainers and Community
Room 316
15:25
30min
Warp: Advancing Simulation AI with Differentiable GPU Computing in Python
Eric Heiden

In this talk we introduce NVIDIA Warp, an open-source Python framework designed for accelerated differentiable computing. Warp enhances Python functions with just-in-time (JIT) compilation, allowing for efficient execution on CPUs and GPUs. The talk’s focus is on Warp’s application in physics simulation, perception, robotics, and geometry processing, along with its capability to integrate with machine-learning frameworks like PyTorch and JAX. Participants will learn the basics of Warp, including its JIT compilation process and the runtime library that supports various spatial computing operations. These concepts will be illustrated with hands-on projects based on research from institutions like MIT and UCLA, providing practical experience in using Warp to address computational challenges. Targeted at academics, researchers, and professionals in computational fields, the course is designed to inspire attendees and equip them with the knowledge and skills to use Warp in their work, enhancing their projects with efficient spatial computing.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
16:05
16:05
30min
An Introduction to Impact Charts
Darren Vengroff

Impact charts, as implemented in the impactchart package,
make it easy to take a data set and visualize the impact of one variable
on another in ways that techniques like scatter plots and linear regression can't,
especially when there are other variables involved.

In this talk, we will introduce impact charts, demonstrate how they find easter-egg impacts
we embed in synthetic data, show how they can find hidden impacts in a real-world use case,
show how you can create your first impact chart with just a few lines of code,
and finally talk a bit about the interpretable machine learning techniques they are built upon.

Impact charts are primarily visual, so this talk will be too.

Human Networks, Social Sciences, and Economics
Room 315
16:05
30min
No-Code-Change GPU Acceleration for Your Pandas and NetworkX Workflows
Rick Ratzel, Vyas Ramasubramani

With datasets growing in both complexity and volume, the demand for more efficient data processing has never been higher. Pandas and NetworkX, the go-to Python libraries for tabular and graph data processing, are very popular for their ease of use and flexibility. However, they often struggle to keep pace with the demands of large-scale data analysis.

This talk introduces new open-source GPU accelerators from the NVIDIA RAPIDS project for Pandas and NetworkX, and will demonstrate how you can enable them for your workflows to experience massive speedups – up to 150x in pandas and 600x in NetworkX – without code changes.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
16:05
30min
PIXIE: Blending Just-in-time and Ahead-of-time compilation in scientific Python applications
Stanley Seibert, Stuart Archibald

Speeding up Python code traditionally involves the use of Just-In-Time (JIT) or Ahead-Of-Time (AOT) compilation. There are tradeoffs to both approaches, however. As part of the Numba project's aim to create a compiler toolkit, the PIXIE project is being developed. It offers a multiple-language consuming, extensible toolchain, that produces AOT compiled binary extension modules. These PIXIE based extension modules contain CPU-specific function dispatch for AOT use and also support something similar to Link-Time-Optimization (LTO) for use in situations such as JIT compilation and/or cross module optimization. PIXIE modules are easy to load and call from Python, and can be inlined into Numba JIT compilation, giving Python developers access to the benefits of both AOT and JIT.

General
Ballroom A
16:05
30min
The power of community in solving scientific Python’s most challenging problems
Leah Wasser

Scientific software drives open research. However, developing and maintaining a Python package is a tricky endeavor. You need to navigate a thorny packaging ecosystem, often in an academic environment that doesn’t traditionally value software. pyOpenSci has learned that an inclusive community can be empowered to make Python packaging more accessible, and that constructive peer review supports maintainers in creating better software, while also providing academic credit. In this talk you’ll learn:

  1. How to build consensus around thorny topics like packaging.
  2. Where to find beginner-friendly packaging support.
  3. How constructive peer review can support better code.
  4. How to get involved with pyOpenSci.
Maintainers and Community
Room 316
16:35
16:35
25min
Break
Plenary (Ballroom B-D)
17:00
17:00
60min
Lightning Talks
Plenary (Ballroom B-D)
18:00
18:00
60min
Poster Sesssion
Plenary (Ballroom B-D)
08:00
08:00
60min
Registration and Breakfast
Plenary (Ballroom B-D)
09:00
09:00
15min
Opening Notes
Plenary (Ballroom B-D)
09:15
09:15
45min
Keynote by Dr. Elizabeth (Libby) Barnes

Professor of Atmospheric Science at Colorado State University

Keynote
Plenary (Ballroom B-D)
10:00
10:00
25min
SciPy Tools Plenary
Plenary (Ballroom B-D)
10:25
10:25
20min
Break
Plenary (Ballroom B-D)
10:45
10:45
30min
Ibis + DuckDB geospatial: a match made on Earth
Naty Clementi

Geospatial data is becoming more present in data workflows today, and plenty of Python tools allow us to work with it. In the past year, a new contender emerged: DuckDB introduced an extension for analyzing geospatial data. Everyone in the data world has been buzzing about DuckDB (~15k stars on GitHub), and now this duck quacks geospatial data too. But wait a minute, isn’t DuckDB all SQL? Yes, but fear not, Ibis has you covered! Ibis is a Python library that provides a dataframe-like interface, enabling users to write Python code to construct SQL expressions that can be executed on multiple backends, like DuckDB. In this talk, you will learn how to leverage the benefits of DuckDB geospatial while remaining in the Python ecosystem (yes, we will do a live demo). This is an introductory talk; everyone is welcome, and no previous experience with spatial databases or geospatial workflows is needed.

Earth, Ocean, Geo, and Atmospheric Science
Ballroom A
10:45
30min
Making Research Data Flow with Python
Josh Borrow

Many tools exist for large-scale data transfer (tens of terabytes or more), but they often don't match the needs of scientific data flows. In this talk, I'll explain how we built the 'librarian' framework with FastAPI, postgres, and Globus to ease this challenge. Designed for the Simons Observatory's petabyte-scale data transfer, I'll cover building reliable web services, flexible development with dependency injection, effective testing with pytest, and deployment using NERSC's Spin. I hope to demystify web and database programming for a scientific audience.

General
Plenary (Ballroom B-D)
10:45
30min
ultrack: large-scale versatile cell tracking in Python under segmentation uncertainty
Jordão Bragantini

Accurate cell tracking is essential to various biological studies. In this talk, we present Ultrack, a novel Python package for cell tracking that considers a set of multiple segmentation hypotheses and picks the segments that are most consistent over time, making it less susceptible to mistakes when traditional segmentation fails.
The package supports various imaging modalities, from small 2D videos to terabyte-scale 3D time-lapses or multicolored datasets in any napari-compatible image format (e.g. tif, zarr, czi, etc.).
It is available at https://github.com/royerlab/ultrack

Data Visualization and Image Processing
Room 315
11:25
11:25
30min
Expanding the OME ecosystem for imaging data management
Erick Martins Ratamero

Image analysis is ubiquitous across many areas of biomedical research, resulting in terabytes of image data that must be hosted by both research institutions and data repositories for sharing and reproducibility. Common solutions for data hosting are required to improve interoperability and accessibility of bioimage data, while maintaining the flexibility to address each institution's unique requirements regarding sharing and infrastructure. OMERO is an open-source solution for image data management which can be customized and hosted by individual institutions. OMERO runs a server-based application with web browser and command line options for accessing and viewing image data, based on the widely used OME data model for microscopy data. Multiple OMERO deployments might be used to provide core delivery, facilitate internal research, or serve as a public data repository. The omero-cli-transfer package facilitates data transfer between these OMERO instances and provides new methods for importing datasets. Another open-source package, ezomero, improves the usability of OMERO in a research environment by providing easier access to OMERO's Python interface. Along with existing OMERO plugins built for other analysis and viewing software, this positions OMERO to be a hub for image storage, analysis, and sharing.

Data Visualization and Image Processing
Room 315
11:25
30min
Free, public, standardized Zarr stores of geospatial data in the cloud for all! Now in Beta.
Christine Smit

At the NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC), we're doing the heavy lifting to make large geospatial datasets easily accessible from the cloud. No more downloading data. No more worrying about quirky metadata or missing dimensions. No more concatenating hundreds or thousands of files together. Just fire up your Jupyter notebook somewhere in Amazon Web Services (AWS)'s US-West-2 region, get some free temporary AWS credentials, open our Zarr stores, and start doing your science.

Earth, Ocean, Geo, and Atmospheric Science
Ballroom A
11:25
30min
From Spyder to DataLab: 15 years of scientific software crafting in Python
Pierre Raybaut

This talk provides an overview of the evolution of scientific software in Python, with a focus on the speaker's journey from creating Spyder, the Scientific Python IDE, to developing DataLab, a platform for signal and image processing. The speaker will share insights into the challenges and opportunities encountered in developing and maintaining these projects, and discuss how they have contributed to the scientific Python ecosystem. The talk will also explore the evolving needs of both the scientific and industrial communities during this period, and why desktop applications remain relevant in the era of web-based tools.

General
Plenary (Ballroom B-D)
12:15
12:15
45min
Diversity Keynote Luncheon by Dr. Anita Sarma

Dr. Anita Sarma is the Associate Head of Research for the School of Electrical Engineering and Computer Science at Oregon State University

Keynote
Plenary (Ballroom B-D)
13:15
13:15
55min
BoFs (Birds-of-a-feather)
Plenary (Ballroom B-D)
14:20
14:20
30min
Echostack: An open-source Python software toolbox that democratizes water column sonar data and processing
Wu-Jung Lee, Dingrui Lei, Brandyn Lucca, CaesarTuguinay, Valentina Staneva, Don Setiawan, Soham Kishor Butala

Water column sonar data collected by echosounders are essential for marine ecosystem research, allowing the detection, classification, and quantification of fish and zooplankton from many different ocean observing platforms. However, broad usage of these data has been hindered by the lack of software tools that allow intuitive and transparent data access, processing, and interpretation. We address this gap by developing Echostack, a toolbox of open-source packages leveraging distributed computing and cloud-interfacing libraries in the scientific Python ecosystem. These tools can be used individually or orchestrated together, which we will demonstrate in an end-to-end workflow.

Earth, Ocean, Geo, and Atmospheric Science
Ballroom A
14:20
30min
Picking your battles: when is a compiled language like Rust beneficial for Data Scientists?
Akshay Gupta

Over the past year, there has been an increase in the number of libraries that leverage Rust and pyo3 to significantly increase performance. What's the catch? In this talk, we will discuss how the Data Science team at Capital One has been thinking about the power of Rust-backed Python and whether the benefits justify the complexity.

Playing Nice: Scientific Computing Across Programming Languages
Plenary (Ballroom B-D)
14:20
30min
SAMGeo: Automated Segmentation of Remote Sensing Imagery with the Segment Anything Model
Qiusheng Wu

Image segmentation plays a crucial role in extracting valuable insights from geospatial data. While traditional segmentation methods can be laborious, deep learning offers automation but often demands extensive training and resources. Meta AI's Segment Anything Model (SAM) presents a compelling solution, segmenting objects without additional training. Our open-source Python package, samgeo, streamlines the use of SAM for geospatial data, offering various segmentation methods. Experiments confirm SAM's accuracy and efficiency as a powerful tool for remote sensing analysis. The samgeo package simplifies the adoption of automated image segmentation, facilitating better geospatial insights and decision-making across multiple domains.

Data Visualization and Image Processing
Room 315
15:00
15:00
30min
Building Daft: Python + Rust = a better distributed query engine
Jay Chia

Python is a popular language for data engineering workloads. In data engineering, developers must use a "Query Engine" to efficiently retrieve data, run data processing and then send data back out to a destination storage system or application.

The Python API for Apache Spark (PySpark) is currently the most popular framework that most data engineers use for data engineering at large scale. However, PySpark has a heavy dependency on the JVM which causes high friction during the development process.

In this talk, we discuss our work with the Daft Python Dataframe (www.getdaft.io) which is a distributed Python query engine built with Rust. We will perform a deep-dive into Daft architecture, and talk about how the strong synergy between Python and Rust enables key advantages for Daft to succeed as a query engine.

Playing Nice: Scientific Computing Across Programming Languages
Plenary (Ballroom B-D)
15:00
30min
HyperSpy – Your Multidimensional Data Analysis Toolbox
Joshua Taillon

HyperSpy is a community-developed open-source library providing a
framework to facilitate interactive and reproducible analyses of
multidimensional datasets. Born out of the electron microscopy
scientific community and building on the extensive scientific Python
environment, HyperSpy provides tools to efficiently explore, manipulate,
and visualize complex datasets of arbitrary dimensionality, including
those larger than a system's memory. After 14 years of development,
HyperSpy recently celebrated its 2.0 version release. This presentation
will (re)introduce HyperSpy's features and community, with a focus on
recent efforts paring the library into a domain-agnostic core and a
robust ecosystem of extensions providing specific scientific
functionality.

Data Visualization and Image Processing
Room 315
15:00
30min
Using Satellite Imagery to Identify Harmful Algal Blooms and Protect Public Health
Emily Dorne

This talk illustrates how machine learning models to detect harmful algal blooms from satellite imagery can help water quality managers make informed decisions around public health warnings for lakes and reservoirs. Rooted in the development of the open source package CyFi, this talk includes insights around identifying when your model is getting the right answer for the wrong reasons, the upsides of using decision tree models with satellite imagery, and how to help non-technical users build confidence in machine learning models. The intended audience is those interested in using satellite imagery to monitor and respond to the world around us.

Earth, Ocean, Geo, and Atmospheric Science
Ballroom A
15:30
15:30
20min
Break
Plenary (Ballroom B-D)
15:50
15:50
30min
Introducing nanoarrow: the world's tiniest Arrow Implementation
Dewey Dunnington

nanoarrow, a newly developed subproject of Apache Arrow, is squarely focused on unlocking connectivity among Python packages and the libraries they wrap using the features and rich type support of the Arrow columnar format. The vision of nanoarrow is that it should be trivial for a library to implement an Arrow-based interface: nanoarrow and its bindings provide tools to produce, consume, and transport tabular data between processes using the Arrow IPC format or between libraries using the Arrow C ABI. For Python maintainers this means less glue code that runs faster so that developers can focus on feature development.

Playing Nice: Scientific Computing Across Programming Languages
Plenary (Ballroom B-D)
15:50
30min
anywidget: custom Jupyter Widgets made easy
Trevor Manz

Visualization plays a critical role in the analysis and decision making with data, yet the manner in which state-of-the-art visualization approaches are disseminated limit their adoption into modern analytical workflows. Jupyter Widgets bridge this gap between Python and interactive web interfaces, allowing for both programmatic and interactive manipulation of data and code. However, their development has historically been tedious and error-prone.

In this talk, you will learn about anywidget, a Python library that simplifies widgets, making their development more accessible, reliable, and enjoyable. I will showcase new visualization libraries built with anywidget and explain how its design enables environments beyond Jupyter to add support.

Data Visualization and Image Processing
Room 315
15:50
30min
xCDAT (Xarray Climate Data Analysis Tools): A Python package for simple climate data analysis on structured grids
Tom Vo

xCDAT(Xarray Climate Data Analysis Tools) is an open-source Python package that extends Xarray for climate data analysis on structured grids. This talk will cover a brief history of xCDAT, the value this package presents to the climate science community, and a general overview of key features with technical examples. xCDAT’s scope focuses on routine climate research analysis operations such as loading, averaging, and regridding data on structured grids (e.g., rectilinear, curvilinear). Some key features include temporal averaging, geospatial averaging, horizontal regridding, vertical regridding, and robust interpretation and handling of metadata and bounds for coordinates.

Earth, Ocean, Geo, and Atmospheric Science
Ballroom A
16:30
16:30
30min
Dante’s Externo: Injecting Python Functions into a Template-Driven CUDA C++ Framework
Braxton Cuneo

Non-Python codebases that use metaprogramming present significant challenges to cross-language development. These challenges are further compounded with the inclusion of GPU processing. While common methods of Python/GPU interoperation are covered by popular Python frameworks, these frameworks do not trivialize this use case.

In this talk, we will discuss the process of integrating a Python code for Monte Carlo particle transport (MCDC) with a template-based CUDA C++ framework which applies inversion of control (Harmonize). We will discuss managing the complexity of cross-language dependency injection, relevant implementation strategies, and pitfalls to avoid.

Playing Nice: Scientific Computing Across Programming Languages
Plenary (Ballroom B-D)
16:30
30min
Great Tables for Everyone
Richard Iannone

Do you find yourself copying your data into Word, just to make a table? If this is you (and this was us) it’s both frustrating and prone to errors. And even though every aspect of a typical analysis can be scripted, it often turns out that the table-making part is elusive. We made Great Tables to enable complete publishing workflows. This Python package lets you easily generate publication-quality tables with the structure you want, many options for formatting values, and plenty of freedom for styling. Importantly, Great Tables closely integrates with Pandas and Polars DataFrames in order to handle a wide range of analyses.

Data Visualization and Image Processing
Room 315
16:30
30min
Simplifying analysis of hierarchical HDF5 and NetCDF4 files with xarray-datatree
Eniola Awowale, Lucas Sterzinger, Tom Nicholas, Nick Lenssen

Xarray-datatree [1], is a Python package that supports HDFs (Hierarchical Data Format) with hierarchical group structures by creating a tree-like hierarchical data structure in xarray. When an HDF file is opened with Datatree, a DataTree object is created that contains all of the groups in the file. The tree-like structure allows each group to be accessed once a DataTree object is instantiated. This eliminates the need for a user to go through each group and subgroup to access observational data.

We will present our use case for Datatree in NASA’s Harmony Level 2 Subsetter (HL2SS). HL2SS provides variable and dimension subsetting for Earth observation data from different NASA data centers. To subset hierarchical datasets without Datatree, HL2SS flattens the entire data structure into a new file by copying all of the grouped and subgrouped variables into the root group. With this new file, a variable or dimension subset is conducted. However, the flattened and subsetted file has to be in the same hierarchical structure of the original file, so it is unflattened, its attributes are copied, and the variables are grouped back to preserve the original group hierarchy. With the open_datatree() function, HL2SS can open datasets containing multiple groups at once and have all of their group hierarchies preserved. This functionality has significant benefits towards optimizing the workflow in HL2SS, since it would eliminate the need to flatten and unflatten grouped datasets.

[1] https://github.com/xarray-contrib/datatree

Earth, Ocean, Geo, and Atmospheric Science
Ballroom A
17:00
17:00
20min
Break
Plenary (Ballroom B-D)
17:20
17:20
60min
Lightning Talks
Plenary (Ballroom B-D)
08:00
08:00
60min
Registration and Breakfast
Plenary (Ballroom B-D)
09:00
09:00
15min
Opening Notes
Plenary (Ballroom B-D)
09:15
09:15
45min
Keynote by Kyle Cranmer

Professor of Physics and the David R. Anderson Director of the Data Science Institute at the University of Wisconsin—Madison

Keynote
Plenary (Ballroom B-D)
10:00
10:00
25min
SciPy Tools Plenary
Plenary (Ballroom B-D)
10:20
10:20
20min
Break
Plenary (Ballroom B-D)
10:45
10:45
30min
Bridging the gap between Earth Engine and the Scientific Python Ecosystem
Qiusheng Wu, Justin Braaten, Samapriya Roy

Google Earth Engine's new data extraction interfaces seamlessly transfer geospatial data into familiar Python formats provided by NumPy, Pandas, GeoPandas, and Xarray. This integration empowers you to harness Earth Engine's vast data catalog and compute power directly within your preferred Python workflows. For example, the Xee library leverages Xarray's lazy evaluation and Dask to streamline the extraction and analysis of Earth Engine data, offering a more Pythonic alternative to traditional image exports. Earth Engine's new data extraction interfaces unlock fresh geospatial analysis potential by leveraging the unique strengths of both the scientific Python ecosystem and Earth Engine.

Earth, Ocean, Geo, and Atmospheric Science
Room 315
10:45
30min
Orchestrating Bioinformatics Workflows Across a Heterogeneous Toolset with Flyte
Pryce Turner

While Python excels at prototyping and iterating quickly, it’s not always performant enough for whole-genome scale data processing. Flyte, an open-source Python-based workflow orchestrator, presents an excellent way to tie together the myriad tools required to run bioinformatics workflows. Flyte is a k8s native orchestrator, meaning all dependencies are captured and versioned in container images. It also allows you to define custom types in Python representing genomic datasets, enabling a powerful way to enforce compatibility across tools. Computational biologists, or any scientists processing data with a heterogeneous toolset, stand to benefit from a common orchestration layer that is opinionated yet flexible.

Playing Nice: Scientific Computing Across Programming Languages
Ballroom A
10:45
30min
Safe, fast, and easy time series preprocessing with Temporian
Mathieu Guillame-Bert, Ian Spektor

Temporal data is ubiquitous in data science and plays a vital role in machine learning pipelines and business decisions. Preprocessing temporal data using generic data tools can be tedious, lead to inefficient computation, and be prone to errors.
Temporian is an open-source library for safe, simple, and efficient preprocessing and feature engineering of temporal data. It supports common temporal data types, including non-uniform sampled, multi-variate, multi-index, and multi-source data. Temporian favors interactive development in notebooks and integration with other machine learning tools, and can run at scale using distributed computing.
This talk, aimed at data scientists and machine learning practitioners, will showcase Temporian’s key features along with its powerful API, and demonstrate its advantages over generic data preprocessing libraries for handling temporal data.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
11:25
11:25
30min
Lonboard: Fast, interactive geospatial vector data visualization in Jupyter
Kyle Barron

Lonboard is a new Python library for geospatial vector data visualization that can be 50x faster than existing alternatives like ipyleaflet or pydeck. This talk will explain why this library is so fast, how it integrates into existing workflows, and planned future improvements.

Earth, Ocean, Geo, and Atmospheric Science
Room 315
11:25
30min
STUMPY: Modern Time Series Analysis with Matrix Profiles
Sean Law

Traditional time series analysis techniques have found success in a variety of data mining tasks. However, they often require years of experience to master and the recent development of straightforward, easy-to-use analysis tools has been lacking. We address these needs with STUMPY, a scientific Python library that implements a novel yet intuitive approach for discovering patterns, anomalies, and other insights from any time series data. This presentation will cover the necessary background needed to follow the live interactive demo, requires no prior experience, and promises a simple, powerful, and scalable time series analysis package that will complement your current toolset.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
11:25
30min
Scikit-build-core: A modern build-backend for CPython C/C++/Fortran/Cython extensions.
Jean-Christophe Fillion-Robin, Henry Schreiner

Discover how scikit-build-core revolutionizes Python extension building with its seamless integration of CMake and Python packaging standards. Learn about its enhanced features for cross-compilation, multi-platform support, and simplified configuration, which enable writing binary extensions with pybind11, Nanobind, Fortran, Cython, C++, and more. Dive into the transition from the classic scikit-build to the robust scikit-build-core and explore its potential to streamline package distribution across various environments.

Playing Nice: Scientific Computing Across Programming Languages
Ballroom A
12:00
12:00
60min
Lunch Break
Plenary (Ballroom B-D)
13:15
13:15
30min
From Code to Clarity: Using Quarto for Python Documentation
isabel zimmerman

To effectively share scientific results, we must blend narrative text and code to create polished, interactive output. Quarto is an open-source scientific publishing system to help you communicate with others through code. Quartodoc is a Python package that generates function references within Quarto websites. Together, these tools create beautiful documentation that is reproducible, accessible, and easily editable.

This talk will include examples of Quarto in action–from simple blogs to expansive Python package documentation including web-assembly powered live examples. Listeners will walk away knowing how to Quarto websites, when to use quartodoc, and how these tools create better documentation.

Playing Nice: Scientific Computing Across Programming Languages
Ballroom A
13:15
30min
How to bootstrap a Data Warehouse with DuckDB
Guen Prawiroatmodjo, Nicholas Ursa, Alex Monahan

A Data Warehouse (DW) is a powerful tool to manage your scientific data, training data, logs, or any other type of relational data. Most Data Warehouses are cloud-based and built to scale to petabyte workflows, but might not be optimal for smaller workloads that need a fast iteration cycle. Likewise, a collection of CSV files and python scripts can become painful to share and maintain. This is where DuckDB comes in! DuckDB is a fast, in-process database that you can run on your laptop, supports a rich SQL dialect, and you can push to the cloud with just a single line of code. In this talk, we’ll show you how to bootstrap a Data Warehouse on your laptop using open source, including ETL (extract-transform-load) data pipelines, dashboard visualization, and sharing via the cloud.

General
Room 315
13:15
30min
Model Share AI: An Integrated Toolkit for Collaborative Machine Learning Model Development, Provenance Tracking, and Deployment in Python
Heinrich Peters

Model Share AI (AIMS) is an easy-to-use Python library designed to streamline collaborative ML model development, model provenance tracking, and model deployment, as well as a host of other functions aiming to maximize the real-world impact of ML research. AIMS features collaborative project spaces, allowing users to analyze and compare their models in a standardized fashion. Model performance and various model metadata are automatically captured to facilitate provenance tracking and allow users to learn from and build on previous submissions. Additionally, AIMS allows users to deploy ML models built in Scikit-Learn, TensorFlow Keras, PyTorch, and ONNX into live REST APIs and automatically generated web apps with minimal code. The ability to deploy models with minimal effort and to make them accessible to non-technical end-users through web apps has the potential to make ML research more applicable to real-world challenges.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
13:55
13:55
30min
Ibis: because SQL is everywhere and so is Python
Gil Forsyth

Tabular data is ubiquitous, and pandas has been the de facto tool in Python for
analyzing it. However, as data size scales, analysis using pandas may become
untenable. Luckily, modern analytical databases (like DuckDB) are able to
analyze this same tabular data, but perform orders-of-magnitude faster than
pandas, all while using less memory. Many of these systems only provide a SQL
interface though; something far different from pandas’ dataframe interface,
requiring a rewrite of your analysis code.

This talk will lay out the current database / data landscape as it relates to
the SciPy stack, and explore how Ibis (an open-source, pure Python, dataframe
interface library) can help decouple interfaces from engines, to improve both performance
and portability. We'll examine other solutions for interacting with SQL from Python and
discuss some of their strengths and weaknesses.

General
Room 315
13:55
30min
Supporting Greater Interactivity in the IPython Visualization Ecosystem
Nathan Martindale, Jacob Smith

Interactive visualizations are invaluable tools for building intuition and supporting rapid exploration of datasets and models. Numerous libraries in Python support interactivity, and workflows that combine Jupyter and IPyWidgets in particular make it straightforward to build data analysis tools on the fly. However, the field is missing the ability to arbitrarily overlay widgets and plots on top of others to support more flexible details-on-demand techniques. This work discusses some limitations of the base IPyWidgets library, explains the benefits of IPyVuetify and how it addresses these limitations, and finally presents a new open-source solution that builds on IPyVuetify to provide easily integrated widget overlays in Jupyter.

Playing Nice: Scientific Computing Across Programming Languages
Ballroom A
13:55
30min
Vector space embeddings and data maps for cyber defense
Benoit Hamelin

Vast amounts of information of interest to cyber defense organizations comes in the form of unstructured data; from host-based telemetry and malware binaries, to phishing emails and network packet sequences. All of this data is extremely challenging to analyze. In recent years there have been huge advances in the methodology for converting unstructured media into vectors. However, leveraging such techniques for cyber defense data remains a challenge.

Imposing structure on unstructured data allows us to leverage powerful data science and machine learning tools. Structure can be imposed in multiple ways, but vector space representations, with a meaningful distance measure, have proven to be one of the most fruitful.

In this talk, we will demonstrate a number of techniques for embedding cyber defense data into vector spaces. We will then discuss how to leverage manifold learning techniques, clustering, and interactive data visualization to broaden our understanding of the data and enrich it with expert feedback.

At the Tutte Institute for Mathematics and Computing (TIMC), we believe in the importance of reproducibility and in making research techniques accessible to the broader cyber defense community. To that end, this talk will leverage several open source libraries and techniques that we have developed at TIMC: Vectorizers, UMAP, HDBSCAN, ThisNotThat and DataMapPlot.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
14:35
14:35
30min
ITK-Wasm: Universal spatial analysis and visualization
Matt McCormick, Jean-Christophe Fillion-Robin

ITK-Wasm combines the Insight Toolkit (ITK) and WebAssembly to enable high-performance spatial analysis across programming languages and hardware architectures.

ITK-Wasm Python packages work in a web browser via Pyodide but also in system-level environments. We describe how ITK-Wasm bridges WebAssembly with Scientific Python through simple fundamental Python and NumPy-based data structures and Pythonic function interfaces. These interfaces can be accelerated through GPU implementations when available.

We discuss how ITK-Wasm's integration of the WebAssembly Component Model launches Scientific Python into a new world of interoperability, enabling the creation of accessible and sustainable multi-language projects that are easily distributed anywhere.

Playing Nice: Scientific Computing Across Programming Languages
Ballroom A
14:35
30min
LlamaBot: a Pythonic interface to Large Language Models
Eric Ma

In this talk, I will present LlamaBot, a Pythonic and modular set of components to build command line and backend tools that leverage large language models (LLMs). During this talk, I will showcase the core design philosophy, internal architecture and dependencies, and live demo command-line applications built using LlamaBot that use both open source and API-access-only LLMs. Finally, I will conclude with a roadmap for LlamaBot development, and an invitation to contribute and shape its development during the Sprints.

Data Science and AI/Machine Learning
Plenary (Ballroom B-D)
14:35
30min
Pooch: a friend to fetch your data files
Santiago Soler

Pooch is a Python library that can download and locally cache files from
the web without hassle. Novices can use it to simply download files
in one line of code and focus on the data.
Package maintainers can use it to provide sample datasets
to their users, in examples and tutorials, as libraries like SciPy,
scikit-image, napari and MetPy do.
During this talk, we'll show you how you can use the different features that
Pooch offers and also how you can extend its capabilities by writing your own
downloaders or post-processors.

General
Room 315
15:05
15:05
25min
Break
Plenary (Ballroom B-D)
15:30
15:30
60min
Lightning Talks
Plenary (Ballroom B-D)
16:30
16:30
120min
BoFs (Birds-of-a-feather)
Plenary (Ballroom B-D)