SciPy 2024

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.

Monday, July 8, 2024

Tuesday, July 9, 2024

Wednesday, July 10, 2024

Thursday, July 11, 2024

Friday, July 12, 2024

07:00

60min

Registration and Breakfast

Ballroom A

08:00

240min

A Practical Introduction to NumPy

Tim Diller

Do you have a basic understanding of Python and want to "level-up" your computational skills? Do you instinctively write for-loops to perform computations on your arrays? Have you ever heard someone complain "Python is slow" and want to prove them wrong? Do you want to know how to manipulate NumPy arrays like a master? If any or all of these is true, then this tutorial is for you!

A hands-on forecasting guide: from theory to practice

Ian Spektor, Diego Kiedanski, Mathieu Guillame-Bert

Forecasting is central to decision-making in virtually all technical domains. For instance, predicting product sales in retail, forecasting energy demand, and anticipating customer churn all have tremendous value across different industries. However, the landscape of forecasting techniques is as diverse as it is useful, and different techniques and expertise are adapted to different types and sizes of data.
In this hands-on workshop, we give an overview of forecasting concepts, popular methods, and practical considerations. We’ll walk you through data exploration, data preparation, feature engineering, statistical forecasting (e.g., STL, ARIMA, ETS), forecasting with tabular machine learning models (e.g., decision forests), forecasting with deep learning methods (e.g., TimesFM, DeepAR), meta-modeling (e.g., hierarchical reconciliation and relational modeling, ensembles, resource models), and how to safely evaluate such temporal models.

Data Visualization with Vega-Altair

Jon Mease, Christopher Davis

This tutorial is an introduction to data visualization using the popular Vega-Altair Python library. Vega-Altair provides a simple, friendly, and consistent API that supports rapid data exploration. Vega-Altair’s flexible support for interactivity enables the creation and sharing of beautiful interactive visualizations.

Participants will learn the foundational concepts that Vega-Altair is built on and will gain hands-on experience exploring a variety of datasets. Of particular interest to the scientific community, this tutorial will cover recent advancements in the Vega-Altair ecosystem that make it possible to scale visualizations to large datasets, and to easily export visualizations to static image formats for publication.

Determining Climate Risks with NASA Earthdata Cloud

Dhavide Aruliah, Karthik Venkataramani, Patricia A. Loto

This tutorial walks participants — Earth scientists with some prior Python experience — through analyses of two particular climate risk scenarios: floods & wildfires. The goal is to obtain hands-on experience with common reproducible Jupyter/Python workflows based on data products from the NASA Earthdata Cloud. The case studies highlight the interplay of distributed data with scalable numerical strategies — "data-proximate computing" — implemented using scientific Python libraries like NumPy, Pandas, & Xarray. This tutorial — co-developed by 2i2c and MetaDocencia — constitutes part of NASA's Transform to Open Science (TOPS) initiative to reinforce principles of Open Science & reproducibility.

Intro to Ibis: blazing fast analytics with DuckDB, Polars, Snowflake, and more, from the comfort of your Python repl.

Gil Forsyth, Phillip Cloud, Naty Clementi, Jim Crist-Harif

Tabular data is ubiquitous, and pandas has been the de facto tool in Python for analyzing it. However, as data size scales, analysis using pandas may become untenable. Luckily, modern analytical databases (like DuckDB) are able to analyze this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Many of these systems only provide a SQL interface though; something far different from pandas’ dataframe interface, requiring a rewrite of your analysis code.

This is where Ibis comes in. Ibis is a pure-Python open-source library that provides a dataframe interface to many popular databases and analytics tools (DuckDB, Polars, Snowflake, Spark, etc...). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL. No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend.

https://ibis-project.org/
https://github.com/ibis-project/ibis

Open Office Hours: First Time Help, Git Tutorial, Tech Support, and AMA Session

SciPy Tutorial Chairs and Friends

In place of the "TorchGeo: Advancing Earth Observation Through Machine Learning" tutorial, we are offering an open office hour session designed to provide support and assistance to attendees. Whether you're a first-time attendee looking for guidance, need help with Git, require technical support, or have questions about anything else, our SciPy tutorial chairs and other experts are here to help. This session is an excellent opportunity to get personalized assistance, resolve technical issues, and ask questions in an informal and supportive environment. Join us for this impromptu and interactive session to make the most of your conference experience!

Unlocking Dynamic Reproducible Documents: A Quarto Tutorial for Scientific Communication

Mine Çetinkaya-Rundel

Quarto is an innovative, open-source scientific and technical publishing system compatible with Jupyter Notebooks and plain text markdown documents. Quarto provides data scientists with a seamless way to publish their work in a high-quality format that is reproducible, accessible, and shareable. With Quarto, researchers can turn their Jupyter Notebooks and literate plain text markdown documents into professional-looking publications in various formats. This workshop will demonstrate how Quarto enables data scientists to turn their work products into professional, high-quality documents, slides, websites, scientific manuscripts, and other shareable artifacts.

Tutorials

Room 315

12:00

90min

Lunch Break

Ballroom A

13:30

240min

Enhancing Predictive Analytics with tsbootstrap and sktime

Sankalp Gilda, Franz Kiraly

Explore tsbootstrap and sktime in our 4-hour tutorial, focusing on enhancing time series forecasting and analysis. Discover how tsbootstrap's bootstrapping methods improve uncertainty quantification in time series data, integrating with sktime's forecasting models. Learn practical applications in various domains, boosting predictive accuracy and insights. This interactive session will provide hands-on experience with these tools, offering a deep dive into advanced techniques like probabilistic forecasting and model evaluation. Join us to expand your expertise in time series analysis, applying innovative methods to tackle real-world data challenges.

Github Actions for Scientific Data Workflows

Valentina Staneva, Quinn Brencher

In this tutorial we will introduce Github Actions to scientists as a tool for lightweight automation of scientific data workflows. We will demonstrate that GitHub Actions are not just a tool for software testing, but can be used in various ways to improve the reproducibility and impact of scientific analysis. Through a sequence of examples, we will demonstrate some of Github Actions' applications to scientific workflows, such as scheduled deployment of algorithms to sensor streams, updating visualizations based on new data, processing large datasets, model versioning and performance benchmarking. GitHub Actions can particularly empower Python scientific programmers who are not willing to build fully-fledged applications or set up complex computational infrastructure, but would like to increase the impact of their research. The goal is that participants will leave with their own ideas of how to integrate Github Actions in their own work.

Hobby Drones, Urban Forests: A Geospatial Journey to Greener Cities

Kevin Lacaille

Drone imagery is more widely available than ever before, allowing the public to capture ultra high-resolution Earth images with hobbyist drones. In this workshop, we will explore drone imagery with Python tools such as geopandas, OpenCV, rasterio, numpy, and shapely. Afterwards, we will assess urban green spaces, focusing on counting trees and estimating their role in capturing carbon to fight climate change. This practical exercise will not only enhance our understanding of urban ecology, but also highlight the importance of trees in urban planning and environmental sustainability.

Interactive data visualizations with Bokeh (in 2024)

Timo Metzger, Bryan Van de Ven, Pavithra Eswaramoorthy

Bokeh is a library for interactive data visualization. You can use it with Jupyter Notebooks or create standalone web applications, all using Python. This tutorial is a thorough guide to Bokeh and its most recent new features. We start with a basic line plot and, step-by-step, make our way to creating a dashboard web application with several interacting components. This tutorial will be helpful for scientists who are looking to level up their analysis and presentations, and tool developers interested in adding custom plotting functionally or dashboards.

Pretraining and Finetuning LLMs from the Ground Up

Sebastian Raschka

This tutorial is aimed at coders interested in understanding the building blocks of large language models (LLMs), how LLMs work, and how to code them from the ground up in PyTorch. We will kick off this tutorial with an introduction to LLMs, recent milestones, and their use cases. Then, we will code a small GPT-like LLM, including its data input pipeline, core architecture components, and pretraining code ourselves. After understanding how everything fits together and how to pretrain an LLM, we will learn how to load pretrained weights and finetune LLMs using open-source libraries.

Gordon Watts, Vangelis Kourlitis

Despite its reputation for being slow, Python is the leading language of scientific computing, which generally needs large-scale (fast) computations. This is because most scientific problems can be split into "metadata bookkeeping" and "number crunching," where the latter is performed by array-oriented (vectorized) calls into precompiled routines.

This tutorial is an introduction to array-oriented programming. We'll focus on techniques that are equally useful in NumPy, Pandas, xarray, CuPy, Awkward Array, and other libraries, and we'll work in groups on three class projects: Conway's Game of Life, evaluating decision trees, and computations on ragged arrays.

GitHub repository: https://github.com/ekourlit/scipy2024-tutorial-thinking-in-arrays

Xarray: Friendly, Interactive, and Scalable Scientific Data Analysis

Scott Henderson, Don Setiawan, Tom Nicholas, Wietze Suijker, Jessica Scheick, Max Jones, Luis Lopez, Negin Sobhani

Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines the convenience of labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. This hands-on tutorial focuses on intermediate and advanced workflows using complex real-world data. We encourage participants in this workshop to bring your own dataset as we will dedicate ample time to apply tutorial concepts to datasets of interest!

Tutorials

Ballroom B/C

07:00

60min

Registration and Breakfast

Ballroom A

08:00

240min

3D Visualization with PyVista

Tetsuo Koyama, Bill Little, Bane Sullivan, Jaswant Panchumarti

PyVista is a general purpose 3D visualization library used for over 2000+ open source projects for the visualization of everything from computer aided engineering and geophysics to volcanoes and digital artwork.

PyVista exposes a Pythonic API to the Visualization Toolkit (VTK) to provide tooling that is immediately usable without any prior knowledge of VTK and is being built as the 3D equivalent of Matplotlib, with plugins to Jupyter to enable visualization of 3D data using both server- and client-side rendering.

All the SQL a Pythonista needs to know: an introduction to SQL and DataFrames with DuckDB

Guen Prawiroatmodjo, Alex Monahan, Mehdi Ouazza, Elena Felder

Structured Query Language (or SQL for short) is a programming language to manage data in a database system and an essential part of any data engineer’s tool kit. In this tutorial, you will learn how to use SQL to create databases, tables, insert data into them and extract, filter, join data or make calculations using queries. We will use DuckDB, a new open source embedded in-process database system that combines cutting edge database research with dataframe-inspired ease of use. DuckDB is only a pip install away (with zero dependencies), and runs right on your laptop. You will learn how to use DuckDB with your existing Python tools like Pandas, Polars, and Ibis to simplify and speed up your pipelines. Lastly, you will learn how to use SQL to create fast, interactive data visualizations, and how to teach your data how to fly and share it via the Cloud.

Automate your research with automan

Prabhu Ramachandran, Pawan Negi

In research involving any kind of computer simulation, we often have to
execute several simulations that might become a part of the final manuscript.
It is found that automating these simulations and their post-processing
introduces significant personal benefit in the form of improving research
output and productivity. Automation makes it much easier to run large
parameter sweeps and studies and allows you to focus on the important
questions to ask rather than managing hundreds or thousands of simulations
manually. This takes the drudgery of data/file management out of your hands,
systematizes your research, and makes it possible to incrementally improve and
refine your work. The added nice benefit is that your research also becomes
much easier to reproduce.

Bring your __repr__’s to life with anywidget

Trevor Manz, Nezar Abdennur, Fritz Lekschas

Jupyter Widgets connect Python objects with web-based visualizations and UIs, enabling both programmatic and interactive manipulation of data and code. For example, lasso some points in a scatterplot visualization and access that selection in Python as a DataFrame.

anywidget makes it simple and enjoyable to bring these capabilities to your own Python classes, and it ensures easy installation and usage by end users in various environments. In this tutorial, you will create your own custom widgets with anywidget and learn the skills to be effective in extending your own Python classes with web-based superpowers.

Cookiecutter: Project Templates and Much More

Reka Horvath

Cookiecutter is mainly known as a tool for software project templates. But its possible use cases are much more versatile: plain text and code, small building blocks and whole projects.
You can get started and build powerful templates without any programming - by using a CLI tool and editing text files. And if you're willing to throw some Python code and Jinja extensions into the mix, you can build pretty sophisticated and flexible automations.
The main goal of this workshop is to give some inspiration: How to detect candidates for automation in your workflow? Where can you improve speed and consistency and free up some mental energy for the actual content of your task?

From RAGs to riches: Build an AI document inquiry web-app

Pavithra Eswaramoorthy, Dharhas Pothina, Andrew Huang

As we descend from the peak of the hype cycle around Large Language Models (LLMs), chat-based document inquiry systems have emerged as a high-value practical use case. Retrieval-Augmented Generation (RAG) is a technique to share relevant context and external information (retrieved from vector storage) to LLMs, thus making them more powerful and accurate.

In this hands-on tutorial, we’ll dive into RAG by creating a personal chat app that accurately answers questions about your selected documents. We’ll use a new OSS project called Ragna that provides a friendly Python and REST API, designed for this particular case. We’ll test the effectiveness of different LLMs and vector databases, including an offline LLM (i.e., local LLM) running on GPUs on the cloud-machines provided to you. We'll then develop a web application that leverages the REST API, built with Panel–a powerful OSS Python application development framework.

hvPlot and Panel: Easy data visualization, data exploration, and data apps

James A. Bednar

This tutorial will show you how to use the Pandas, Dask, or Xarray APIs you already know to interactively explore and visualize your data even if it is big, streaming, or multidimensional. Then just replace your expression arguments with widgets to get an instant web app that you can share as HTML+WASM or backed by a live Python server. These tools let you focus on your data rather than the API, and let you build linked, interactive drill-down exploratory apps without having to run a web-technology software development project, which you can then share without becoming an operations specialist.

Tutorials

Ballroom D

12:00

90min

Lunch Break

Ballroom A

13:30

240min

Building Complex Web Apps with Jupyter Widgets

Nicole Brewer, Matt Craig, Juan Cabanela, Maarten Breddels

Interactive widgets were introduced to the Jupyter ecosystem over 10 years ago. A number of progressively more powerful interactive widget packages have been developed since then, supporting the construction of sophisticated dashboards and interactives. This tutorial will describe a number of approaches to developing and managing complex web apps that are compatible with Jupyter widgets and promote scalable application development.

Create Your First Pure Python Package: Make Your Python Code Easier to Share and Use

Leah Wasser, Isabel Zimmerman, Jeremiah Paige

Creating code that can be shared and reused is the pinnacle of open science. But tools and skills to share your code can be tricky to learn. In this hands-on tutorial, you’ll learn how to turn your pure Python code into an installable Python module that can be shared with others. To get the most out of this tutorial, you should be familiar with writing Python code, Python environments and functions.

You will leave this tutorial understanding how to:

Create code that can be installed into different environments
Use Hatch as a workflow tool, making setup and installation of your code easier
Use Hatch to publish your package to (test) PyPI

Data of an Unusual Size (2024 edition): A practical guide to analysis and interactive visualization of massive datasets

Pavithra Eswaramoorthy, Dharhas Pothina

While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.

In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a public cloud provided by the presenters – starting from how the data is stored and read, to how it is processed and visualized.

Generative AI Copilot for Scientific Software – a RAG-Based Approach using OLMo

Vani Mandava, Cordero Core, Don Setiawan, Niki Burggraf, Anant Mittal, Anshul Tambay, Madhav Kashyap, Anuj Sinha, Ishika Khandelwal

Generative AI systems built upon large language models (LLMs) have shown great promise as tools that enable people to access information through natural conversation. Scientists can benefit from the breakthroughs these systems enable to create advanced tools that will help accelerate their research outcomes. This tutorial will cover: (1) the basics of language models, (2) setting up the environment for using open source LLMs without the use of expensive compute resources needed for training or fine-tuning, (3) learning a technique like Retrieval-Augmented Generation (RAG) to optimize output of LLM, and (4) build a “production-ready” app to demonstrate how researchers could turn disparate knowledge bases into special purpose AI-powered tools. The right audience for our tutorial is scientists and research engineers who want to use LLMs for their work.

Image analysis and visualization in Python with scikit-image, napari, and friends

Lars Grüter, Erick Martins Ratamero, Biola Adeyemi, Jordão Bragantini, Stéfan van der Walt

Between telescopes and satellite cameras and MRI machines and microscopes, scientists are producing more images than they can realistically look at. They need specialized viewers for multi-dimensional images, and automated tools to help process those images into knowledge. In this tutorial, we will cover the fundamentals of algorithmic image analysis, starting with how to think of images as NumPy arrays, moving on to basic image filtering, and finishing with a complete workflow: segmenting a 3D image into regions and making measurements on those regions. At every step, we will visualize and understand our work using matplotlib and napari.

Introduction to Property-Based Testing

Zac Hatfield-Dodds

Writing correct software is difficult, and even scientists don’t always get it right
[citation needed].

Hypothesis is a testing package that will search for counterexamples to your
assertions – so you can write tests that provide a high-level description of your
code or system, and let the computer attempt a Popperian falsification. If it
fails, your code is (probably) OK… and if it succeeds you have a minimal input
to debug.

Come along and learn the principles of property-based testing, how to use
Hypothesis, and how to use it to check scientific code – whether highly-polished
or quick-and-dirty!

Working with U.S. Census Data in Python: Discovery, Analysis, and Visualization

Darren Vengroff

The United States Census Bureau publishes over 1,600 data sets via its APIs. These are useful across a myriad of fields in the social sciences. In this interactive tutorial, attendees will learn how to use open-source Python tools to discover, download, analyze, and generate maps of U.S. Census
data. The tutorial is full of practical examples and best practices to help participants avoid the tedium of data wrangling and concentrate on their research questions.

This hands-on tutorial will consider the full breadth and richness of data available from the U.S. Census. We will cover not only American Community Survey (ACS) and similarly well-known data sets, but also a number of data sets that are less well-known but nonetheless useful in a variety of research contexts.

The tutorial has no slides. Instead, it will be presented from a series of live Jupyter notebooks. After each lesson notebook is presented by the instructor, participants will be given a hands-on exercise to put what they just learned into practice. Essentially they will start with a research question and a blank notebook. Using what they just learned, they will then write the code to answer the question.

Lesson will start with the most basic queries and mapping and move through more advanced topics related to geographies, variables, groups and trees of related variables, and data set exploration.
After covering the concepts, the group as a whole will go through a complete end-to-end research example. Finally, individuals and small groups will have a chance to complete a series of short interactive exercises extending what they have learned and share the results with their peers.

All Python tooling used in the workshop is available as open-source software. Final versions of the notebooks used in the tutorial will also be made available via open-source.

Tutorials

Room 315

17:30

120min

WELCOME RECEPTION, hosted by Streamlit

Hosted by Streamlit in the Main Foyer the Convention Center. Catch up with old friends or meet new fellow attendees! Food and drinks will be served.

Social Event

Tacoma Convention Center Foyer

07:30

90min

Registration and Breakfast

Ballroom

09:00

15min

Opening notes

Ballroom

09:15

45min

Keynote: The Right Tool for the Job

Julia Silge

There are many programming languages that we might choose for scientific computing, and we each bring a complex set of preferences and experiences to such a decision. There are significant barriers to learning about other programming languages outside our comfort zone, and seeing another person or community make a different choice can be baffling. In this talk, hear about the costs that arise from exploring or using multiple programming languages, what we can gain by being open to different languages, and how curiosity and interest in other programming languages supports sharing across communities. We’ll explore these three points with practical examples from software built for flexible storage and model deployment, as well as a brand new project for scientific computing.

Keynote

Ballroom

10:00

25min

SciPy Tools Plenary (Ballroom)

Ballroom

10:25

20min

Break

Ballroom

10:45