SciPy 2024
Do you have a basic understanding of Python and want to "level-up" your computational skills? Do you instinctively write for-loops to perform computations on your arrays? Have you ever heard someone complain "Python is slow" and want to prove them wrong? Do you want to know how to manipulate NumPy arrays like a master? If any or all of these is true, then this tutorial is for you!
Forecasting is central to decision-making in virtually all technical domains. For instance, predicting product sales in retail, forecasting energy demand, and anticipating customer churn all have tremendous value across different industries. However, the landscape of forecasting techniques is as diverse as it is useful, and different techniques and expertise are adapted to different types and sizes of data.
In this hands-on workshop, we give an overview of forecasting concepts, popular methods, and practical considerations. We’ll walk you through data exploration, data preparation, feature engineering, statistical forecasting (e.g., STL, ARIMA, ETS), forecasting with tabular machine learning models (e.g., decision forests), forecasting with deep learning methods (e.g., TimesFM, DeepAR), meta-modeling (e.g., hierarchical reconciliation and relational modeling, ensembles, resource models), and how to safely evaluate such temporal models.
This tutorial is an introduction to data visualization using the popular Vega-Altair Python library. Vega-Altair provides a simple, friendly, and consistent API that supports rapid data exploration. Vega-Altair’s flexible support for interactivity enables the creation and sharing of beautiful interactive visualizations.
Participants will learn the foundational concepts that Vega-Altair is built on and will gain hands-on experience exploring a variety of datasets. Of particular interest to the scientific community, this tutorial will cover recent advancements in the Vega-Altair ecosystem that make it possible to scale visualizations to large datasets, and to easily export visualizations to static image formats for publication.
This tutorial walks participants — Earth scientists with some prior Python experience — through analyses of two particular climate risk scenarios: floods & wildfires. The goal is to obtain hands-on experience with common reproducible Jupyter/Python workflows based on data products from the NASA Earthdata Cloud. The case studies highlight the interplay of distributed data with scalable numerical strategies — "data-proximate computing" — implemented using scientific Python libraries like NumPy, Pandas, & Xarray. This tutorial — co-developed by 2i2c and MetaDocencia — constitutes part of NASA's Transform to Open Science (TOPS) initiative to reinforce principles of Open Science & reproducibility.
Tabular data is ubiquitous, and pandas has been the de facto tool in Python for analyzing it. However, as data size scales, analysis using pandas may become untenable. Luckily, modern analytical databases (like DuckDB) are able to analyze this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Many of these systems only provide a SQL interface though; something far different from pandas’ dataframe interface, requiring a rewrite of your analysis code.
This is where Ibis comes in. Ibis is a pure-Python open-source library that provides a dataframe interface to many popular databases and analytics tools (DuckDB, Polars, Snowflake, Spark, etc...). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL. No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend.
https://ibis-project.org/
https://github.com/ibis-project/ibis
In place of the "TorchGeo: Advancing Earth Observation Through Machine Learning" tutorial, we are offering an open office hour session designed to provide support and assistance to attendees. Whether you're a first-time attendee looking for guidance, need help with Git, require technical support, or have questions about anything else, our SciPy tutorial chairs and other experts are here to help. This session is an excellent opportunity to get personalized assistance, resolve technical issues, and ask questions in an informal and supportive environment. Join us for this impromptu and interactive session to make the most of your conference experience!
Quarto is an innovative, open-source scientific and technical publishing system compatible with Jupyter Notebooks and plain text markdown documents. Quarto provides data scientists with a seamless way to publish their work in a high-quality format that is reproducible, accessible, and shareable. With Quarto, researchers can turn their Jupyter Notebooks and literate plain text markdown documents into professional-looking publications in various formats. This workshop will demonstrate how Quarto enables data scientists to turn their work products into professional, high-quality documents, slides, websites, scientific manuscripts, and other shareable artifacts.
Explore tsbootstrap and sktime in our 4-hour tutorial, focusing on enhancing time series forecasting and analysis. Discover how tsbootstrap's bootstrapping methods improve uncertainty quantification in time series data, integrating with sktime's forecasting models. Learn practical applications in various domains, boosting predictive accuracy and insights. This interactive session will provide hands-on experience with these tools, offering a deep dive into advanced techniques like probabilistic forecasting and model evaluation. Join us to expand your expertise in time series analysis, applying innovative methods to tackle real-world data challenges.
In this tutorial we will introduce Github Actions to scientists as a tool for lightweight automation of scientific data workflows. We will demonstrate that GitHub Actions are not just a tool for software testing, but can be used in various ways to improve the reproducibility and impact of scientific analysis. Through a sequence of examples, we will demonstrate some of Github Actions' applications to scientific workflows, such as scheduled deployment of algorithms to sensor streams, updating visualizations based on new data, processing large datasets, model versioning and performance benchmarking. GitHub Actions can particularly empower Python scientific programmers who are not willing to build fully-fledged applications or set up complex computational infrastructure, but would like to increase the impact of their research. The goal is that participants will leave with their own ideas of how to integrate Github Actions in their own work.
Drone imagery is more widely available than ever before, allowing the public to capture ultra high-resolution Earth images with hobbyist drones. In this workshop, we will explore drone imagery with Python tools such as geopandas, OpenCV, rasterio, numpy, and shapely. Afterwards, we will assess urban green spaces, focusing on counting trees and estimating their role in capturing carbon to fight climate change. This practical exercise will not only enhance our understanding of urban ecology, but also highlight the importance of trees in urban planning and environmental sustainability.
Bokeh is a library for interactive data visualization. You can use it with Jupyter Notebooks or create standalone web applications, all using Python. This tutorial is a thorough guide to Bokeh and its most recent new features. We start with a basic line plot and, step-by-step, make our way to creating a dashboard web application with several interacting components. This tutorial will be helpful for scientists who are looking to level up their analysis and presentations, and tool developers interested in adding custom plotting functionally or dashboards.
This tutorial is aimed at coders interested in understanding the building blocks of large language models (LLMs), how LLMs work, and how to code them from the ground up in PyTorch. We will kick off this tutorial with an introduction to LLMs, recent milestones, and their use cases. Then, we will code a small GPT-like LLM, including its data input pipeline, core architecture components, and pretraining code ourselves. After understanding how everything fits together and how to pretrain an LLM, we will learn how to load pretrained weights and finetune LLMs using open-source libraries.
Despite its reputation for being slow, Python is the leading language of scientific computing, which generally needs large-scale (fast) computations. This is because most scientific problems can be split into "metadata bookkeeping" and "number crunching," where the latter is performed by array-oriented (vectorized) calls into precompiled routines.
This tutorial is an introduction to array-oriented programming. We'll focus on techniques that are equally useful in NumPy, Pandas, xarray, CuPy, Awkward Array, and other libraries, and we'll work in groups on three class projects: Conway's Game of Life, evaluating decision trees, and computations on ragged arrays.
GitHub repository: https://github.com/ekourlit/scipy2024-tutorial-thinking-in-arrays
Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines the convenience of labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. This hands-on tutorial focuses on intermediate and advanced workflows using complex real-world data. We encourage participants in this workshop to bring your own dataset as we will dedicate ample time to apply tutorial concepts to datasets of interest!
PyVista is a general purpose 3D visualization library used for over 2000+ open source projects for the visualization of everything from computer aided engineering and geophysics to volcanoes and digital artwork.
PyVista exposes a Pythonic API to the Visualization Toolkit (VTK) to provide tooling that is immediately usable without any prior knowledge of VTK and is being built as the 3D equivalent of Matplotlib, with plugins to Jupyter to enable visualization of 3D data using both server- and client-side rendering.
Structured Query Language (or SQL for short) is a programming language to manage data in a database system and an essential part of any data engineer’s tool kit. In this tutorial, you will learn how to use SQL to create databases, tables, insert data into them and extract, filter, join data or make calculations using queries. We will use DuckDB, a new open source embedded in-process database system that combines cutting edge database research with dataframe-inspired ease of use. DuckDB is only a pip install away (with zero dependencies), and runs right on your laptop. You will learn how to use DuckDB with your existing Python tools like Pandas, Polars, and Ibis to simplify and speed up your pipelines. Lastly, you will learn how to use SQL to create fast, interactive data visualizations, and how to teach your data how to fly and share it via the Cloud.
In research involving any kind of computer simulation, we often have to
execute several simulations that might become a part of the final manuscript.
It is found that automating these simulations and their post-processing
introduces significant personal benefit in the form of improving research
output and productivity. Automation makes it much easier to run large
parameter sweeps and studies and allows you to focus on the important
questions to ask rather than managing hundreds or thousands of simulations
manually. This takes the drudgery of data/file management out of your hands,
systematizes your research, and makes it possible to incrementally improve and
refine your work. The added nice benefit is that your research also becomes
much easier to reproduce.
Jupyter Widgets connect Python objects with web-based visualizations and UIs, enabling both programmatic and interactive manipulation of data and code. For example, lasso some points in a scatterplot visualization and access that selection in Python as a DataFrame.
anywidget makes it simple and enjoyable to bring these capabilities to your own Python classes, and it ensures easy installation and usage by end users in various environments. In this tutorial, you will create your own custom widgets with anywidget and learn the skills to be effective in extending your own Python classes with web-based superpowers.
Cookiecutter is mainly known as a tool for software project templates. But its possible use cases are much more versatile: plain text and code, small building blocks and whole projects.
You can get started and build powerful templates without any programming - by using a CLI tool and editing text files. And if you're willing to throw some Python code and Jinja extensions into the mix, you can build pretty sophisticated and flexible automations.
The main goal of this workshop is to give some inspiration: How to detect candidates for automation in your workflow? Where can you improve speed and consistency and free up some mental energy for the actual content of your task?
As we descend from the peak of the hype cycle around Large Language Models (LLMs), chat-based document inquiry systems have emerged as a high-value practical use case. Retrieval-Augmented Generation (RAG) is a technique to share relevant context and external information (retrieved from vector storage) to LLMs, thus making them more powerful and accurate.
In this hands-on tutorial, we’ll dive into RAG by creating a personal chat app that accurately answers questions about your selected documents. We’ll use a new OSS project called Ragna that provides a friendly Python and REST API, designed for this particular case. We’ll test the effectiveness of different LLMs and vector databases, including an offline LLM (i.e., local LLM) running on GPUs on the cloud-machines provided to you. We'll then develop a web application that leverages the REST API, built with Panel–a powerful OSS Python application development framework.
This tutorial will show you how to use the Pandas, Dask, or Xarray APIs you already know to interactively explore and visualize your data even if it is big, streaming, or multidimensional. Then just replace your expression arguments with widgets to get an instant web app that you can share as HTML+WASM or backed by a live Python server. These tools let you focus on your data rather than the API, and let you build linked, interactive drill-down exploratory apps without having to run a web-technology software development project, which you can then share without becoming an operations specialist.
Interactive widgets were introduced to the Jupyter ecosystem over 10 years ago. A number of progressively more powerful interactive widget packages have been developed since then, supporting the construction of sophisticated dashboards and interactives. This tutorial will describe a number of approaches to developing and managing complex web apps that are compatible with Jupyter widgets and promote scalable application development.
Creating code that can be shared and reused is the pinnacle of open science. But tools and skills to share your code can be tricky to learn. In this hands-on tutorial, you’ll learn how to turn your pure Python code into an installable Python module that can be shared with others. To get the most out of this tutorial, you should be familiar with writing Python code, Python environments and functions.
You will leave this tutorial understanding how to:
- Create code that can be installed into different environments
- Use Hatch as a workflow tool, making setup and installation of your code easier
- Use Hatch to publish your package to (test) PyPI
While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.
In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a public cloud provided by the presenters – starting from how the data is stored and read, to how it is processed and visualized.
Generative AI systems built upon large language models (LLMs) have shown great promise as tools that enable people to access information through natural conversation. Scientists can benefit from the breakthroughs these systems enable to create advanced tools that will help accelerate their research outcomes. This tutorial will cover: (1) the basics of language models, (2) setting up the environment for using open source LLMs without the use of expensive compute resources needed for training or fine-tuning, (3) learning a technique like Retrieval-Augmented Generation (RAG) to optimize output of LLM, and (4) build a “production-ready” app to demonstrate how researchers could turn disparate knowledge bases into special purpose AI-powered tools. The right audience for our tutorial is scientists and research engineers who want to use LLMs for their work.
Between telescopes and satellite cameras and MRI machines and microscopes, scientists are producing more images than they can realistically look at. They need specialized viewers for multi-dimensional images, and automated tools to help process those images into knowledge. In this tutorial, we will cover the fundamentals of algorithmic image analysis, starting with how to think of images as NumPy arrays, moving on to basic image filtering, and finishing with a complete workflow: segmenting a 3D image into regions and making measurements on those regions. At every step, we will visualize and understand our work using matplotlib and napari.
Writing correct software is difficult, and even scientists don’t always get it right
[citation needed].
Hypothesis is a testing package that will search for counterexamples to your
assertions – so you can write tests that provide a high-level description of your
code or system, and let the computer attempt a Popperian falsification. If it
fails, your code is (probably) OK… and if it succeeds you have a minimal input
to debug.
Come along and learn the principles of property-based testing, how to use
Hypothesis, and how to use it to check scientific code – whether highly-polished
or quick-and-dirty!
The United States Census Bureau publishes over 1,600 data sets via its APIs. These are useful across a myriad of fields in the social sciences. In this interactive tutorial, attendees will learn how to use open-source Python tools to discover, download, analyze, and generate maps of U.S. Census
data. The tutorial is full of practical examples and best practices to help participants avoid the tedium of data wrangling and concentrate on their research questions.
This hands-on tutorial will consider the full breadth and richness of data available from the U.S. Census. We will cover not only American Community Survey (ACS) and similarly well-known data sets, but also a number of data sets that are less well-known but nonetheless useful in a variety of research contexts.
The tutorial has no slides. Instead, it will be presented from a series of live Jupyter notebooks. After each lesson notebook is presented by the instructor, participants will be given a hands-on exercise to put what they just learned into practice. Essentially they will start with a research question and a blank notebook. Using what they just learned, they will then write the code to answer the question.
Lesson will start with the most basic queries and mapping and move through more advanced topics related to geographies, variables, groups and trees of related variables, and data set exploration.
After covering the concepts, the group as a whole will go through a complete end-to-end research example. Finally, individuals and small groups will have a chance to complete a series of short interactive exercises extending what they have learned and share the results with their peers.
All Python tooling used in the workshop is available as open-source software. Final versions of the notebooks used in the tutorial will also be made available via open-source.
Hosted by Streamlit in the Main Foyer the Convention Center. Catch up with old friends or meet new fellow attendees! Food and drinks will be served.
There are many programming languages that we might choose for scientific computing, and we each bring a complex set of preferences and experiences to such a decision. There are significant barriers to learning about other programming languages outside our comfort zone, and seeing another person or community make a different choice can be baffling. In this talk, hear about the costs that arise from exploring or using multiple programming languages, what we can gain by being open to different languages, and how curiosity and interest in other programming languages supports sharing across communities. We’ll explore these three points with practical examples from software built for flexible storage and model deployment, as well as a brand new project for scientific computing.
Using machine learning to predict chemical properties and behavior is an important complement to traditional approaches to computation and simulation in chemistry. The ANAKIN-ME (ANI) methodology has been shown to produce generalized and transferable neural network potentials, trained on density functional theory (DFT) molecular energies, at a greatly reduced computational cost. The work presented here details an approach to generating new data in an active learning scheme in order to improve predictions in the regions of chemical space with high predictive uncertainty at the atom level.
Aviation accounts for 2% of global greenhouse gas emissions, and reliance on liquid petroleum-based fuels makes this sector challenging to decarbonize. We seek to accelerate the development of sustainable aviation fuels using an early-stage design tool with a data-driven approach. We developed our strategy using the Python-based optimization packages BoTorch and Ax, and also rely on Pandas. We will discuss how to down-select from many possible fuel components to a specified number of chemical species and identify which combinations are most promising for a novel sustainable aviation fuel. We will also present its integration in our open-source web tool supporting biofuel research.
pandas is one of the most commonly used data science libraries in Python, with a convenient set of APIs for data cleaning, preparation, analysis, and exploration. However, despite its widespread adoption, pandas suffers from severe memory and performance issues on even moderately sized datasets. Modin is an open-source project that serves as a fast, scalable drop-in replacement for pandas (https://github.com/modin-project/modin). By changing just a single line of code, Modin seamlessly speeds up pandas workflow on a laptop or in a cluster. Originally developed at UC Berkeley, Modin has been downloaded more than 17 million times and is used by leading data science teams across industries.
Agent-based models (ABMs) are powerful tools for understanding how people behave and interact. However, many ABMs are slow, cumbersome to use, or both. Here we introduce Starsim, an open-source, high-performance ABM specialized for modeling different diseases (such as HIV, STIs, and tuberculosis). Built on NumPy, Numba, and SciPy, Starsim's performance rivals ABMs implemented in Java or C++, but with the convenience provided by Python: specifically, the ability to quickly implement and refine new disease modules. Starsim can also be extended to other applications in which people interact on timescales from days to decades, including economics and social science.
Distributed systems are neat to demo, but hard to use in reality.
This talk goes through lessons learned running 100,000s of Dask clusters and 1,000,000,000s of Python functions for users in critical production settings across many companies and research groups.
We'll cover lessons learned like ...
- GIL Vigilance is Good
- Kubernetes is too heavyweight if all you want is lots of jobs
- ARM is underused
- Docker doesn't work well for data science folks
- Availability-Zones are key for spot/GPU availability
- Adaptive is underused (but hard)
- Most workloads are small
- Most workloads are fast
- Most users don't scale up properly
- Most people overestimate costs
These lessons will be motivated by tons of metadata collected and aggregated from real-world workloads.
Monte Carlo / Dynamic Code (MC/DC) is a performant and scalable Monte Carlo radiation transport simulation package with GPU and CPU support. It is written entirely in Python and uses Numba to accelerate Python code to CPU and GPU targets. This allows MC/DC to be a portable, easily installable, single language source code ideal for rapid numerical methods exploration at scale. We will discuss the benefits and drawbacks of such a scheme and make comparisons to traditionally compiled codes as well as those written using other modern high-level languages (i.e., Julia).
The CALculation of PHAse Diagram (CALPHAD) method coupled with uncertainty quantification and propagation (UQ & UP) calculations is a viable tool to predict thermodynamic properties in a multicomponent region at different temperatures and compositions with a confidence interval. These types of calculations provide upper and lower bounds of thermochemical property predictions when choosing the chemistry of candidate salt mixtures and therefore are vital for molten salt reactor engineering applications. The present work will study NaCl-KCl-MgCl2 salt mixture because of its high interest for molten salt applications with the aid of the ESPEI and PyCalphad open-source codes for UQ and UP calculations.
The representation, synthesis, modeling, and visualization of neighborhoods is a fundamental pursuit across a range of social sciences. In recent decades, recogni
We present AstroPhot, a tool to accelerate the analysis of astronomical images. AstroPhot allows for simultaneously modelling images with galaxies and point sources in multi-band and time domain data. In this talk I will the benefits and challenges of how we used PyTorch (a differentiable and GPU accelerated scientific python library) to allow for fast development without sacrificing numerical performance. I will detail our development process as well as how we encourage users of all skill levels to engage with our documentation/tools.
In this talk, I will discuss how one can foster a culture of open source contributions at one's company. Based on my successes and failures as a data scientist working in the biotech space, I will describe two key ideas (fostering internal open source and articulating value to key senior leadership) as being on the critical path to generating buy-in within the organization.
In this talk, I will discuss learning, ethical, and legal issues when using large language models to supplement learning and practicing scientific computing. Engineering disciplines are becoming
increasingly dependent upon computational tools and resources. Writing scientific computing code has become increasingly easier with GitHub CoPilot, ChatGPT, and more LLM tools. Problems can arise when
practitioners begin to use code without reviewing and understanding why it was written that way. I will present my findings when incorporating ChatGPT into the wealth of learning resources and how I discuss academic integrity in relation to U.S. copyright law and ethical responsibility. What constitutes intellectual contribution" and "independent work" and "plagiarism" when we resuse code from open source software and from LLMs?
MDAnalysis (https://www.mdanalysis.org) is one of the most widely used open-source Python libraries for molecular simulation analysis, with applications ranging from understanding the interaction of drugs with proteins to the design of novel materials. With over 200 contributors and 18 years of development, MDAnalysis has established a mature, stable API and a broad user community. Here we present the current status of the library’s capabilities as it approaches its next major release. We also detail ongoing work to address modern challenges in the ever-evolving landscape of molecular simulation, such as handling increasingly large simulation datasets and meeting the tenets of FAIR.
Discover the potential of multi-agent generative AI applications with AutoGen, a pioneering framework designed to tackle complex tasks requiring multi-step planning, reasoning, and action. In this talk, we will explore the fundamentals of multi-agent systems, learn how to build applications using AutoGen, and discuss the open challenges associated with this approach, such as control trade-offs, evaluation challenges, and privacy concerns.
With AutoGen's open-source platform and growing ecosystem, developers can harness the power of generative AI to create advanced AI assistants and interfaces for the digital world. This talk is ideal for those with a general understanding of generative AI and Python application development.
Johnson Matthey (JM) leads in sustainable technologies, employing advanced science to address global challenges in energy, chemicals, and automotive sectors. Our cutting-edge research and development (R&D) facilities include state-of-the-art characterization tools, handling diverse datasets like images, timeseries, 3D tomograms, spectra, and digital twins. With the rising demand for data-driven insights, Python has emerged as a vital tool in enhancing decision-making processes. We showcase our utilization of the open-source community to construct our data science research platform, marking a significant step forward in our innovation journey.
The ATLAS experiment at CERN explores vast amounts of physics data to answer the most fundamental questions of the Universe. The prevalence of Python in scientific computing motivated ATLAS to adopt it for its data analysis workflows while enhancing users' experience. This talk will describe to a broad audience how a large scientific collaboration leverages the power of the Scientific Python ecosystem to tackle domain-specific challenges and advance our understanding of the Cosmos. Through a simplified example of the renowned Higgs boson discovery, attendees will gain insights into the utilization of Python libraries to discriminate a signal in immersive noise, through tasks such as data cleaning, feature engineering, statistical interpretation and visualization at scale.
GitHub repository for the talk: https://github.com/ekourlit/scipy2024-ATLAS-demo
Support for string data in NumPy has long been a sore spot for the community. At the beginning of 2023 I was given the task to solve that problem by writing a new UTF-8 variable-length string DType leveraging the new NumPy DType API. I will offer my personal narrative of how I accomplished that goal over the course of 2023 and offer my experience as a model for others to take on difficult projects in the scientific python ecosystem, offering tips for how to get help when needed and contribute productively to an established open source community.
From radio telescopes to proton accelerators, scientific instruments produce tremendous amounts of data at equally high rates. To handle this data deluge and to ensure the fidelity of the instruments’ observations, architects have historically written measurements to disk, enabling downstream scientists and researchers to build applications with pre-recorded files. The future of scientific computing is interactive and streaming; how many Nobel Prizes are hidden on a dusty hard drive that a scientist didn’t have time or resources to analyze? In this talk, NVIDIA and the SETI institute will present their joint work in building scalable, real time, high performance, and AI ready sensor processing pipelines at the Allen Telescope Array. Our goal is to provide all scientific computing developers with the tools and tips to connect high speed sensors to GPU compute and lower the time to scientific insights.
In the domain of data science, a significant number of questions are aimed at understanding and quantifying the effects of interventions, such as assessing the efficacy of a vaccine or the impact of price adjustments on the sales volume of a product. Traditional association based methods machine learning methods, predominantly utilized for predictive analytics, prove inadequate for answering these causal questions from observational data, necessitating the use of causal inference methodologies. This talk aims to introduce the audience to the Directed Acyclic Graph (DAG) framework for causal inference. The presentation has two main objectives: firstly, to provide an insight into the types of questions where causal inference methods can be applied; and secondly, to demonstrate a walkthrough of causal analysis on a real dataset, highlighting the various steps of causal analysis and showcasing the use of the pgmpy package.
Dask is a library for distributed computing with Python that integrates tightly with pandas. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). The re-implementation of the DataFrame API addresses all of the pain points that users ran into. We will look into how Dask is a lot faster now, how it performs on benchmarks that is struggled with in the past and how it compares to other tools like Spark, DuckDB and Polars.
We present mrfmsim, an open-source framework that facilitates the design, simulation, and signal validation of magnetic resonance force microscopy experiments. The mrfmsim framework uses directed acyclic graphs (DAGs) to model experiments and employs a plugin system that adds custom experiments and functionalities. Differing from common DAG-powered workflow packages, mrfmsim allows flexible customizations of experiments post-definition without rewriting the internal model, such as optimized looping. In the talk, we present the challenges in building simulation packages for experiments undergoing continuous development in a graduate research setting. We discuss the current one-off approach that led to error-prone code and how modularity, extendibility, and readability can speed up the development cycle.
Causal inference has traditionally been used in fields such as economics, health studies, and social sciences. In recent years, algorithms combining causal inference and machine learning have been a hot topic. Libraries like EconML and CausalML, for instance, are good Python tools that facilitate the easy execution of causal analysis in areas like economics, human behavior, and marketing. In this talk, I will explain key concepts of causal inference with machine learning, show practical examples, and offer some practical tips. Attendees will learn how to apply machine learning to causal analysis effectively, boosting their research and decision-making.
We are developing a modern open-source Python compiler called LPython
(https://lpython.org/) that can execute user's code interactively in Jupyter to
allow exploratory work (much like CPython, MATLAB or Julia) as well as compile
to binaries with the goal to run user's code on modern architectures such as
multi-core CPU, GPU, as well as unfamiliar, new architectures like GSI's APU,
which features programmable compute-in memory. We aim to provide the best
possible performance for numerical array oriented code. The compiler itself is written in C++ for robustness
and speed.
SciPy package scipy.sparse is moving from its matrix API to an array API. This will allow sizes other than 2D, and clean up treatment of the multiplication operator *. This talk will start by describing the changes and their impacts. We will then discuss the process of revamping an API without messing up existing user code too much. And the trade-offs between slow changes over many releases vs. faster, perhaps breaking changes. And choosing whether to just make a new package instead. The talk should be useful for users of scipy.sparse and also for packages considering a major API change.
In this talk we introduce NVIDIA Warp, an open-source Python framework designed for accelerated differentiable computing. Warp enhances Python functions with just-in-time (JIT) compilation, allowing for efficient execution on CPUs and GPUs. The talk’s focus is on Warp’s application in physics simulation, perception, robotics, and geometry processing, along with its capability to integrate with machine-learning frameworks like PyTorch and JAX. Participants will learn the basics of Warp, including its JIT compilation process and the runtime library that supports various spatial computing operations. These concepts will be illustrated with hands-on projects based on research from institutions like MIT and UCLA, providing practical experience in using Warp to address computational challenges. Targeted at academics, researchers, and professionals in computational fields, the course is designed to inspire attendees and equip them with the knowledge and skills to use Warp in their work, enhancing their projects with efficient spatial computing.
Impact charts, as implemented in the impactchart package,
make it easy to take a data set and visualize the impact of one variable
on another in ways that techniques like scatter plots and linear regression can't,
especially when there are other variables involved.
In this talk, we will introduce impact charts, demonstrate how they find easter-egg impacts
we embed in synthetic data, show how they can find hidden impacts in a real-world use case,
show how you can create your first impact chart with just a few lines of code,
and finally talk a bit about the interpretable machine learning techniques they are built upon.
Impact charts are primarily visual, so this talk will be too.
With datasets growing in both complexity and volume, the demand for more efficient data processing has never been higher. Pandas and NetworkX, the go-to Python libraries for tabular and graph data processing, are very popular for their ease of use and flexibility. However, they often struggle to keep pace with the demands of large-scale data analysis.
This talk introduces new open-source GPU accelerators from the NVIDIA RAPIDS project for Pandas and NetworkX, and will demonstrate how you can enable them for your workflows to experience massive speedups – up to 150x in pandas and 600x in NetworkX – without code changes.
Speeding up Python code traditionally involves the use of Just-In-Time (JIT) or Ahead-Of-Time (AOT) compilation. There are tradeoffs to both approaches, however. As part of the Numba project's aim to create a compiler toolkit, the PIXIE project is being developed. It offers a multiple-language consuming, extensible toolchain, that produces AOT compiled binary extension modules. These PIXIE based extension modules contain CPU-specific function dispatch for AOT use and also support something similar to Link-Time-Optimization (LTO) for use in situations such as JIT compilation and/or cross module optimization. PIXIE modules are easy to load and call from Python, and can be inlined into Numba JIT compilation, giving Python developers access to the benefits of both AOT and JIT.
Scientific software drives open research. However, developing and maintaining a Python package is a tricky endeavor. You need to navigate a thorny packaging ecosystem, often in an academic environment that doesn’t traditionally value software. pyOpenSci has learned that an inclusive community can be empowered to make Python packaging more accessible, and that constructive peer review supports maintainers in creating better software, while also providing academic credit. In this talk you’ll learn:
- How to build consensus around thorny topics like packaging.
- Where to find beginner-friendly packaging support.
- How constructive peer review can support better code.
- How to get involved with pyOpenSci.
Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference.
The Poster session will be in the Ballroom from 6:00-7:00pm. Meet with the poster authors to ask questions and learn about the posters that will be on display throughout the main conference.
The Job Fair will be held concurrently in the Ballroom foyer with participating sponsors. Sponsor companies will be available to discuss current job opportunities.
Earth’s climate is chaotic and noisy. Finding usable signals amidst all of the noise can be challenging: be it predicting if it will rain, knowing which direction a hurricane will go, understanding the implications of melting Arctic ice, or detecting the impacts of humans on the earth’s surface. Here, I will demonstrate how explainable artificial intelligence (XAI) techniques can sift through vast amounts of climate data and push the bounds of scientific discovery: allowing scientists to ask “why?” but now with the power of machine learning.
Geospatial data is becoming more present in data workflows today, and plenty of Python tools allow us to work with it. In the past year, a new contender emerged: DuckDB introduced an extension for analyzing geospatial data. Everyone in the data world has been buzzing about DuckDB (~15k stars on GitHub), and now this duck quacks geospatial data too. But wait a minute, isn’t DuckDB all SQL? Yes, but fear not, Ibis has you covered! Ibis is a Python library that provides a dataframe-like interface, enabling users to write Python code to construct SQL expressions that can be executed on multiple backends, like DuckDB. In this talk, you will learn how to leverage the benefits of DuckDB geospatial while remaining in the Python ecosystem (yes, we will do a live demo). This is an introductory talk; everyone is welcome, and no previous experience with spatial databases or geospatial workflows is needed.
Many tools exist for large-scale data transfer (tens of terabytes or more), but they often don't match the needs of scientific data flows. In this talk, I'll explain how we built the 'librarian' framework with FastAPI, postgres, and Globus to ease this challenge. Designed for the Simons Observatory's petabyte-scale data transfer, I'll cover building reliable web services, flexible development with dependency injection, effective testing with pytest, and deployment using NERSC's Spin. I hope to demystify web and database programming for a scientific audience.
This talk discusses recent developments in open source computational economics, with a focus on the Econ-ARK project and dynamic stochastic optimization problems. Economics is often concerned with agents making choices across periods of time and interacting through a market. Historically, these problems have been solved using dynamic programming methods that are plagued by the curse of dimensionality. In practice, economics models were either dramatically simplified for tractability or solved to only rough approximation. Recent work has shown how deep learning can be used to solve these problems in a much more efficient way. Today, more models are computationally feasible, and we should expect general computing methods to continue to expand this horizon. Thus, what's needed is a portable way of representing economic models which is agnostic to solution methods. I'll present early-stage efforts to produce such a language as a flavor of language that is compatible with Sympy.
Accurate cell tracking is essential to various biological studies. In this talk, we present Ultrack, a novel Python package for cell tracking that considers a set of multiple segmentation hypotheses and picks the segments that are most consistent over time, making it less susceptible to mistakes when traditional segmentation fails.
The package supports various imaging modalities, from small 2D videos to terabyte-scale 3D time-lapses or multicolored datasets in any napari-compatible image format (e.g. tif, zarr, czi, etc.).
It is available at https://github.com/royerlab/ultrack
Image analysis is ubiquitous across many areas of biomedical research, resulting in terabytes of image data that must be hosted by both research institutions and data repositories for sharing and reproducibility. Common solutions for data hosting are required to improve interoperability and accessibility of bioimage data, while maintaining the flexibility to address each institution's unique requirements regarding sharing and infrastructure. OMERO is an open-source solution for image data management which can be customized and hosted by individual institutions. OMERO runs a server-based application with web browser and command line options for accessing and viewing image data, based on the widely used OME data model for microscopy data. Multiple OMERO deployments might be used to provide core delivery, facilitate internal research, or serve as a public data repository. The omero-cli-transfer package facilitates data transfer between these OMERO instances and provides new methods for importing datasets. Another open-source package, ezomero, improves the usability of OMERO in a research environment by providing easier access to OMERO's Python interface. Along with existing OMERO plugins built for other analysis and viewing software, this positions OMERO to be a hub for image storage, analysis, and sharing.
At the NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC), we're doing the heavy lifting to make large geospatial datasets easily accessible from the cloud. No more downloading data. No more worrying about quirky metadata or missing dimensions. No more concatenating hundreds or thousands of files together. Just fire up your Jupyter notebook somewhere in Amazon Web Services (AWS)'s US-West-2 region, get some free temporary AWS credentials, open our Zarr stores, and start doing your science.
This talk provides an overview of the evolution of scientific software in Python, with a focus on the speaker's journey from creating Spyder, the Scientific Python IDE, to developing DataLab, a platform for signal and image processing. The speaker will share insights into the challenges and opportunities encountered in developing and maintaining these projects, and discuss how they have contributed to the scientific Python ecosystem. The talk will also explore the evolving needs of both the scientific and industrial communities during this period, and why desktop applications remain relevant in the era of web-based tools.
Talk is now available on YouTube
Join us for the Diversity Keynote Luncheon. Lunch will be provided in the foyer of the ballroom.
With the recent release of NumPy 2.0, the NumPy maintainers are looking for feedback on the release. Are there issues that are blocking your ability to migrate to using NumPy 2.0? Are there things you wish we had fixed or changed but didn't make it into the 2.0 release? Any changes you really don't like? Please come and let us know. Your feedback will directly influence how NumPy 2.1 looks and how we manage major releases in the future.
Generative AI has rapidly changed the landscape of computing and data education. Many learners are utilizing generative AI to assist in learning, so what should educators do to address the opportunities, risks, and potential for their use? The goal of this open discussion session is to bring together community members to unravel these pressing questions in order to not only improve learning outcomes in a variety of diverse contexts: not only students learning in a classroom setting, but also ed-tech or generative AI designers developing new user experiences that aim to improve human capacities, and even scientists interested in learning best practices for communicating results to stakeholders or creating learning materials for colleagues. The open discussion will include ample opportunity for community members to network with each other and build connections after the conference.
Scientific environments and IDEs for Python have grown significantly in complexity in recent years, adding many features found in traditional IDEs such as debuggers, LSP support, plugin management, Git clients, testing integration, and more. This adds functionality that users are asking for, but also makes them more complicated for users without those needs. Is that additional complexity justified? Is it really serving users? That’s what we’d like to find out through this BoF.
Therefore, we’d like to invite the community to share their thoughts about the features they find most important for their research, both existing and to-be-developed, in open source scientific IDEs (such as JupyterLab and Spyder). Equally helpful would be feedback on what features haven’t been so helpful for users, and should be simplified, reworked or perhaps even removed. Finally, this would be an opportunity to ask questions of and interact directly with the IDE maintainers.
Water column sonar data collected by echosounders are essential for fisheries and marine ecosystem research, enabling the detection, classification, and quantification of fish and zooplankton from many different ocean observing platforms. However, the broad usage of these data has been hindered by the lack of modular software tools that allow flexible composition of data processing workflows that incorporate powerful analytical tools in the scientific Python ecosystem. We address this gap by developing Echostack, a suite of open-source Python software packages that leverage existing distributed computing and cloud-interfacing libraries to support intuitive and scalable data access, processing, and interpretation. These tools can be used individually or orchestrated together, which we demonstrate in example use cases for a fisheries acoustic-trawl survey.
Over the past year, there has been an increase in the number of libraries that leverage Rust and pyo3 to significantly increase performance. What's the catch? In this talk, we will discuss how the Data Science team at Capital One has been thinking about the power of Rust-backed Python and whether the benefits justify the complexity.
Image segmentation plays a crucial role in extracting valuable insights from geospatial data. While traditional segmentation methods can be laborious, deep learning offers automation but often demands extensive training and resources. Meta AI's Segment Anything Model (SAM) presents a compelling solution, segmenting objects without additional training. Our open-source Python package, samgeo, streamlines the use of SAM for geospatial data, offering various segmentation methods. Experiments confirm SAM's accuracy and efficiency as a powerful tool for remote sensing analysis. The samgeo package simplifies the adoption of automated image segmentation, facilitating better geospatial insights and decision-making across multiple domains.
Python is a popular language for data engineering workloads. In data engineering, developers must use a "Query Engine" to efficiently retrieve data, run data processing and then send data back out to a destination storage system or application.
The Python API for Apache Spark (PySpark) is currently the most popular framework that most data engineers use for data engineering at large scale. However, PySpark has a heavy dependency on the JVM which causes high friction during the development process.
In this talk, we discuss our work with the Daft Python Dataframe (www.getdaft.io) which is a distributed Python query engine built with Rust. We will perform a deep-dive into Daft architecture, and talk about how the strong synergy between Python and Rust enables key advantages for Daft to succeed as a query engine.
HyperSpy is a community-developed open-source library providing a
framework to facilitate interactive and reproducible analyses of
multidimensional datasets. Born out of the electron microscopy
scientific community and building on the extensive scientific Python
environment, HyperSpy provides tools to efficiently explore, manipulate,
and visualize complex datasets of arbitrary dimensionality, including
those larger than a system's memory. After 14 years of development,
HyperSpy recently celebrated its 2.0 version release. This presentation
will (re)introduce HyperSpy's features and community, with a focus on
recent efforts paring the library into a domain-agnostic core and a
robust ecosystem of extensions providing specific scientific
functionality.
This talk illustrates how machine learning models to detect harmful algal blooms from satellite imagery can help water quality managers make informed decisions around public health warnings for lakes and reservoirs. Rooted in the development of the open source package CyFi, this talk includes insights around identifying when your model is getting the right answer for the wrong reasons, the upsides of using decision tree models with satellite imagery, and how to help non-technical users build confidence in machine learning models. The intended audience is those interested in using satellite imagery to monitor and respond to the world around us.
nanoarrow, a newly developed subproject of Apache Arrow, is squarely focused on unlocking connectivity among Python packages and the libraries they wrap using the features and rich type support of the Arrow columnar format. The vision of nanoarrow is that it should be trivial for a library to implement an Arrow-based interface: nanoarrow and its bindings provide tools to produce, consume, and transport tabular data between processes using the Arrow IPC format or between libraries using the Arrow C ABI. For Python maintainers this means less glue code that runs faster so that developers can focus on feature development.
Visualization plays a critical role in the analysis and decision making with data, yet the manner in which state-of-the-art visualization approaches are disseminated limit their adoption into modern analytical workflows. Jupyter Widgets bridge this gap between Python and interactive web interfaces, allowing for both programmatic and interactive manipulation of data and code. However, their development has historically been tedious and error-prone.
In this talk, you will learn about anywidget, a Python library that simplifies widgets, making their development more accessible, reliable, and enjoyable. I will showcase new visualization libraries built with anywidget and explain how its design enables environments beyond Jupyter to add support.
xCDAT(Xarray Climate Data Analysis Tools) is an open-source Python package that extends Xarray for climate data analysis on structured grids. This talk will cover a brief history of xCDAT, the value this package presents to the climate science community, and a general overview of key features with technical examples. xCDAT’s scope focuses on routine climate research analysis operations such as loading, averaging, and regridding data on structured grids (e.g., rectilinear, curvilinear). Some key features include temporal averaging, geospatial averaging, horizontal regridding, vertical regridding, and robust interpretation and handling of metadata and bounds for coordinates.
Non-Python codebases that use metaprogramming present significant challenges to cross-language development. These challenges are further compounded with the inclusion of GPU processing. While common methods of Python/GPU interoperation are covered by popular Python frameworks, these frameworks do not trivialize this use case.
In this talk, we will discuss the process of integrating a Python code for Monte Carlo particle transport (MCDC) with a template-based CUDA C++ framework which applies inversion of control (Harmonize). We will discuss managing the complexity of cross-language dependency injection, relevant implementation strategies, and pitfalls to avoid.
Do you find yourself copying your data into Word, just to make a table? If this is you (and this was us) it’s both frustrating and prone to errors. And even though every aspect of a typical analysis can be scripted, it often turns out that the table-making part is elusive. We made Great Tables to enable complete publishing workflows. This Python package lets you easily generate publication-quality tables with the structure you want, many options for formatting values, and plenty of freedom for styling. Importantly, Great Tables closely integrates with Pandas and Polars DataFrames in order to handle a wide range of analyses.
Xarray-datatree
[1], is a Python package that supports HDFs (Hierarchical Data Format) with hierarchical group structures by creating a tree-like hierarchical data structure in xarray
. When an HDF file is opened with Datatree, a DataTree
object is created that contains all of the groups in the file. The tree-like structure allows each group to be accessed once a DataTree
object is instantiated. This eliminates the need for a user to go through each group and subgroup to access observational data.
We will present our use case for Datatree in NASA’s Harmony Level 2 Subsetter (HL2SS). HL2SS provides variable and dimension subsetting for Earth observation data from different NASA data centers. To subset hierarchical datasets without Datatree, HL2SS flattens the entire data structure into a new file by copying all of the grouped and subgrouped variables into the root group. With this new file, a variable or dimension subset is conducted. However, the flattened and subsetted file has to be in the same hierarchical structure of the original file, so it is unflattened, its attributes are copied, and the variables are grouped back to preserve the original group hierarchy. With the open_datatree()
function, HL2SS can open datasets containing multiple groups at once and have all of their group hierarchies preserved. This functionality has significant benefits towards optimizing the workflow in HL2SS, since it would eliminate the need to flatten and unflatten grouped datasets.
[1] https://github.com/xarray-contrib/datatree
Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference.
Developed by the Ohio Supercomputer Center (OSC) and funded by the National Science Foundation, Open OnDemand (openondemand.org) is an open-source portal that enables web-based access to HPC services. Clients manage files and jobs, create and share apps, run GUI applications and connect via SSH, all from any device with a web browser.
Open OnDemand empowers students, researchers, and industry professionals with remote web access to supercomputers. From a client perspective, key features are that it requires zero installation (since it run entirely in a browser), is easy to use (via a simple interface), and is compatible with any device (even a mobile phone or tablet). From a system administrator perspective, key features are that it provides a low barrier to entry for users of all skill levels, is open source and has a large community behind it, and is configurable and flexible for user’s unique needs.
BOF for people who would like to be involved in proceedings, particularly for supplemental last-minute paper reviewers for this year, and for people interested in proceedings, MyST, JupyterBooks, and computational publishing.
With the increasing sizes of data sets and computational tasks, cloud-based resources like databases, GPU computation, data processing pipelines, or hosted Jupyter tools have become a critical resource for scientific development. Projects that do not start at massive scales, though, face the challenge of moving from local developer machines or clusters to fully scaling cloud-based resources. This can be quite challenging as this can be a quite different skill set than what is currently developed in scientific educational programs. Thus, researchers must either learn themselves or lobby for administrative support to help design and deploy cloud infrastructure. In this BoF we think it would be helpful for folks who have not worked with cloud resources to have a chance to ask questions and for those who have incorporated cloud resources into their workflows to share advice.
Scientific Python Ecosystem Coordination (SPEC) documents (https://scientific-python.org/specs/) provide operational guidelines for projects in the scientific Python ecosystem. SPECs are similar to project-specific guidelines (like PEPs, NEPs, SLEPs, and SKIPs), but are opt-in, have a broader scope, and target all (or most) projects in the scientific Python ecosystem. Come hear more about what we are working on and planning. Better yet, come share your ideas for improving the ecosystem!
I will tell the story of how the statistical challenges in the search for the Higgs boson and exotic new physics at the Large Hadron Collider led to new approaches to collaborative, open science. The story centers around computational and sociological challenges where software and cyberinfrastructure play a key role. I will highlight a few important changes in perspective that were critical for progress including embracing declarative specifications, pivoting from reproducibility to reuse, and the abstraction that led to the field of simulation-based inference.
Google Earth Engine's new data extraction interfaces seamlessly transfer geospatial data into familiar Python formats provided by NumPy, Pandas, GeoPandas, and Xarray. This integration empowers you to harness Earth Engine's vast data catalog and compute power directly within your preferred Python workflows. For example, the Xee library leverages Xarray's lazy evaluation and Dask to streamline the extraction and analysis of Earth Engine data, offering a more Pythonic alternative to traditional image exports. Earth Engine's new data extraction interfaces unlock fresh geospatial analysis potential by leveraging the unique strengths of both the scientific Python ecosystem and Earth Engine.
While Python excels at prototyping and iterating quickly, it’s not always performant enough for whole-genome scale data processing. Flyte, an open-source Python-based workflow orchestrator, presents an excellent way to tie together the myriad tools required to run bioinformatics workflows. Flyte is a k8s native orchestrator, meaning all dependencies are captured and versioned in container images. It also allows you to define custom types in Python representing genomic datasets, enabling a powerful way to enforce compatibility across tools. Computational biologists, or any scientists processing data with a heterogeneous toolset, stand to benefit from a common orchestration layer that is opinionated yet flexible.
Temporal data is ubiquitous in data science and plays a vital role in machine learning pipelines and business decisions. Preprocessing temporal data using generic data tools can be tedious, lead to inefficient computation, and be prone to errors.
Temporian is an open-source library for safe, simple, and efficient preprocessing and feature engineering of temporal data. It supports common temporal data types, including non-uniform sampled, multi-variate, multi-index, and multi-source data. Temporian favors interactive development in notebooks and integration with other machine learning tools, and can run at scale using distributed computing.
This talk, aimed at data scientists and machine learning practitioners, will showcase Temporian’s key features along with its powerful API, and demonstrate its advantages over generic data preprocessing libraries for handling temporal data.
Lonboard is a new Python library for geospatial vector data visualization that can be 50x faster than existing alternatives like ipyleaflet
or pydeck
. This talk will explain why this library is so fast, how it integrates into existing workflows, and planned future improvements.
Traditional time series analysis techniques have found success in a variety of data mining tasks. However, they often require years of experience to master and the recent development of straightforward, easy-to-use analysis tools has been lacking. We address these needs with STUMPY, a scientific Python library that implements a novel yet intuitive approach for discovering patterns, anomalies, and other insights from any time series data. This presentation will cover the necessary background needed to follow the live interactive demo, requires no prior experience, and promises a simple, powerful, and scalable time series analysis package that will complement your current toolset.
Discover how scikit-build-core revolutionizes Python extension building with its seamless integration of CMake and Python packaging standards. Learn about its enhanced features for cross-compilation, multi-platform support, and simplified configuration, which enable writing binary extensions with pybind11, Nanobind, Fortran, Cython, C++, and more. Dive into the transition from the classic scikit-build to the robust scikit-build-core and explore its potential to streamline package distribution across various environments.
To effectively share scientific results, we must blend narrative text and code to create polished, interactive output. Quarto is an open-source scientific publishing system to help you communicate with others through code. Quartodoc is a Python package that generates function references within Quarto websites. Together, these tools create beautiful documentation that is reproducible, accessible, and easily editable.
This talk will include examples of Quarto in action–from simple blogs to expansive Python package documentation including web-assembly powered live examples. Listeners will walk away knowing how to Quarto websites, when to use quartodoc, and how these tools create better documentation.
A Data Warehouse (DW) is a powerful tool to manage your scientific data, training data, logs, or any other type of relational data. Most Data Warehouses are cloud-based and built to scale to petabyte workflows, but might not be optimal for smaller workloads that need a fast iteration cycle. Likewise, a collection of CSV files and python scripts can become painful to share and maintain. This is where DuckDB comes in! DuckDB is a fast, in-process database that you can run on your laptop, supports a rich SQL dialect, and you can push to the cloud with just a single line of code. In this talk, we’ll show you how to bootstrap a Data Warehouse on your laptop using open source, including ETL (extract-transform-load) data pipelines, dashboard visualization, and sharing via the cloud.
Model Share AI (AIMS) is an easy-to-use Python library designed to streamline collaborative ML model development, model provenance tracking, and model deployment, as well as a host of other functions aiming to maximize the real-world impact of ML research. AIMS features collaborative project spaces, allowing users to analyze and compare their models in a standardized fashion. Model performance and various model metadata are automatically captured to facilitate provenance tracking and allow users to learn from and build on previous submissions. Additionally, AIMS allows users to deploy ML models built in Scikit-Learn, TensorFlow Keras, PyTorch, and ONNX into live REST APIs and automatically generated web apps with minimal code. The ability to deploy models with minimal effort and to make them accessible to non-technical end-users through web apps has the potential to make ML research more applicable to real-world challenges.
Tabular data is ubiquitous, and pandas has been the de facto tool in Python for
analyzing it. However, as data size scales, analysis using pandas may become
untenable. Luckily, modern analytical databases (like DuckDB) are able to
analyze this same tabular data, but perform orders-of-magnitude faster than
pandas, all while using less memory. Many of these systems only provide a SQL
interface though; something far different from pandas’ dataframe interface,
requiring a rewrite of your analysis code.
This talk will lay out the current database / data landscape as it relates to
the SciPy stack, and explore how Ibis (an open-source, pure Python, dataframe
interface library) can help decouple interfaces from engines, to improve both performance
and portability. We'll examine other solutions for interacting with SQL from Python and
discuss some of their strengths and weaknesses.
Interactive visualizations are invaluable tools for building intuition and supporting rapid exploration of datasets and models. Numerous libraries in Python support interactivity, and workflows that combine Jupyter and IPyWidgets in particular make it straightforward to build data analysis tools on the fly. However, the field is missing the ability to arbitrarily overlay widgets and plots on top of others to support more flexible details-on-demand techniques. This work discusses some limitations of the base IPyWidgets library, explains the benefits of IPyVuetify and how it addresses these limitations, and finally presents a new open-source solution that builds on IPyVuetify to provide easily integrated widget overlays in Jupyter.
Vast amounts of information of interest to cyber defense organizations comes in the form of unstructured data; from host-based telemetry and malware binaries, to phishing emails and network packet sequences. All of this data is extremely challenging to analyze. In recent years there have been huge advances in the methodology for converting unstructured media into vectors. However, leveraging such techniques for cyber defense data remains a challenge.
Imposing structure on unstructured data allows us to leverage powerful data science and machine learning tools. Structure can be imposed in multiple ways, but vector space representations, with a meaningful distance measure, have proven to be one of the most fruitful.
In this talk, we will demonstrate a number of techniques for embedding cyber defense data into vector spaces. We will then discuss how to leverage manifold learning techniques, clustering, and interactive data visualization to broaden our understanding of the data and enrich it with expert feedback.
At the Tutte Institute for Mathematics and Computing (TIMC), we believe in the importance of reproducibility and in making research techniques accessible to the broader cyber defense community. To that end, this talk will leverage several open source libraries and techniques that we have developed at TIMC: Vectorizers, UMAP, HDBSCAN, ThisNotThat and DataMapPlot.
ITK-Wasm combines the Insight Toolkit (ITK) and WebAssembly to enable high-performance spatial analysis across programming languages and hardware architectures.
ITK-Wasm Python packages work in a web browser via Pyodide but also in system-level environments. We describe how ITK-Wasm bridges WebAssembly with Scientific Python through simple fundamental Python and NumPy-based data structures and Pythonic function interfaces. These interfaces can be accelerated through GPU implementations when available.
We discuss how ITK-Wasm's integration of the WebAssembly Component Model launches Scientific Python into a new world of interoperability, enabling the creation of accessible and sustainable multi-language projects that are easily distributed anywhere.
In this talk, I will present LlamaBot, a Pythonic and modular set of components to build command line and backend tools that leverage large language models (LLMs). During this talk, I will showcase the core design philosophy, internal architecture and dependencies, and live demo command-line applications built using LlamaBot that use both open source and API-access-only LLMs. Finally, I will conclude with a roadmap for LlamaBot development, and an invitation to contribute and shape its development during the Sprints.
Pooch is a Python library that can download and locally cache files from the web without hassle. Novices can use it to simply download files in one line of code and focus on the data. Package maintainers can use it to provide sample datasets to their users, in examples and tutorials, as libraries like SciPy, scikit-image, napari and MetPy do. During this talk, we'll show you how you can use the different features that Pooch offers and also how you can extend its capabilities by writing your own downloaders or post-processors.
Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference.
Come join the BoF to do a practice run on contributing to a GitHub project. We will walk through how to open a Pull Request for a bugfix, using the workflow most libraries participating at the weekend sprints use (hosted by the sprint chairs)
This BOF will serve as a forum for maintainers and users of scientific python packages to discuss support for thread-based parallelism in the free-threaded build of CPython. Replacing approaches to parallelism in many APIs that leverage multiprocessing with Python threads is very attractive, but adding thread safety to libraries after the fact may be challenging. Users can share workflows they want to attack with the free-threaded interpreter. Maintainers can seek advice and offer their experience so far adding support.
In the open-source community, the security of software packages is a critical concern since it constitutes a significant portion of the global digital infrastructure. This BoF session will focus on the supply chain security of open-source software in scientific computing. We aim to bring together maintainers and contributors of scientific Python packages to discuss current security practices, identify common vulnerabilities, and explore tools and strategies to enhance the security of the ecosystem. Join us to share your experiences, challenges, and ideas on fortifying our open-source projects against potential threats and ensuring the integrity of scientific software.
If you have interest in NumPy, SciPy, Signal Processing, Simulation, DataFrames, or Graph Analysis, we'd love to hear what performance you're seeing and how you're measuring. We've been working to accelerate your favorite packages on GPUs
This BoF session will explore the essential data required to build a robust and comprehensive map of the open source science landscape. The discussion will center around the types of data needed, the challenges in collecting and curating this data, and the potential insights and benefits that such a map can provide. Participants will engage in a discussion on how to effectively gather and use data to illuminate the dynamics, challenges, and opportunities within open source and open science ecosystems.
Key Discussion Points:
- Identifying necessary data types. For example: people: pull requests, hours committed, number up upvotes on issues; projects: number of direct dependencies, number of citations, etc.
- Challenges in data collection and curation.
- Methodologies for ensuring data accuracy and comprehensiveness.
- Insights gained from mapping large datasets.
- The impact of a comprehensive map on the open source and open science communities.
- Future directions and potential improvements for data collection and mapping.
The context of this BoF will be set by a brief demonstration of The Map of Open Source Science seen on https://opensource.science
Come share your ideas next year's SciPy. Participants will have an opportunity to sign up to be on next year's organizing committee.