SciPy 2023
This tutorial is an introduction to Pydantic, a library for data validation and settings management using Python type annotations. Using a semi-realistic ML and / or scientific software pipeline scenario we demonstrate how Pydantic can be used to support type validations for scientific data structures, APIs and configuration systems. We show how the use of Pydantic in scientific and ML software leads to a more pleasant user experience as well as more robust and easier to maintain code. A minimum knowledge of Python type annotations, class definitions and data structures will be helpful
for beginners but not required.
In this tutorial, attendees will learn hands-on how to optimize the trajectory of a self-landing rocket in a real-time simulated setting using CVXPY, a Python-embedded modeling language for convex optimization. We integrate the optimization with the Kerbal Space Program, to showcase a complete landing mission without human intervention, ideally in one piece. CVXPY allows solving complex problems declaratively, letting convex optimization find an optimal way of meeting target conditions with respect to an objective function. After solving the initial problem, attendees will use a selection of advanced CVXPY features while making the example gradually more realistic.
One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this workshop, we present an introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows.
Enjoy a gentle introduction to Python for folks who are completely new to it and may not have much experience programming. Learn how to write Python while practicing loops, if’s, functions, and usage of Python’s built-in features in a series of fun, interactive exercises inside Jupyter Notebooks. By the end you’ll be ready to write your own basic Python -- but most importantly, I want you to learn the form and vocabulary of Python so that you can understand Python documentation, interpret code written by others, and get the most out of other SciPy tutorials.
Communicating scientific data often relies on making comparisons between multiple datasets.
Join the Matplotlib team to learn about creating multi-axis figures to display such data side-by-side.
This intermediate level tutorial will cover a variety of tools for making multi-axis figures.
Of particular focus will be the subplot_mosaic and the layout engines: tight, constrained, and compressed.
This tutorial will emphasize the use of Matplotlib's Object Oriented (OO) API and why that is generally recommended over the pyplot (plt) API.
Privacy guarantee is the most crucial requirement when it comes to analyse sensitive data. However, data anonymisation techniques alone do not always provide complete privacy protection; moreover Machine Learning models could also be exploited to leak sensitive data when attacked, and no counter-measure is applied. Privacy-preserving machine learning (PPML) methods hold the promise to overcome all these issues, allowing to train machine learning models with full privacy guarantees. In this tutorial we will explore several methods for privacy-preserving data analysis, and how these techniques can be used to safely train ML models without actually seeing the data.
Between telescopes and satellite cameras and MRI machines and microscopes, scientists are producing more images than they can realistically look at. They need specialized viewers for multi-dimensional images, and automated tools to help process those images into knowledge. In this tutorial, we will cover the fundamentals of algorithmic image analysis, starting with how to think of images as NumPy arrays, moving on to basic image filtering, and finishing with a complete workflow: segmenting a 3D image into regions and making measurements on those regions. At every step, we will visualize and understand our work using matplotlib and napari.
PyVista is a general purpose 3D visualization library used for over 1400+ open source projects for the visualization of everything from computer aided engineering and geophysics to volcanoes and digital artwork.
PyVista exposes a Pythonic API to the Visualization Toolkit (VTK) to provide tooling that is immediately usable without any prior knowledge of VTK and is being built as the 3D equivalent of Matplotlib, with plugins to Jupyter to enable visualization of 3D data using both server- and client-side rendering.
Already familiar with ipywidgets, but ready to take your skills to the next level? In this tutorial we walk through what it takes to transform an exploratory Jupyter Notebook into a mature web application. Web apps can be a valuable product of collaboration between researchers and software developers, and the packages used in this tutorial were selected to support this relationship, starting with using JupyterLab as an integrated development environment. Attendees will learn how to design and document a scientific web application that accommodates increasing complexity, but is also inheritable by the researchers who maintain them in the long run.
This tutorial session is intended to give attendees a gentle introduction to applying causal thinking and causal inference to data using python. Causal data analysis is very common in many academic domains (e.g. in social psychology, epidemiology, macroeconomics, etc) as well as in industry (all of the largest Silicon Valley tech companies employ teams of scientists who answer business questions purely with causal inference methods). The tutorial will involve a combination of presentations with open Q&A and hands-on exercises contained in Google Colab notebooks.
NumPy provides Python with a powerful array processing library and an elegant syntax that is well suited to expressing computational algorithms clearly and efficiently. We'll introduce basic array syntax and array indexing, review some of the available mathematical functions in NumPy, and discuss how to write your own routines.
Visual Studio Code (VS Code) is a free code editor that runs on Windows, Linux, macOS and in your browser. This tutorial aims at Python programmers of all levels who are already using VS Code or are interested in doing so, and will take them from zero (installing VS Code) to a production setup for Python development. We will cover starter topics, such as customizing the UI and extensions, using code autocomplete, code navigation, debugging, and Jupyter Notebooks. We will also go into advanced use cases, such as remote development, pair programming via Live Share, Dev containers, GitHub Codespaces & more.
We will kick off this tutorial with an introduction to deep learning and highlight its primary strengths and use cases compared to traditional machine learning. In recent years, PyTorch has emerged as the most widely used deep learning library for research. However, a lot has changed regarding how we train neural networks these days. After getting a firm grasp of the PyTorch API, you will learn how to train deep neural networks using various multi-GPU training paradigms. We will also fine-tune large language models (transformers) and deploy them to the cloud.
Machine learning (ML) pipelines involve a variety of computationally intensive stages. As state-of-the-art models and systems demand more compute, there is an urgent need for adaptable tools to scale ML workloads. This idea drove the creation of Ray—an open source, distributed ML compute framework that not only powers systems like ChatGPT but also pushes theoretical computing benchmarks. Ray AIR is especially useful for parallelizing ML workloads such as pre-processing images, model training and finetuning, and batch inference. In this tutorial, participants will learn about AIR’s composable APIs through hands-on coding exercises.
This tutorial is an introduction to cloud-based geospatial analysis with Earth Engine and the geemap Python package. We will cover the basics of Earth Engine data types and how to visualize, analyze, and export Earth Engine data in a Jupyter environment using geemap. We will also demonstrate how to develop and deploy interactive Earth Engine web apps. Throughout the session, practical examples and hands-on exercises will be provided to enhance learning. The attendees should have a basic understanding of Python and Jupyter Notebooks. Familiarity with Earth science and geospatial datasets is not required, but will be useful.
While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.
In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a public cloud provided by the presenters – starting from how the data is stored and read, to how it is processed and visualized.
This tutorial introduces Keras, a powerful deep learning library and demonstrates how to enable generative models using Keras. The first part delves into the Keras training pipeline and extended modules. The second part explores image generative models using stable diffusion, with live coding examples to generate novel images and teach the model new concepts. Finally, you'll explore language generative models, including GPT and BART, with a live coding example that demonstrates how to enable these models. By the end of this tutorial, you'll have a solid understanding of how to harness Keras to create powerful AI applications.
Pandas can be tricky, and there is a lot of bad advice floating around. This tutorial will cut through some of the biggest issues I've seen with Pandas code after working with the library for a while and writing three books on it.
We will discuss:
- Proper types
- Chaining
- Aggregation
- Debugging
In this workshop, we will introduce Numba - a JIT compiler that is designed to speed up numerical calculations. Most people found all of it is like a mystery - It sounds like magic, but how does it work? Under what conditions does it work? And because of it, new users found it hard to start using it and it requires a steep learning curve to get the hang of it. This workshop will provide all the knowledge that you need to make Numba works for you.
Despite its reputation for being slow, Python is the leading language of scientific computing, which generally needs large-scale (fast) computations. This is because most scientific problems can be split into "metadata bookkeeping" and "number crunching," where the latter is performed by array-oriented (vectorized) calls into precompiled routines.
This tutorial is an introduction to array-oriented programming. We'll focus on techniques that are equally useful in NumPy, Pandas, xarray, CuPy, Awkward Array, and other libraries, and we'll work in groups on three class projects: Conway's Game of Life, evaluating decision trees, and computations on ragged arrays.
This tutorial will show you how to use the Pandas or Xarray APIs you already know to interactively explore and visualize your data even if it is big, streaming, or multidimensional. Then just replace your expression arguments with widgets to get a web app that you can share as HTML+WASM or backed by a live Python server. These tools let you focus on your data rather than the API, and let you build linked, interactive drill-down exploratory apps without having to run a web-technology software development project, which you can then share without becoming an operations specialist.
One of the biggest challenges for data scientists and machine learning engineers alike is the friction caused by the iteration cycle between prototyping and production. It’s not enough to deploy a working model to a serving app. The iterative process itself needs to be a tight feedback loop between experimentation, data and model refinement, deploying to production, and dealing with data drift. In this tutorial, attendees will learn how to unify the common tools in the Python Data/ML scientific stack into a single orchestration plane using Flyte so that you can reduce the friction between prototyping and production.
Dask is a Python library for scaling and parallelizing Python code. It provides familiar, high-level interfaces to extend the SciPy ecosystem to larger-than-memory or distributed environments, as well as lower-level interfaces for parallelizing custom algorithms. In this tutorial, we’ll cover advanced features of Dask like applying custom operations to Dask DataFrames and arrays, debugging computations, diagnosing performance issues, and more. Attendees should walk away with a deeper understanding of Dask’s internals, an introduction to more advanced features, and ideas of how they can apply these features effectively to their own workloads.
Bokeh is a library for interactive data visualization. You can use it with Jupyter Notebooks or create standalone web applications, all using Python. This tutorial is a complete guide to Bokeh, where we start with a basic line plot and step-by-step make our way to creating a dashboard with several interacting components. This tutorial will be helpful for scientists who are looking to level-up their analysis and presentations, and tool developers interested in adding custom plotting functionally or dashboards.
We love Python but maybe not enough to commit to an entire coding language. What if we could understand the fundamentals and begin working with real-time data in a single session? Actionable python scripts and understanding the frameworks might be enough to be a springboard for larger exploration projects.
Resampling and Monte Carlo statistical techniques are surprisingly intuitive, and they are often more flexible and accurate than their better-known analytical counterparts. In this tutorial, participants will develop their intuitive understanding of frequentist statistics and apply it using three functions in scipy.stats
- monte_carlo_test
, permutation_test
, and bootstrap
- to dramatically expand the statistical analyses they can perform with the SciPy Library.
SymPy is a Python library for symbolic mathematics. This tutorial will introduce SymPy to a beginner audience. It will cover an introduction to symbolic computing, basic operations, simplification, calculus, matrices, advanced expression manipulation, code generation, and selected advanced topics. The tutorial does not have any prerequisites beyond knowledge of Python and basic freshman level mathematics. It will be presented with Jupyter notebooks with regular exercises for the attendees. After attending this tutorial, attendees will be able to start using SymPy to solve their own problems.
Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines the convenience of labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. This tutorial will introduce data scientists already familiar with Xarray to more intermediate and advanced topics, such as applying functions in SciPy/NumPy with no Xarray equivalent, advanced indexing concepts, and wrapping other array types in the scientific Python ecosystem.
SciPy Welcome Reception hosted by Enthought. Tuesday, July 11, 6:30-8:30 at Enthought HQ, 200 W Cesar Chavez, Austin. Meet fellow attendees! Food and drinks served!
Walk, get a ride, or take the bus with CapMetro!
Michael Droettboom is a Principal Software Engineering Manager at Microsoft where he leads the CPython Performance Engineering Team. That team contributes directly to the upstream CPython project, and recently helped make Python 3.11 up to 60% faster than 3.10.
Michael has been contributing to open source for over 25 years: he is the former lead maintainer of matplotlib, a major contributor to astropy, and he is the original author of Pyodide and airspeed velocity. His work has supported such diverse applications as the Hubble and James Webb Space Telescopes, the Firefox web browser, infrared retinal imaging, and optical sheet music recognition.
N-dimensional datasets are common in many scientific fields, and quickly accessing subsets of these datasets is critical for an efficient exploration experience. Blosc2 is a compression and format library that recently added support for multidimensional datasets. Compression is crucial in effectively dealing with sparse datasets as the zeroed parts can be almost entirely suppressed, while the non-zero parts can still be stored in smaller sizes than their uncompressed counterparts. Moreover, the new double data partition in Blosc2 reduces the need for decompressing unnecessary data, which allows for top-class slicing speed.
While the NumPy C API lets developers write C that builds or evaluates arrays, just writing C is often not enough to outperform NumPy. NumPy's usage of Single Instruction Multiple Data routines, as well as multi-source compiling, provide optimizations that are impossible to beat with simple C. This presentation offers principles to help determine if an array-processing routine, implemented as a C-extension, might outperform NumPy called from Python. A C-extension implementing a narrow use case of the np.nonzero()
routine will be studied as an example.
Research on animal acoustic communication is being revolutionized by deep learning. In this talk we present vak, a framework that allows researchers in this area to easily benchmark deep neural network models and apply them to their own data. We'll demonstrate how research groups are using vak through examples with TweetyNet, a model that automates annotation of birdsong by segmenting spectrograms. Then we'll show how adopting Lightning as a backend in version 1.0 has allowed us to incorporate more models and features, building on the foundation we put in place with help from the scientific Python stack.
The NASA Atmosphere SIPS, located at the University of Wisconsin, is responsible for producing operational cloud and aerosol scientific products from satellite observations. With decades of satellite observations, new scientific algorithms are employing Machine Learning (ML) methods to improve processing efficiencies and scientific analyses. In preparation for future developments, we are working with NASA Atmospheric Science Teams to understand ML requirements and assist in developing new tools that will benefit both the Science Teams and the broader Open-Source Science community. This talk will step through a ML methodology being used to identify cloud types and severe aerosols.
Numerical Python libraries can run computations on many CPU cores with various parallel interfaces. When we simultaneously use multiple levels of parallelism, it may result in oversubscription and degraded performance. This talk explores the programming interfaces used to control parallelism exposed by libraries such as NumPy, SciPy, and scikit-learn. We will learn about parallel primitives used in these libraries, such as OpenMP and Python's multiprocessing module. We will see how to control parallelism in these libraries to avoid oversubscription. Finally, we will look at the overall landscape for configuring parallelism and highlight paths for improving the user experience.
An important problem in genomics is identifying the proteins that bind to DNA. Although many methods attempt to learn DNA motifs underlying protein binding as position-weight matrices (PWMs), these PWMs cannot faithfully represent real biology. For instance, a static PWM cannot describe a zinc-finger protein whose fingers can optionally include one-nucleotide spacing. TF-MoDISco is a framework for extracting motifs using attribution scores from a machine-learning model. The learned motifs and syntax overcome many of the limitations presented by PWM. I will describe the TF-MoDISco algorithm and showcase its efficient re-implementation, tfmodisco-lite.
In this contribution we will present the first stable version v1.0 of Gammapy, an openly developed Python package for gamma-ray astronomy. Gammapy provides methods for the analysis of astronomical gamma-ray data, such as measurement of spectra, images and light curves. By relying on standardized data formats and a joint likelihood framework, it allows astronomers to combine data from multiple instruments and constrain underlying astrophysical emission processes across large parts of the electromagnetic spectrum. Finally we will share lessons learned during the journey towards version v1.0 for an openly developed scientific Python package.
Data quality remains a core concern for practitioners of machine learning, data science, and data engineering, and in recent years specialized packages have emerged to validate and monitor data and models. However, as the open source community iterates on data frameworks – notably, highly performant entrants such as Polars – data quality libraries need to catch up to support them. In this talk, you will learn about Pandera and its journey from being a pandas-only validator to a generic tool for testing arbitrary data containers so that it can provide a standardized way of creating data validation tools.
The Scientific Python project aims to better coordinate the ecosystem and grow the community. Come hear about our recent progress and our plans for the coming year!
Qiskit is an open-source SDK for quantum computers, enabling developers to work with these powerful machines using a familiar python interface. First released in 2017, Qiskit has become the most popular package for quantum computing (Unitary Fund, 2022), with a thriving open-source community. As Qiskit has grown and changed, so has our approach to nurturing our community. This talk will share important lessons we’ve learnt over the years, including practical tips you can apply to your own projects. Whether you’re just starting in open-source or already manage an established community, this talk is for you!
Our recent work implements a domain-specific language called Disciplined Saddle Programming (DSP) in Python. It is available at https://github.com/cvxgrp/dsp. DSP allows specifying convex-concave saddle, or minimax problems, a class of convex optimization problems commonly used in game theory, machine learning, and finance. One application for DSP is to naturally describe and solve robust optimization problems. We show numerous examples of these problems, including robust regressions and economic applications. However, this only represents a fraction of problems solvable with DSP, and we want to engage with the SciPy community to hear about further potential applications.
In the era of exascale computing, storage and analysis of large scale data have become more important and difficult. We present libyt, an open source C++ library, that allows researchers to analyze and visualize data using yt or other Python packages in parallel during simulation runtime. We describe the methods for reading adaptive mesh refinement data structure, handling data transition between Python and simulation with minimal memory overhead, and conducting analysis with no additional time penalty using Python C API and NumPy C API. We demonstrate how it solves the problem in astrophysical simulations and increases disk usage efficiency.
Emukit is an open-source package for uncertainty quantification in Python. It provides various Bayesian methods, such as optimization, experimental design and quadrature, in a flexible unified way that leverages their commonalities. In the talk we will explain how and why Emukit was built, what are its strengths and weaknesses, how it is used today and in what scenarios one might find it useful.
Over the last decade, the SunPy ecosystem, a Python solar data analysis environment, has evolved organically to serve the needs of scientists analyzing solar physics data, mostly on desktop and laptop computers. However, modern solar observatories are producing data volumes in the tens of petabytes, necessitating the need for parallelized and out-of-core computation. HelioCloud is a cloud computing environment tailored for heliophysics research and colocated with many terabytes of solar physics data. In this talk, we will show how the SunPy ecosystem, combined with Dask on HelioCloud, can be used to efficiently process high-resolution solar data.
Open source researchers are increasingly challenged while navigating the data which open source communities inherently create when working in the open. While mining software repositories for insights into open source practices isn't new, moving beyond code analysis into ecosystems-level research does not have a clear path. This talk will outline the current ethical, legal, and policy challenges community leaders, as well as researchers in academia and industry face and the ambiguous areas decision makers should be aware of.
DuckDB is a novel analytical data management system. DuckDB supports complex queries, has no external dependencies, and is deeply integrated into the Python ecosystem. Because DuckDB runs in the same process, no serialization or socket communication has to occur, making data transfer virtually instantaneous. For example, DuckDB can directly query Pandas data frames faster than Pandas itself. In our talk, we will describe the user values of DuckDB, and how it can be used to improve their day-to-day lives through automatic parallelization, efficient operators and out-of-core operations.
The Open Force Field (OpenFF) initiative was formed to build a new generation of force fields for molecular dynamics (MD) simulations using modern data-driven techniques. Openness is one of our fundamental founding principles, and everything we produce is released openly and accessibly so that the community can validate, modify, or extend our work. Here we introduce some flagship packages in our ecosystem and the advances they have enabled in force field science and MD workflows. These include fitting custom functional forms, exploring the addition of off-site charges, and using neural networks to assign charges to protein-ligand systems.
TensorFlow Probability is a powerful library for statistical analysis in Python. Using TensorFlow Probability’s implementation of Bayesian methods, modelers can incorporate prior information and obtain parameter estimates and a quantified degree of belief in the results. Resampling methods like Markov Chain Monte Carlo can also be used to perform Bayesian analysis. As an alternative, we show how to use numerical optimization to estimate model parameters, and then show how numerical differentiation can be used to get a quantified degree of belief. How to perform simulation in Python to corroborate our results is also demonstrated.
The NIST Interatomic Potentials Repository project has developed Python APIs to support user interactions with the repository data hosted at https://potentials.nist.gov. The associated code is layered, starting with generic methods for JSON/XML-based data and databases, and building up to user-friendly interfaces specific to the repository. This design allows for basic users to easily explore the data and expert users to perform more complicated operations or create custom APIs for other databases. The repository APIs help users find and compare interatomic models, set up simulations, perform high throughput calculations, and access the high throughput results.
GraphBLAS solves graph problems using sparse linear algebra. We are using it to build graphblas-algorithms
, a fast backend to NetworkX. python-graphblas
is faster and more capable than scipy.sparse
for both graph algorithms and sparse operations. If you have sparse data or graph workloads that you want to scale and make faster, then this is for you. Come learn what makes GraphBLAS special--and fast!--and how to use it effectively.
Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference.
The Poster session will be in the Zlotnik Ballroom from 6:00-7:00pm.
The Job Fair will be held concurrently in the Zlotnik foyer with participating sponsors. Sponsor companies will be available to discuss current job opportunities.
At Scholz Garten, 1607 San Jacinto Blvd. Join your fellow community members from 7:00-9:00. Walking distance from AT&T Center. Venue, food, and drinks sponsored by OSSci.
Angela Pisco is the head of computational biology at insitro. She is passionate about extracting meaningful information from biomedical datasets and use that to improve disease understanding and drug development. She has studied Biomedical Engineering as BSc and MSc and have a PhD in Systems Biology. Her PhD work became the foundation of a new direction of thinking on why cancer develops resistance to chemotherapy, which is the major reason for treatment failure. In her postdoctoral work, she investigated the mechanisms of cellular differentiation in the skin. She developed a 3D computational model that recapitulated the observed changes in the mouse skin connective tissue and dermis during development. The combination of the mathematical analysis with experimental data led to a new understanding of how distinct fibroblast subpopulations become activated, proliferate, and deposit matrix proteins during wound healing. Before moving to insitro, she led the Data Science platform at CZ Biohub. There she made significant contributions for the whole organism cell atlas projects including the first whole mouse cell atlas, the first aging cell atlas, and Tabula Sapiens, one of the first Human Cell Atlas drafts (The Tabula Sapiens Consortium, Science 2022). She is also a founder and core member of Open Problems in Single Cell (openproblems.bio), a community effort to improve multimodal data analysis by both generating gold standard datasets and benchmarking metrics and infrastructure.
Google Earth Engine is a cloud-computing platform with a multi-petabyte catalog of satellite imagery and geospatial datasets. Built upon the Earth Engine Python API and open-source mapping libraries, geemap enables Earth Engine users to interactively manipulate, analyze, and visualize geospatial big data in a Jupyter environment. This presentation introduces Earth Engine and highlights the key features of geemap for interactive mapping and geospatial analysis with Earth Engine. Attendees can utilize geemap to create satellite timelapse animations for any location on Earth within 60 seconds. Additional resources will be provided to the attendees to learn more about geemap.
Your users have entrusted their data to you. But what happens when a law enforcement government agency demands you share the data with them? We will demystify the process of receiving and responding to law enforcement’s demands for data. We demonstrate how designing around privacy can limit what needs to be shared. To make subpoenas less scary, we break them down as a technical process, and share the protections we implemented at Mozilla. If you want to understand the real-world impact of your approaches to privacy, this talk is for you.
This talk will discuss how Numba was used to accelerate MCViNE, a software environment for building and running digital twins of neutron experiments via Monte Carlo ray tracing. Numba is an open-source JIT compiler for Python using LLVM to generate efficient machine code for CPUs and GPUs with NVIDIA CUDA. Python and Numba were used to create a GPU accelerated version of MCViNE utilizing an extensible object-oriented design that has achieved a speedup of up to 1000x over the CPU. The performance gain with Numba enables more sophisticated data analysis and impacts neutron scattering science and instrument design.
Recharging ground aquifers is an urgent task for improving groundwater sustainability in California. Geophysical data can provide a capability to image the subsurface where the major data gap lies. However, neither data nor analytic tools required to derive subsurface information is readily accessible. We present an interactive web application that utilizes a public database, GIS capabilities and directly integrates Jupyter Notebooks and Python packages from researchers to guide recharge site location. Our demonstration showcases how this technology can contribute to improving groundwater recharge in California and how integrating the research knowledge directly into a web application can increase the impact.
Jupyter-scatter is a scalable, interactive, and interlinked scatter plot widget for exploring datasets with up to several million data points. It focuses on data-driven visual encodings and offers two-way pan+zoom and lasso interactions. Beyond a single instance, jupyter-scatter can compose multiple scatter plots and synchronize their views and selections. Moreover, points can be connected by spline-interpolated lines. Thanks to the underlying WebGL rendering engine, spatial and color changes are smoothly transitioned. Finally, the API integrates seamlessly with Pandas DataFrames and offers functional methods that group properties by type to ease accessibility and readability.
Diversity, equity and inclusion initiatives often start with measurement - what do our communities look like today and how can we track progress against our goals? However, data collected through APIs, web scraping, surveys, interviews, inference etc. have the potential to expose more details about an individual than they were expecting, especially when aggregated across platforms and shared in public forums. This talk will discuss tactics, opportunities and challenges when collecting sensitive data in and around open source communities, while aligning with policies and regulations, respecting the right to anonymity and ensuring the safety of all members of the community.
DataFrame libraries in general, pandas and Dask specifically, are moving towards a better integration with PyArrow. This has many benefits, like improved performance and a reduced memory footprint. We want to connect with users to discuss how PyArrow can improve DataFrame libraries and what they expect out of PyArrow support. This can include things like improved performance, more consistent behavior or better interoperability with other libraries.
Imaging communities across different fields (microscopy, remote sensing, medical imaging, materials science) are currently all moving to develop cloud- and chunking friendly imaging formats based around Zarr. This includes OME-NGFF and GeoZarr. Although pretty much everyone has agreed on Zarr as the container for the image data, there is ongoing discussion about how best to store metadata about the images. In this BoF we'll discuss ways to encode where each pixel in the image is located in space (and time!) (and frequency!), and whether it's possible to harmonize this encoding across the different formats and standards. A relevant issue is https://github.com/ome/ngff/issues/174.
Scientific Python Ecosystem Coordination (SPEC) documents (https://scientific-python.org/specs/) provide operational guidelines for projects in the scientific Python ecosystem. SPECs are similar to project-specific guidelines (like PEPs, NEPs, SLEPs, and SKIPs), but are opt-in, have a broader scope, and target all (or most) projects in the scientific Python ecosystem. Come hear more about what we are working on and planning. Better yet, come share your ideas for improving the ecosystem!
So you’ve written the perfect notebook, but do you know who can read it? As a notebook author you have great stories, code, and visualizations filling your work, but how often do you consider accessibility? Jupyter notebooks seem like they are for everyone, but how a notebook gets written can greatly impact how usable it is for people with disabilities. We’ve curated authoring-focused best practices for notebook content to help your notebooks be more inclusive and reach a wider audience.
UXarray aims to provide xarray-styled functionality for unstructured grid datasets. UXarray offers support for loading and representing unstructured grids by utilizing existing Xarray functionality paired with new routines that are specifically written for operating on unstructured grids. In this talk, we will present the current capabilities of the library: reading and writing of unstructured grids, reading of datasets along with basic grid operations and the need to speed up computations, integration operations along with details on speedups obtained by using Numba and python indexing. We will also demonstrate the use of this library for visualization of unstructured grids.
Aviation comprises 2-3% of global CO2 emissions. Transitioning to cleaner, more sustainable aviation fuels can reduce its environmental impacts. To help accelerate sustainable aviation fuel development, we trained machine learning models to predict fundamental properties of biofuel blends using Fourier transform infrared (FTIR) spectra. We leveraged TPOT and standard libraries like NumPy, pandas, and scikit-learn to develop the models. This presentation will discuss how we overcame challenges with decomposing FTIR spectra data and using machine learning on small datasets (<100 samples). We will also discuss integration of the models into our open-source webtool to support biofuel research.
Behind every successful open source project is a strong contributor community. What makes these communities strong? What can you do in your OSS project to nurture a thriving contributor community? In this presentation, we will share insights from the work of the Contributor Experience Lead team (NumPy, SciPy, Matplotlib, and pandas) and discuss why designing and providing positive contributor experience is vital to sustainability of each individual project and the SciPy ecosystem overall.
yt_xarray is a new package in the scientific python ecosystem for linking yt and xarray. yt, primarily used in computational astrophysics, has gradually broadened support for scientific domains, including geoscience disciplines. Most geoscience data, however, still requires manual steps to load into yt. yt_xarray, a new xarray extension, aims to streamline communication of data from xarray to yt, providing a potentially useful tool to the many geoscience researchers already using xarray while allowing yt to leverage the distributed backends already supported by xarray. In this presentation, we will provide an overview of the usage and design of yt_xarray.
In research and data science, effective communication requires weaving together narrative text and code to produce elegantly formatted output. By embedding executable Python code blocks inside markdown, the open-source publishing platform, Quarto, works with Jupyter and VS Code to enable you to create these fully reproducible documents and reports with the format and styling you need. In this talk I’ll share how to get started and a few of my favorite things in Quarto including creating a manuscript, presentation and website in HTML, PDF and Word from a single source file, and creating lessons, reports, and Confluence documents.
Long-tailed distributions are common in natural and engineered systems; as a result, we encounter extreme values more often than we would expect from a short-tailed distribution. If we are not prepared for these "black swans", they can be disastrous.
But we have statistical tools for identifying long-tailed distributions, estimating their parameters, and making better predictions about rare events.
In this talk, I present evidence of long-tailed distributions in a variety of datasets -- including earthquakes, asteroids, and stock market crashes -- discuss statistical methods for dealing with them, and show implementations using scientific Python libraries.
The open-source project, Xarray, combines labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. Xarray has strong user bases in the physical sciences and geospatial community. However, new users commonly struggle to fit their dataset into the Xarray model and with conceptualizing and constructing an Xarray object that makes subsequent analysis steps easy (“dataset wrangling”). We take inspiration from the “tidy data” concept for dataframes — “datasets structured to facilitate analysis” (Wickham, 2014) — and attempt a definition of tidy data for labeled array objects provided by Xarray.
A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling.
MetPy is an open-source Python package for meteorological and atmospheric science applications, leveraging significantly many other pieces of the scientific Python stack (e.g. numpy, matplotlib, scipy, etc.). With a focus on sustainability, Metpy extensively leverages GitHub Action to try to automate as much of the software development process as possible. Sustainability also extends to the growth of the community of developers, and we have been working to try to make that sustainable as well. Here we talk about our experiences, share our successes and lessons learned with trying to build a sustainable project.
This project introduces an extensible workflow used to evaluate climate model output using collections of Jupyter notebooks. The workflow supports parametrizing and batch-executing notebooks using Papermill, in conjunction with developing notebooks interactively. Additional features include integration with Dask and caching intermediate data products generated by notebooks. The final product of the workflow can automatically be built into a Jupyter book for easy presentation and shareability. While it was initially developed for climate modeling, the flexible and extensible nature of this framework makes it adaptable to any kind of data analysis work, and the presentation will highlight this capability.
napari is an n-dimensional image viewer for Python. If you’ve ever tried plt.imshow(arr)
and made Matplotlib unhappy because arr
has more than two dimensions, then napari might be for you! napari will gladly display higher-dimensional arrays by providing sliders to explore additional dimensions. But napari can also: overlay derived data, such as points, segmentations, polygons, surfaces, and more; and annotate and edit these data, using standard data structures like NumPy or Zarr arrays, allowing you to seamlessly weave exploration, computation, and annotation in image analysis.
Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference.
Each new SciPy brings even more tools for data visualization and for building data-rich scientific applications and dashboards. This BoF brings together maintainers of Python tools for data visualization and building apps to help make sense of this complex landscape for users and to highlight new developments, trends, and opportunities. Join us and stay ahead of the curve!
Scientific open source software has often advanced by volunteer efforts with little financial support. In recent years, there has been an increase in different groups funding open source software. How has this changed the open source community? Where would future funding have the largest impact in the open source landscape? What new thing would you build that would make the lives of developers, researchers, and users easier? How much support is needed and what are the best ways to provide that support? What large scale project doesn’t exist that needs to exist? How do you balance funded and volunteer efforts? Join this lively discussion to help identify key focus areas for open source funding and resources.
"Python packaging is a rapidly changing landscape, plagued by many hurdles and challenges for users. The scientific Python community faces some of the greatest difficulties of anyone here, given the high reliance on external binaries and compiled code, the diversity of packaging ecosystems (PyPI, Conda, others), and the fact that many if not most users are not professional software engineers, like in other ecosystems. This is made all the more critical by the importance of reproducible research, and its sensitivity to even small dependency changes.
We'd like to build on the recent momentum behind evolving the packaging landscape to better serve these needs and building bridges between key players in the core Python and scientific spaces, with an intense, engaging and open discussion. This will bring together the key community stakeholders and everyday package authors to sync up on best practices, strengthen collaboration, and help come to consensus that would take months or even years if not for in-person discussion, as well as provide a jumping-off point for followup conversations and future action items."
Dr. Rumman Chowdhury is a trailblazer in the field of applied algorithmic ethics, creating cutting-edge socio-technical solutions for ethical, explainable and transparent AI. She currently runs Parity Consulting, Parity Responsible Innovation Fund, and is a Responsible AI Fellow at the Berkman Klein Center for Internet & Society at Harvard University. She is also a Research Affiliate at the Minderoo Center for Democracy and Technology at Cambridge University and a visiting researcher at the NYU Tandon School of Engineering. Previously, she was the director of the ML Ethics, Transparency, and Accountability team at Twitter identifying and mitigating algorithmic harms on the platform. Before that she was CEO and founder of Parity, an enterprise algorithmic audit platform company. She formerly served as Global Lead for Responsible AI at Accenture Applied Intelligence. In her work as Accenture’s Responsible AI lead, she led the design of the Fairness Tool, a first-in-industry algorithmic tool to identify and mitigate bias in AI systems. Dr. Chowdhury has been featured in international media, including the Wall Street Journal, Financial Times, Harvard Business Review, NPR, MIT Sloan Magazine among others. She was named one of BBC’s 100 Women, recognized as one of the Bay Area’s top 40 under 40, and honored to be inducted to the British Royal Society of the Arts (RSA).
In this talk, we will examine the new CUDA package layout for Conda (as included in conda-forge). Show how CUDA components have been broken out. Share how this affects development and package building. Walk through changes in the conda-forge infrastructure made to incorporate these new packages. Examine recipes using the new packages and what was needed to update them. Additionally will provide guidance on how to use these new packages in recipes or in library development.
In this talk we will share a Python library to obtain and analyze policing data, that was developed in conjunction with community activists, data scientists, social scientists and the Small Town Police Accountability (SToPA) Research Lab. We will showcase components of the SToPA library which use Python tools such as web drivers, optical character recognition, geospatial mapping, machine learning and statistical sampling to better understand the policing landscape. The goal of this work is to present an easily replicable framework for analyzing police and community interactions with accessible on-ramps for activists, developers and researchers.
Existing production machine learning systems often suffer from various problems that make them hard to use. For example, data scientists and ML practitioners often spend most of their time stitching and managing bespoke distributed systems to build end-to-end ML applications and push models to production.
To address this, the Ray community has built Ray AI Runtime (Ray AIR), an open-source toolkit for building large-scale end-to-end ML applications.
The array API standard (https://data-apis.org/array-api/) is a common specification for Python array libraries, such as NumPy, PyTorch, CuPy, Dask, and JAX.
This standard will make it straightforward for array-consuming libraries, like scikit-learn and SciPy, to write code that uniformly supports all of these libraries. This will allow, for instance, running the same code on the CPU and GPU.
This talk will cover the scope of the array API standard, supporting tooling which includes a library-independent test suite and compatibility layer, what work has been completed so far, and the plans going forward.
Geolocated data from smartphone apps are well-established resources for research. While most of that data come as points (e.g., geotagged photos), there are a growing number of apps that collect linear data from users activities (e.g., running, hiking, off-road driving). Using established ecological methods, shallow-machine learning packages, and multiprocessing we demonstrate a novel approach using mobile app data to estimate back-country recreation popularity at multiple scales. The topics covered include normalizing and thinning coordinate data, merging linear data from multiple sources, and accounting for spatial bias while preserving the integrity of the original data.
Allegro and FLARE are two very different packages for constructing machine learning potentials that are fast, accurate, and suitable for extreme-scale molecular dynamics simulations. Allegro uses PyTorch for efficient equivariant potentials with state-of-the-art accuracy, while FLARE is a sparse Gaussian process potential with an optimized C++ training backend leveraging Kokkos, OpenMP, and MPI for state-of-the-art performance, and a user-friendly Python frontend. We will compare and contrast the two methods, discuss lessons learned, and show spectacular scientific applications.
Fast interactive visualization remains a considerable barrier in analyses pipelines for large neuronal datasets. Here, we present fastplotlib, a scientific plotting library featuring an expressive API for very fast visualization of scientific data. Fastplotlib is built upon pygfx which utilizes the GPU via WGPU, allowing it to interface with modern graphics APIs such as Vulkan for fast rendering of objects. Fastplotlib is non-blocking, allowing for interactivity with data after plot generation. Ultimately, fastplotlib is a general purpose scientific plotting library that is useful for the fast and live visualization and analysis of complex datasets.
Once a maintainer of a project decides to step down of a project, the community needs to quickly adapt to this decision. This situation can be devastating for small projects and lead to their extinction. This talk demonstrates, based on the case of poliastro, that the community is a key factor for a software to survive no matter who is leading it.
Metal-Organic Frameworks (MOFs) have vast potential for gas adsorption, but their practical use hinges on their ability to dissipate thermal energy generated during adsorption. Here, we performed the first high-throughput screening of thermal conductivity in over 10,000 MOFs using molecular dynamics simulations. Next, we developed a graph neural network (GNN) based model to swiftly predict the diagonal components of the thermal conductivity tensor for accelerated materials discovery. Attendees will gain insights into how GNNs can be trained to predict material tensor properties, benefiting both the materials science and machine learning communities.
As scientists continue to embrace the Jupyter ecosystem for constructing computational narratives of their science through code, data, and rich text, they may encounter technical and community barriers to maintaining and sharing their science with new and existing audiences. We demonstrate the value of open-source science community building and getting there through reliance on the open-source Jupyter ecosystem, pre-packaged GitHub and BinderHub-based infrastructure, and documentation for creating, sharing, testing, and maintaining Pythia Cookbooks for their computational narratives.
Relational databases manage structured data and facilitate queries in collaborative repositories, but using SQL from a scientific programming language is awkward. DataJoint is an open-source framework for managing scientific data supporting data definition, diagramming, and queries. DataJoint makes computation a native part of its data model, bridging the gap between databases and numerical analysis in automated workflows. We will showcase the elegance of the relational data model and its versatility through neuroscience research examples. We will also introduce the DataJoint SciViz library, enabling scientists to build web apps for data visualization and unlocking further potential for data-driven discovery.
CZ CELxGENE Discover has released all of its human and mouse single-cell data through a new API that allows for efficient and low-latency querying. The data is fully standardized, hosted publicly and it is composed by a count matrix of 50 mi cells (observations) by >60 k genes (features) accompanied by cell and gene metadata. While these data are built from more than 700 datasets, the API enables convenient cell- and gene-based filtering to obtain any slice of interest in a matter of seconds. All data can be quickly transformed to numpy, pandas, anndata or Seurat objects.
Communities are at the heart of open source software and are fundamental to our projects’ long-term success. The Python ecosystem has several mature projects, that have spent years working on community initiatives. Newer projects can learn from their experiences and build stronger foundations to foster healthy communities.
In this talk, we share a set of practices for community-first projects, including repository management, contributor pathways, and governance principles. We’ll also share real examples from our own journey transitioning a company-backed OSS project, Nebari (https://nebari.dev/), to be more community-oriented.
Force fields (FF)—the (parametrized) mapping from geometry to energy, are a crucial component of molecular dynamics (MD) simulations, whose associated Boltzmann-like target probability densities are sampled to estimate ensemble observables, to harvest quantitative insights of the system. State-of-the-art force fields are either fast (molecular mechanics, MM-based) or accurate (quantum mechanics, QM-based), but seldom both. Here, leveraging graph-based machine learning and incorporating inductive biases crucial to chemical modeling, we approach the balance between accuracy and speed from two angles---to make MM more accurate and to make machine learning force fields faster.
Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference.
Come join the BoF to do a practice run on contributing to a GitHub project. We will walk through how to open a Pull Request for a bugfix, using the workflow most libraries participating at the weekend sprints use (hosted by the sprint chairs)
Here the aim of the panel would be to throw light on role code assistants like Co-Pilot and tools like ChatGPT and how they revolutionize coding careers. Also, provide insights that help young and budding programmers to prepare themselves for futuristic careers. Also, try to find answers to some hypothetical questions like can AI replace human programmers? Can it add or suggest new features to the language itself? and problems people may face while developing enterprise-grade applications with AI.
NumFOCUS will facilitate a discussion around open source projects managing a robust Code of Conduct as well as ongoing DEI support
Discuss the effects of recent and potential performance improvements on the scientific Python packages. The goal is to discuss the cost/benefit tradeoffs of adapting existing libraries to take advantage of potential improvements, especially per-interpreter GIL and nogil, but also type specializations in the interpreter.
"Notebooks can be a powerful tool for the purposes for which they were designed—learning, experimenting, and sharing results. However, users face many challenges when trying to achieve true reproducbility with notebooks alone, including lack of dependency management, pitfalls of non-linear interactive execution, and requiring bespoke tooling to open and execute. Furthermore, there is a growing need to go beyond reprodubility of individual results—siloed into an opaque format possessing limited interoperability with the rest of the Python ecosystem—toward reusuability of research methods, that can be shared, built upon, and deployed by users across the world.
Therefore, we invite the community to share their tools and workflows to go beyond reproducibility and towards true reusable science, built on the shoulders of giants. Furthermore, we hope to explore how we can encourage users and the community to move beyond the notebooks monoculture and toward a holistic, open, modular and interoperable approaches to conducting research and developing scientific code."
Feedback on SciPy 2023 and ideas for SciPy 2024