SciPy 2023

0.42 2023 SciPy 2023 2023-07-10 2023-11-10 124 00:05 https://cfp.scipy.org/2023/schedule/ America/Chicago 2023-07-10T08:00:00-05:00 08:00 04:00 Classroom 106 2023-100-full-stack-machine-learning-for-data-scientists https://cfp.scipy.org//2023/talk/DDJTZL/ false Full-stack Machine Learning for Data Scientists Tutorials Tutorial en One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this workshop, we present an introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows. One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this workshop, we present an introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows. We’ll present a high-level overview of the 8 layers of the ML stack: data, compute, versioning, orchestration, software architecture, model operations, feature engineering, and model development. We’ll present a schematic as to which layers data scientists need to be thinking about and working with, and then introduce attendees to the tooling and workflow landscape. In doing so, we’ll present a widely applicable stack that provides the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure. /media/2023/submissions/DDJTZL/scipy-full-stack-ML_BYOf5z5.png Hugo Bowne-AndersonSavin Goyal 2023-07-10T13:30:00-05:00 13:30 04:00 Classroom 106 2023-43-modern-deep-learning-with-pytorch https://cfp.scipy.org//2023/talk/8BZN3E/ false Modern Deep Learning with PyTorch Tutorials Tutorial en We will kick off this tutorial with an introduction to deep learning and highlight its primary strengths and use cases compared to traditional machine learning. In recent years, PyTorch has emerged as the most widely used deep learning library for research. However, a lot has changed regarding how we train neural networks these days. After getting a firm grasp of the PyTorch API, you will learn how to train deep neural networks using various multi-GPU training paradigms. We will also fine-tune large language models (transformers) and deploy them to the cloud. This tutorial will be aimed at Python programmers new to PyTorch and deep learning. However, even more experienced deep learning practitioners and PyTorch users may be exposed to new concepts and ideas when exploring other open source libraries to extend PyTorch. Throughout this 4-hour tutorial session, attendees will learn how to use PyTorch to train neural networks for image and text classification. We will discuss the individual strengths and weaknesses of deep learning and contrast it with traditional machine learning via libraries such as scikit-learn. We will discuss the PyTorch library in detail, exploring it as a tensor library, automatic differentiation library, and library for implementing deep neural networks. After getting a firm grasp of the PyTorch API, we will introduce additional open source libraries to familiarize attendees with the modern open source stack for deep learning. For instance, we will organize our model training loops using the Lightning Trainer, which will help us reduce boilerplate code and get additional benefits such as model checkpointing, logging, and convenient mixed precision training. Then, we will explore multi-GPU training strategies from DeepSpeed library to accelerate model training if multiple GPUs are available. Note that all model code in this tutorial can be run on a laptop computer, but attendees will also be introduced to free GPU options for this tutorial via Google Colab and Lightning to get the full benefits of this multi-GPU training section. Large language transformers have largely replaced recurrent neural networks for text classification and generation. So, in this tutorial, attendees will learn how to adopt and fine-tune large language models from the HuggingFace transformer library. Lastly, as a bonus, we will also build a deep learning model demo using Gradio and deploy it to the cloud using Lightning. Sebastian Raschka 2023-07-10T08:00:00-05:00 08:00 04:00 Classroom 101 2023-201-mosaic-magic-with-matplotlib https://cfp.scipy.org//2023/talk/CJUYJM/ false Mosaic Magic with Matplotlib Tutorials Tutorial en Communicating scientific data often relies on making comparisons between multiple datasets. Join the Matplotlib team to learn about creating multi-axis figures to display such data side-by-side. This intermediate level tutorial will cover a variety of tools for making multi-axis figures. Of particular focus will be the [subplot_mosaic](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/mosaic.html) and the layout engines: tight, constrained, and compressed. This tutorial will emphasize the use of Matplotlib's Object Oriented (OO) API and why that is generally recommended over the pyplot (plt) API. This tutorial is designed for users of Matplotlib who want to learn more about how to lay out complicated figures. Bring a figure you like that you want to replicate the layout of or one that you'd like to improve. - Introduction (10 mins) - Parts of a figure. What makes up a figure (20 mins) - (Build up to: https://matplotlib.org/stable/gallery/showcase/anatomy.html) - Creating a figure with a single axes (10 mins) - Object oriented model of interacting with axes (20 mins) - e.g. Prefer `ax.plot` over `plt.plot` - Multi axes figures (~1.5 hr): - `subplots` (10 mins) - `subplot_mosaic` (30 mins) - `grid_spec` (20 mins) - `subplot2grid` (5 mins) - `add_axes` (5 mins) - `add_subplot` (5 mins) - Inset and zoomed axes (5 mins) - Layout engines (30 mins) - Introduction (10 mins) - Constrained Layout (10 mins) - Compressed Layout (5 mins) - Tight Layout (5 mins) - Labeling figures (20 mins) - Axis/figure labels (10 mins) - Legends (5 mins) - Colorbars (5 mins) - Subfigures (10 mins) - Conclusions/questions (20 mins) Detailed setup instructions will be provided prior to the event. /media/2023/submissions/CJUYJM/tutorial_image_yUX7wUY.png Kyle Sunden 2023-07-10T13:30:00-05:00 13:30 04:00 Classroom 101 2023-96-3d-visualization-with-pyvista https://cfp.scipy.org//2023/talk/LZPDBD/ false 3D Visualization with PyVista Tutorials Tutorial en [PyVista](https://github.com/pyvista/pyvista) is a general purpose 3D visualization library used for over 1400+ open source projects for the visualization of everything from [computer aided engineering and geophysics to volcanoes and digital artwork](https://dev.pyvista.org/getting-started/external_examples.html). PyVista exposes a Pythonic API to the [Visualization Toolkit (VTK)](http://www.vtk.org) to provide tooling that is immediately usable without any prior knowledge of VTK and is being built as the 3D equivalent of Matplotlib, with plugins to Jupyter to enable visualization of 3D data using both server- and client-side rendering. Our tutorial will demonstrate PyVista's latest capabilities and bring a wide range of users to the forefront of 3D visualization in Python. - Use PyVista to create 3D visualizations from a variety of datasets in common formats. - Overview the classes and data structures of PyVista with real-world examples. - Be familiar of the various filters and features of PyVista. - Know which Python libraries are used and can be used by PyVista (meshio, trimesh etc). We see this tutorial catering to anyone who wants to visualize data in any domain, and this ranges from basic Python users to advanced power users. /media/2023/submissions/LZPDBD/pyvista_banner_HZ7nr2j.png Bane SullivanTetsuo KoyamaAlexander Kaszynski 2023-07-10T08:00:00-05:00 08:00 04:00 Classroom 105 2023-192-building-better-data-structures-apis-and-configuration-systems-for-scientific-software-using-pydantic https://cfp.scipy.org//2023/talk/VBZ9PN/ false Building better data structures, APIs and configuration systems for scientific software using Pydantic Tutorials Tutorial en This tutorial is an introduction to Pydantic, a library for data validation and settings management using Python type annotations. Using a semi-realistic ML and / or scientific software pipeline scenario we demonstrate how Pydantic can be used to support type validations for scientific data structures, APIs and configuration systems. We show how the use of Pydantic in scientific and ML software leads to a more pleasant user experience as well as more robust and easier to maintain code. A minimum knowledge of Python type annotations, class definitions and data structures will be helpful for beginners but not required. One of the most controversial design choices of Python is the use of dynamic types. Dynamic types of variables can often lead to confusion for beginners, but also for experts it is a common sources of hard-to-find bugs. For this reason the concept of type annotations has been introduced later in the language to allow for static code analysis and more detailed source code documentation. Pydantic is a Python library that makes use of these type annotations to parse and validate types for class based data structures. In the past years Pydantic has gained tremendous popularity among web developers and is now the most widely used data validation library for Python. In this tutorial we show how the use of Pydantic can help to build better data structures, APIs and configuration systems for scientific Python packages as well. In many cases the validated types lead to a more pleasant user experience as well as more robust and easier to maintain code. In the first block we introduce the basics of the library such as the concept of Pydantic models, type annotations and atomic types such as int, float, str etc. We show how types are parsed and how models can be configured to forbid extra attributes. At the end of the block participants will try to implement the first Pydantic model and explore basic configuration settings. We then proceed with the introduction of more complex types, such as typed dicts, Enums and date time objects. We will also cover custom types, which can bet used to build nested models. Then we introduce the basics of type validation for multiple scenarios, such as pre and post init and root validation. At the end of this block we will cover the topic of dynamic model creation. In the following hands-on session participants will implement a more complex Pydantic model representing the response from a weather data API at multiple levels of difficulty. The subsequent block will be dedicated to serialization and deserialization of Pydantic models. We will first motivate the need and then introduce the JSON and YAML data formats. We will show how to support custom types for JSON serialization and give an overview of configuration options related to serialization. We will conclude with performance remarks for serialization a large number of model. In the corresponding hands-on exercise participants will use the weather data structure and build a small configurable data processing pipeline which visually compares the weather forecast data from different models. Finally we will give a summary and key takeaways of the tutorial and recommend additional resources for learning Pydantic. Axel DonathNick Langellier 2023-07-10T13:30:00-05:00 13:30 04:00 Classroom 105 2023-195-meet-your-coding-best-friend-vs-code-a-hands-on-tutorial-on-how-to-get-the-most-out-of-the-world-s-most-popular-python-editor https://cfp.scipy.org//2023/talk/RKV3PZ/ false Meet your coding best friend: VS Code💖 - A hands-on tutorial on how to get the most out of the world’s most popular Python editor Tutorials Tutorial en Visual Studio Code (VS Code) is a free code editor that runs on Windows, Linux, macOS and in your browser. This tutorial aims at Python programmers of all levels who are already using VS Code or are interested in doing so, and will take them from zero (installing VS Code) to a production setup for Python development. We will cover starter topics, such as customizing the UI and extensions, using code autocomplete, code navigation, debugging, and Jupyter Notebooks. We will also go into advanced use cases, such as remote development, pair programming via Live Share, Dev containers, GitHub Codespaces & more. After this tutorial you will walk away with a fully equipped VS Code editor, ready to work on your next project or contribute to your favorite Scientific Python library. We will also cover tips and tricks for data science and visualization and some advanced features you may not have heard of yet. We will cover the following topics: **The basics: VS Code editor and Python extension overview**. We will show you how to set up your editor, where to find the most useful menus and settings, and how to set up your workspace to start developing. We’ll explain how to find and install our favorite extensions for Python, how to use VS Code with Git. **Scientific Python development tips and tricks**. In the second hour, we will cover how to navigate and test your Python code like a pro. We will also cover some data science tools that will help you run your favorite data analysis projects directly in VS Code, as well as some GitHub features to test and document your code. **Advanced development Part I: Work where you want to**. The third hour of the tutorial explains how to use the remote development extensions pack to hook up VS Code to a remote resource, like a powerful VM in the cloud, a local Linux instance, a Docker instance or GitHub Codespaces. **Advanced development Part II: Tools that make you look like you know magic**. It’s time to have some fun and try out cool features for remote collaboration and code generation that will make you feel like the future is here. **Wrap-up & Epilogue**. We’ll recap what we’ve learned in this tutorial and share reading and learning materials to help you on your VS Code journey. /media/2023/submissions/RKV3PZ/Screenshot_2023-03-01_104725_3q3B65N.png Guen PrawiroatmodjoSarah KaiserLeopold Talirz 2023-07-10T08:00:00-05:00 08:00 04:00 Classroom 202 2023-142-image-analysis-and-visualization-in-python-with-scikit-image-napari-and-friends https://cfp.scipy.org//2023/talk/NEUUKG/ false image analysis and visualization in Python with scikit-image, napari, and friends Tutorials Tutorial en Between telescopes and satellite cameras and MRI machines and microscopes, scientists are producing more images than they can realistically look at. They need specialized viewers for multi-dimensional images, and automated tools to help process those images into knowledge. In this tutorial, we will cover the fundamentals of algorithmic image analysis, starting with how to think of images as NumPy arrays, moving on to basic image filtering, and finishing with a complete workflow: segmenting a 3D image into regions and making measurements on those regions. At every step, we will visualize and understand our work using matplotlib and napari. Between telescopes and satellite cameras and MRI machines and microscopes, scientists are producing more images than they can realistically look at. They need specialized viewers for multi-dimensional images, and automated tools to help process those images into knowledge. This tutorial is aimed at folks who have some experience in scientific computing with Python, but are new to image analysis. To get the most out of it, they should have done some work with NumPy arrays — no need to be an expert! — but they don't need to know [an image from a pipe](https://en.wikipedia.org/wiki/The_Treachery_of_Images). We will cover the fundamentals of working with images in scientific Python. The tutorial will be split into four parts, of about 45 minutes each, plus breaks: - **Images are just NumPy arrays.** In this section we will cover the basics: how to think of images not as things we can see but numbers we can analyze. - **Changing the structure of images with image filtering.** In this section we will define *filtering*, a fundamental operation on signals (1D), images (2D), and higher-dimensional images (3D+). We will use filtering to find various structures in images, such as *blobs* and *edges*. Putting NumPy, SciPy, scikit-image, and scikit-learn together, we'll show how these fundamental filters are related to modern convolutional neural networks. - **Finding regions in images and measuring their properties.** In this section we will define image segmentation — splitting up images into regions. We will show how segmentation is commonly represented in the scientific Python ecosystem, some basic and advanced methods to do it, and use it to take measurements of segmented objects in our images. We will use scikit-image for some basics, and to make object measurements, but we'll also demonstrate how to use a modern, neural-network-based library to find our imaged objects quickly and get on with our science: measuring the things we've imaged. - **Q&A/Quick tour of advanced features.** This section will be more freestyle and will depend on the audience. We may do a guided tour of other advanced image analysis topics, answer lingering questions about the previous sections, or walk around the room and help people apply what they've learned to their own data of interest. Attendees will leave understanding how to work with images in Python, knowing some of the main libraries that can help them do that, and knowing where to get more help if they need it. /media/2023/submissions/NEUUKG/napari-cells_3afAN5z.png Juan Nunez-IglesiasLars GrüterKira Evans 2023-07-10T13:30:00-05:00 13:30 04:00 Classroom 202 2023-23-introduction-to-numerical-computing-with-numpy https://cfp.scipy.org//2023/talk/UJBWPQ/ false Introduction to Numerical Computing With NumPy Tutorials Tutorial en NumPy provides Python with a powerful array processing library and an elegant syntax that is well suited to expressing computational algorithms clearly and efficiently. We'll introduce basic array syntax and array indexing, review some of the available mathematical functions in NumPy, and discuss how to write your own routines. NumPy provides Python with a powerful array processing library and an elegant syntax that is well suited to expressing computational algorithms clearly and efficiently. We'll introduce basic array syntax and array indexing, review some of the available mathematical functions in NumPy, and discuss how to write your own routines The tutorial is intended for people new to the scientific Python ecosystem. Previous experience in Python or another programming language is useful but not required. /media/2023/submissions/UJBWPQ/numpy_logo_p3rGRbD.svg Sandhya Govindaraju 2023-07-10T08:00:00-05:00 08:00 04:00 Classroom 203 2023-199-ppml-machine-learning-on-data-you-cannot-see https://cfp.scipy.org//2023/talk/B9CHA7/ false PPML: Machine Learning on data you cannot see Tutorials Tutorial en Privacy guarantee is **the** most crucial requirement when it comes to analyse sensitive data. However, data anonymisation techniques alone do not always provide complete privacy protection; moreover Machine Learning models could also be exploited to _leak_ sensitive data when _attacked_, and no counter-measure is applied. *Privacy-preserving machine learning* (PPML) methods hold the promise to overcome all these issues, allowing to train machine learning models with full privacy guarantees. In this tutorial we will explore several methods for privacy-preserving data analysis, and how these techniques can be used to safely train ML models _without_ actually seeing the data. Privacy guarantees are **the** most crucial requirement when it comes to analyse sensitive data. These requirements could be sometimes very stringent, so that it becomes a real barrier for the entire pipeline. Reasons for this are manifold, and involve the fact that data could not be _shared_ nor moved from their silos of resident, let alone analysed in their _raw_ form. As a result, _data anonymisation techniques_ are sometimes used to generate a sanitised version of the original data. However, these techniques alone are not enough to guarantee that privacy will be completely preserved. Moreover, the _memoisation_ effect of Deep learning models could be maliciously exploited to _attack_ the models, and _reconstruct_ sensitive information about samples used in training, even if these information were not originally provided. *Privacy-preserving machine learning* (PPML) methods hold the promise to overcome all those issues, allowing to train machine learning models with full privacy guarantees. This workshop will be mainly organised in **three** main parts. In the first part, we will introduce the main concepts of **differential privacy**: what is it, and how this method differs from more classical _anonymisation_ techniques (e.g. `k-anonymity`). In the second part, we will focus on Machine learning experiments. We will start by demonstrating how DL models could be exploited (i.e. _inference attack_ ) to reconstruct original data solely analysing models predictions; and then we will explore how **differential privacy** can help us protecting the privacy of our model, with _minimum disruption_ to the original pipeline. Finally, we will conclude the tutorial considering more complex ML scenarios to train Deep learning networks on encrypted data, with specialised _distributed federated_ _learning_ strategies. Valerio Maggio 2023-07-10T13:30:00-05:00 13:30 04:00 Classroom 203 2023-1-introduction-to-causal-inference https://cfp.scipy.org//2023/talk/CQRYUC/ false Introduction to Causal Inference Tutorials Tutorial en This tutorial session is intended to give attendees a gentle introduction to applying causal thinking and causal inference to data using python. Causal data analysis is very common in many academic domains (e.g. in social psychology, epidemiology, macroeconomics, etc) as well as in industry (all of the largest Silicon Valley tech companies employ teams of scientists who answer business questions purely with causal inference methods). The tutorial will involve a combination of presentations with open Q&A and hands-on exercises contained in Google Colab notebooks. The tutorial will involve a combination of presentations with open Q&A and hands-on exercises contained in Google Colab notebooks. This session will cover the difference between correlation and causation, the pitfalls of conducting an analysis using observational data, how causal inference can help get around these pitfalls, and examples of common, modern modeling approaches used to conduct causal inference (propensity score matching, estimating causal curves, g-computation, and double ML). After the tutorial, the attendees should have a good foundational understanding of causality and the ability to confidently explore the topic on their own. Causal inference can be a very theory-heavy topic, making it impenetrable to novices. In this tutorial, we'll aim to take a more practical perspective on causal inference, while still occasionally touching on the theory. Tutorial participants are not expected to be familiar with causal inference before attending, but we hope they have an earnest curiosity to learn about it! To get the most out of the session, the participants ought to have experience working with the common python data stack: matplotlib, numpy, pandas, and scikit-learn. Attendees should have some experience conducting classic machine learning modeling using the scikit-learn API, although having advanced machine learning expertise is absolutely not a prerequisite. A very basic understanding of statistics would be helpful (e.g. understanding what a mean is, what confidence intervals represent). Roni Kobrosly 2023-07-10T08:00:00-05:00 08:00 04:00 Classroom 103 2023-112-introduction-to-python-and-programming https://cfp.scipy.org//2023/talk/CDRJYE/ false Introduction to Python and Programming Tutorials Tutorial en Enjoy a gentle introduction to Python for folks who are completely new to it and may not have much experience programming. Learn how to write Python while practicing loops, if’s, functions, and usage of Python’s built-in features in a series of fun, interactive exercises inside Jupyter Notebooks. By the end you’ll be ready to write your own basic Python -- but most importantly, I want you to learn the form and vocabulary of Python so that you can understand Python documentation, interpret code written by others, and get the most out of other SciPy tutorials. To make the most of SciPy it helps to have some basic familiarity with the Python language itself. This beginner level tutorial is designed for folks who are brand-new to Python and may not even have much programming experience. I’ll help you get a working Python installation in which you can launch Jupyter Notebooks, a common tool used in scientific research with Python and in SciPy tutorials. Attendees will learn to work with Python variables, the object interface, loops, conditional statements, function definitions, and the use of basic Python data structures through hands-on exercises inside of Jupyter. Students will use the ipythonblocks library to manipulate an image-like grid of colors for immediate, interactive feedback that makes it easy to tell whether code had the intended effect. My goal is for you to leave the tutorial with a basic familiarity with Python (and a working Python installation) that helps you focus on the scientific libraries you’ll learn about in the other tutorials and throughout SciPy. Familiarity with the usage and features of Jupyter will also help you dive headfirst into other tutorials. Matt Davis 2023-07-10T13:30:00-05:00 13:30 04:00 Classroom 103 2023-190-scalable-machine-learning-workloads-with-ray-ai-runtime https://cfp.scipy.org//2023/talk/7BRY3J/ false Scalable machine learning workloads with Ray AI Runtime Tutorials Tutorial en Machine learning (ML) pipelines involve a variety of computationally intensive stages. As state-of-the-art models and systems demand more compute, there is an urgent need for adaptable tools to scale ML workloads. This idea drove the creation of Ray—an open source, distributed ML compute framework that not only powers systems like ChatGPT but also pushes theoretical computing benchmarks. Ray AIR is especially useful for parallelizing ML workloads such as pre-processing images, model training and finetuning, and batch inference. In this tutorial, participants will learn about AIR’s composable APIs through hands-on coding exercises. State-of-the-art machine learning (ML) models require an exponentially increasing amount of compute, making it necessary to utilize the full capacity of your laptop or workstation and beyond to cloud cluster. However, scaling introduces challenges with orchestration, integration, and maintenance. What's more, ML systems change quickly. If you rely on piecemeal solutions to parallelize individual stages of pre-processing, training, inference, and tuning, then stitching these evolving systems together requires a lot of overhead. This context drove the development of [Ray](https://github.com/ray-project/ray): a solution to enable researchers and developers to scale Python code to the full capacity of your laptop or cluster without worrying about implementing complex distributed computing logic. This hands-on tutorial introduces Ray AI Runtime (AIR), an open source, Python-based set of libraries that equip researchers and developers with a toolkit for parallelizing ML workloads. We will use a popular computer vision (CV) use case, image segmentation, to guide participants through common ML workloads, including data pre-processing, model training and fine-tuning, and parallel batch inference. #### Resources - GitHub repository with relevant resources including notebooks, setup instructions, reference implementations to coding exercises, and a README for an overview. - Participants will be able to use a pre-configured compute cluster for the duration of the tutorial. #### Audience - Intermediate-level Python and ML researchers and developers. - Those interested in scaling ML workloads up to full laptop capacity to a cluster. #### Prerequisites - Familiarity with basic ML concepts and workflows. - No prior experience with Ray or distributed computing. - (Optional) [Overview of Ray](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb) notebook as background material. #### Key Takeaways - Understand common challenges and trade-offs when scaling CV pipelines from laptop to cluster. - Hands-on skill in using Ray AIR to scale CV workloads, including model training, fine-tuning, inference. #### Outline Challenges with scaling ML systems (10 min) - Why are distributed systems so important to ML in general and CV pipelines in particular? How does Ray provide the common ML compute scale from laptop to cluster? Hands-on lab 1: Composing CV pipelines (60 min) - Examples introducing Ray Data, Train and Tune libraries. Participants will practice composing components to scale an end-to-end ML workload. - Ray Data - Ingest, shard and preprocess the data. - Ray Train - Train a model on the preprocessed training set. - Ray Tune - Run hyperparameter tuning experiment. - BatchPredictor - Perform batch inference on the test set. (10 minute break) Hands-on lab 2: Model training and fine-tuning (60 min) - Learn about approaches to scaling model training. - Code: Implement transformer model fine-tuning with Ray Train and evaluate performance. (10 minute break) Hands-on lab 3: Batch inference (60 min) - Learn about and evaluate several distributed batch inference design patterns. - Implement distributed batch inference through hands-on coding exercises. - Code: Run batch inference using vision transformer and evaluate performance. Next steps (10 min) - How to get involved with Ray and access further resources. /media/2023/submissions/7BRY3J/e2e_air_2YbrESe.png Emmy LiAdam Breindel 2023-07-10T08:00:00-05:00 08:00 04:00 Classroom 104 2023-172-controlling-self-landing-rockets-using-cvxpy https://cfp.scipy.org//2023/talk/ZCUDYT/ false Controlling Self-Landing Rockets Using CVXPY Tutorials Tutorial en In this tutorial, attendees will learn hands-on how to optimize the trajectory of a self-landing rocket in a real-time simulated setting using CVXPY, a Python-embedded modeling language for convex optimization. We integrate the optimization with the Kerbal Space Program, to showcase a complete landing mission without human intervention, ideally in one piece. CVXPY allows solving complex problems declaratively, letting convex optimization find an optimal way of meeting target conditions with respect to an objective function. After solving the initial problem, attendees will use a selection of advanced CVXPY features while making the example gradually more realistic. After giving an introduction to CVXPY at SciPy 2022, we want to follow up and provide an in-depth, worked example about one of the most inquired applications of convex optimization: controlling a self-landing rocket. Indeed, this is also one of the most complex problems to solve and practical usefulness has only recently been achieved. Nevertheless, CVXPY makes it possible to elegantly solve a simplified yet at its core realistic version of the problem. The application serves as a common thread that attendees can work along while being introduced to convex optimization and CVXPY in particular, as well as some of the more advanced features of the library. The tutorial will start by introducing the problem of controlling a self-landing rocket and why it is important. We will then provide an overview of convex optimization and how it can be used to solve this problem. Next, we will dive into the details of CVXPY, starting from a simple hello-world example and gradually moving towards expressing the full problem. Stating the problem should look familiar to anyone who has worked with NumPy before, and only requires high-school level physics knowledge to understand. We have integrated our problem with the Kerbal Space Program, which fits the theme of our tutorial nicely. It allows us to make our problem gradually more realistic by incorporating conditions such as drag, fuel usage, and wind. We will run the scripts written by the attendees to see if it manages to land a rocket safely. As we solve the problem, we will showcase some of the more advanced features of CVXPY, including DPP and CVXPYgen, which can give a significant speedup in practice. By the end of the tutorial, attendees will have a thorough understanding of how to use CVXPY to solve complex optimization problems, and how to apply it to real-world problems such as controlling a self-landing rocket. No prior knowledge of convex optimization is assumed, making this tutorial accessible to beginners in the field. /media/2023/submissions/ZCUDYT/rocket_EeNrpQX.png Philipp SchieleSteven DiamondEric Sager Luxenberg 2023-07-10T13:30:00-05:00 13:30 04:00 Classroom 104 2023-203-how-the-little-jupyter-notebook-became-a-web-app-managing-increasing-complexity-with-nbdev https://cfp.scipy.org//2023/talk/NFWZXD/ false How the Little Jupyter Notebook Became a Web App: Managing Increasing Complexity with nbdev Tutorials Tutorial en Already familiar with ipywidgets, but ready to take your skills to the next level? In this tutorial we walk through what it takes to transform an exploratory Jupyter Notebook into a mature web application. Web apps can be a valuable product of collaboration between researchers and software developers, and the packages used in this tutorial were selected to support this relationship, starting with using JupyterLab as an integrated development environment. Attendees will learn how to design and document a scientific web application that accommodates increasing complexity, but is also inheritable by the researchers who maintain them in the long run. Our tutorial should appeal to scientists and software developers alike. We hope to convince you that web applications are excellent tools for improving the accessibility of scientific data and software and provide you with the know-how to develop one that accommodates growth and collaboration. The structure of the tutorial is based on a true story about a little Jupyter Notebook… One day, a research scientist created the Notebook to do some exploratory development. After a while, that Notebook grew into a reusable workflow for creating a helpful visualization the scientist often ran for different parameters. The scientist eventually recognized that their workflow might be worth sharing, so they started working with a software developer to help the little Notebook grow into a web application. At first, the developer replaced hardcoded inputs into interactive ipywidgets, and used Voila` to hide the code cells from users. That was the day the little Notebook became a dashboard, but its journey didn’t stop there. Over time, the researcher had new ideas about features they wanted to add, so the developer transformed the dashboard into a tab-based web application that could accommodate more steps with rich instructions. But there was a problem. The Notebook started experiencing growing pains. It contained more code than was comfortable. The developer made the notebook feel better by offloading some of the code into python modules. This worked well for logic, but as the application grew more complex, it was important to develop nested widget components in the visual Notebook environment. At first, the developer coded views in extra notebooks, and then copy-pasted the code into the python module, but this became laborious and confusing. One day, the developer started to write a tool that would export code cells from the notebooks into python modules. That way, the developer could code entirely in notebooks, and they could leave in all the markdown and code cells that documented what they were thinking as they designed the tool. That was the day that the little Notebook became a literate Notebook family. Not long after, the developer was listening to the Talk Python to Me podcast, and heard someone mention a tool called nbdev. The tool was just like the one the developer had made, except it had many more useful features, like notebook-friendly git commits and merges. Eureka! Finally, the developer could accommodate increasing complexity with simple tools. When the developer gave the Notebook family back, the researchers were able to maintain it themselves, without having to download scary IDEs, extensions, or environments. And they all lived happily ever after. Nicole BrewerLudovico Bianchi 2023-07-11T08:00:00-05:00 08:00 04:00 Classroom 106 2023-31-idiomatic-pandas https://cfp.scipy.org//2023/talk/8TAA7K/ false Idiomatic Pandas Tutorials Tutorial en Pandas can be tricky, and there is a lot of bad advice floating around. This tutorial will cut through some of the biggest issues I've seen with Pandas code after working with the library for a while and writing three books on it. We will discuss: * Proper types * Chaining * Aggregation * Debugging Are you confused or frustrated with Pandas? Or maybe your own Pandas code when you come back to it later, you find it confusing or difficult to work with. I've taught Pandas to thousands in Corporate settings, Universities, and Virtually. I've also seen the bad code that my students write and have strong opinions on how to correct it. This workshop assumes you know some Pandas and want to apply idiomatic constructs to existing code. There will be some lecture and then breakout time to apply the constructs on your own: We will cover * Types * Chaining * Mutation * Aggregation * Debugging /media/2023/submissions/8TAA7K/ms-idiomaticpd-course_Ln0MMD9.png Matt Harrison 2023-07-11T13:30:00-05:00 13:30 04:00 Classroom 106 2023-214-advanced-dask-tutorial https://cfp.scipy.org//2023/talk/MQQJKG/ false Advanced Dask Tutorial Tutorials Tutorial en Dask is a Python library for scaling and parallelizing Python code. It provides familiar, high-level interfaces to extend the SciPy ecosystem to larger-than-memory or distributed environments, as well as lower-level interfaces for parallelizing custom algorithms. In this tutorial, we’ll cover advanced features of Dask like applying custom operations to Dask DataFrames and arrays, debugging computations, diagnosing performance issues, and more. Attendees should walk away with a deeper understanding of Dask’s internals, an introduction to more advanced features, and ideas of how they can apply these features effectively to their own workloads. Dask is a popular Python library for scaling and parallelizing Python code on a single machine or across a cluster. It provides familiar, high-level interfaces to extend the SciPy ecosystem (e.g. NumPy, pandas, scikit-learn) to larger-than-memory or distributed environments, as well as lower-level interfaces for parallelizing custom algorithms and workflows. In this tutorial, we’ll cover advanced features of Dask like applying custom operations to Dask DataFrames and arrays, inspecting the internal state of clusters, debugging distributed computations, diagnosing performance issues, and more. Attendees should walk away with a deeper understanding of Dask’s internals, an introduction to more advanced features, and ideas of how they can apply these features effectively to their own data-intensive workloads. Basic Dask experience is required, though knowledge of Dask’s internals is not. This hands-on tutorial is intended for existing or aspiring Dask users looking to gain a deeper understanding of more intermediate and advanced topics. James BourbeauNaty ClementiJulia SignellCharles Blackmon-Luca 2023-07-11T08:00:00-05:00 08:00 04:00 Classroom 101 2023-53-thinking-in-arrays https://cfp.scipy.org//2023/talk/XBUC8S/ false Thinking in arrays Tutorials Tutorial en Despite its reputation for being slow, Python is the leading language of scientific computing, which generally needs large-scale (fast) computations. This is because most scientific problems can be split into "metadata bookkeeping" and "number crunching," where the latter is performed by array-oriented (vectorized) calls into precompiled routines. This tutorial is an introduction to array-oriented programming. We'll focus on techniques that are equally useful in NumPy, Pandas, xarray, CuPy, Awkward Array, and other libraries, and we'll work in groups on three class projects: Conway's Game of Life, evaluating decision trees, and computations on ragged arrays. Array-oriented programming is a paradigm in its own right, challenging us to think about problems in a different way. From APL in 1966 to NumPy today, most users of array-oriented programming are scientists, analyzing or simulating data. This tutorial focuses on the thought process: all of the problems are to be solved in an imperative way (for loops) and an array-oriented way. Matlab will be used for plotting, but all plotting commands will be given (not prerequisites). We'll alternate between short lectures and small group projects (3‒4 people each), in which tutors will be available for help, followed by a guided tour through solutions, alternatives, and trade-offs. Here is a general outline: **0:00‒0:20 (20 min):** Array-oriented programming as a paradigm: APL, SPEAKEASY, IDL, MATLAB, S, R, NumPy. Overview of basic and advanced slicing, broadcasting, and dimensional reduction. Powerful concept: element indexing is function application and advanced slicing is function composition. **0:20‒0:40 (20 min):** Project 1: Conway's Game of Life. Calculating number of neighbors and updating the board "all at once." **0:40‒0:55 (15 min):** Break **0:55‒1:15 (20 min):** Guided discussion of solutions to Project 1. **1:15‒1:35 (20 min):** Array-oriented programming and the "iteration until converged" problem. How to update arrays in which some elements have converged and others haven't. **1:35‒1:55 (20 min):** Project 2: evaluating a decision tree, by walking over each node individually (as in a computer science class) and by million-ball Plinko! (how Scikit-Learn actually does it). **1:55‒2:10 (15 min):** Break **2:10‒2:30 (20 min):** Solutions to Project 2. **2:30‒2:45 (15 min):** Demo: Mandelbrot (fractal) picture, computed 11 different ways: Python, NumPy, C++ (pybind11), Cython, Numba imperative, Numba vectorized, CuPy, CuPy with custom CUDA, Numba-CUDA, JAX-CPU, and JAX-GPU. Discussion of performance and trade-offs. **2:45‒3:05 (20 min):** Non-rectilinear (ragged) arrays and arrays of arbitrary data structures: Apache Arrow and Awkward Array. **3:05‒3:25 (20 min):** Project 3: a big, ragged dataset: computing lengths of taxi trips from polylines with varying numbers of edges. Since this is a big dataset, we'll also look at ways to scale it up with Dask. **3:25‒3:40 (15 min):** Break **3:40‒4:00 (20 min):** Solutions to Project 3. Jim Pivarski 2023-07-11T13:30:00-05:00 13:30 04:00 Classroom 101 2023-5-sympy-introductory-tutorial https://cfp.scipy.org//2023/talk/LJQPVT/ false SymPy Introductory Tutorial Tutorials Tutorial en SymPy is a Python library for symbolic mathematics. This tutorial will introduce SymPy to a beginner audience. It will cover an introduction to symbolic computing, basic operations, simplification, calculus, matrices, advanced expression manipulation, code generation, and selected advanced topics. The tutorial does not have any prerequisites beyond knowledge of Python and basic freshman level mathematics. It will be presented with Jupyter notebooks with regular exercises for the attendees. After attending this tutorial, attendees will be able to start using SymPy to solve their own problems. SymPy is a pure Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python. SymPy can be used in a wide array of applications. This includes basic usage as an interactive calculator, symbolically modeling problems in physics and engineering, generating fast numeric code, and use in a Python library representing custom symbolic objects. Anyone interested in learning how to get started using SymPy for any such applications should attend this tutorial. This tutorial is a beginner level tutorial and only requires knowledge of how to use Python. Knowledge of mathematics up to basic calculus is recommended. More advanced mathematical topics will be explained as part of the tutorial. Knowledge of other Python libraries such as NumPy is NOT required. There will be a short section near the end on how to interface SymPy with other libraries such as NumPy, but the majority of the tutorial does not make use of any additional libraries. This tutorial will cover the basics of how to use SymPy, and will also touch on some advanced topics. We will start by discussing the basics of how to build mathematical expressions with SymPy and manipulate them. We will look at how to avoid some of the more common pitfalls and gotchas when using the SymPy. We will then move onto the most common functions in SymPy such as simplification functions, solvers, functions for doing operations from calculus such as differentiation and integration, and matrices. Finally, as time permits, we will look into more advanced topics, such as code generation, extending SymPy, interfacing with other libraries such as NumPy, and additional SymPy submodules. After attending this tutorial, attendees will be able to start using SymPy to solve their own problems. They will also be armed with the knowledge of how to discover additional more specific functionality in SymPy that may be required for their particular use-case. We will expect tutorial attendees to have the tutorial materials installed on their computers prior to the tutorial. This way we will not waste time in the beginning getting things installed. The tutorial will also be available online using either Binder or JupyterLite for those that do not wish to install things locally. Aaron MeurerAnutosh BhatSangyub Lee 2023-07-11T08:00:00-05:00 08:00 04:00 Classroom 105 2023-208-hvplot-and-panel-visualize-all-your-data-easily-from-notebooks-to-dashboards https://cfp.scipy.org//2023/talk/VKXXNH/ false hvPlot and Panel: Visualize all your data easily, from notebooks to dashboards Tutorials Tutorial en This tutorial will show you how to use the Pandas or Xarray APIs you already know to interactively explore and visualize your data even if it is big, streaming, or multidimensional. Then just replace your expression arguments with widgets to get a web app that you can share as HTML+WASM or backed by a live Python server. These tools let you focus on your data rather than the API, and let you build linked, interactive drill-down exploratory apps without having to run a web-technology software development project, which you can then share without becoming an operations specialist. Python offers many powerful visualization tools, each with their own strengths and advantages, but few people have the time and interest to learn all the different APIs required to use these different tools. Luckily, a de-facto standard API for data plotting has emerged in the Pandas .plot() API, which is now supported by many different plotting packages. In this tutorial, you will learn how to use hvPlot, a high-level interactive plotting library that exposes the power of Bokeh, Matplotlib, Plotly, Datashader, and Cartopy using the same .plot API you may already know from using Pandas or Xarray's plotting interface. We'll also show you how to turn nearly any expression you can write with that API into a web app with plots and tables by simply substituting widgets for any parameters you want users to be able to select. Thanks to the HoloViz tools on which hvPlot is built, the resulting apps can easily handle big data (up to billions of rows on an ordinary laptop), remote data (either in Jupyter or in standalone apps), streaming data (using streaming dataframe libraries), geographical data (building on the geoscience software stack), and multidimensional data (using Xarray). hvPlot's high-level interface should be sufficient for nearly all of the common data-exploration and data-analysis tasks you want to do with Pandas or Xarray, but in keeping with the HoloViz philosophy of "shortcuts rather than dead ends", we'll also show you how and when to drop down to lower-level APIs when you need to, such as when building more complex apps using Panel, doing complex graphical data calculations using Datashader, or integrating plotting and interactivity into your own libraries using Param and HoloViews. With the techniques you learn in the hands-on exercises in this tutorial, you'll get the tools and know-how to effectively explore, analyze and visualize simple or complex, small or large, and static or dynamic data easily, concisely, and reproducibly. The resulting visualizations and apps can be shared as static images, simple HTML documents with limited interactivity, HTML+WASM documents with full Python-backed interactivity, or as Python apps deployed on a remote server. We expect participants to have previously used some sort of plotting tool and to be comfortable with Python and at least one array-based library (Numpy, Pandas, Xarray, CuPy, cuDF, Dask, etc.). James A. BednarSophia Yang 2023-07-11T13:30:00-05:00 13:30 04:00 Classroom 105 2023-101-interactive-data-visualization-with-bokeh https://cfp.scipy.org//2023/talk/C9QZXU/ false Interactive data visualization with Bokeh Tutorials Tutorial en Bokeh is a library for interactive data visualization. You can use it with Jupyter Notebooks or create standalone web applications, all using Python. This tutorial is a complete guide to Bokeh, where we start with a basic line plot and step-by-step make our way to creating a dashboard with several interacting components. This tutorial will be helpful for scientists who are looking to level-up their analysis and presentations, and tool developers interested in adding custom plotting functionally or dashboards. Bokeh is a Python library for creating interactive data visualizations. Bokeh allows you to create plots that can be displayed in a web browser, without needing to write HTML and JavaScript. In development for over 10 years, Bokeh has become a core tool for Python data science workflows, used for both exploratory analysis and in presentations. It is actively used in scientific domains including bioscience, geoscience, and astrophysics. Moreover, other useful libraries in the PyData ecosystem, like Dask, ArViz, and the Holoviz tools, build custom applications and workflows with Bokeh. In this tutorial, you’ll learn everything you need to know to create beautiful and powerful interactive plots from scratch. We’ll start by introducing core Bokeh concepts, creating simple static plots like line and bar charts, and customizing them. We’ll then gradually introduce layers of interactivity, create specialized plots like geographic maps, and discuss new features like contour plots. By the end, you will be able to create a complete interactive dashboard using Bokeh. This tutorial is presented by Bokeh core team members and is fully hands-on with several examples and exercises in every section. We hope to enable more people, especially scientists and tool developers, to create pretty yet powerful visualizations. /media/2023/submissions/C9QZXU/bokeh-tutorial-session-image_7znieSO.png Pavithra EswaramoorthyIan ThomasBryan Van de VenTimo MetzgerVictoria Adesoba 2023-07-11T08:00:00-05:00 08:00 04:00 Classroom 202 2023-137-explore-generative-models-in-ai-with-keras https://cfp.scipy.org//2023/talk/7NLG3F/ false Explore generative models in AI with Keras Tutorials Tutorial en This tutorial introduces Keras, a powerful deep learning library and demonstrates how to enable generative models using Keras. The first part delves into the Keras training pipeline and extended modules. The second part explores image generative models using stable diffusion, with live coding examples to generate novel images and teach the model new concepts. Finally, you'll explore language generative models, including GPT and BART, with a live coding example that demonstrates how to enable these models. By the end of this tutorial, you'll have a solid understanding of how to harness Keras to create powerful AI applications. In this tutorial, we will explore the powerful Keras library and the world of generative models in AI. We will begin with a brief introduction to Keras, its history, and its value in creating neural networks. We will then dive into the Keras training pipeline, exploring sequential, functional, and custom models, optimizers, loss and metrics, and the training API. We will also cover Keras extended modules for NLP, CV, and GNN, and walk through an end-to-end example to create and optimize a model. In the second part of the tutorial, we will specifically focus on image generative model stable diffusion architecture. We will explain stable diffusion, demonstrate a latent space walkthrough, and generate images using a colab example. Additionally, we will focus on image inpainting and teaching stable diffusion new concepts, this is called textual inversion. Finally, we will explore how generative models work in NLP, specifically focusing on GPT structure and GPT 2, BART, and the mobile playbook. We will demonstrate XLA compilation and show how general support for text generation using one API can be achieved. By the end of this tutorial, attendees will have a solid understanding of Keras and generative models and how they can be used to create powerful AI applications. Divyashree Shivakumar SreepathihalliChen Qian 2023-07-11T13:30:00-05:00 13:30 04:00 Classroom 202 2023-64-resampling-and-monte-carlo-methods-in-scipy-stats https://cfp.scipy.org//2023/talk/F3HAUQ/ false Resampling and Monte Carlo Methods in SciPy.stats Tutorials Tutorial en Resampling and Monte Carlo statistical techniques are surprisingly intuitive, and they are often more flexible and accurate than their better-known analytical counterparts. In this tutorial, participants will develop their intuitive understanding of frequentist statistics and apply it using three functions in `scipy.stats` - `monte_carlo_test`, `permutation_test`, and `bootstrap` - to dramatically expand the statistical analyses they can perform with the SciPy Library. Scientists and engineers often seek to answer questions of the following forms. 1. Is my sample drawn from this hypothesized distribution? 2. Are my samples drawn from the *same* distribution? 3. Based on these samples, what can I infer about the populations from which they were drawn? Common statistical procedures used to answer questions of these forms include: 1. the one-sample t-test ("Is my sample drawn from a distribution with population mean `m`?"), 2. the two-sample t-test ("Are my two samples drawn from distributions with the same population mean?"), and 3. the confidence interval of the mean ("Given my sample, what can I say about the true value of the population mean?"). Such procedures are developed under technical assumptions (e.g., the samples were drawn from normally-distributed populations) that make the mathematics tractable, yet in practice, these assumptions can never be met exactly. Fortunately for science, the conclusions drawn from the procedures above are relatively insensitive to deviations from these assumptions… except when they’re not! One solution is to abandon frequentist statistics in favor of another paradigm (Bayesian), but the approach suggested by this tutorial is to remove the assumptions, reduce reliance on the analytical approximations, and instead use computers to approximate (or even exactly calculate) responses to the original questions. This idea will lead us to three techniques: 1. Monte Carlo tests (`scipy.stats.monte_carlo_test`) 2. Permutation tests (`scipy.stats.permutation_test`) 3. The Bootstrap (`scipy.stats.bootstrap`) For many of the same reasons that arithmetic (sums and differences) seems simpler than calculus (integrals and derivatives), these techniques are relatively easy to grasp. Likewise, just as computational methods for integration, equation solving, and optimization can solve a wider variety of problems than analytical approaches, these computational statistical techniques are comparatively flexible and easy to apply. During this tutorial, participants will write their own code to execute fundamental resampling and Monte Carlo algorithms and compare the results of their code against the equivalent functions in SciPy. They will apply their new understanding of SciPy's `monte_carlo_test`, `permutation_test`, and `bootstrap` functions to reproduce and extend the capabilities of SciPy's other statistics functions (e.g. to small samples, to discrete distributions). Through this tutorial, participants will improve their ability to apply existing statistical procedures to a given situation and gain the ability to *create* customized statistical procedures for demanding applications. /media/2023/submissions/F3HAUQ/exact_vs_approximate_cdjfyvk.png Matt HaberlandAlbert Steppi 2023-07-11T08:00:00-05:00 08:00 04:00 Classroom 203 2023-120-data-of-an-unusual-size-a-practical-guide-to-analysis-and-interactive-visualization-of-massive-datasets https://cfp.scipy.org//2023/talk/ALSYBR/ false Data of an Unusual Size: A practical guide to analysis and interactive visualization of massive datasets Tutorials Tutorial en While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need. In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a public cloud provided by the presenters – starting from how the data is stored and read, to how it is processed and visualized. "Big data" refers to any data that is too large to handle comfortably with your current tools and infrastructure. As the leading language for data science, Python has many mature options that allow you to work with datasets that are orders of magnitudes larger than what can fit into a typical laptop's memory. This tutorial will help you understand how large-scale analysis differs from local workflows, the unique challenges associated with scale, and some best practices to work productively with your data. By the end, you will be able to answer: - What makes some data formats more efficient at scale? - Why, how, and when (and when not) to leverage parallel and distributed computation (primarily with Dask) for your work? - How to manage cloud storage, resources, and costs effectively? - How interactive visualization can make large and complex data more understandable (primarily with hvPlot)? - How to comfortably collaborate on data science projects with your entire team? The tutorial focuses on the reasoning, intuition, and best practices around big data workflows, while covering the practical details of Python libraries like Dask and hvPlot that are great at handling large data. It includes plenty of exercises to help you build a foundational understanding within three hours. /media/2023/submissions/ALSYBR/tutorial_banner_nvqATRn.png Pavithra EswaramoorthyDharhas PothinaChristopher Ostrouchov 2023-07-11T13:30:00-05:00 13:30 04:00 Classroom 203 2023-82-xarray-friendly-interactive-and-scalable-scientific-data-analysis https://cfp.scipy.org//2023/talk/QXAYRM/ false Xarray: Friendly, Interactive, and Scalable Scientific Data Analysis Tutorials Tutorial en Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines the convenience of labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. This tutorial will introduce data scientists already familiar with Xarray to more intermediate and advanced topics, such as applying functions in SciPy/NumPy with no Xarray equivalent, advanced indexing concepts, and wrapping other array types in the scientific Python ecosystem. Xarray is an open-source Python project that makes working with complex, multi-dimensional arrays elegant, intuitive, and efficient. Real-world datasets are often a collection of many related variables on a common grid rather than raw numbers. Such datasets are common in the disciplines of earth science, astronomy, biology, and finance. These datasets are more than just arrays of values: they have labels which describe how array values map to locations in dimensions such as space and time and metadata that describes how the data was collected and processed. Xarray embraces this complexity and enables users to use dataset metadata such as dimension names and coordinate labels to easily analyze, manipulate, and visualize their datasets. For example, the Pandas-inspired Xarray label-based syntax `temperature.sel(place=”Boston”)` is more intuitive and less error-prone compared to NumPy syntax: `temperature[0]`. This hands-on tutorial will introduce data scientists already familiar with Xarray to more advanced concepts. All material will be presented via Jupyter Notebooks, with participants actively coding and performing exercises to solidify understanding of key concepts. The tutorial intersperses teaching intermediate to advanced Xarray concepts with increasingly complex real-world data analysis tasks. The participant learning goals for the tutorial are to: 1. Effectively use Xarray’s powerful multidimensional indexing operations 2. Become familiar with important parts of Xarray’s computational API 3. Understand how to extend Xarray’s built-in capabilities with custom computation functions 4. Understand how Xarray fits in with other array types in the scientific Python ecosystem The structure of our tutorial is based on our extensive experience teaching Xarray over the past few years, including numerous similar tutorials at international conferences like SciPy, as well as in formal classes taught at the National Center for Atmospheric Research and the University of Washington. The tutorial will be presented using [Nebari](scipy.quansight.dev), which will facilitate interactive computation and a consistent computational environment without requiring participants to install any software. Tutorial material will be available online ([link]( https://tutorial.xarray.dev/workshops/scipy2023/README.html)) and we will ensure that proper environment files are available for participants that prefer running the tutorial locally. Participants are expected to have some familiarity with Jupyter notebooks, NumPy, Pandas, and Xarray. No specific domain knowledge (e.g. geoscience) is required to effectively participate in this tutorial. If you are new to Xarray then please go through last year’s tutorial ([link](https://tutorial.xarray.dev/workshops/scipy2022/README.html#scipy-2022)) prior to attending, as our tutorial will assume attendees have a working understanding of these basic concepts. /media/2023/submissions/QXAYRM/dataset-diagram-logo_PHL7b4e.png Deepak CherianNegin SobhaniScott HendersonAnderson BanihirweDon SetiawanThomas NicholasJessica Scheick 2023-07-11T08:00:00-05:00 08:00 04:00 Classroom 103 2023-9-power-up-your-work-with-compiling-and-profiling https://cfp.scipy.org//2023/talk/NDYWUR/ false Power up your work with compiling and profiling Tutorials Tutorial en In this workshop, we will introduce Numba - a JIT compiler that is designed to speed up numerical calculations. Most people found all of it is like a mystery - It sounds like magic, but how does it work? Under what conditions does it work? And because of it, new users found it hard to start using it and it requires a steep learning curve to get the hang of it. This workshop will provide all the knowledge that you need to make Numba works for you. Have you ever heard of Numba? It is (mainly) a JIT (Just-In-Time) compiler to make your math-heavy Python code run faster under certain conditions. Most people found all of it to be like a mystery - It sounds like magic, but how does it work? Under what conditions does it work? And because of it, new users found it hard to start using it and it requires a steep learning curve to get the hang of it. This workshop requires no prior experience. However, it will be most beneficial to those who are working with numerical data, like data scientists and researchers. We also expect participants have no knowledge of how compilers work and not much understanding of how CPython works. Through exercises, we will explore in what situation Numba works, when it does not and the reason why. We will also look at some cases where we can make Numba works by changing a few things in your code. Hopefully, by finishing the workshop, you will have a better understanding of how Numba works before you even start using it. This knowledge can save you some time on try and error, making your experience in using it better. **What Attendees will Learn** By the end of the workshop, you will have some understanding of what Numba is and how it speeds up your Python code. You will also have a better idea about the limitation of Numba and when it does not help. You may also know how to change your code to make it benefit from the speeding up of Numba. You will also learn some troubleshooting skills and where to look for help if got stuck in the future. Cheuk Ting Ho 2023-07-11T13:30:00-05:00 13:30 04:00 Classroom 103 2023-20-python-for-answering-geospatial-questions-exploring-social-inequity-in-our-communities https://cfp.scipy.org//2023/talk/PTB7DU/ true Python for answering geospatial questions: exploring social inequity in our communities Tutorials Tutorial en We love Python but maybe not enough to commit to an entire coding language. What if we could understand the fundamentals and begin working with real-time data in a single session? Actionable python scripts and understanding the frameworks might be enough to be a springboard for larger exploration projects. Recent advances in geospatial analysis and the availability of digital maps have revealed the importance of urban form and built infrastructure as fundamental to understanding the vulnerabilities and vitality of global and local cities. Learning to write Python scripts (assuming low or no-level prior experience) we will discover morphometrics and what they reveal about the urban form of cities. /media/2023/submissions/PTB7DU/Screenshot_2023-01-27_at_3.17.49_PM_3Wyxlz7.png bonny p mcclain 2023-07-11T08:00:00-05:00 08:00 04:00 Classroom 104 2023-81-an-introduction-to-cloud-based-geospatial-analysis-with-earth-engine-and-geemap https://cfp.scipy.org//2023/talk/GQ7PG3/ false An Introduction to Cloud-Based Geospatial Analysis with Earth Engine and Geemap Tutorials Tutorial en This tutorial is an introduction to cloud-based geospatial analysis with Earth Engine and the geemap Python package. We will cover the basics of Earth Engine data types and how to visualize, analyze, and export Earth Engine data in a Jupyter environment using geemap. We will also demonstrate how to develop and deploy interactive Earth Engine web apps. Throughout the session, practical examples and hands-on exercises will be provided to enhance learning. The attendees should have a basic understanding of Python and Jupyter Notebooks. Familiarity with Earth science and geospatial datasets is not required, but will be useful. The Earth is constantly changing, which creates significant challenges for the environment and human society. To tackle these challenges on a global scale, the Earth science community relies heavily on geospatial datasets that are collected through various means, such as satellite, aerial, and mobile sensors. However, the explosive growth of geospatial datasets over the past few decades has overwhelmed the Earth science community's capacity for storage, analysis, and visualization. Fortunately, the advent of cloud-computing platforms (e.g., Google Earth Engine) has made it possible to access, manipulate, and analyze large volumes of geospatial data on-the-fly. In recent years, Earth Engine has become increasingly popular in the geospatial community and has enabled numerous Earth science applications at local, regional, and global scales. The geemap Python package is built upon the Earth Engine Python API and open-source mapping libraries. It allows Earth Engine users to interactively manipulate, analyze, and visualize geospatial big data in a Jupyter environment. Since its creation in April 2020, geemap has received over [2,500 GitHub stars](https://github.com/giswqs/geemap/stargazers) and is being used by over [800 projects](https://github.com/giswqs/geemap/network/dependents) on GitHub. More than [130 Jupyter notebook examples](https://geemap.org/tutorials/) and an [open-access book](https://book.geemap.org/) are available for learning geemap. This tutorial consists of seven 30-minute sessions and three 10-minute breaks. During each hands-on session, the attendees will walk through Jupyter notebook examples on Google Colab with the instructors. At the end of each session, they will complete a hands-on exercise to apply the knowledge they have learned. The topics that will be covered in this tutorial include: (1) Introduction to Earth Engine and geemap; (2) Using Earth Engine data; (3) Visualizing Earth Engine data; (4) Analyzing Earth Engine data; (5) Exporting Earth Engine data; (6) Creating satellite timelapse animations; and (7) Developing and deploying interactive Earth Engine web apps. This tutorial is intended for scientific programmers, data scientists, geospatial analysts, and concerned citizens of Earth. Attendees should have a basic understanding of Python and the Jupyter ecosystem. Familiarity with Earth science and geospatial datasets is not necessary, but it will be helpful. For more information about Earth Engine and geemap, visit https://earthengine.google.com and https://geemap.org. Qiusheng WuSteve Greenberg 2023-07-11T13:30:00-05:00 13:30 04:00 Classroom 104 2023-48-a-hands-on-introduction-to-production-grade-data-science-orchestration-with-flyte https://cfp.scipy.org//2023/talk/YHEYVY/ false A Hands-on Introduction to Production-grade Data Science Orchestration with Flyte Tutorials Tutorial en One of the biggest challenges for data scientists and machine learning engineers alike is the friction caused by the iteration cycle between prototyping and production. It’s not enough to deploy a working model to a serving app. The iterative process itself needs to be a tight feedback loop between experimentation, data and model refinement, deploying to production, and dealing with data drift. In this tutorial, attendees will learn how to unify the common tools in the Python Data/ML scientific stack into a single orchestration plane using Flyte so that you can reduce the friction between prototyping and production. # Background This tutorial interleaves lecture-style content and coding exercises to give data scientists, machine learning engineers, and data engineers hands-on experience with Flyte. Flyte is an open source workflow orchestrator that has a Python SDK for writing and scheduling execution graphs in a type-safe, reproducible manner. The topics and concepts covered in this tutorial are transferable to other similar orchestration tools, or would be useful for anyone who wants to build their own orchestrator. We will anchor the tutorial to five challenges of model development and deployment: scalability, data quality, reproducibility, recoverability, and auditability. Using Flyte, we’ll see how to address these challenges and abstract them out to give you a broader understanding of how to overcome them. # Main Content First I’ll define and describe what these five challenges mean in the context of model development. Then I’ll dive into the ways in which Flyte provides solutions to them, taking you through the reasoning behind Flyte’s data-centric and ML-aware design. We'll cover: - **Flyte tasks and workflows**: the building blocks for expressing execution graphs. - **Dynamic workflows**: for defining execution graphs at runtime. - **Map tasks**: Scale embarrassingly parallel workflows. - **Plugins**: Extend Flyte's core functionality. - **Type System**: See the benefits of static type safety. - **DataFrame Types**: Validate dataframe-like objects at runtime. - **Reproducibility**: Containerize and harden your execution graph. - **Caching**: Don't waste precious compute resources re-running nodes. - **Recovering Executions**: Build fault-tolerant pipelines. - **Checkpointing**: Checkpoint progress within a node. - **Flyte Decks**: Create rich static reports associated with your tasks. Attendees will learn how Flyte distributes and scales computation, enforces static and runtime type safety, leverages Docker to provide strong reproducibility guarantees, implements caching and checkpointing to recover from failed model training runs, and ships with built-in data lineage tracking for full data pipeline auditability. # Wrap-up The end of the tutorial will provide a summary of all the main learnings, point to resources to learn more, and a discussion for attendees to address their questions. # Resources - [Flyte Repo](https://github.com/flyteorg/flyte) - [Flyte Docs](https://docs.flyte.org/en/latest/) - [Scipy 2022 Flyte talk](https://www.youtube.com/watch?v=EykWaiHHDNg) - [Scipy 2020 Pandera talk](https://www.youtube.com/watch?v=PxTLD-ueNd4) Niels Bantilan 2023-07-11T18:30:00-05:00 18:30 02:00 Enthought - 200 W Cesar Chavez St 2023-301-scipy-welcome-reception https://cfp.scipy.org//2023/talk/Y8PPAA/ false SciPy Welcome Reception Tutorial en SciPy Welcome Reception hosted by Enthought. Tuesday, July 11, 6:30-8:30 at Enthought HQ, 200 W Cesar Chavez, Austin. Meet fellow attendees! Food and drinks served! [Walk](https://www.google.com/maps/dir/AT%26T+Hotel+and+Conference+Center,+University+Avenue,+Austin,+TX/Enthought,+200+W+Cesar+Chavez+St+Suite+202,+Austin,+TX+78701/@30.272726,-97.7524166,15z/data=!3m2!4b1!5s0x8644b508a6554d83:0x7edf0a3a6fece735!4m18!4m17!1m5!1m1!1s0x8644b59de7f3c8cf:0x7ef52b1ad3321879!2m2!1d-97.7404423!2d30.2816145!1m5!1m1!1s0x8644b509cdd787e9:0x108b9372002d7f55!2m2!1d-97.7463985!2d30.2642596!2m3!6e1!7e2!8j1689100200!3e2?entry=ttu), get a ride, or [take the bus](https://www.google.com/maps/dir/AT%26T+Hotel+and+Conference+Center,+University+Avenue,+Austin,+TX/Enthought,+200+W+Cesar+Chavez+St+Suite+202,+Austin,+TX+78701/@30.2737123,-97.7521933,15z/data=!3m1!5s0x8644b508a6554d83:0x7edf0a3a6fece735!4m19!4m18!1m5!1m1!1s0x8644b59de7f3c8cf:0x7ef52b1ad3321879!2m2!1d-97.7404423!2d30.2816145!1m5!1m1!1s0x8644b509cdd787e9:0x108b9372002d7f55!2m2!1d-97.7463985!2d30.2642596!2m3!6e1!7e2!8j1689100200!3e3!5i3?entry=ttu&utm_medium=s2email&shorturl=1) with [CapMetro](https://www.capmetro.org/app)! 200 W Cesar Chavez 2023-07-12T09:15:00-05:00 09:15 00:45 Zlotnik Ballroom 2023-279-keynote-open-source-contributors-in-space-and-time https://cfp.scipy.org//2023/talk/X7YH7A/ false Keynote - Open Source Contributors in Space and Time Keynote Talk en Michael Droettboom is a Principal Software Engineering Manager at Microsoft where he leads the CPython Performance Engineering Team. That team contributes directly to the upstream CPython project, and recently helped make Python 3.11 up to 60% faster than 3.10. Michael has been contributing to open source for over 25 years: he is the former lead maintainer of matplotlib, a major contributor to astropy, and he is the original author of Pyodide and airspeed velocity. His work has supported such diverse applications as the Hubble and James Webb Space Telescopes, the Firefox web browser, infrared retinal imaging, and optical sheet music recognition. Michael Droettboom 2023-07-12T10:45:00-05:00 10:45 00:30 Zlotnik Ballroom 2023-177-fast-exploration-of-the-milky-way-or-any-other-n-dimensional-dataset- https://cfp.scipy.org//2023/talk/G9P3AG/ false Fast Exploration of the Milky Way (or any other n-dimensional dataset) Machine Learning, Data Science, and Ethics in AI Talk en N-dimensional datasets are common in many scientific fields, and quickly accessing subsets of these datasets is critical for an efficient exploration experience. Blosc2 is a compression and format library that recently added support for multidimensional datasets. Compression is crucial in effectively dealing with sparse datasets as the zeroed parts can be almost entirely suppressed, while the non-zero parts can still be stored in smaller sizes than their uncompressed counterparts. Moreover, the new double data partition in Blosc2 reduces the need for decompressing unnecessary data, which allows for top-class slicing speed. Blosc is a high-performance compressor optimized for binary data, such as floating-point numbers, integers, and booleans, although it can also handle string data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed direct memory fetch approach, which uses a memcpy() OS call. Blosc is widely used in popular storage libraries like HDF5 (via h5py or PyTables) or Zarr, and is probably producing many petabytes of compressed data every day around the world. C-Blosc2 (https://github.com/Blosc/c-blosc2) is the latest major version of C-Blosc. It comes with Python-Blosc2 (https://github.com/Blosc/python-blosc2), a lightweight Python wrapper that exposes many of its new features. Some of the most interesting features are: - 64-bit containers: There is no practical limit in dataset sizes. - Frames: Data can be serialized either on-disk or in-memory. - Meta-layers: Meta-data can be added in different layers inside frames. - Blosc2 NDim: N-dimensional datasets can be created, read, and sliced efficiently. - Double partitioning: Data can be split into fine-grained cubes for faster reads of n-dimensional slices. - Parallel reads: When several blocks of a chunk need to be read, this is done in parallel. - Support for special values: Large sequences of repeated values can be represented efficiently. With leveraging these features, Blosc2 provides a powerful, yet flexible tool for data handling. For example, when Blosc2 cooperates with libraries like PyTables/HDF5, it allows to query [100 trillion rows tables in human time frames](https://www.blosc.org/posts/100-trillion-baby/). Furthermore, being able to compress multidimensional data is of great help in handling large multidimensional datasets because 1) it reduces the amount of storage resources and 2) reduces the bandwidth necessary to bring data from storage (disk, memory) to the CPU, allowing to process data more effectively in general. Additionally, compression can represent a wide variety of sparse data without requiring a specific format. Instead, compression works to minimize the number of zeros and keep storage requirements to a minimum. We will address common misconceptions about compressing data, such as: 1) decompressing data takes CPU time, which may slow down computations, and 2) when retrieving a subset of data, all affected partitions must be decompressed, adding overhead. To debunk these myths, we offer the following facts: 1) decompressing data within CPU caches often saves transmission cycles, and 2) Blosc2 features a novel double partitioning schema that minimizes decompression overhead. We will leverage Python-Blosc2 to: - Describe the main features of Blosc2 - Provide useful advice on the best codecs and filters for different types of datasets - Explain how to partition multidimensional datasets for efficient slicing - Compare efficiency and resource savings with other packages, such as h5py, PyTables, and Zarr Finally, we will demonstrate an example of exploring the Milky Way's 3D dataset effectively, using data from the Gaia mission. /media/2023/submissions/G9P3AG/b2nd-2_hLm7ZVA.png Francesc Alted 2023-07-12T11:25:00-05:00 11:25 00:30 Zlotnik Ballroom 2023-236-a-computer-vision-ml-approach-to-classifying-clouds-and-aerosols-from-satellite-observations https://cfp.scipy.org//2023/talk/LAKM79/ false A Computer Vision (ML) Approach to Classifying Clouds and Aerosols from Satellite Observations Machine Learning, Data Science, and Ethics in AI Talk en The NASA Atmosphere SIPS, located at the University of Wisconsin, is responsible for producing operational cloud and aerosol scientific products from satellite observations. With decades of satellite observations, new scientific algorithms are employing Machine Learning (ML) methods to improve processing efficiencies and scientific analyses. In preparation for future developments, we are working with NASA Atmospheric Science Teams to understand ML requirements and assist in developing new tools that will benefit both the Science Teams and the broader Open-Source Science community. This talk will step through a ML methodology being used to identify cloud types and severe aerosols. The purpose of this talk is to share how to make most efficient use of the existing machine learning (ML) software, such as tensorflow, to implement scientific ML methods. We will first describe the science objectives we are trying to achieve, elaborate on lessons learned, and finally introduce future challenges. Our primary science objective is to identify different cloud types and aerosols from satellite imagery, where the cloud types are indicative of different meteorological conditions. The science objective during the talk will be catered towards the broader scientific community while expecting little to no background in atmospheric science or ML. All of this will be accomplished by presenting visualization of satellite imagery throughout to relate the data to the audience. Subsequently we will introduce the ML techniques we have been using. We employ a pretrained VGG16, a convolutional neural network (CNN), which we fine-tune to identify cloud types and aerosols from satellite imagery. There will be accompanying animations illustrating this process and how the inference is combined into the softmax layer providing the result. The specific lessons learned in using ML software is to consider which part of the code executes in CPU or GPU space. Initially we noticed GPU usage was not consistently 100% during inference. To demonstrate potential, we dumped the data to a 200GB file and streamed that directly to the GPU. This test proved what was possible and allowed us to rewrite our generator using keras Sequence where the init and getitem called tf.device locking the data I/O and preprocessing to the CPU leaving the GPU solely for inference. This approached yielded a 2x performance increase. Since our goal is to add value to existing NASA algorithm methodologies via ML, this goal requires us to have labeled data. We experimented with existing labeling packages but in the end decided to incorporate the labeling tasking into existing software the community uses. Thankfully as part of NASA’s participation in Open-Source Science, one of the primary tools used by the Atmospheric Science community, NASA Worldview, is already open sourced. This allowed us to install their docker images and extend this tool bringing the labeling task directly to the Scientists. Additionally, I will talk about the importance of visualizing the data through the entire process of training a CNN. For example, I have a video file flipping through thousands of images from our training set. And I will use this to emphasize the importance of looking at data throughout the process and the importance of being able to share information. Open-Source Science is great but being able to convey information about how ML works is just as important. Steve Dutcher 2023-07-12T13:15:00-05:00 13:15 00:30 Zlotnik Ballroom 2023-45-pandera-beyond-pandas-data-validation https://cfp.scipy.org//2023/talk/T3XSHX/ false Pandera: Beyond Pandas Data Validation Machine Learning, Data Science, and Ethics in AI Talk en Data quality remains a core concern for practitioners of machine learning, data science, and data engineering, and in recent years specialized packages have emerged to validate and monitor data and models. However, as the open source community iterates on data frameworks – notably, highly performant entrants such as Polars – data quality libraries need to catch up to support them. In this talk, you will learn about Pandera and its journey from being a pandas-only validator to a generic tool for testing arbitrary data containers so that it can provide a standardized way of creating data validation tools. # Motivation Data quality remains a core concern for practitioners in machine learning, data science, and data engineering, and many specialized packages have emerged to fulfill the need of validating and monitoring data and models. However, as the open source community creates new data manipulation frameworks – notably, new highly performant entrants such as [Polars](https://www.pola.rs/) – existing data quality frameworks need to catch up to support them, and in some cases, the community creates new data validation libraries for these new data frameworks. # Origins [Pandera](https://github.com/unionai-oss/pandera) started as a small project in 2018 with the goal of providing a lightweight, flexible, and expressive API to validate Pandas DataFrames. This part of the talk provides a short primer on data validation and property-based testing with Pandera, providing insights into how its design facilitates code-first schema authoring and maintenance, which in turn gives rise to safer and more robust data pipelines. This primer will contain content similar to the "Introduction to Pandera" notebook in the pandera documentation: https://pandera.readthedocs.io/en/stable/try_pandera.html # Evolution After gaining traction over the years, the author and community of contributors began to expand Pandera’s scope to support pandas-compliant data frameworks such as GeoPandas, Dask, Modin, and Pyspark Pandas (formerly Koalas). As requests for other libraries increased in frequency, it became clear that Pandera in its existing state was not well-suited for extension beyond Pandas objects. This part of the talk focuses on some of the key design failures that made it difficult to extend to other data frameworks. Rewrites are Fun! (not): Imagine doing a complete internal rewrite of a library while bug reports, feature requests, and pull requests are coming in from contributors: does it sound fun? In the author’s experience, it’s like juggling three balls while playing drums with your feet as someone throws water balloons in your face. This part of the talk outlines the challenges, lessons learned, and things the author would have done differently to anticipate issues related to the separation of concerns, modularity, and extensibility. # Conclusion This talk is about how Pandera has evolved to provide a standard schema interface for easily extending and supporting validation backends for arbitrary statistical data containers. Attendees will learn not only about data testing principles such as run-time validation and property-based testing, they will also learn about the challenges of maintaining and evolving an open source project that many people rely on as a critical piece of their data infrastructure. The high-level goal for this talk is to highlight lessons learned from Pandera’s particular journey from supporting only Pandas as a backend to supporting a whole suite of data objects. Niels Bantilan 2023-07-12T13:55:00-05:00 13:55 00:30 Zlotnik Ballroom 2023-173-disciplined-saddle-programming https://cfp.scipy.org//2023/talk/NPG3NS/ false Disciplined Saddle Programming Machine Learning, Data Science, and Ethics in AI Talk en Our recent work implements a domain-specific language called Disciplined Saddle Programming (DSP) in Python. It is available at https://github.com/cvxgrp/dsp. DSP allows specifying convex-concave saddle, or minimax problems, a class of convex optimization problems commonly used in game theory, machine learning, and finance. One application for DSP is to naturally describe and solve robust optimization problems. We show numerous examples of these problems, including robust regressions and economic applications. However, this only represents a fraction of problems solvable with DSP, and we want to engage with the SciPy community to hear about further potential applications. Convex-concave saddle problems are a class of optimization problems that generalize convex optimization and have a wide range of practical applications, including game theory, finance, and machine learning. A technical trick to convert to a single convex problem is called dualization. However, carrying out this conversion by hand can be tedious and error-prone. In this context, we introduced Disciplined Saddle Programming (DSP) in a recent paper, and the accompanying Python package is implemented as an extension of CVXPY. It is available at https://github.com/cvxgrp/dsp. DSP is a domain-specific language (DSL) for specifying saddle problems for which the dualizing trick can be automated. DSP is based on the conic-representable saddle programs developed by Juditsky and Nem irovski, who showed how to carry out the required dualization automatically using conic duality. The DSP language and methods extend Nesterov and Nemirovski's earlier development of conic representable convex problems and can be seen as extending disciplined convex programming (DCP) to saddle problems. There are numerous benefits of using DSP. The language makes it easier for practitioners to specify and solve saddle problems, and it can handle a wide range of optimization problems, including many robust optimization problems, which have recently gained wider attention. Indeed, some argue that most optimization problems should be solved as robust problems instead, as inputs are rarely obtained with absolute certainty. Further, hearing about even more applications from the SciPy community is an intended side effect to make the package easier to integrate for practitioners. Just as DCP, and by extension CVXPY, made it easy for users to formulate and solve complex convex problems, DSP allows users to easily formulate and solve saddle problems. The method is implemented in an open-source Python package, also called DSP. This package provides a way to automate the dualization of saddle problems and provides a simple interface for users to formulate and solve complex problems in a structured and disciplined way. In summary, disciplined saddle programming (DSP) is a new approach that can simplify solving saddle problems in convex optimization. It automates the dualization of saddle problems and provides a simple interface for users to specify and solve complex problems in a structured and disciplined way. DSP is designed to be easy to learn and use, and is compatible with the existing CVXPY framework. DSP has the potential to make saddle problems much easier to solve, which could have a significant impact on a wide range of fields that rely on optimization. /media/2023/submissions/NPG3NS/dsp_GKs6McB.png Philipp SchieleEric Sager Luxenberg 2023-07-12T14:35:00-05:00 14:35 00:30 Zlotnik Ballroom 2023-73-emukit-python-toolkit-for-uncertainty-quantification https://cfp.scipy.org//2023/talk/RJPMGC/ false Emukit: Python toolkit for uncertainty quantification Machine Learning, Data Science, and Ethics in AI Talk en Emukit is an open-source package for uncertainty quantification in Python. It provides various Bayesian methods, such as optimization, experimental design and quadrature, in a flexible unified way that leverages their commonalities. In the talk we will explain how and why Emukit was built, what are its strengths and weaknesses, how it is used today and in what scenarios one might find it useful. ## Description Emukit is a highly adaptable Python toolkit for enriching decision making under uncertainty. This is particularly pertinent to model complex systems where data is scarce or difficult to acquire. In these scenarios, propagating well-calibrated uncertainty estimates within a design loop or computational pipeline ensures that constrained resources are used effectively. The main features currently available in Emukit are: * Bayesian optimisation: optimise physical experiments and tune parameters of machine learning algorithms; * Experimental design/Active learning: design the most informative experiments and perform active learning with machine learning models; * Sensitivity analysis: analyse the influence of inputs on the outputs of a given system; * Bayesian quadrature: efficiently compute the integrals of functions that are expensive to evaluate; * Multi-fidelity emulation: build surrogate models when data is obtained from multiple information sources that have different fidelity and/or cost. The package was released in 2019, and since then gained popularity among the research communities of Bayesian optimization, Bayesian quadrature, and multi-fidelity modelling. The aim of this talk is to present Emukit to a wider audience of Python developers. It may be of interest to machine learning practitioners in need of hyper-parameter optimization methods, scientists running complex simulations and looking for emulation and UQ techniques, and everyone interested in approaches for decision making under uncertainty. Hearing about our development experience and lessons learned may also be useful to those who look to develop scientific packages in Python. The first part of the talk will focus on technical details of the package. We will start with a brief introduction into Emukit and the methods it provides. Emukit is a replacement for GPyOpt and the reasons that prompted its development will be discussed. We will go over the key software design principles of Emukit, and see how they lead to a flexible and adaptable toolkit, but also how they may hinder the computational efficiency. Other popular frameworks for Bayesian optimization, Trieste and BoTorch, will be used to highlight strengths and weaknesses of Emukit. The second part will focus on usage and adoption. We will talk about target audience of the toolkit, existing uses for teaching and research, and discuss why anyone who is not an expert in Bayesian active learning methods would want to use Emukit. ## Additional materials Emukit is available on Github: https://github.com/EmuKit/emukit. There is also a website about the package: https://emukit.github.io/. Emukit was first presented at NeurIPS workshop on ML and the Physical Sciences, 2019. Corresponding paper on arXiv: https://arxiv.org/abs/2110.13293. Emukit is used for teaching ML and the Physical World course at the University of Cambridge. The course website can be found at https://mlatcl.github.io/mlphysical/. Emukit was also adopted for the Gaussian Process Summer School 2022: https://gpss.cc/gpss22/. Some of the previous talks given by the speaker can be found on his website: https://paleyes.info/#talks. Andrei Paleyes 2023-07-12T16:05:00-05:00 16:05 00:30 Zlotnik Ballroom 2023-157-bayesian-statistics-with-python-no-resampling-necessary https://cfp.scipy.org//2023/talk/TVRHYS/ false Bayesian Statistics with Python, No Resampling Necessary Machine Learning, Data Science, and Ethics in AI Talk en TensorFlow Probability is a powerful library for statistical analysis in Python. Using TensorFlow Probability’s implementation of Bayesian methods, modelers can incorporate prior information and obtain parameter estimates and a quantified degree of belief in the results. Resampling methods like Markov Chain Monte Carlo can also be used to perform Bayesian analysis. As an alternative, we show how to use numerical optimization to estimate model parameters, and then show how numerical differentiation can be used to get a quantified degree of belief. How to perform simulation in Python to corroborate our results is also demonstrated. This talk is a concise update of a talk delivered previously for PyStan, the Python Interface for STAN, which is software for Bayesian inference. Now we will focus on the TensorFlow Probability library. Here are links for the previous talk: https://github.com/c22hatal/bayes_confidence/tree/main/meetup11aug21/meetup11aug21 https://www.youtube.com/watch?v=-7l5QTq5Hz0&list=PLhbPZ4oC18muuVdH3pjpjGmHkJqxCldYR&index=11&t=1073s We first briefly review the Bayesian concepts of prior and posterior and elaborate on how the posterior distribution of the parameters can be approximated by a normal distribution with large sample sizes. This is the key theoretical point of the talk and is discussed in section 4.1 of Bayesian Data Analysis [1]. Through the talk, we will corroborate the proof by using resampling methods. We show that the normal approximation and resampling methods are equivalent with large data using TensorFlow Probability. After the talk, users can confidently use TensorFlow Probability and SciPy/NumPy to perform Bayesian analysis without resampling if their samples are sufficiently large. After the theoretical discussion, we get into how the posterior distribution can be modeled using TensorFlow Probability’s distribution classes. I will show how you can sample from the distributions and calculate the posterior log probability density. We will focus on a linear regression setting where the Normal and 𝜒2 distributions will be used as priors for the slope and intercept parameters. https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/Chi2 https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/Normal https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/JointDistributionNamed Then I show how the posterior modes can be estimated using TensorFlow or SciPy optimization. The Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) will be used. This method doesn’t calculate the full Hessian, the second and cross derivatives of the log posterior function. https://www.tensorflow.org/probability/api_docs/python/tfp/optimizer/lbfgs_minimize https://docs.scipy.org/doc/scipy/reference/optimize.minimize-lbfgsb.html The inverse Hessian gives us the posterior variance under our approximation. So I show how you can take numeric derivatives in NumPy to obtain it. Derivatives are taken according to the method in Numerical Recipes [2]. Vectorized computations will be used where possible Then I finally use resampling, particularly Markov Chain Monte Carlo sampling to show how well the approximation to the posterior distribution works. This is accomplished using TensorFlow Probability functions. I provide a framework for simulation in Python that is used to demonstrate these results as well. https://www.tensorflow.org/probability/api_docs/python/tfp/mcmc References 1. Bayesian Data Analysis (3rd. ed.). A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari and D. B. Rubin, 2013 Boca Raton, Chapman and Hall–CRC 2. Numerical Recipes: The Art of Scientific Computing (3rd. ed.). W. H. Press, S. A. Teukolsky, W T. Vetterling, and Brian P. Flannery. 2007. Cambridge University Press, USA. Charles D Lindsey 2023-07-12T17:00:00-05:00 17:00 01:00 Zlotnik Ballroom 2023-295-lightning-talks https://cfp.scipy.org//2023/talk/NUT798/ false Lightning Talks Lightning Talks Talk en Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference. 2023-07-12T18:00:00-05:00 18:00 01:00 Zlotnik Ballroom 2023-300-poster-session-and-job-fair https://cfp.scipy.org//2023/talk/VUD8HM/ false Poster Session and Job Fair Talk en The Poster session will be in the Zlotnik Ballroom from 6:00-7:00pm. The Job Fair will be held concurrently in the Zlotnik foyer with participating sponsors. Sponsor companies will be available to discuss current job opportunities. POSTERS: Title - Authors (Track) 1. RECOIL - Ronchi Evaluator and Classifier of Imperfect Lenses (RECOIL) - Allen S. Harvey Jr., Clare Egan (Astronomy and Physics) 2. Planetary Defense Using Python: Measuring Deflection of the Didymos Binary Asteroid System by the NASA DART Mission - Arushi Nath (Astronomy and Physics) 3. pyro: a python hydrodynamics code for teaching and prototyping - Michael Zingale (Astronomy and Physics) 4. Accessing astronomical data with Python - Brigitta Sipőcz (Astronomy and Physics) 5. Spatial and Single-Cell Analysis of MERFISH Data using the Python Library Cormerant - Nicolas Fernandez (Bioinformatics, Computational Biology & Neuroscience) 6. Cross-language Data Grammar for Single-cell Feature Engineering - Dave Bunten (Bioinformatics, Computational Biology & Neuroscience) 7. Biomolecular crystallographic computing with Jupyter - Blaine Mooers (Bioinformatics, Computational Biology & Neuroscience) 8. MDAKits: A Framework for FAIR-Compliant Molecular Simulation Analysis - Ian Kenney (Bioinformatics, Computational Biology & Neuroscience) 9. Obtain quantitative insights through image registration in python - Matt McCormick, Konstantinos Ntatsis (Bioinformatics, Computational Biology & Neuroscience) 10. EEG-to-fMRI: Neuroimaging Cross Modal Synthesis in Python - David Calhas (Bioinformatics, Computational Biology & Neuroscience) 11. Matchmaker: A Toolkit for Collocating and Combining Satellite-Based Earth Observations - Greg Quinn (Earth, Ocean, Geo, and Atmospheric) 12. Building geospatial workflows for Impact using Leafmap, SageMaker Studio Lab, and Open Data on AWS - Qiusheng Wu, Mike Jeffe (Earth, Ocean, Geo, and Atmospheric) 13. Operational Open Science and Software for the Planet's Largest Climate Observatory - Zachary Sherman (Earth, Ocean, Geo, and Atmospheric) 14. Moving the Earth with thermodynamics and python - Cian Wilson (Earth, Ocean, Geo, and Atmospheric) 15. Bringing automated data analysis and machine learning pipelines directly to end users using Unidata tools - Thomas Martin, Hailey Johnson, Drew Camron (Earth, Ocean, Geo, and Atmospheric) 16. Yori: A New, Highly Customizable Tool for Level-3 Data Production - Paolo Veglio (Earth, Ocean, Geo, and Atmospheric) 17. Intuitive Statistics in SciPy - Matt Haberland, Albert Steppi (General Track) 18. Using MyST Markdown in JupyterLab - Rowan Cockett (General Track) 19. PyVista: A Python Library for Interactive 3D Data Visualization and Analysis - Tetsuo Koyama (General Track) 20. SOSA: The Scalable Open-Source Analysis Stack - James A. Bednar, Martin Durant (General Track) 21. Sensitivity Analysis in Python: `scipy.stats.sobol_indices` - Pamphile Roy (General Track) 22. Improving the SciPy-CuPy compatibility for interpolation and signal processing - Edgar Andrés Margffoy Tuay (General Track) 23. aPhyloGeo-Covid: A Web Interface for Phylogeographic Analysis of SARS-CoV-2 Variation using Neo4j and Snakemake - Nadia Tahiri, Wanlin Li (Machine Learning, Data Science, and Ethics in AI) 24. Quantifying Uncertainty in Time Series Forecasting with Conformal Prediction - Fede Garza Ramirez (Machine Learning, Data Science, and Ethics in AI) 25. Anti-Patterns: How not to do things in Python - Gajendra Deshpande (Machine Learning, Data Science, and Ethics in AI) 26. pomegranate v1.0.0: now with PyTorch - Jacob Schreiber (Machine Learning, Data Science, and Ethics in AI) 27. Data engineering and analytics for photolithography manufacturing process at DuPont ‚Äì a practical approach from lab to fab - Avishek Panigrahi, Sumanth S, Abhishek Shrivastava, stefan caporale (Machine Learning, Data Science, and Ethics in AI) 28. Stochastic Unitary Constraints - Victoria Schneider, Sara Logsdon, Delaney Ott (Machine Learning, Data Science, and Ethics in AI) 29. Hamilton: Scalable, Portable, and Self-Documenting Dataflows in Python - Elijah ben izzy, Stefan Krawczyk (Machine Learning, Data Science, and Ethics in AI) 30. Teaching machine learning in professional education - Nadia Udler (Machine Learning, Data Science, and Ethics in AI) 31. Magic Data Abstractions (for Magic™ data) - Valerio Maggio (Machine Learning, Data Science, and Ethics in AI) 32. Self-Supervised Cilia Segmentation - Meekail Zain, Shannon Quinn (Machine Learning, Data Science, and Ethics in AI) 33. Data-centric ML pipeline for resolving data drift and optimizing data preprocessing - Hongsup Shin (Machine Learning, Data Science, and Ethics in AI) 34. "Clockwork" detection in categorical telemetry data - Benoit Hamelin (Machine Learning, Data Science, and Ethics in AI) 35. Intro to Quantum Computing for Drug Design - Maurice Benson (Machine Learning, Data Science, and Ethics in AI) 36. PyQtGraph - High Performance Visualization for All Platforms - Nathan Jessurun (Machine Learning, Data Science, and Ethics in AI) 37. Accelerating Drug Discovery on the Cloud with Open Source Python - Nathan Knapp (Materials and Chemistry) 38. Modeling Multiphase Multicomponent Precipitate Growth with Phase-Field and Python - Trevor Keller (they/them) (Materials and Chemistry) 39. Materials Project: building an open-source, data-driven platform for materials science - Ruoxi Yang (Materials and Chemistry) 40. Rozha: Supporting and Simplifying Multilingual Natural Language Processing - Ian Goodale (Social Science and the Digital Humanities) 41. Spatial Microsimulation & Activity Allocation in Python: An Update on the Likeness Toolkit - James Gaboardi, Joe Tuccillo (Social Science and the Digital Humanities) 42. Python meta packages - Jorge Martinez, Roberto Pastor (Tending Your Open Source Garden: Maintenance and Community) 43. quartodoc: a tool for quick and easy package documentation - Michael Chow (Tending Your Open Source Garden: Maintenance and Community) 44. TUG-RSE: Pulling Students into Research Software Engineering - Aman Goel (Tending Your Open Source Garden: Maintenance and Community) 45. CI/CD pipelines for scientists - Jorge Martinez (Tending Your Open Source Garden: Maintenance and Community) 46. First steps toward supercharging remote development with Spyder - Carlos Cordoba (Tending Your Open Source Garden: Maintenance and Community) 47. Chalk'it : dataflow and drag-and-drop Python dashboarding - Mongi Ben Gaid (Tending Your Open Source Garden: Maintenance and Community) 48. Accessible documentation for everyone - Jorge Martinez, Revathy Venugopal (Tending Your Open Source Garden: Maintenance and Community) 49. Patterns and Anti-Patterns when Measuring Diversity in Open Source - Amanda Casari (Tending Your Open Source Garden: Maintenance and Community) 2023-07-12T19:00:00-05:00 19:00 02:00 Zlotnik Ballroom 2023-302-scipy-attendee-social-event-hosted-by-open-source-science-ossci- https://cfp.scipy.org//2023/talk/3NH3L8/ false SciPy Attendee Social Event hosted by Open Source Science (OSSci) Talk en At Scholz Garten, 1607 San Jacinto Blvd. Join your fellow community members from 7:00-9:00. Walking distance from AT&T Center. Venue, food, and drinks sponsored by OSSci. Scholz Garten, 1607 San Jacinto Blvd 2023-07-12T10:45:00-05:00 10:45 00:30 Amphitheater 204 2023-18-out-performing-numpy-is-hard-when-and-how-to-try-with-your-own-c-extensions https://cfp.scipy.org//2023/talk/DF8PVV/ false Out-Performing NumPy is Hard: When and How to Try with Your Own C-Extensions General Track Talk en While the NumPy C API lets developers write C that builds or evaluates arrays, just writing C is often not enough to outperform NumPy. NumPy's usage of Single Instruction Multiple Data routines, as well as multi-source compiling, provide optimizations that are impossible to beat with simple C. This presentation offers principles to help determine if an array-processing routine, implemented as a C-extension, might outperform NumPy called from Python. A C-extension implementing a narrow use case of the ``np.nonzero()`` routine will be studied as an example. While it is well known that C-extensions can improve the performance of Python programs, writing C-extensions that improve the performance of NumPy array operations is different. Many NumPy functions employ highly optimized C routines, some of which take advantage of low-level processor optimizations. In most cases, just writing Python that calls NumPy is faster than a custom C extension. However, for routines that are sufficiently narrow in scope, there are opportunities for optimization. This presentation offers principles to help determine if a routine, implemented as a C-extension, might outperform related NumPy routines called from Python. Along the way, Python project setup, and the basics of the NumPy C API, will be introduced. A narrow use-case of the ``np.nonzero()`` function will be implemented in C as an example: rather than returning all indices of all non-zero values for all dtypes and dimensionalities (as ``np.nonzero()`` does), this new function, ``first_true_1d()``, will return only the index of the first-encountered non-zero value for one-dimensional Boolean arrays. The performance of this far simpler routine, and why it sometimes cannot out-perform ``np.nonzero()``, will be examined. Christopher Ariza 2023-07-12T11:25:00-05:00 11:25 00:30 Amphitheater 204 2023-29-can-there-be-too-much-parallelism- https://cfp.scipy.org//2023/talk/VUFGS8/ false Can There Be Too Much Parallelism? General Track Talk en Numerical Python libraries can run computations on many CPU cores with various parallel interfaces. When we simultaneously use multiple levels of parallelism, it may result in oversubscription and degraded performance. This talk explores the programming interfaces used to control parallelism exposed by libraries such as NumPy, SciPy, and scikit-learn. We will learn about parallel primitives used in these libraries, such as OpenMP and Python's multiprocessing module. We will see how to control parallelism in these libraries to avoid oversubscription. Finally, we will look at the overall landscape for configuring parallelism and highlight paths for improving the user experience. Numerical Python libraries such as NumPy, SciPy, and PyTorch can run computations on multiple CPU cores. These libraries expose a wide range of programming interfaces to control parallelism. These interfaces include environment variables, library-specific APIs, and context managers such as threadpoolctl. While reviewing the interfaces for controlling parallelism, we will learn about the many parallel primitives used in these libraries. We will cover lower-level primitives such as pthreads or OpenMP and higher-level primitives such as Python's multithreading and multiprocessing modules. Libraries that require lower-level parallel primitives need to go through a compilation step with languages and tools such as Numba, Cython, C++, or Rust. When we use multiple forms of parallelism, controlling how many cores your program uses is essential to prevent oversubscription. We will learn how libraries such as Dask, Ray, and scikit-learn handles mix their parallelism with user-provided parallel routines. Finally, we will zoom out to see the overall landscape for controlling parallelism and highlight possible paths to improve the user and developer experience. This is an intermediate talk for software and machine learning engineers that want to understand and configure parallelism in the PyData stack. Thomas J. Fan 2023-07-12T13:15:00-05:00 13:15 00:30 Amphitheater 204 2023-249-scientific-python-from-init-to-call- https://cfp.scipy.org//2023/talk/3NBFHV/ false Scientific Python: from `__init__` to `__call__` Tending Your Open Source Garden: Maintenance and Community Talk en The Scientific Python project aims to better coordinate the ecosystem and grow the community. Come hear about our recent progress and our plans for the coming year! The Scientific Python project's vision is to help pave the way toward a vibrant, unified, and collaborative scientific Python community. It focuses its efforts along two primary axes: _(i)_ to create a joint community around scientific Python projects and _(ii)_ to support maintainers by building cross-cutting technical infrastructure and tools. Last year we launched the project with new websites, a Hugo web theme, a social media campaign, and a collaborative coordination process similar to PEPs called SPECs. This year, we are fortunate to have received [funding from CZI](https://scientific-python.org/grants/community_and_communications_infrastructure/) for the continued development, maintenance, and support of web and documentation themes, as well as other community infrastructure, in collaboration with Quansight. With the community and communication infrastructure having support for the next few years, we are able to focus more on technical topics and the SPECs. As a first project, we are [funded to work on improving sparse *array*](https://scientific-python.org/grants/sparse_arrays) (vs matrix) semantics in SciPy with the goal of removing sparse *matrices* and, eventually, also NumPy *matrices* from several ecosystem libraries. In line with our philosophy of continually working with the community and incorporating their feedback, we hosted the first of several [Sparse Summits](https://scientific-python.org/summits/sparse/)—virtual meetings to identify sparse array needs in ecosystem libraries. This project spans multiple core projects, including numpy, scipy, scikit-image, networkx, scikit-learn, and many of the packages built on top of these libraries. In addition to the sparse summit, we have hosted a [domain stack summit](https://scientific-python.org/summits/domain-stacks/), to discuss domain-specific umbrella projects that host several others, as well as the first [annual developer summit](https://scientific-python.org/summits/developer/). This in-person workshop brought together over 30 community members for a week-long, collaborative sprint, and tackled topics including build & testing systems, continuous integration infrastructure, release management tools, and community management. Finally, we will update the community on our progress on the [decadal plan](https://scientific-python.org/grants/planning_next_decade/). Our efforts thus far have already culminated in joint efforts to develop tools and shared infrastructure that will positively impact the whole ecosystem. And, while there is still a long road ahead, we are excited to continue preparing the ecosystem for the next decade of scientific computing in Python. /media/2023/submissions/3NBFHV/Scientific-Python-min_NTmaUXt.png Juanita Gomez 2023-07-12T13:55:00-05:00 13:55 00:30 Amphitheater 204 2023-61-beyond-bits-qubits-effective-open-source-community-management-in-quantum-computing https://cfp.scipy.org//2023/talk/KXWZJY/ false Beyond Bits & Qubits: Effective Open Source Community Management in Quantum Computing Tending Your Open Source Garden: Maintenance and Community Talk en Qiskit is an open-source SDK for quantum computers, enabling developers to work with these powerful machines using a familiar python interface. First released in 2017, Qiskit has become the most popular package for quantum computing (Unitary Fund, 2022), with a thriving open-source community. As Qiskit has grown and changed, so has our approach to nurturing our community. This talk will share important lessons we’ve learnt over the years, including practical tips you can apply to your own projects. Whether you’re just starting in open-source or already manage an established community, this talk is for you! basic outline of proposed talk: ### 1. Context This section will provide a brief introduction to Qiskit (https://qiskit.org) as an open-source package and some of the challenges we’ve faced in maintaining and growing our community. ### 2. The Academic Element One of the unique aspects of maintaining an open-source project in a scientific field is the closer relationship to academia compared to other open-source software. This can pose unique challenges, as researchers often have different goals, mindsets and working culture when it comes to publishing code, which doesn’t always work well with traditional open-source ways of working. We continually face these conflicts in Qiskit, so in this section we will talk through some of the effective ways we’ve found to address these differences through education and the development of of the Qiskit Ecosystem (https://qiskit.org/ecosystem). ### 3. Clearly Defined Spaces Defining the mechanisms for *how* different members of the community interact is a subtle yet crucial aspect of community management that requires careful planning. Whether it’s clearly defined issue templates, organised discussion forums, or actual events, having clearly defined spaces can help contributors and maintainers work together more effectively. So this section will demonstrate specific strategies we’ve used in Qiskit and the underlying principles that make them effective. ### 4. Be a Kind Human This section will focus on the incredibly important aspect of fostering a welcoming culture within your open-source community. We will touch on the importance of a code of conduct, contributing guidelines, issue tagging, using empathetic and accessible language, and other general tips for making the whole contribution experience inclusive. ### 5. Metrics and Automation This section will focus on how to use automations to streamline your contributor experience and collect valuable data along the way. From bots to actions to built-in GitHub features there are a ton of options to choose from, so we’ll highlight the ones we’ve found the most useful and the important insights we’ve gained as a result. ### 6. Development meets DevRel Effective community management requires significant time investment, which can take a toll on project maintainers. This section will make the case for working closely with Developer Relations experts (perhaps even hiring one if you haven’t already!) to offload some of that burden. Developer Advocates are highly specialised in communication for a developer audience, and can become valuable assets when brought into an open-source team. ### 7. The Community Management Graveyard To wrap things up, this section will cover ideas that we have tried and failed during our community management journey in Qiskit. Things that started out with the best intentions that just didn’t work out and what we learned from the process. The tone of this section will demonstrate how experimenting is an important part of the process of finding a community management setup that works for you, and that trying and failing in public is what open source is all about. Abby Mitchell 2023-07-12T14:35:00-05:00 14:35 00:30 Amphitheater 204 2023-163-thar-be-dragons-ethical-legal-and-policy-challenges-when-measuring-open-source https://cfp.scipy.org//2023/talk/QUNAY9/ false Thar Be Dragons - Ethical, Legal, and Policy Challenges when Measuring Open Source Tending Your Open Source Garden: Maintenance and Community Talk en Open source researchers are increasingly challenged while navigating the data which open source communities inherently create when working in the open. While mining software repositories for insights into open source practices isn't new, moving beyond code analysis into ecosystems-level research does not have a clear path. This talk will outline the current ethical, legal, and policy challenges community leaders, as well as researchers in academia and industry face and the ambiguous areas decision makers should be aware of. Challenges to outline can include: __Ethical__ - Academia - quantitative + qualitative open source data is not (usually) subject to IRB - Does anti-aliasing across datasets potentially create opportunities for harm for members of open source communities? __Legal__ - When does information become a dataset? - Can I use this data? Which license for what? __Policy__ - Can umbrella foundations "opt-in" communities and projects into ecosystem scale research? - How can communities and projects create clear boundaries about how and where they want the "data exhaust" they release to be used? amanda casari 2023-07-12T15:25:00-05:00 15:25 00:30 Amphitheater 204 2023-180-in-process-analytical-data-management-with-duckdb https://cfp.scipy.org//2023/talk/S8NSHT/ false In-Process Analytical Data Management with DuckDB General Track Talk en DuckDB is a novel analytical data management system. DuckDB supports complex queries, has no external dependencies, and is deeply integrated into the Python ecosystem. Because DuckDB runs in the same process, no serialization or socket communication has to occur, making data transfer virtually instantaneous. For example, DuckDB can directly query Pandas data frames faster than Pandas itself. In our talk, we will describe the user values of DuckDB, and how it can be used to improve their day-to-day lives through automatic parallelization, efficient operators and out-of-core operations. Data management systems and data analysts have a troubled relationship: Common systems such as Postgres or Spark are unwieldy, hard to set up and maintain, hard to transfer data in and out, and hard to integrate into complex end-to-end workflows. As a response, analysts have developed their own ecosystem of data wrangling tools such as Pandas or Polars. These tools are much more natural for analysts to use, but are limited in the amount of data they can process or the amount of automatic optimization that is supported. DuckDB is a new analytical data management system that is built for an in-process use case. DuckDB speaks SQL, has no external dependencies, and is deeply integrated into the Python ecosystem. DuckDB is Free and Open Source software under the MIT license. DuckDB uses state-of-the art query processing techniques with vectorized execution, lightweight compression, and morsel-driven automatic parallelism. DuckDB is out-of-core capable, meaning that it is capable of not only reading datasets that are bigger than main memory. This allows for analysis of far greater datasets and in many cases removes the need to run separate infrastructure. The “duckdb” Python package is not a client to the DuckDB system, it provides the entire database engine. DuckDB runs without any external server directly inside the Python process. Once there, DuckDB can run complex SQL queries on data frames in Pandas, Polars or PyArrow formats out-of-the box. DuckDB can also directly ingest files in Parquet, CSV or JSON formats. Because DuckDB runs in the same process, data transfer are virtually instantaneous. Conversely, DuckDB’s query results can be transferred back into data frames very cheaply, allowing direct integration with complex downstream libraries such as PyTorch or TensorFlow. DuckDB enjoys fast-growing popularity, the Python package alone is currently downloaded around one million times a month. DuckDB has recently become the default backend of the Ibis project that offers a consistent interface in Python over a variety of data backends. This talk is aimed at two main groups, data analysts and data engineers. For the analysts, we will explain the user values of DuckDB, and how it can be used to improve their day-to-day lives. For data engineers, we will describe DuckDB’s capabilities to become part of large automated data pipelines. The presenters for the proposed talk, Hannes Mühleisen and Mark Raasveldt are the original creators of DuckDB, they are still leading the project and are deeply familiar with its Python integration. - DuckDB Python API Overview: https://duckdb.org/docs/api/python/overview - DuckDB PyPI Download Statistics: https://pypistats.org/packages/duckdb - DuckDB Ibis Backend: https://ibis-project.org/backends/DuckDB/ - Peer-reviewed paper about the concept behind DuckDB by the presenters https://www.cidrdb.org/cidr2020/papers/p23-raasveldt-cidr20.pdf - Talk about DuckDB at FOSDEM 2020 by Hannes: https://archive.fosdem.org/2020/schedule/event/duckdb/ - Talk about DuckDB at CMU by Mark: https://www.youtube.com/watch?v=PFUZlNQIndo Hannes MühleisenMark RaasveldtAlex Monahan 2023-07-12T16:05:00-05:00 16:05 00:30 Amphitheater 204 2023-218-graphblas-for-sparse-data-and-graphs https://cfp.scipy.org//2023/talk/YESNZB/ false GraphBLAS for Sparse Data and Graphs General Track Talk en GraphBLAS solves graph problems using sparse linear algebra. We are using it to build [`graphblas-algorithms`](https://github.com/python-graphblas/graphblas-algorithms), a fast backend to NetworkX. [`python-graphblas`](https://github.com/python-graphblas/python-graphblas/) is faster and more capable than `scipy.sparse` for both graph algorithms and sparse operations. If you have sparse data or graph workloads that you want to scale and make faster, then this is for you. Come learn what makes GraphBLAS special--and fast!--and how to use it effectively. Sparse data and graph problems appear in virtually all science and engineering disciplines. Nevertheless, adoption of sparse and graph techniques has been slow (so opportunity to exploit sparsity are plentiful)--perhaps because it's not always obvious to know when to apply them, or existing libraries are too slow or difficult to use. GraphBLAS can help. By expressing graph algorithms in the language of linear algebra, it can handle larger data in parallel and is versatile enough to express custom analyses and integrate into larger workflows. In this talk, we will cover: - How to recognize sparse or graph problems and when to use GraphBLAS - Representing a graph as a sparse matrix - The equivalence between graph problems and sparse problems - How GraphBLAS extends linear algebra with masking and arbitrary semirings to be more capable and work-efficient than `scipy.sparse` - The underlying sparse data structures and how to efficiently convert to and from them - Examples of graph algorithms written in [`python-graphblas`](https://github.com/python-graphblas/python-graphblas/) - Using GraphBLAS as a backend to NetworkX via dispatching to [`graphblas-algorithms`](https://github.com/python-graphblas/graphblas-algorithms) - Benchmarks comparing GraphBLAS, NetworkX, and `scipy.sparse` Come learn what makes GraphBLAS special--and fast!--and how to use it effectively. Erik WelchJim Kitchen 2023-07-12T10:45:00-05:00 10:45 00:30 Grand Salon C 2023-103-vak-a-neural-network-framework-for-researchers-studying-animal-acoustic-communication https://cfp.scipy.org//2023/talk/333PY7/ false vak: a neural network framework for researchers studying animal acoustic communication Bioinformatics, Computational Biology & Neuroscience Talk en Research on animal acoustic communication is being revolutionized by deep learning. In this talk we present vak, a framework that allows researchers in this area to easily benchmark deep neural network models and apply them to their own data. We'll demonstrate how research groups are using vak through examples with TweetyNet, a model that automates annotation of birdsong by segmenting spectrograms. Then we'll show how adopting Lightning as a backend in version 1.0 has allowed us to incorporate more models and features, building on the foundation we put in place with help from the scientific Python stack. Are humans unique among animals? We speak languages, but is speech somehow like other animal behaviors, such as birdsong? Questions like these are answered by studying how animals communicate with sound. This research requires cutting edge computational methods and big team science across a wide range of disciplines, including ecology, ethology, bioacoustics, psychology, neuroscience, linguistics, and genomics. As in many other domains, this research is being revolutionized by deep learning algorithms. Deep neural network models enable answering questions that were previously impossible to address, in part because these models automate analysis of very large datasets. Within the study of animal acoustic communication, multiple models have been proposed for similar tasks, often implemented as research code with different libraries, such as Keras and Pytorch. This situation has created a real need for a framework that allows researchers to easily benchmark models and apply trained models to their own data. To address this need, we developed vak [1], a neural network framework designed for this research community, built with core libraries of the scientific Python stack such as numpy, scipy, pandas and dask. In this talk, we will show how vak makes it easy for researchers to work with neural network models through a simple command-line interface and TOML configuration files. As an example, we will demonstrate how we used vak to benchmark a neural network model, TweetyNet [2], that automates annotation of birdsong by segmenting spectrograms. Using vak allowed us to tune hyperparameters and determine the minimal amount of expensive human-annotated data we needed for accurate model performance. We will show how TweetyNet and vak made it possible to relate the complex syntax of canary song to the hidden states of neural activity in the canary brain, and how these tools are being used by other researchers in neuroscience and bioacoustics. Then we will demonstrate how in version 1.0 of vak we have significantly extended its generality, in large part by adopting the Lightning library as a backend. We will show how we are using version 1.0 of vak to reduce the segment error rate of TweetyNet, minimizing the need to clean up predictions with post-processing. In addition we'll walk through how we're using vak to compare performance of TweetyNet with other neural network architectures proposed for similar tasks. Finally we will show work in progress incorporating other families of neural network models into vak, generative and unsupervised learning algorithms for dimensionality reduction and similarity measurements. Both authors are experienced public speakers [3], and the combination of cutting edge neural network models in Python with studies of birds, their song, and the vocalizations of other charismatic animals are sure to make for an entertaining and informative talk. [1] https://github.com/vocalpy/vak [2] https://elifesciences.org/articles/63853, https://github.com/yardencsGitHub/tweetynet [3] https://nicholdav.info/talks/, https://yardencsgithub.github.io/talks/ /media/2023/submissions/333PY7/vak-logo-primary_ewQYbmJ.png David NicholsonYarden Cohen 2023-07-12T11:25:00-05:00 11:25 00:30 Grand Salon C 2023-41-tfmodisco-lite-an-attribution-based-motif-discovery-algorithm https://cfp.scipy.org//2023/talk/AQ3Z3U/ false tfmodisco-lite: an attribution-based motif discovery algorithm Bioinformatics, Computational Biology & Neuroscience Talk en An important problem in genomics is identifying the proteins that bind to DNA. Although many methods attempt to learn DNA motifs underlying protein binding as position-weight matrices (PWMs), these PWMs cannot faithfully represent real biology. For instance, a static PWM cannot describe a zinc-finger protein whose fingers can optionally include one-nucleotide spacing. TF-MoDISco is a framework for extracting motifs using attribution scores from a machine-learning model. The learned motifs and syntax overcome many of the limitations presented by PWM. I will describe the TF-MoDISco algorithm and showcase its efficient re-implementation, tfmodisco-lite. Understanding the binding of proteins to the genome is crucial for deciphering gene expression programs across cell types. Yet, identifying where and when these proteins bind along the genome is complicated. Most proteins bind to a specific sequence of nucleotides, known as a "motif." But not all proteins are this simple: zinc-finger proteins are comprised of many "fingers" that each bind to short 3-4 basepair motifs. While these short motifs are always found in the same order, variable spacing can be found between these short motifs, and not all are always necessary for binding. Other proteins require the presence of a co-factor to bind to their motifs. Faithfully describing the sequence determinants of protein binding, sometimes called the cis-regulatory logic, for all proteins is a challenging task. Increasingly, people have been using machine learning to understand biology by training neural networks to take in nucleotide sequence and predict a readout of interest, e.g. ATAC-seq, ChIP-seq, CAGE, etc. One can then run a feature attribution algorithm, such as ISM or DeepLIFT, to highlight the nucleotides that drive the predicted readouts. However, summarizing these attributions into repeated patterns has thus far been a missing component of the analysis pipeline. TF-MoDISco is a framework for using attribution-weighted sequence to discover motifs. The approach differs from classic motif finding algorithms in both the input and the output. Rather than operating solely nucleotide sequence, TF-MoDISco also takes in the attributions from a machine learning model using any attribution algorithm. These attributions highlight the nucleotides involved in accurate predictions and so distinguish between driver motifs and passenger motifs. At the end of the procedure, TF-MoDISco returns clusters of "seqlets," or found motif hits. These patterns, aligned to each other to account for spacing, represent the true heterogeneity of protein binding to the genome. By returning clustering of seqlets, TF-MoDISco overcomes many of the problems of position-weight matrices (PWMs), such as the inability to account for variable spacing and linear assumption across nucleotides. This talk will describe the TF-MoDISco procedure at a high level (first 15 minutes) and give a tutorial on how to use the code for discovery in practice (second 15 minutes). Examples will come from models used to predict chromatin accessibility via ATAC-seq as well as protein binding via ChIP-seq readouts. Specifically, the tutorial will cover tfmodisco-lite, a rewrite of the original algorithm that scales significantly better, runs faster, and requires less code. By the end of the talk, one should feel comfortable applying the method to their own data and interpreting the reports that are generated. Jacob Schreiber 2023-07-12T13:15:00-05:00 13:15 00:30 Grand Salon C 2023-169-gammapy-a-python-package-for-gamma-ray-astronomy-version-v1-0 https://cfp.scipy.org//2023/talk/RSE89M/ false Gammapy: a Python Package for Gamma-Ray Astronomy Version v1.0 Astronomy and Physics Talk en In this contribution we will present the first stable version v1.0 of Gammapy, an openly developed Python package for gamma-ray astronomy. Gammapy provides methods for the analysis of astronomical gamma-ray data, such as measurement of spectra, images and light curves. By relying on standardized data formats and a joint likelihood framework, it allows astronomers to combine data from multiple instruments and constrain underlying astrophysical emission processes across large parts of the electromagnetic spectrum. Finally we will share lessons learned during the journey towards version v1.0 for an openly developed scientific Python package. By observing the very high energy (VHE) range of the electromagnetic spectrum we can gain valuable insight into to the extreme universe, including remnants of supernova explosions and surroundings of black holes. In the past VHE gamma-ray astronomy was typically conducted by small, closed collaborations as a subfield of particle physics. However, with hundreds of sources now identified, VHE gamma-ray astronomy has emerged as a new branch of Astronomy. This field provides us with the high energy context for understanding the physical processes occurring throughout the universe, across the entire electromagnetic spectrum. The next generation of ground based gamma-ray instruments, particularly the Cherenkov Telescope Array Observatory (CTA)), is set to revolutionize gamma-ray astronomy. With an anticipated sensitivity ten times greater than current telescopes, it has the potential to attract a community of thousands of gamma-ray astronomers. Furthermore, it will operate as an open observatory, making both data and analysis tools readily available to the public. In this contribution we introduce the inaugural stable version (v1.0) of Gammapy, a Python package for gamma-ray astronomy and the primary library for the future CTAO science tools. Leveraging the scientific Python ecosystem, including Numpy, Scipy, and Astropy, Gammapy offers a comprehensive set of standard data analysis tools, making it an indispensable resource for (not only) gamma-ray astronomers. By utilizing common open data formats, Gammapy also enables existing instruments such as VERITAS, H.E.S.S., or HAWC to export and archive their data, preserving it for future analysis using improved methods. Additionally, it facilitates the combination of data from multiple instruments, resulting in more sensitive analyses with greater statistics and a larger energy range. Gammapy tackles the varied structure of gamma-ray data and science analysis cases by implementing a uniform API for N-dimensional sky maps. This API is independent of the underlying pixelization scheme and supports local WCS, allsky HEALPix, and region-based projections. These data structures prove useful for a broad range of applications as well as astronomers observing in other wavelength. Building on these core data structures, Gammapy features a maximum likelihood fitting framework that enables simultaneous modeling of gamma-ray emission in four dimensions: space, energy, and time. By providing a general likelihood interface, Gammapy enables science users to integrate gamma-ray data with astronomical data from other wavelengths, as well as with neutrino data. Thanks to its straightforward Python API, Gammapy can be paired with other Python-based broadband emission modeling packages, allowing for direct measurement of parameters pertaining to common underlying astrophysical processes. This feature is crucial to realizing the full potential of future multi-wavelength and multi-messenger Astronomy. Lastly, we will discuss the valuable lessons learned during our journey to achieve v1.0 quality for an openly developed package. This will involve addressing concerns regarding maintainability, selection of dependencies, handling high dimensional data structures and API design. We believe that sharing our experience will be helpful to other scientific Python projects in the future. Axel Donath 2023-07-12T13:55:00-05:00 13:55 00:30 Grand Salon C 2023-84-libyt-a-tool-for-parallel-in-situ-analysis-with-yt https://cfp.scipy.org//2023/talk/JTXC9W/ false libyt: a Tool for Parallel In Situ Analysis with yt Astronomy and Physics Talk en In the era of exascale computing, storage and analysis of large scale data have become more important and difficult. We present libyt, an open source C++ library, that allows researchers to analyze and visualize data using yt or other Python packages in parallel during simulation runtime. We describe the methods for reading adaptive mesh refinement data structure, handling data transition between Python and simulation with minimal memory overhead, and conducting analysis with no additional time penalty using Python C API and NumPy C API. We demonstrate how it solves the problem in astrophysical simulations and increases disk usage efficiency. ## Motivation and Aims In the era of exascale computing, storage and analysis of large scale data have become more important and difficult. We present libyt, an open source C++ library, that allows researchers to analyze and visualize data using yt or any other Python packages in parallel during simulation runtime. ## Methods ### Connecting Python and Simulation We use Python C API and NumPy C API to connect variables and arrays in simulation to Python. This includes creating a NumPy array through wrapping an existing C array without additional memory, allocating new arrays and assigning values, and building Python objects and module that contain simulation information. We also create Python C-extension methods for Python to request data from simulations. ### Executing Python Codes and Handling Errors libyt runs in situ analysis using Python interpreter. This is like running Python prompt inside the ongoing simulation with data loaded. libyt checks input Python syntax through compiling it to code object. If error occurs, it parses the error to see if this is caused by input not done yet or a real error. ### In Situ Analysis Under Parallel Computing Each MPI process contains one simulation code and one Python. All Python instances will work together to conduct in situ analysis in parallel using mpi4py (Python bindings for MPI). yt (a Python package for analyzing and visualizing volumetric data) supports MPI parallelism feature. libyt borrows this feature and handles data transition between different MPI processes and between simulation and Python. Since every data is separated in different processes, and we cannot predict how Python decomposes the jobs and asks for data, we use one-sided MPI to deal with data exchanging process between each nodes. ## Applications ### Analyzing Fuzzy Dark Matter Vortices Simulation using GAMER + libyt We use GAMER, a simulation for astrophysics, to simulate the evolution of vortices form from density voids in a Fuzzy Dark Matter halo. Each snapshot takes 116 GB, and a total of 321 snapshots are required to capture them (37 TB disk space). We solve this by using yt in libyt to extract our region of interest, which now consumes only 8 GB in each step. The data size is 15 times smaller. - Animation: https://youtu.be/tUjJYGbWgUc ### Analyzing Core-Collapse Supernova Simulation using GAMER + libyt We use GAMER to simulate core-collapse supernova explosions. We use libyt to call yt and draw slice plot of the entropy distribution. Since entropy is not part of the variable in simulation's iterative process, these entropy data will only be generated through simulation when they are needed by yt. libyt tries to minimize memory usage. - Animation: https://youtu.be/6iwHzN-FsHw ## Discussion and Conclusion - libyt provides a promising solution that binds simulation to Python with minimal memory overhead and no additional time penalty. It makes analyzing large scale simulation feasible. - libyt focuses on using yt as its core analytic method, even though it can call arbitrary Python modules. We will extend to more data structure in the future. Shin-Rong Tsai 2023-07-12T14:35:00-05:00 14:35 00:30 Grand Salon C 2023-79-seeing-the-sun-through-the-clouds-accelerating-the-sunpy-data-analysis-ecosystem-with-dask https://cfp.scipy.org//2023/talk/DZBF7K/ false Seeing the Sun through the Clouds: Accelerating the SunPy Data Analysis Ecosystem with Dask Astronomy and Physics Talk en Over the last decade, the SunPy ecosystem, a Python solar data analysis environment, has evolved organically to serve the needs of scientists analyzing solar physics data, mostly on desktop and laptop computers. However, modern solar observatories are producing data volumes in the tens of petabytes, necessitating the need for parallelized and out-of-core computation. HelioCloud is a cloud computing environment tailored for heliophysics research and colocated with many terabytes of solar physics data. In this talk, we will show how the SunPy ecosystem, combined with Dask on HelioCloud, can be used to efficiently process high-resolution solar data. The SunPy ecosystem is a set of community-developed, free and open-source Python packages for solar data analysis. The ecosystem consists of the core sunpy package, which provides general capabilities such as data download, data structures, and coordinate transformations, as well as a growing set of affiliated packages which provide more application-specific functionality such as image processing techniques. The entire SunPy ecosystem depends heavily on the broader scientific Python ecosystem, including numpy, scipy, and scikit-image and especially the astropy package, a community Python package for astronomy. Over the last decade, the SunPy ecosystem has evolved organically to serve the needs of scientists analyzing solar physics data. Analysis of observational solar data has traditionally been carried out on desktop or laptop computers or small compute clusters (see Bobra et al., 2020). This limitation is partly due to the longstanding historical reliance on the proprietary Interactive Data Language (IDL) by the solar physics community which has limited scalability due in part to licensing restrictions. However, modern space- and ground-based solar observatories are producing data volumes in the tens of petabytes, necessitating the need for parallelized and out-of-core computation. The surge in popularity of Python within the broader astronomy community as well as the growing availability of computing resources has led to many solar researchers using Python in cloud environments. All of these factors have propelled the development of HelioCloud. Inspired by similar science platforms for other disciplines like Pangeo, HelioCloud is a NASA-funded, AWS-backed cloud computing environment tailored for heliophysics research. HelioCloud provides both a dashboard for creating custom virtual machines as well as a JupyterLab interface. Using the latter allows for interactive, scalable computation enabled by Dask across many compute nodes. Most importantly, HelioCloud is collocated with nearly 1 petabyte of solar physics data such that researchers can perform their analysis without the added latency of needing to download the data. In this talk, we will demonstrate how the SunPy ecosystem, combined with Dask on HelioCloud, can be used to efficiently process high-resolution solar data. First, we will provide a brief description of the SunPy project with particular emphasis on the ndcube and sunkit-image affiliated packages. Next, we will provide a brief description of the JupyterLab interface of the HelioCloud platform. Finally, we will demonstrate a typical scientific workflow on HelioCloud by efficiently analyzing many hours worth of solar active region evolution using sunpy, ndcube, sunkit-image, and Dask to scale out our computation over many workers. Additionally, we will discuss existing incompatibilities between Dask and the astropy ecosystem and how collaboration with the broader scientific Python community could resolve such frictions. Will BarnesNabil FreijJack IrelandStuart Mumford 2023-07-12T15:25:00-05:00 15:25 00:30 Grand Salon C 2023-140-open-force-field-next-generation-force-fields-with-open-data-open-software-and-open-science https://cfp.scipy.org//2023/talk/8MMNUD/ false Open Force Field: next-generation force fields with open data, open software, and open science Materials and Chemistry Talk en The Open Force Field (OpenFF) initiative was formed to build a new generation of force fields for molecular dynamics (MD) simulations using modern data-driven techniques. Openness is one of our fundamental founding principles, and everything we produce is released openly and accessibly so that the community can validate, modify, or extend our work. Here we introduce some flagship packages in our ecosystem and the advances they have enabled in force field science and MD workflows. These include fitting custom functional forms, exploring the addition of off-site charges, and using neural networks to assign charges to protein-ligand systems. **Background** Molecular dynamics (MD) simulations are now critical components in pharmaceutical and biomolecular research. A potential energy function called a ‘force field’ is used to solve the differential equations that describe the particle motion. A vast number of different force fields have now been released, each fit to experimental or quantum chemistry data to reproduce specific properties in a limited region of chemical space. However, the core of most of these date from work published decades ago, and new force field development has primarily taken the form of incremental improvements guided by human chemical intuition rather than systematic, reproducible methods. **Outline** The [Open Force Field (OpenFF) initiative](https://openforcefield.org/) was formed to produce open and extensible infrastructure to build a new generation of MD force fields. We have now developed many software packages for constructing, applying, and benchmarking force fields. We have also generated several high-quality quantum chemistry datasets. Everything is available freely on [GitHub](https://github.com/openforcefield/), [Zenodo](https://zenodo.org/communities/openforcefield/), and the [MolSSI QCArchive server](https://qcarchive.molssi.org/). This work has been successfully used to investigate potential improvements to force fields, as well as simplify many previously difficult aspects of preparing MD systems. Here we will introduce the [OpenFF-Toolkit](https://github.com/openforcefield/openff-toolkit) and [OpenFF-Interchange](https://github.com/openforcefield/openff-interchange) packages. We can use them to quickly assign force field parameters to arbitrary systems of small molecules, and then write these systems out in common MD formats for simulation. We also introduce the [OpenFF-Bespokefit](https://github.com/openforcefield/openff-bespokefit) package for fitting custom torsion parameters, as well as the [OpenFF-QCSubmit](https://github.com/openforcefield/openff-qcsubmit) package for interacting with QCArchive. We show how to use the datasets we have released on QCArchive. We will finally show some of the advancements enabled by our work. The [OpenFF-Evaluator](https://github.com/openforcefield/openff-evaluator) package was instrumental in investigating the effect of using a custom potential for van der Waals’ parameters. We used [OpenFF-Recharge](https://github.com/openforcefield/openff-recharge) to explore adding off-site charges with virtual sites. Finally, we describe the development of a neural network for quickly assigning conformer-independent partial charges – this also employed OpenFF-Recharge, as well as [OpenFF-NAGL](https://github.com/openforcefield/openff-nagl). We hope these examples give a brief overview of how OpenFF can help both common everyday MD tasks as well as larger scientific investigations. **Previous talks** I've previously given [keynote talks](https://www.youtube.com/watch?v=Jw1iVjHkRPM) at the Open Force Field annual meetings and presented at open science meetings convened by the [NIH](https://datascience.nih.gov/news/nih-odss-to-host-sessions-at-ismb-annual-conference), the NSF, and groups in the [scientific computing](https://www.youtube.com/watch?v=hS87inZupdQ) and molecular simulation communities. Jeff Wagner 2023-07-12T16:05:00-05:00 16:05 00:30 Grand Salon C 2023-147-designing-user-friendly-apis-for-the-nist-interatomic-potentials-repository https://cfp.scipy.org//2023/talk/JH9JMV/ false Designing user-friendly APIs for the NIST Interatomic Potentials Repository Materials and Chemistry Talk en The NIST Interatomic Potentials Repository project has developed Python APIs to support user interactions with the repository data hosted at https://potentials.nist.gov. The associated code is layered, starting with generic methods for JSON/XML-based data and databases, and building up to user-friendly interfaces specific to the repository. This design allows for basic users to easily explore the data and expert users to perform more complicated operations or create custom APIs for other databases. The repository APIs help users find and compare interatomic models, set up simulations, perform high throughput calculations, and access the high throughput results. This presentation outlines the Python APIs developed for the public database of the NIST Interatomic Potentials Repository. The entire framework consists of six different Python packages designed for data interaction and generation: DataModelDict, cdcs, yabadaba, potentials, atomman, and iprPy. These packages have an import hierarchy with each subsequent package incorporating or inheriting the previous. All project data is represented with JSON/XML equivalent data models. Having data that can be equivalently represented in JSON and XML takes advantage of the benefits of both formats while placing only minor limits on schema designs. The “DataModelDict” Python class extends the basic dict to allow for easy transformations between Python, JSON and XML, and includes additional methods for exploring and manipulating individual records. All public potentials data are hosted in a CDCS database accessible at https://potentials.nist.gov. CDCS databases store XML formatted records, they support multiple schemas, and provide both a web-based interface and a REST API for interacting with the data. The “cdcs” package defines Python methods for common database interactions that wrap around the REST API calls. The also provides options to build custom REST calls to the database for features not yet directly supported. The JSON/XML equivalent data models means that all records can also be stored in JSON-based Mongo databases or as local collections of JSON or XML files. The “yabadaba” Python package provides an intermediate abstraction layer allowing users to interact with data stored in all three database infrastructures using common methods. It also provides a framework for interpreting and building data records associated with different schemas. These features make it possible for end users to explore and generate data while remaining agnostic to the infrastructure used to store the data. While the “cdcs” and “yabadaba” packages provide APIs for interacting with an arbitrary CDCS database, the “potentials” package provides APIs specifically focused on interatomic potentials content in potentials.nist.gov. Utilizing the yabadaba features, any user can create their own copy of all interatomic potentials listings and then search and explore from either location. Searches can be performed both using simple Python methods or using Jupyter widget-based GUIs. The potentials package also forms the basis for adding new listings to the repository and for generating the traditional static repository website at https://www.ctcms.nist.gov/potentials/. The ”atomman” package focuses on setting up and analyzing atomic configurations and LAMMPS simulations. On the data side, it extends the “potentials” package functionality to interpreting schemas of atomic configurations. Finally, the “iprPy” package is centered around providing a collection of standard atomistic property calculation methods for characterizing interatomic potentials. The iprPy calculations can be performed individually or in high throughput, and can be executed from the command line, from within Python, or using transparent-box demonstration Jupyter Notebooks. Lucas Hale 2023-07-13T09:15:00-05:00 09:15 00:45 Zlotnik Ballroom 2023-280-keynote-how-open-source-tools-power-the-efforts-of-biological-data-analysis-and-drug-discovery https://cfp.scipy.org//2023/talk/DQQBWR/ false Keynote - How Open Source Tools Power the Efforts of Biological Data Analysis and Drug Discovery Keynote Talk en Angela Pisco is the head of computational biology at insitro. She is passionate about extracting meaningful information from biomedical datasets and use that to improve disease understanding and drug development. She has studied Biomedical Engineering as BSc and MSc and have a PhD in Systems Biology. Her PhD work became the foundation of a new direction of thinking on why cancer develops resistance to chemotherapy, which is the major reason for treatment failure. In her postdoctoral work, she investigated the mechanisms of cellular differentiation in the skin. She developed a 3D computational model that recapitulated the observed changes in the mouse skin connective tissue and dermis during development. The combination of the mathematical analysis with experimental data led to a new understanding of how distinct fibroblast subpopulations become activated, proliferate, and deposit matrix proteins during wound healing. Before moving to insitro, she led the Data Science platform at CZ Biohub. There she made significant contributions for the whole organism cell atlas projects including the first whole mouse cell atlas, the first aging cell atlas, and Tabula Sapiens, one of the first Human Cell Atlas drafts (The Tabula Sapiens Consortium, Science 2022). She is also a founder and core member of Open Problems in Single Cell (openproblems.bio), a community effort to improve multimodal data analysis by both generating gold standard datasets and benchmarking metrics and infrastructure. Angela Pisco 2023-07-13T10:45:00-05:00 10:45 00:30 Zlotnik Ballroom 2023-189-subpoenas-less-scary https://cfp.scipy.org//2023/talk/TKQFWU/ false Subpoenas Less Scary Machine Learning, Data Science, and Ethics in AI Talk en Your users have entrusted their data to you. But what happens when a law enforcement government agency demands you share the data with them? We will demystify the process of receiving and responding to law enforcement’s demands for data. We demonstrate how designing around privacy can limit what needs to be shared. To make subpoenas less scary, we break them down as a technical process, and share the protections we implemented at Mozilla. If you want to understand the real-world impact of your approaches to privacy, this talk is for you. Rebecca BurWeiDavid Zeber 2023-07-13T12:15:00-05:00 12:15 00:45 Zlotnik Ballroom 2023-282-diversity-luncheon-keynote-how-can-we-protect-vulnerable-groups-while-measuring-representation-in-our-communities- https://cfp.scipy.org//2023/talk/H9FDBV/ false Diversity Luncheon Keynote: How can we protect vulnerable groups while measuring representation in our communities? Keynote Talk en Diversity, equity and inclusion initiatives often start with measurement - what do our communities look like today and how can we track progress against our goals? However, data collected through APIs, web scraping, surveys, interviews, inference etc. have the potential to expose more details about an individual than they were expecting, especially when aggregated across platforms and shared in public forums. This talk will discuss tactics, opportunities and challenges when collecting sensitive data in and around open source communities, while aligning with policies and regulations, respecting the right to anonymity and ensuring the safety of all members of the community. Sophia Vargas 2023-07-13T14:20:00-05:00 14:20 00:30 Zlotnik Ballroom 2023-228-using-python-to-accelerate-sustainable-aviation-fuel-research-and-development https://cfp.scipy.org//2023/talk/XSQKSA/ false Using Python to accelerate sustainable aviation fuel research and development Machine Learning, Data Science, and Ethics in AI Talk en Aviation comprises 2-3% of global CO2 emissions. Transitioning to cleaner, more sustainable aviation fuels can reduce its environmental impacts. To help accelerate sustainable aviation fuel development, we trained machine learning models to predict fundamental properties of biofuel blends using Fourier transform infrared (FTIR) spectra. We leveraged TPOT and standard libraries like NumPy, pandas, and scikit-learn to develop the models. This presentation will discuss how we overcame challenges with decomposing FTIR spectra data and using machine learning on small datasets (<100 samples). We will also discuss integration of the models into our open-source webtool to support biofuel research. Aviation comprises 2-3% of global carbon dioxide emissions and 9-12% of U.S. transportation greenhouse gas emissions. Sustainable aviation fuels have the potential for reducing emissions and environmental impacts; however, due to high costs and high-volume requirements, experimental property testing of bio-based jet fuels is usually conducted years after initial bench-scale experiments are completed. Neglecting to conduct property testing early in the development cycle can lead to wasted investments spent on production of biofuels that do not meet performance expectations. Machine learning has already proven to be a valuable tool for predicting sustainable aviation fuel properties and accelerating research. In 2020, we presented our approach at SciPy (https://www.youtube.com/watch?v=ENOf0IZDla8) to predict high-throughput aviation fuel properties of over 10,000 molecules with molecular descriptors. The correlation analysis and tree-based methods for feature ranking were later published in Fuel (https://doi.org/10.1016/j.fuel.2022.123836). Using the property prediction models, we created the first Python-based, comprehensive, open-source webtool that enables scientists and companies to explore viable bio-based molecules without spending time and money testing in the lab (https://feedstock-to-function.lbl.gov). Because aviation fuels are made of blends of molecules and compounds, our current research focuses on expanding the webtool to predict properties of fuel blends using Fourier transform infrared (FTIR) spectra and experimental property data. Specifically, we use binning and smoothing techniques to reduce experimental noise in more than 6700 FTIR spectra features and use non-negative matrix factorization (NMF) for feature selection to develop models that predict fundamental properties of biofuel blends (e.g., boiling point, flash point, melting point, specific gravity, and kinematic viscosity). The predictive models are also integrated into the webtool to help sustainable aviation fuel research. Our workflow includes using libraries such as Numpy, pandas, scikit-learn to reduce FTIR spectra data into interpretable components to predict properties, and the Tree-based Pipeline Optimization Tool (TPOT) to develop property prediction models with reduced FTIR spectra as features. Specifically, we will discuss methods for coalescing experimental spectra data from different sources, and will present methods for reducing the influence of experimental noise on model performance. We will also discuss using NMF as a dimensionality reduction technique that correctly groups FTIR spectra wavelengths together and results in meaningful features. Additionally, we will address common pitfalls such as defining an applicability domain, and recognizing and limiting the possibility of overfitting. By sharing our experience and lessons learned, we aim to help the community overcome similar challenges when developing models for advancing science, while also demonstrating how a Python-based, open-source webtool can facilitate faster, less expensive bioprocess optimization and scale-up of sustainable aviation fuels. Ana Comesana 2023-07-13T15:00:00-05:00 15:00 00:30 Zlotnik Ballroom 2023-178-contributor-experience-why-it-matters https://cfp.scipy.org//2023/talk/MEGK33/ false Contributor experience - Why it matters Tending Your Open Source Garden: Maintenance and Community Talk en Behind every successful open source project is a strong contributor community. What makes these communities strong? What can you do in your OSS project to nurture a thriving contributor community? In this presentation, we will share insights from the work of the Contributor Experience Lead team (NumPy, SciPy, Matplotlib, and pandas) and discuss why designing and providing positive contributor experience is vital to sustainability of each individual project and the SciPy ecosystem overall. Behind every successful open source project is a strong contributor community. Engaging and supporting contributors requires specialized knowledge, experience, and time commitment from project leaders. However, a chronic lack of resources and time often inhibits them to focus on this work. Recognizing these challenges, in late 2021, we created a team of Contributor Experience Leads to support contributors to the four foundational libraries in the Scientific Python ecosystem: NumPy, SciPy, Matplotlib, and pandas. In this presentation, we will share insights from the work of our team and discuss why it is vital for project maintenance and sustainability. We will examine what we have identified as primary goals and priorities for a Contributor Experience team in each project, taking into account project size, structure, and governance model. We will also discuss how this work could be applied to other projects in the SciPy ecosystem. Finally, we will talk about the Contributor Experience Project (https://contributor-experience.org), a community of practice and an open-source community-led project dedicated to developing best practices for onboarding and supporting contributors to open source. Melissa Weber MendonçaInessa PawsonNoa Tamir 2023-07-13T15:50:00-05:00 15:50 00:30 Zlotnik Ballroom 2023-196-zarr-community-specification-of-large-cloud-optimised-n-dimensional-typed-array-storage https://cfp.scipy.org//2023/talk/T3NSL8/ false Zarr: Community specification of large, cloud-optimised, N-dimensional, typed array storage Tending Your Open Source Garden: Maintenance and Community Talk en A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling. Zarr is a data format for storing chunked, compressed N-dimensional arrays and is sponsored by [NumFOCUS](https://numfocus.org/project/zarr) under their umbrella. In this presentation, we will discuss the evolution of Zarr, first introduced at [SciPy 2019](https://youtu.be/qyJXBlrdzBs); the development of the [Zarr Enhancement Process (ZEP)](https://zarr.dev/zeps/) and its use to define the next major version of the [Zarr Specification (V3)](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html); as well as uptake of the format across the research landscape. ### Outline: First, we’ll be talking about: ### Introduction and Working of Zarr (10 mins.) - What is Zarr, and how it works? - The inner workings of Zarr using illustrated graphics - When and Why should you use Zarr? - Extensive pluggable compressors (via [numcodecs](https://github.com/zarr-developers/numcodecs/)) and file-storage systems - What is the [Zarr Specification](https://zarr.readthedocs.io/en/stable/spec/v2.html)? - A summary of the technical specification of Zarr - Adoption of the Zarr specification in various programming languages like Python, C, C++, Java, and Javascript and how all of us form a wonderful community together - Development of Zarr since it was first presented in SciPy 2019 by Alistair Miles - Highlighting some important technical and community milestones since 2019 - Securing grants from [CZI](https://chanzuckerberg.com/eoss/proposals/zarr-a-common-backbone-for-the-scalable-storage-of-annotated-tensor-data/) and getting sponsored by NumFOCUS After this: ### Usage of Zarr across several domains (5 mins.) - Interoperability with Dask, Xarray and Numpy - Adoption of Zarr by various communities like Geospatial, Bio-imaging, Genomics, Data Science/Engineering etc. - Development of convention processes like [GeoZarr](https://github.com/zarr-developers/geozarr-spec) and [OME-Zarr](https://github.com/ome/ome-zarr-py) Then we’ll discuss: ### [ZEP Process](https://zarr.dev/zeps/) (10 mins.) - Need and origin of a community feedback process for the evolution of Zarr specification - How it works? - Transformation from steering council governed to community-owned specification - Learnings when migrating from [Spec V2](https://zarr.readthedocs.io/en/stable/spec/v2.html) → [Spec V3](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html) And finally: ### Conclusion (5 mins.) - Key takeaways - How can you get involved? - QnA This talk aims to address an audience who works with large amounts of data and is looking for a format which is transparent, open-source, reliable, cloud-optimised, and friendly to the environment. Also, we’d like to invite anyone interested in the lessons we learnt by maintaining the project throughout the years. The tone of the talk is set to be informative, story-telling and fun. ### After this talk, you’d: - understand the basics of Zarr and its specification, - know why you should have a process for your project, - have essential takeaways regarding when an OSS project transitions from a young to a mature stage - as well as the pros and cons of a steering council vs a community-owned open-source project Sanket VermaJohn KirkhamJosh Moore 2023-07-13T16:30:00-05:00 16:30 00:30 Zlotnik Ballroom 2023-245-building-metpy-for-the-long-term-working-to-keep-an-open-source-project-sustainable https://cfp.scipy.org//2023/talk/UT3CUZ/ false Building MetPy for the Long Term: Working to Keep an Open Source Project Sustainable Tending Your Open Source Garden: Maintenance and Community Talk en MetPy is an open-source Python package for meteorological and atmospheric science applications, leveraging significantly many other pieces of the scientific Python stack (e.g. numpy, matplotlib, scipy, etc.). With a focus on sustainability, Metpy extensively leverages GitHub Action to try to automate as much of the software development process as possible. Sustainability also extends to the growth of the community of developers, and we have been working to try to make that sustainable as well. Here we talk about our experiences, share our successes and lessons learned with trying to build a sustainable project. MetPy is an open-source Python package for meteorological and atmospheric science applications, leveraging significantly many other pieces of the scientific Python stack (e.g. numpy, matplotlib, scipy, etc.). Its goal is to provide tested, reusable components suitable to a wide array of tasks, including scripted data visualization and analysis. The guiding principle is to make MetPy easy to use with any dataset that can be read into Python. MetPy’s general functionality breaks down into: reading data, meteorological calculations, interpolation, and meteorology-specific plotting. MetPy also has significant integration with XArray, as well as extended support for interpreting netCDF Climate and Forecasting Convention metadata. As a scientific software project that has actively solicited users across the research and education spaces, MetPy has placed a heavy emphasis on the sustainability of the project. Too often core scientific libraries fall into disarray, with a heavy toll on the reproducibility of scientific results. Even given our strong institutional support, our goal with the MetPy project is to build the project with an eye to these potential problems and keep the project sustainable as much as possible. One axis of sustainability for us lies on the side of technology and project infrastructure, which has been highly automated. This starts with our unit tests and test coverage, run automatically on GitHub, using its Actions service. These tests are run across a variety of OS, python version, and package manager combinations, as well as covering a wide array of sets of dependencies. This gives us great coverage of potential breakages. This also extends to automated documentation builds and publication, link checking, code quality checks, and, most importantly, making releases. This combination of processes, built heavily on the Github Actions service minimizes the need for humans in the loop of standard software development steps, allowing us to maximize the use of development time elsewhere. Technological automation is important for sustainability, but it’s only one part of the equation; in order to have a truly sustainable open source project, you must also be solving the issue of people. MetPy follows open development practices to drive community participation as much as possible. We use Github issues, pull requests, discussions, and projects extensively to allow input from any interested user. We also hold regular, open developer calls to keep the project moving forward; we have also started holding community calls to try to give the community more of a voice and input into the direction of the project; these are also done with a goal to encourage more of the community to become involved directly with MetPy development. This talk will share our lessons learned, both with technology and people, to help other projects who want to try to improve their overall sustainability. Ryan May 2023-07-13T17:20:00-05:00 17:20 01:00 Zlotnik Ballroom 2023-296-lightning-talks https://cfp.scipy.org//2023/talk/HWW7S7/ false Lightning Talks Lightning Talks Talk en Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference. 2023-07-13T10:45:00-05:00 10:45 00:30 Amphitheater 204 2023-185-using-numba-for-gpu-acceleration-of-neutron-beamline-digital-twins https://cfp.scipy.org//2023/talk/VVVQRU/ false Using Numba for GPU acceleration of Neutron Beamline Digital Twins Materials and Chemistry Talk en This talk will discuss how Numba was used to accelerate MCViNE, a software environment for building and running digital twins of neutron experiments via Monte Carlo ray tracing. Numba is an open-source JIT compiler for Python using LLVM to generate efficient machine code for CPUs and GPUs with NVIDIA CUDA. Python and Numba were used to create a GPU accelerated version of MCViNE utilizing an extensible object-oriented design that has achieved a speedup of up to 1000x over the CPU. The performance gain with Numba enables more sophisticated data analysis and impacts neutron scattering science and instrument design. Motivation MCViNE is a software package for creating digital twins of neutron scattering experiments using a Monte Carlo ray-tracing approach. These simulations are useful in performing advanced neutron data analysis as well as in designing novel neutron instruments and sample environments. Specifically, it has been used in the initial designs for instruments in the Second Target Station at the Spallation Neutron Source at Oak Ridge National Laboratory. Currently, MCViNE only runs on CPUs which is a bottleneck in large simulations with tens of billions of neutrons, and in modelling complex multiple scattering, with some simulations taking months to complete. Due to the massively parallel nature of Monte Carlo methods, bringing GPU acceleration to these simulations would offer superior performance and scalability. MCViNE is originally implemented in C++ and parallelized using MPI, and it has bindings to Python for user interaction; however, extensibility for the user can be very difficult. Methods To improve performance and to ease user contributions, Python and Numba were chosen to create a new package providing GPU acceleration of MCViNE. Numba is an open-source JIT (just-in-time) compiler for Python using LLVM to generate efficient machine code and supports GPUs using NVIDIA CUDA. Numba is designed for scientific computing and can support NumPy arrays and functions. Currently, we are only using Numba for its GPU capabilities. Using Python and Numba for this application allowed several advantages such as utilizing an extensible object-oriented approach and polymorphism. Each MCViNE instrument is composed of several components, such as a neutron source, a guide, and a monitor. During the simulation, neutrons can travel through each component in the instrument. Each component has a method (“propagate”) defined for propagating the neutron through it. Additionally, sample environments are created using constructive solid geometry (CSG) with each primitive shape defined as a CUDA kernel. To each constructed shape, many CUDA kernels are available, each modeling a different type of scattering physics. Due to different component/scattering-kernel types and geometric shapes, using an object-oriented design was beneficial. Furthermore, this structure allowed for custom on-the-fly CUDA kernel generation for complex instrument/sample geometries and physics. Results and Conclusions Python and Numba was used to create a GPU accelerated version of MCViNE, which has so far achieved speedups of up to 1000x over the original CPU implementation. This performance gain enables more sophisticated data analysis for neutron scattering and impacts neutron scattering science and instrument design. Using Python has helped increase the usability, extensibility, and maintainability of the codebase. Additionally, coupling Python with Numba allowed complex combinations of CUDA kernels to be generated at runtime, which would have been significantly harder to implement in other languages. The techniques used in this project could also be applied to other scientific computing applications. Resources: https://github.com/mcvine/acc https://mcvine.ornl.gov/ https://github.com/mcvine/mcvine Coleman Kendrick 2023-07-13T11:25:00-05:00 11:25 00:30 Amphitheater 204 2023-24-interactive-exploration-of-large-scale-datasets-with-jupyter-scatter https://cfp.scipy.org//2023/talk/AXSVZ3/ false Interactive Exploration of Large-Scale Datasets with Jupyter-Scatter General Track Talk en Jupyter-scatter is a scalable, interactive, and interlinked scatter plot widget for exploring datasets with up to several million data points. It focuses on data-driven visual encodings and offers two-way pan+zoom and lasso interactions. Beyond a single instance, jupyter-scatter can compose multiple scatter plots and synchronize their views and selections. Moreover, points can be connected by spline-interpolated lines. Thanks to the underlying WebGL rendering engine, spatial and color changes are smoothly transitioned. Finally, the API integrates seamlessly with Pandas DataFrames and offers functional methods that group properties by type to ease accessibility and readability. Visualizing datasets as a 2D scatter plot is one of the most popular data visualization methods for understanding the distributions, identifying trends, and discovering correlations. The method is used in any scientific domain. For instance, in biology, machine learning, or digital humanities, high-dimensional datasets are often summarized with dimensionality-reduction methods like PCA, t-SNE, or UMAP, and the results are typically visualized as 2D scatter plots to discover clusters. Unfortunately, many visualization tools are unable to scale or compromise user experience with datasets that grow in size, dimensionality, and quantity. For instance, while datashader can render datasets of almost any size, it offers limited interactions. On the other hand, Plotly provides interactivity but does not extend nearly as well to millions of points. Ideally, we want to be able to render and interactively explore one or more datasets with millions of data points. Jupyter-scatter (https://github.com/flekschas/jupyter-scatter) is a purpose-built widget for Jupyter Notebook, Lab, and Google Colab that supports interactive, interlinked, and scalable exploration of multiple large-scale datasets as scatter plots. It focuses on data-driven visual encodings, offers pan+zoom interactions, and two-way lasso selection. Beyond a single instance, jupyter-scatter can compose multiple scatter plots and synchronize their views and selections. Moreover, points can be connected by spline-interpolated lines. Thanks to the underlying WebGL rendering engine (https://github.com/flekschas/regl-scatterplot), changes in the spatial or color encoding of the points are smoothly transitioned. Finally, the widget API is inspired by seaborn and integrates seamlessly with Pandas DataFrames. As the number of arguments can get overwhelming when many properties are customized, jupyter-scatter provides a functional API that groups properties by type and exposes them via meaningfully-named methods. This functional API additionally allows users to programmatically modify active widgets from Python. To further ease the usability, jupyter-scatter infers sensible default color encodings from the data and dynamically adjusts the point opacity based on the point density in the current field of view. Using examples from single-cell biology and machine learning we demonstrate how jupyter-scatter works, how it enables more efficient exploration of large-scale datasets, and how it can be integrated with other ipywidgets to build bespoke applications. /media/2023/submissions/AXSVZ3/Teaser-Final_mdukBeW.png Fritz Lekschas 2023-07-13T14:20:00-05:00 14:20 00:30 Amphitheater 204 2023-129-accessibility-best-practices-for-authoring-jupyter-notebooks https://cfp.scipy.org//2023/talk/VGAUQN/ false Accessibility best practices for authoring Jupyter notebooks General Track Talk en So you’ve written the perfect notebook, but do you know who can read it? As a notebook author you have great stories, code, and visualizations filling your work, but how often do you consider accessibility? Jupyter notebooks seem like they are for everyone, but how a notebook gets written can greatly impact how usable it is for people with disabilities. We’ve curated authoring-focused best practices for notebook content to help your notebooks be more inclusive and reach a wider audience. Accessibility practices are for everyone, but this may be especially important to notebook authors in academic and public settings where it is often legally required. Using [staple accessibility frameworks](https://www.w3.org/WAI/WCAG21/Understanding/intro#understanding-the-four-principles-of-accessibility), this talk will dive into what it means to make your notebook’s content accessible and provide actionable guidance on how you as an author can improve your notebooks. These skills can be applied regardless of preferred notebook interface, author skill set, or prior accessibility knowledge. This talk is best for an audience that is familiar with Jupyter notebooks. Prior accessibility knowledge or any other Jupyter knowledge is not necessary. The content is likely to be most engaging for an audience who regularly authors notebooks. The structure of the talk will be as follows: 1. Background and introduction to accessibility (7 minutes) 1.1 Why this talk? (Hint: community members have requested it) 1.2 Defining accessibility and scoping: what we will and won’t cover in the talk 1.3 Common terms (disability, WCAG, assistive technology) 2. Breaking down the notebook with WCAG (13 minutes) 2.1 Perceivable elements (Labels, colors, alternative forms of content) 2.2 Operable elements (Labeling for interactive areas, keyboard controls) 2.3 Understandable writing and structure (Markdown headings, summaries, plain language) 3. Adopting a notebook accessibility checklist (2 minutes) 4. What you can do next (2 minutes) 5. Questions (6 minutes) At the end of this talk, attendees will * Have an awareness of foundational accessibility principles and how they can appear in Jupyter notebooks. * Be able to identify common accessibility pitfalls (ie. misused Markdown, incomplete visualizations, etc.) in Jupyter notebooks and what to do instead. * Have a checklist for easy reference of accessibility best practices when writing their own notebooks or editing existing ones. Stephannie Jimenez GachaIsabela Presedo-Floyd 2023-07-13T15:00:00-05:00 15:00 00:30 Amphitheater 204 2023-136-scientific-and-technical-publishing-with-python-and-quarto https://cfp.scipy.org//2023/talk/7ZGCQM/ false Scientific and technical publishing with Python and Quarto General Track Talk en In research and data science, effective communication requires weaving together narrative text and code to produce elegantly formatted output. By embedding executable Python code blocks inside markdown, the open-source publishing platform, Quarto, works with Jupyter and VS Code to enable you to create these fully reproducible documents and reports with the format and styling you need. In this talk I’ll share how to get started and a few of my favorite things in Quarto including creating a manuscript, presentation and website in HTML, PDF and Word from a single source file, and creating lessons, reports, and Confluence documents. Research and data science isn’t just experiments and code, it’s also communicating about our results, creating reports, sharing analyses, and teaching. To communicate effectively, we need to weave together narrative text and code to produce elegantly formatted, interactive output. Not only does it need to look great, but it needs to be reproducible, accessible, easily editable, diffable, version controlled and output in a variety of formats, such as PDF, HTML and MS Word. Jupyter has already made so much of this possible! The open-source publishing platform, Quarto, combines with Jupyter, or is a VS Code extension, so that we can easily use the output format and the styling that’s needed for any situation. You can author documents as plain text markdown or Jupyter notebooks with scientific markdown, including equations (LaTeX support!), citations, cross references, figure panels, callouts, advanced layouts, and more. Quarto (https://quarto.org/) is a markdown format that adds executable Python code blocks and build on top of Pandoc to produce a variety of output documents. This allows you to create fully reproducible documents and reports—the Python code required to produce your output is part of the document itself, and is automatically re-run whenever the document is rendered. This means you can create documents as plain text markdown or Jupyter notebooks that can be easily rendered into presentations, websites and manuscripts in a variety of journal formats. You can also engage readers by adding interactive data exploration to your documents using Jupyter Widgets, htmlwidgets for R, Observable JS, and Shiny. In this talk, I’ll discuss how to author these dynamic, computational documents with Quarto and Python, showing how to get started and highlighting a few of my favorite things. I’ll walk through how to use a single source document to target multiple formats - transforming a simple document into a presentation, a scientific manuscript, a website, a blog, and a book in a variety of formats including HTML, PDF and MS Word. I’ll share workflows for creating and automating reports, an approach to creating online lessons, and finally how to publish Jupyter notebooks within existing content management systems like Hugo, Docusaurus, and Confluence, so that you can get started creating whatever content you need. Tracy Teal 2023-07-13T15:50:00-05:00 15:50 00:30 Amphitheater 204 2023-50-taming-black-swans-long-tailed-distributions-in-the-natural-and-engineered-world https://cfp.scipy.org//2023/talk/RVLFPB/ false Taming Black Swans: Long-tailed distributions in the natural and engineered world General Track Talk en Long-tailed distributions are common in natural and engineered systems; as a result, we encounter extreme values more often than we would expect from a short-tailed distribution. If we are not prepared for these "black swans", they can be disastrous. But we have statistical tools for identifying long-tailed distributions, estimating their parameters, and making better predictions about rare events. In this talk, I present evidence of long-tailed distributions in a variety of datasets -- including earthquakes, asteroids, and stock market crashes -- discuss statistical methods for dealing with them, and show implementations using scientific Python libraries. You would think we'd be better prepared for disaster. But events like Hurricane Katrina in 2005, which caused catastrophic flooding in New Orleans, and Hurricane Maria in 2017, which caused damage in Puerto Rico that has still not been repaired, show that large-scale disaster response is often inadequate. Even wealthy countries -- with large government agencies that respond to emergencies and well-funded organizations that provide disaster relief -- have been caught unprepared time and again. The are many reasons for these failures, but one of them is that rare, large events are fundamentally hard to comprehend. Because they are rare, it is hard to get the data we need to estimate their likelihood precisely. And because they are large, they challenge our ability to imagine quantities that are orders of magnitude bigger than what we experience in ordinary life. In terms introduced by Nassim Taleb, a "black swan" is a large, impactful event that was considered extremely unlikely before it happened, based on a model of prior events. If the distribution of event sizes is actually long-tailed and the model is Gaussian, black swans will happen with some regularity. However, black swans can be "tamed'' by using appropriate models, including lognormal, Student t, and Pareto distributions. In this talk, I introduce these distributions and show how they can be used to model measurements from natural and engineered systems -- including earthquakes, craters on the moon, solar flares, file sizes, and stock market crashes. We will use distributions and optimization tools from SciPy to estimate parameters and generate predictions, and Matplotlib to visualize the results. /media/2023/submissions/RVLFPB/longtail_38_0_I4vvxzb.png Allen Downey 2023-07-13T16:30:00-05:00 16:30 00:30 Amphitheater 204 2023-174-view-annotate-and-analyze-multi-dimensional-images-in-python-with-napari https://cfp.scipy.org//2023/talk/X8KZ3E/ false View, annotate, and analyze multi-dimensional images in Python with napari General Track Talk en napari is an n-dimensional image viewer for Python. If you’ve ever tried `plt.imshow(arr)` and made Matplotlib unhappy because `arr` has more than two dimensions, then napari might be for you! napari will gladly *display higher-dimensional arrays* by providing sliders to explore additional dimensions. But napari can also: *overlay* derived data, such as points, segmentations, polygons, surfaces, and more; and *annotate* and *edit* these data, using standard data structures like NumPy or Zarr arrays, allowing you to *seamlessly weave* exploration, computation, and annotation in image analysis. napari is an n-dimensional image viewer for Python. If you’ve ever tried `plt.imshow(arr)` and made Matplotlib unhappy because `arr` has more than two dimensions, then napari might be for you! The napari canvas can be 2D or 3D. When you give napari an array with more dimensions than the canvas, it will automatically create sliders for those additional dimensions, allowing you to rapidly explore your full data, rather than a few sampled slices. Image analysis and visualization involves more than images though: feature detection algorithms result in *points*, segmentation results in *label images*, annotation results in *shapes* such as rectangles or polygons, and more. Napari provides *layers* that can be displayed on top of each other or side by side, allowing users of Scientific Python to gain a rapid understanding of the algorithms they’re using — where they work well and where they might go wrong. Sometimes, image analysis algorithms get you *this* far, but not quite far enough. In such cases, it’s useful to manually curate their output, then continue with downstream steps of an analysis. Napari provides editing tools for its layer types, allowing one for example to add missing points to the output of a peak detection algorithm, remove incorrect ones, paint over incorrect parts of a segmentation, or draw polygons around missed objects of interest. The resulting data points are saved in standard Scientific Python data structures, such as NumPy or Zarr arrays. This design makes it easy to seamlessly weave together image exploration, image computation, processing, and analysis, and data annotation, curation, and editing. Napari provides a *plugin interface*, allowing developers to extend napari’s capabilities, providing users with novel ways to interact with their data. Because napari provides both a library accessible within Python, IPython, and Jupyter, *and* a standalone executable script, we have even found that napari plugins can be an effective way to help collaborators run Python image analysis workflows without needing to launch Python. In this talk, I’ll introduce napari’s history, demonstrate all the features described above, and discuss current limitations and where we’re going. /media/2023/submissions/X8KZ3E/napari-window_5oz2dQ9.png Juan Nunez-Iglesias 2023-07-13T10:45:00-05:00 10:45 00:30 Grand Salon C 2023-80-interactive-analysis-of-satellite-imagery-with-earth-engine-and-geemap https://cfp.scipy.org//2023/talk/MFQQRJ/ false Interactive Analysis of Satellite Imagery with Earth Engine and Geemap Earth, Ocean, Geo, and Atmospheric Talk en Google Earth Engine is a cloud-computing platform with a multi-petabyte catalog of satellite imagery and geospatial datasets. Built upon the Earth Engine Python API and open-source mapping libraries, geemap enables Earth Engine users to interactively manipulate, analyze, and visualize geospatial big data in a Jupyter environment. This presentation introduces Earth Engine and highlights the key features of geemap for interactive mapping and geospatial analysis with Earth Engine. Attendees can utilize geemap to create satellite timelapse animations for any location on Earth within 60 seconds. Additional resources will be provided to the attendees to learn more about geemap. The Earth is constantly changing, which creates significant challenges for the environment and human society. To tackle these challenges on a global scale, the Earth science community relies heavily on geospatial datasets that are collected through various means, such as satellite, aerial, and mobile sensors. However, the explosive growth of geospatial datasets over the past few decades has overwhelmed the Earth science community's capacity for storage, analysis, and visualization. Fortunately, the advent of cloud-computing platforms, such as Google Earth Engine, has made it possible to access, manipulate, and analyze large volumes of geospatial data on-the-fly. In recent years, Earth Engine has become increasingly popular in the geospatial community and has enabled numerous Earth science applications at local, regional, and global scales. The geemap Python package is built upon the Earth Engine Python API and open-source mapping libraries. It allows Earth Engine users to interactively manipulate, analyze, and visualize geospatial big data in a Jupyter environment. Since its creation in April 2020, geemap has received over [2,500 GitHub stars](https://github.com/giswqs/geemap/stargazers) and is being used by over [800 projects](https://github.com/giswqs/geemap/network/dependents) on GitHub. More than [130 Jupyter notebook examples](https://geemap.org/tutorials/) and an [open-access book](https://book.geemap.org/) are available for learning geemap. This presentation introduces Earth Engine and highlights the key features of geemap for interactive mapping and geospatial analysis with Earth Engine, such as - Searching and loading datasets from the Earth Engine Data Catalog - Visualizing raster and vector datasets interactively - Using Cloud Optimized GeoTIFFs (COG) and SpatioTemporal Asset Catalogs (STAC) - Visualizing the Dynamic World global land cover datasets - Creating satellite timelapse animations This presentation is intended for scientific programmers, data scientists, geospatial analysts, and concerned citizens of Earth. Attendees should have a basic understanding of Python and Jupyter Notebook. Familiarity with Earth science and geospatial datasets is not necessary, but it will be helpful. For more information about Earth Engine and geemap, visit https://earthengine.google.com and https://geemap.org. Qiusheng WuSteve Greenberg 2023-07-13T11:25:00-05:00 11:25 00:30 Grand Salon C 2023-193-accelerating-the-use-of-public-geophysical-data-for-recharging-california-s-groundwater https://cfp.scipy.org//2023/talk/BAD9ZQ/ false Accelerating the Use of Public Geophysical Data for Recharging California’s Groundwater Earth, Ocean, Geo, and Atmospheric Talk en Recharging ground aquifers is an urgent task for improving groundwater sustainability in California. Geophysical data can provide a capability to image the subsurface where the major data gap lies. However, neither data nor analytic tools required to derive subsurface information is readily accessible. We present an interactive web application that utilizes a public database, GIS capabilities and directly integrates Jupyter Notebooks and Python packages from researchers to guide recharge site location. Our demonstration showcases how this technology can contribute to improving groundwater recharge in California and how integrating the research knowledge directly into a web application can increase the impact. California's Central Valley is one of the world's most productive farmland, but the region faces a serious threat to groundwater sustainability due to population growth and climate change. Recharging ground aquifers is essential to address this challenge, however a major data gap exists in the subsurface. Geophysical data can provide crucial information about the subsurface, but neither the data nor the analytic tools required to derive subsurface information is readily accessible to those working on the recharge problem. In this talk, we will present our development of a web-application and companion public database for accelerating groundwater recharge in California, which is a part of the Sustainability Accelerator Project funded by Stanford Doerr School of Sustainability. Our application uses electrical resistivity data obtained from electromagnetic geophysical surveys, as well as ancillary data from driller's logs (containing information about sediment/rock) and water level/quality measurements, to create 2D maps of recharge metrics. These maps guide the location of recharge sites, and the public resistivity and ancillary data are compiled into an online database using Redivis and displayed in a custom web-application. The application provides project partners the ability to utilize research codes without requiring knowledge of Python, and is flexible to allow updates by researchers to support rapid changes and feedback from partners to meet their specific needs for a recharge site location. The development of the web-application was a collaborative effort between academic researchers and software engineers at Curvenote. The application enables direct use of research code by front-facing practitioners tackling the recharge problem in California. We utilized open source Python packages, to create Jupyter Notebooks that can execute each stage of the workflow. /media/2023/submissions/BAD9ZQ/scipy_2023_thumbnail_nZPIWB6.png SEOGI KANG 2023-07-13T14:20:00-05:00 14:20 00:30 Grand Salon C 2023-76-uxarray-a-python-library-for-unstructured-climate-and-weather-data https://cfp.scipy.org//2023/talk/XMBALS/ true UXarray, a python library for unstructured climate and weather data Earth, Ocean, Geo, and Atmospheric Talk en UXarray aims to provide xarray-styled functionality for unstructured grid datasets. UXarray offers support for loading and representing unstructured grids by utilizing existing Xarray functionality paired with new routines that are specifically written for operating on unstructured grids. In this talk, we will present the current capabilities of the library: reading and writing of unstructured grids, reading of datasets along with basic grid operations and the need to speed up computations, integration operations along with details on speedups obtained by using Numba and python indexing. We will also demonstrate the use of this library for visualization of unstructured grids. After less than a year of development, UXarray has already become a popular Python repository with an active community engagement, boasting more than 10 forks and 77 stars on GitHub. The UXarray project aims to bridge the gap between traditional operations on structured grids and modern standards for unstructured grids, such as the UGRID specification. Global climate models have traditionally used rectangular latitude-longitude grids for their data layout, but these grids lead to computational challenges at high resolutions due to the convergence of lines of longitude at the poles. Therefore, modeling centers worldwide have adopted unstructured grids that allow for quasi-uniform distribution of data over the sphere. However, analyzing data on these grids is far more difficult than on latitude-longitude grids, often requiring groups to apply lossy regridding to their data so that traditional tools can be applied. To partly address this problem, groups worldwide have moved towards the adoption of standards for unstructured grid data, such as the UGRID specification developed under the Climate-Forecast (CF) conventions. Most climate models output data in the NetCDF format, and the CF conventions are an important standard for organizing the metadata of these files and includes details on how to describe a rectangular latitude longitude grid. The UGRID specification describes how a NetCDF file can represent an unstructured grid, but it has potential issues. Currently, the UGRID specification is under consideration to be included in the netCDF CF conventions. Our new Python library, UXarray, supports operations directly on unstructured grid data, reducing the need for creating regular-grid copies of unstructured grid output and simplifying the workflow. Unstructured grids can be provided in files following various conventions, such as UGRID, SCRIP, EXODUS, etc. These conventions have different definitions and representations of the attributes and variables used to describe the unstructured grid topology. Moreover, the UGRID convention does not enforce standard variable namings for most of the attributes and variables, except for a few required ones. UXarray unifies all of these conventions at the data loading step by representing grids internally in the UGRID convention, regardless of the original grid file type. Furthermore, it uses a set of standardized names for topology attributes and variables, while still providing the user with the original attribute names and variables from the grid definition file. All of these features lay the foundation for the development of quick and efficient algorithms for climate scientists around the world. Our design for UXarray aims to maintain Xarray interoperability, which allows us to utilize various Xarray-compatible packages. UXarray uses Numba for loop optimizations and faster computation. Additionally, we provide examples and performance metrics showcasing interoperable read/write operations, grid and corresponding data reading, efficiency and optimization built into UXarray, and visualization. Overall, UXarray aims to simplify the workflow for climate and weather scientists working with unstructured grids and allow them to efficiently analyze and visualize their data. Rajeev Jain 2023-07-13T15:00:00-05:00 15:00 00:30 Grand Salon C 2023-158-introducing-ytxarray https://cfp.scipy.org//2023/talk/QYHD3G/ false Introducing yt_xarray Earth, Ocean, Geo, and Atmospheric Talk en *yt_xarray* is a new package in the scientific python ecosystem for linking *yt* and *xarray*. *yt*, primarily used in computational astrophysics, has gradually broadened support for scientific domains, including geoscience disciplines. Most geoscience data, however, still requires manual steps to load into *yt*. *yt_xarray*, a new *xarray* extension, aims to streamline communication of data from *xarray* to *yt*, providing a potentially useful tool to the many geoscience researchers already using *xarray* while allowing *yt* to leverage the distributed backends already supported by *xarray*. In this presentation, we will provide an overview of the usage and design of *yt_xarray*. A number of recent efforts within the [*yt*](https://yt-project.org/) community have broadened the scope of scientific domains supported by *yt*. Some of these efforts included improving generic functionality while others focused on adding functionality required for specific domains outside the astrophysics scientific community. For geoscience data in particular, the addition of a geographic coordinate handler and an interface to [*cartopy*](https://scitools.org.uk/cartopy/docs/latest/) for producing maps within the *yt* plotting framework enabled analysis of geographic datasets. Getting the data into *yt*, however, was not as streamlined as it could be; with the exception of some new custom data ingestors (termed "frontends" in yt) for specific geoscience data products, most geoscience data still required manual loading of arrays with generic *yt* loaders. In addition to extra steps for the user, this limitation also required that the data fit entirely within memory. [*yt_xarray*](https://yt-xarray.readthedocs.io/en/latest/) fills this gap in data regularization required for loading geodata in *yt* by leveraging [*xarray*](https://docs.xarray.dev/en/stable/) for reading of data on demand as *yt* needs it. Rather than a traditional *yt* frontend, *yt_xarray* v0.1 introduced an *xarray* `accessor` object that streamlines the creation of *yt* datasets from subsets of fields, simplifying the process of using *yt* with most regularly gridded datasets that *xarray* can load. While the initial release focuses on simply returning a *yt* dataset object for use with any *yt* function, future releases will further simplify access to *yt* functions from *xarray* by providing *yt* function wrappers from within *yt_xarray*. While *yt* and *xarray* have some similarity in that they both load and manpipulate coordinate-referenced arrays, *yt* is inherently is designed primarily for volumetric data while *xarray* supports sets of labeled arrays more generally. This difference informed a number of important design choices in *yt_xarray*, in particular with regards to how chunked arrays are handled. For gridded datasets in *yt*, a physical domain can be subdivided into multiple grid objects so that a single *yt* "chunk" maps to a subdomain of the whole grid. During processing, subdomains are processed sequentially so that data is loaded as needed. In *xarray*, chunks are defined as contiguous index ranges within arrays, with the actual data potentially residing in on-disk files or existing as delayed computations. *yt_xarray* merges these two chunking systems by building *yt* grids that map spatial subdomains to index ranges of *xarray* fields. This allows a 1:1 mapping of *Dask*-*xarray* chunks to yt grid objects but also allows multiple *Dask*-*xarray* chunks to be contained within a yt grid object. In this presentation, we will provide an overview for using *yt_xarray* for loading and analyzing regularly gridded 2D and 3D *xarray* datasets. In addition to the general usage and development plans, we will describe the design of *yt_xarray* with a focus on leveraging the performance benefits of distributed arrays loaded via *xarray*. Chris Havlin 2023-07-13T15:50:00-05:00 15:50 00:30 Grand Salon C 2023-194-tidy-geospatial-cubes https://cfp.scipy.org//2023/talk/LCWBBP/ false Tidy Geospatial Cubes Earth, Ocean, Geo, and Atmospheric Talk en The open-source project, Xarray, combines labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. Xarray has strong user bases in the physical sciences and geospatial community. However, new users commonly struggle to fit their dataset into the Xarray model and with conceptualizing and constructing an Xarray object that makes subsequent analysis steps easy (“dataset wrangling”). We take inspiration from the “tidy data” concept for dataframes — “datasets structured to facilitate analysis” (Wickham, 2014) — and attempt a definition of tidy data for labeled array objects provided by Xarray. The open-source project, Xarray, combines the convenience of labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays ("cubes") to provide an intuitive and scalable interface for scientific analysis. Xarray is now widely used across many areas of scientific research, with a particularly strong user base in the physical sciences. New users commonly struggle to fit their dataset into the Xarray data model and, in particular, struggle with conceptualizing and constructing an Xarray object that makes subsequent analysis steps easy (“dataset wrangling”). We take inspiration from the “tidy data” concept for dataframes — “datasets structured to facilitate analysis” (Wickham, 2014) — and attempt a definition of tidy data for labeled array objects provided by Xarray. A ‘tidy dataset’ framework will help streamline processing workflows across the physical sciences and provide a set of norms and principles to guide the use and construction of large and complex datasets encountered in these fields. The utility of this exercise is twofold: helping dataset producers construct more useful Analysis-Ready datasets; and developing a set of guidelines that can help users wrangle their datasets into a form that enables convenient analysis with Xarray. In addition, a commonly-defined concept for ‘tidy’ geospatial array data might enable development of ‘tidy’ tools that consume and produce tidy datasets (Wickham, 2014). This presentation will examine three datasets and the processes of ‘tidying’ them. We will demonstrate various ways that a dataset may be ‘untidy’ — not conducive to analysis — and present a useful set of rules to define ‘tidy geospatial cubes.’ The examples we will discuss are: 1) Harmonized Landsat Sentinel-2 (HLS), a dataset of multispectral reflectance measurements, 2) Aquarius, a dataset of remotely sensed sea surface salinity measurements; and 3) ITS_LIVE, a multi-sensor dataset of ice velocity measurements for glaciers and ice sheets based on satellite image pairs. Our presentation will walk through common analytical workflows with these remote sensing datasets and highlight the organizational choices a user must make along the way (related to metadata, variables, coordinates, and dimensions) to efficiently arrive at a computational result with Xarray. Defining a common framework for labeled array objects will ease the learning curve for new users and minimize the time spent on data-wrangling steps. At present, the examples are satellite remote sensing datasets, and we recognize that there might be elements of the ‘tidy Xarray’ definition that are specific to this subdomain. We hope to spark a discussion that will help generalize the presented principles. Emma MarshallDeepak CherianScott Henderson 2023-07-13T16:30:00-05:00 16:30 00:30 Grand Salon C 2023-223-climate-model-evaluation-workflow-built-on-jupyter-notebooks https://cfp.scipy.org//2023/talk/EPPR7R/ false Climate Model Evaluation Workflow Built on Jupyter Notebooks Earth, Ocean, Geo, and Atmospheric Talk en This project introduces an extensible workflow used to evaluate climate model output using collections of Jupyter notebooks. The workflow supports parametrizing and batch-executing notebooks using Papermill, in conjunction with developing notebooks interactively. Additional features include integration with Dask and caching intermediate data products generated by notebooks. The final product of the workflow can automatically be built into a Jupyter book for easy presentation and shareability. While it was initially developed for climate modeling, the flexible and extensible nature of this framework makes it adaptable to any kind of data analysis work, and the presentation will highlight this capability. Motivation Within the field of climate modeling, there is a need to run collections of scripts generating plots of common diagnostic metrics of climate model output, for example as models are run with different configurations during development. These scripts often involve manual configuration, and the output is not necessarily well-organized for interpreting and sharing. Jupyter notebooks help address this problem, creating more readable workflows that can be annotated and edited interactively, then easily presented to others as a Jupyter book. However, Jupyter notebooks are not by default parameterizable or runnable in batches. This project addresses this gap by utilizing Papermill to create a package that can run collections of Jupyter notebooks with configurable parameters, cache generated data products, and publish results as a Jupyter book, while continuing to support the interactive development work that Jupyter notebooks enable. This framework is not limited to use within climate modeling; the infrastructure is useful to any data science project that would benefit from a batch-executable, parameterizable, and shareable Jupyter notebook-based workflow. Methods This project uses a number of existing open-source Python tools, building on the Jupyter ecosystem using Papermill as well as Jinja templating, supporting Dask functionality, and publishing a Jupyter book. It brings these tools together to create a powerful workflow that combines their functionality. The project infrastructure will be published as a Python package and on Github, and examples showcasing its functionality will be made available. Results Currently (as of 3/1/23), the project is in the development stage, with several working demos. By the time of the conference, a more complete version will be public on Github with documentation and installable as a Python package, along with examples that can be downloaded and built on. Conclusion We have developed a framework for data analysis using collections of parameterizable Jupyter notebooks, along with infrastructure to support Dask, caching of data products, building a Jupyter book and other features. This is a powerful application of the Jupyter ecosystem and can be applied to a wide range of fields outside of the climate model evaluation use case it was initially developed for. Elena Romashkova 2023-07-13T13:15:00-05:00 13:15 00:55 Classroom 105 2023-285--bof-room-105-scientific-python-ecosystem-coordination https://cfp.scipy.org//2023/talk/3HXLZV/ false [BoF Room 105] Scientific Python Ecosystem Coordination Birds of a Feather (BoF) Talk en Scientific Python Ecosystem Coordination (SPEC) documents (https://scientific-python.org/specs/) provide operational guidelines for projects in the scientific Python ecosystem. SPECs are similar to project-specific guidelines (like PEPs, NEPs, SLEPs, and SKIPs), but are opt-in, have a broader scope, and target all (or most) projects in the scientific Python ecosystem. Come hear more about what we are working on and planning. Better yet, come share your ideas for improving the ecosystem! Juanita GomezJarrod MillmanStéfan van der Walt 2023-07-13T18:30:00-05:00 18:30 00:55 Classroom 105 2023-288--bof-room-105-scientific-python-packaging-summit https://cfp.scipy.org//2023/talk/H3GNAT/ false [BoF Room 105] Scientific Python Packaging Summit Birds of a Feather (BoF) Talk en "Python packaging is a rapidly changing landscape, plagued by many hurdles and challenges for users. The scientific Python community faces some of the greatest difficulties of anyone here, given the high reliance on external binaries and compiled code, the diversity of packaging ecosystems (PyPI, Conda, others), and the fact that many if not most users are not professional software engineers, like in other ecosystems. This is made all the more critical by the importance of reproducible research, and its sensitivity to even small dependency changes. We'd like to build on the recent momentum behind evolving the packaging landscape to better serve these needs and building bridges between key players in the core Python and scientific spaces, with an intense, engaging and open discussion. This will bring together the key community stakeholders and everyday package authors to sync up on best practices, strengthen collaboration, and help come to consensus that would take months or even years if not for in-person discussion, as well as provide a jumping-off point for followup conversations and future action items." C.A.M. GerlachHenry Schreiner III 2023-07-13T13:15:00-05:00 13:15 00:55 Classroom 103 2023-283--bof-room-103-pyarrow-in-pandas-and-dask https://cfp.scipy.org//2023/talk/SZP3LA/ false [BoF Room 103] PyArrow in pandas and Dask Birds of a Feather (BoF) Talk en DataFrame libraries in general, pandas and Dask specifically, are moving towards a better integration with PyArrow. This has many benefits, like improved performance and a reduced memory footprint. We want to connect with users to discuss how PyArrow can improve DataFrame libraries and what they expect out of PyArrow support. This can include things like improved performance, more consistent behavior or better interoperability with other libraries. Patrick HoeflerJames BourbeauMatt Harrison 2023-07-13T18:30:00-05:00 18:30 00:55 Classroom 103 2023-286--bof-room-103-python-visualization-and-app-tools https://cfp.scipy.org//2023/talk/GEDFS7/ false [BoF Room 103] Python Visualization and App Tools Birds of a Feather (BoF) Talk en Each new SciPy brings even more tools for data visualization and for building data-rich scientific applications and dashboards. This BoF brings together maintainers of Python tools for data visualization and building apps to help make sense of this complex landscape for users and to highlight new developments, trends, and opportunities. Join us and stay ahead of the curve! James A. BednarElliott Sales de AndradeBane SullivanSophia YangJuan Nunez-IglesiasKushal KolarJon MeaseNathan JessurunHadley Wickham 2023-07-13T13:15:00-05:00 13:15 00:55 Classroom 104 2023-284--bof-room-104-where-on-earth-is-my-pixel- https://cfp.scipy.org//2023/talk/A9EGX9/ false [BoF Room 104] Where on Earth is my Pixel? Birds of a Feather (BoF) Talk en Imaging communities across different fields (microscopy, remote sensing, medical imaging, materials science) are currently all moving to develop cloud- and chunking friendly imaging formats based around Zarr. This includes OME-NGFF and GeoZarr. Although pretty much everyone has agreed on Zarr as the container for the image data, there is ongoing discussion about how best to store metadata about the images. In this BoF we'll discuss ways to encode *where* each pixel in the image is located in space (and time!) (and frequency!), and whether it's possible to harmonize this encoding across the different formats and standards. A relevant issue is https://github.com/ome/ngff/issues/174. Juan Nunez-IglesiasJosh Moore 2023-07-13T18:30:00-05:00 18:30 00:55 Classroom 104 2023-287--bof-room-104-funding-open-source-software https://cfp.scipy.org//2023/talk/7DDDWU/ false [BoF Room 104] Funding Open Source Software Birds of a Feather (BoF) Talk en Scientific open source software has often advanced by volunteer efforts with little financial support. In recent years, there has been an increase in different groups funding open source software. How has this changed the open source community? Where would future funding have the largest impact in the open source landscape? What new thing would you build that would make the lives of developers, researchers, and users easier? How much support is needed and what are the best ways to provide that support? What large scale project doesn’t exist that *needs* to exist? How do you balance funded and volunteer efforts? Join this lively discussion to help identify key focus areas for open source funding and resources. Demitri MunaPaige Martin 2023-07-14T09:15:00-05:00 09:15 00:45 Zlotnik Ballroom 2023-281-keynote-responsible-ai-in-practice-how-far-we-ve-come-and-where-we-re-going https://cfp.scipy.org//2023/talk/HPFVLT/ false Keynote - Responsible AI in Practice: How far we've come and where we're going Keynote Talk en Dr. Rumman Chowdhury is a trailblazer in the field of applied algorithmic ethics, creating cutting-edge socio-technical solutions for ethical, explainable and transparent AI. She currently runs Parity Consulting, Parity Responsible Innovation Fund, and is a Responsible AI Fellow at the Berkman Klein Center for Internet & Society at Harvard University. She is also a Research Affiliate at the Minderoo Center for Democracy and Technology at Cambridge University and a visiting researcher at the NYU Tandon School of Engineering. Previously, she was the director of the ML Ethics, Transparency, and Accountability team at Twitter identifying and mitigating algorithmic harms on the platform. Before that she was CEO and founder of Parity, an enterprise algorithmic audit platform company. She formerly served as Global Lead for Responsible AI at Accenture Applied Intelligence. In her work as Accenture’s Responsible AI lead, she led the design of the Fairness Tool, a first-in-industry algorithmic tool to identify and mitigate bias in AI systems. Dr. Chowdhury has been featured in international media, including the Wall Street Journal, Financial Times, Harvard Business Review, NPR, MIT Sloan Magazine among others. She was named one of BBC’s 100 Women, recognized as one of the Bay Area’s top 40 under 40, and honored to be inducted to the British Royal Society of the Arts (RSA). Dr. Rumman Chowdhury 2023-07-14T11:25:00-05:00 11:25 00:30 Zlotnik Ballroom 2023-33-modern-compute-stack-for-scaling-large-ai-ml-workloads https://cfp.scipy.org//2023/talk/ALEQSL/ false Modern compute stack for scaling large AI/ML workloads Machine Learning, Data Science, and Ethics in AI Talk en Existing production machine learning systems often suffer from various problems that make them hard to use. For example, data scientists and ML practitioners often spend most of their time stitching and managing bespoke distributed systems to build end-to-end ML applications and push models to production. To address this, the Ray community has built Ray AI Runtime (Ray AIR), an open-source toolkit for building large-scale end-to-end ML applications. Existing production machine learning systems often suffer from various problems that make them hard to use. For example, data scientists and ML practitioners often spend most of their time stitching and managing bespoke distributed systems to build end-to-end ML applications and push models to production. To address this, the Ray community has built Ray AI Runtime (Ray AIR), an open-source toolkit for building large-scale end-to-end ML applications. Ray is a distributed compute framework, powering large scale machine learning models such as OpenAI's ChatGPT. By leveraging Ray’s distributed compute strata and library ecosystem, the Ray AI Runtime brings scalability and programmability to ML platforms. The main focus of the Ray AI Runtime is to provide the compute layer for Python-based AI/ML workloads and is designed to interoperate with popular ML frameworks and other systems for storage and metadata needs. In this session, we’ll explore and discuss the following: Why and what is Ray How AIR, built atop Ray, allows you to program and scale your machine learning workloads easily AIR’s interoperability and easy integration points with other systems for storage and metadata needs AIR’s cutting-edge features for accelerating the machine learning lifecycle such as data preprocessing, last-mile data ingestion, tuning and training, and serving at scale Key takeaways for attendees are: * Ray as a general purpose framework for distributed computing * Understand how Ray AI Runtime can be used to implement scalable, programmable machine learning workflows. * Learn how to pass and share data across distributed trainers and Ray native libraries: Tune, Serve, Train, RLlib, etc. * How to scale python-based workloads across supported public clouds Jules S. DamjiAmog Kamsetty 2023-07-14T13:15:00-05:00 13:15 00:30 Zlotnik Ballroom 2023-216-ultra-fast-visualization-of-large-datasets-using-modern-graphics-apis-in-jupyter-notebooks https://cfp.scipy.org//2023/talk/Q9KTXS/ false Ultra fast visualization of large datasets using modern graphics APIs in jupyter notebooks Bioinformatics, Computational Biology & Neuroscience Talk en Fast interactive visualization remains a considerable barrier in analyses pipelines for large neuronal datasets. Here, we present *fastplotlib*, a scientific plotting library featuring an expressive API for very fast visualization of scientific data. *Fastplotlib* is built upon *pygfx* which utilizes the GPU via WGPU, allowing it to interface with modern graphics APIs such as *Vulkan* for fast rendering of objects. *Fastplotlib* is non-blocking, allowing for interactivity with data after plot generation. Ultimately, *fastplotlib* is a general purpose scientific plotting library that is useful for the fast and live visualization and analysis of complex datasets. Over the past decade, advanced analyses pipelines have been developed for large neuronal datasets [1][2]. However, fast visualization and live interactivity during data collection is largely unsupported. While current tools within the Python plotting ecosystem (ex. *pyqtgraph, VisPy, napari*) allow for interactive data visualization, they either fail to leverage modern GPUs efficiently, lack intuitive APIs for rapid prototyping, or require users to write their own shaders. Additionally, other popular plotting libraries, such as *bokeh* and *matplotlib*, are not geared towards fast interactive visualization with millions of objects. Given these challenges with current visualization tools, the need for a modern GPU-driven interactive plotting library exists. In this presentation, we will go through the technical details, as well as a brief demo on how *fastplotlib* makes fast interactive visualization of complex neuronal datasets possible. We will also demonstrate the broader applicability of *fastplotlib* as a fast, general-purpose plotting library. *Fastplotlib* is built on top of *pygfx* which is a cutting edge Python rendering engine that utilizes *Vulkan*, which can efficiently leverage modern GPU and CPU hardware. *Vulkan*, released in 2016, is the successor to *OpenGL* and features a low overhead with respect to the amount of code per-draw-per-object allowing for speed even when rendering millions of objects. *Pygfx* is also non-blocking, which allows for interactivity and modification of already drawn objects. *Fastplotlib* utilizes the *pygfx* rendering library for fast visualization with an expressive API for scientific visualization. The benefits of *fastplotlib* are that it reduces boilerplate code which allows users to focus on their data without having to manage the underlying rendering process. Additionally, *fastplotlib* allows for animations as well as high-level interactivity among plots, which can be combined with lazy loading of very large neuronal imaging movies that are hundreds of gigabytes or terabytes in size. Furthermore, *fastplotlib* can be used in jupyter notebooks, allowing it to be used on cloud computing and other remote infrastructures. In total, these unique features and the underlying architecture create a plotting library that is fast, easy to use, and multifaceted. Initially, *fastplotlib* was developed for use in the neuroscience community to aid in the analysis of large neuronal datasets. However, the long term goal of this project is to provide an open source software that serves as a general-purpose scientific plotting library. As we are currently in the early stages of development, we are looking for community involvement and to connect with other developers to further progress our software package. https://github.com/kushalkolar/fastplotlib /media/2023/submissions/Q9KTXS/Screen_Shot_2023-03-01_at_5.56.53_PM_NwW79q4.png Caitlin LewisKushal Kolar 2023-07-14T13:55:00-05:00 13:55 00:30 Zlotnik Ballroom 2023-234-datajoint-bringing-databases-back-into-data-science https://cfp.scipy.org//2023/talk/SXJFBQ/ false DataJoint: Bringing databases back into data science Bioinformatics, Computational Biology & Neuroscience Talk en Relational databases manage structured data and facilitate queries in collaborative repositories, but using SQL from a scientific programming language is awkward. DataJoint is an open-source framework for managing scientific data supporting data definition, diagramming, and queries. DataJoint makes computation a native part of its data model, bridging the gap between databases and numerical analysis in automated workflows. We will showcase the elegance of the relational data model and its versatility through neuroscience research examples. We will also introduce the DataJoint SciViz library, enabling scientists to build web apps for data visualization and unlocking further potential for data-driven discovery. Research teams work on complex scientific data with many contributors. They execute quickly evolving and complex computational pipelines around such data. This requires a systematic approach to structuring data with clarity and transparency, linking it with distributed computation. Relational databases solve many of these problems; they support data integrity and facilitate queries in large, collaborative repositories. However, working with relational databases through SQL from Python can be awkward. As a result, many data scientists have dismissed relational databases and missed out on their great capabilities. Enter DataJoint, an open-source framework designed explicitly for managing scientific data. DataJoint uses a relational database system as its backend but utilizes Python programming constructs to define and query the database, similar to object-relational mappers commonly used in web development. It is specifically designed from the ground up for supporting complex data and distributed computations, making it an ideal tool for data scientists. One of the most significant advantages of DataJoint is that it allows you to design complex databases directly from a Jupyter notebook. It provides its own sublanguage for defining database schemas to capture relationships between data elements, including beautiful diagrams for convenient navigation. DataJoint also provides a convenient query language that reduces the complexity of SQL select statements into an algebra of five operators. Data operations are well integrated with other data science tools such as numpy and pandas. Most importantly, DataJoint makes computations a first-class citizen in its data model. Computational dependencies are encoded as part of the database design, so the database schema serves to specify the computational data pipeline and workflow. DataJoint has been in continuous development and use for about 14 years and is currently used in approximately a hundred research labs. A rich collection of standardized workflows, DataJoint Elements, has been in development by the research community. In this talk, we will introduce the basic principles of scientific databases, including how to create a database, how to visualize its structure, how to enter and delete data, and how to define and execute computational dependencies. We will also showcase examples from past and current neuroscience projects. For large-scale computations, DataJoint can be combined with job orchestration tools for scalable computing. Furthermore, we will introduce the new DataJoint SciViz library that provides a low-code approach for creating websites for data visualization to show off your work. DataJoint has become a part of the data science tool stack for working with scientific databases, providing the full rigor of relational databases for maintaining data integrity and consistency, especially in dynamic collaborative projects. Finally, we will share some glimpses of our future developments and invite diverse teams to contribute and collaborate, making DataJoint an even more powerful tool for managing scientific data. With DataJoint, scientists can bring relational databases into the modern era of data science and streamline their data management and computational workflows. /media/2023/submissions/SXJFBQ/datajoint_talk_gq7K6nn.png Dimitri Yatsenko 2023-07-14T14:35:00-05:00 14:35 00:30 Zlotnik Ballroom 2023-241-an-api-for-efficient-and-low-latency-access-to-the-largest-standardized-single-cell-data-repository-by-cz-cellxgene-discover- https://cfp.scipy.org//2023/talk/CQNJ9Z/ false An API for efficient and low-latency access to the largest standardized single-cell data repository by CZ CELLxGENE Discover. Bioinformatics, Computational Biology & Neuroscience Talk en CZ CELxGENE Discover has released all of its human and mouse single-cell data through a new API that allows for efficient and low-latency querying. The data is fully standardized, hosted publicly and it is composed by a count matrix of 50 mi cells (observations) by >60 k genes (features) accompanied by cell and gene metadata. While these data are built from more than 700 datasets, the API enables convenient cell- and gene-based filtering to obtain any slice of interest in a matter of seconds. All data can be quickly transformed to numpy, pandas, anndata or Seurat objects. As a part of the CZ CELxGENE Discover suite (cellxgene.cziscience.com) we have deployed Python and R APIs to query the largest aggregation of single-cell data from 50 million cells along >60 thousand genes from the major human and mouse tissues. The data is comprised of more than 700 individual datasets represented as a single gene expression count matrix along with metadata data frames, where all cells have harmonized annotations across 11 variables (e.g. cell type, tissue, sequencing technology, donor id, etc) and all gene IDs and labels have been standardized on GENCODE references (https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md) . The APIs are able to perform efficient cell-based queries across all cells regardless of the dataset of origin. The concatenated data presents a unique opportunity to apply machine learning on single-cell gene expression at an unprecedented scale for biological discoveries. More importantly, the data and APIs are built around a recently developed technology, TileDB-SOMA, which allows for cloud-optimized storage and access, low-latency access for larger-than-memory slices of data, querying and filtering under lazy evaluation, and transformers to pandas, pyarrow, anndata and Seurat. The APIs are free to use (https://pypi.org/project/cell-census/) and the data is hosted publicly online, which allows users to fetch slices of data with less than 10 lines of code and under 2 minutes. Our main objective is to accelerate biological discoveries by providing ready-to-use standardized gene expression data from 50 million human and mouse cells in an interoperable manner. We are eager to provide the support necessary to enable researchers to effectively use the data and APIs. Pablo Garcia-Nieto 2023-07-14T15:30:00-05:00 15:30 01:00 Zlotnik Ballroom 2023-297-lightning-talks https://cfp.scipy.org//2023/talk/RTA7JG/ false Lightning Talks Lightning Talks Talk en Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference. 2023-07-14T10:45:00-05:00 10:45 00:30 Amphitheater 204 2023-232-new-cuda-toolkit-packages-for-conda https://cfp.scipy.org//2023/talk/DQR9NU/ false New CUDA Toolkit packages for Conda General Track Talk en In this talk, we will examine the new CUDA package layout for Conda (as included in conda-forge). Show how CUDA components have been broken out. Share how this affects development and package building. Walk through changes in the conda-forge infrastructure made to incorporate these new packages. Examine recipes using the new packages and what was needed to update them. Additionally will provide guidance on how to use these new packages in recipes or in library development. Based on feedback from package maintainers and end users, we’ve extended and restructured the CUDA Toolkit packages in conda-forge. We’ve added new packages for CUDA components that were requested. Also we’ve more finely split out CUDA toolkit packages by CUDA component to provide package maintainers and end users a light-weight, precise method for including and stating CUDA dependencies. In addition to the CUDA redistributable libraries already available, we have included compilers, debuggers, profilers, etc. Thus providing users of the conda-forge channel a full development suite that they can use in their own projects. Also these greatly simplify the build infrastructure in conda-forge. Finally more libraries are included, which will allow package maintainers to enable additional features in recipe builds. Similarly packages have become more granular. Each component of the CUDA toolkit is separated out. Further components are split into packages used at build time and run time. Maintainers of packages can now select which components they depend on for a build and only depend on the needed shared library at runtime. In terms of the package ecosystem, this makes CUDA component usage legible in downstream recipes and packages, which can make updates more targeted and easier to manage. For end users all of this means quicker downloads, more compact installs, and a smoother upgrade path. To aid package maintainers and users in leveraging this new functionality, we will share the overall package structure and how this is integrated into conda-forge. Also we will share examples from recipes on how these CUDA packages can be used. Similarly we will show how these packages can be integrated into development workflows. John KirkhamThomson ComerRick Ratzel 2023-07-14T11:25:00-05:00 11:25 00:30 Amphitheater 204 2023-4-python-array-api-standard-toward-array-interoperability-in-the-scientific-python-ecosystem https://cfp.scipy.org//2023/talk/T7DTX8/ false Python Array API Standard: Toward Array Interoperability in the Scientific Python Ecosystem General Track Talk en The array API standard (https://data-apis.org/array-api/) is a common specification for Python array libraries, such as NumPy, PyTorch, CuPy, Dask, and JAX. This standard will make it straightforward for array-consuming libraries, like scikit-learn and SciPy, to write code that uniformly supports all of these libraries. This will allow, for instance, running the same code on the CPU and GPU. This talk will cover the scope of the array API standard, supporting tooling which includes a library-independent test suite and compatibility layer, what work has been completed so far, and the plans going forward. This talk will have the following outline: * A motivating example, adding array API standard usage to a real-world scientific data analysis script so it runs with CuPy and PyTorch in addition to NumPy. * History of the Data APIs Consortium and array API specification. * The scope and general design principles of the specification. * Current status of implementations: * Two versions of the standard have been released, 2021.12 and 2022.12. * The standard includes all important core array functionality and extensions for linear algebra and Fast Fourier Transforms. * NumPy and CuPy have complete reference implementations in submodules (numpy.array_api). * NumPy, CuPy, and PyTorch have near full compliance and have plans to approach full compliance * array-api-compat is a wrapper library designed to be vendored by consuming libraries like scikit-learn that makes NumPy, CuPy, and PyTorch use a uniform API. * The array-api-tests package is a rigorous and complete test suite for testing against the array API and can be used to determine where an array API library follows the specification and where it doesn’t. * Future work * Add full compliance to NumPy, as part of NumPy 2.0. * Focus on improving adoption by consuming libraries, such as SciPy and scikit-learn. * Reporting website that lists array API compliance by library. * Work is being done to create a similar standard for dataframe libraries. This work has already produced a common dataframe interchange API. /media/2023/submissions/T7DTX8/Slides.001_BaSD3Sp.png Aaron MeurerTyler ReddyStephan HoyerLeo FangStephannie Jimenez GachaMatthew BarberRalf GommersAthan ReinesMarioThomas J. FanAndreas MuellerSaul shanabrookAlexandre PassosTravis E OliphantJohn Kirkham 2023-07-14T13:15:00-05:00 13:15 00:30 Amphitheater 204 2023-155-what-happens-when-the-main-maintainer-of-a-project-takes-a-step-down- https://cfp.scipy.org//2023/talk/A7EZZV/ false What happens when the main maintainer of a project takes a step down? Tending Your Open Source Garden: Maintenance and Community Talk en Once a maintainer of a project decides to step down of a project, the community needs to quickly adapt to this decision. This situation can be devastating for small projects and lead to their extinction. This talk demonstrates, based on the case of poliastro, that the community is a key factor for a software to survive no matter who is leading it. Free and open source software is made by the community, for the community, off the community. The community is made out of amazing people who are human beings. The Python community would not be what it is without people. Some people of these people are maintainers of projects. They devote a significant amount of their time to guarantee the health of a project, review new contributions, solving questions... The community recognizes their effort and usually evolves around them. However, what happens when a maintainer steps down of a project? How does the community react to this situation? What about for tiny projects? This talk presents some key concepts for building a healthy community around a project to guarantee its survival over time. These key concepts include not only good coding practices like documentation but also the creation of community meetings for everyone, promotion of software, financial support, and tons of passion among others. As an example, the case of "poliastro" is used. Jorge Martinez 2023-07-14T13:55:00-05:00 13:55 00:30 Amphitheater 204 2023-209-better-open-source-homes-and-gardens-with-project-pythia https://cfp.scipy.org//2023/talk/EDZ9YB/ false Better (Open Source) Homes and Gardens with Project Pythia Tending Your Open Source Garden: Maintenance and Community Talk en As scientists continue to embrace the Jupyter ecosystem for constructing computational narratives of their science through code, data, and rich text, they may encounter technical and community barriers to maintaining and sharing their science with new and existing audiences. We demonstrate the value of open-source science community building and getting there through reliance on the open-source Jupyter ecosystem, pre-packaged GitHub and BinderHub-based infrastructure, and documentation for creating, sharing, testing, and maintaining Pythia Cookbooks for their computational narratives. A “community garden” metaphor is particularly apt for a free- and open-source software project and community. Enthusiasm, creativity, and openness work both for the SciPy conference and Albany NY’s Tulip Festival. But a “garden”, be it botanical or cyber, requires nurturing. With regard to free- and open-source software, there are bounteous examples. Pull requests (PRs) are sown and merged; Issues are resolved, and bugs are removed. Yet we also see signs of formerly fruitful repositories that have been left to languish. Issues proliferate like weeds; bugs roam freely, and eventually the repos’ stars fade away. It is incumbent on the SciPy community to ensure that the projects we are invested in take the more fruitful path. One such open source “greenspace” is Project Pythia (hereafter Pythia). Now in its 3rd year, Pythia extends Pangeo by providing an educational and training hub for the geoscientific Python community. It has three key components: 1. Foundations: The core geoscientific Python stack (JupyterBook) 2. Cookbooks: Advanced and domain-specific workflows (JupyterBooks) 3. Resource Gallery of externally-hosted geoscientific Python resources Here we discuss Pythia’s infrastructure, which sustains the above components in a year-round “community garden”. Pythia’s content is built upon an open stack of infrastructure for reproducibility and collaboration that provides for the care and nurturing of the community it serves. We have built a cloud-based publishing system upon Jupyter Book that automates notebook execution in a reproducible, curated environment. Users can interact with notebooks via Binder links, launching directly into an identical environment. The platform provides automated code- and link-checking, ensuring a rapid healing cycle. Collaboration is achieved through PRs that trigger the same execution infrastructure and a rich preview. Our infrastructure relies on GitHub, which encourages open development via PRs. Pythia uses this process extensively for building and maintaining its “garden”, for the core team and community contributions. GitHub’s focus on collaboration provides users a sense of ownership of whatever “garden” they choose to visit, and provides a path for others to visit and contribute. GitHub’s Actions power Pythia’s automation of key steps in the notebook execution/publishing process. We periodically re-run the publication workflow as health checks for on-going maintenance of the materials, as well as for new “plantings” via PRs. Pythia’s web portal displays the updated content, which users can download to try out and build on in their own “backyard gardens”–computing environments. A garden may need more powerful tools. While GitHub Actions may often suffice, real-world scientific workflows have compute and data requirements that exceed GitHub’s free resources. Pythia’s notebooks can also be executed on our dedicated cloud using BinderHub, which provides a way to execute notebooks within custom environments. Pythia’s workflows are able to validate and deploy results directly from execution on its BinderHub. The same BinderHub instance powers interactive user sessions, guaranteeing that users execute code in the same environment in which the rendered web pages were built. Kevin TyleDrew Camron 2023-07-14T14:35:00-05:00 14:35 00:30 Amphitheater 204 2023-121-community-first-open-source-an-action-plan- https://cfp.scipy.org//2023/talk/9JTLCF/ false Community-first open source: An action plan! Tending Your Open Source Garden: Maintenance and Community Talk en Communities are at the heart of open source software and are fundamental to our projects’ long-term success. The Python ecosystem has several mature projects, that have spent years working on community initiatives. Newer projects can learn from their experiences and build stronger foundations to foster healthy communities. In this talk, we share a set of practices for community-first projects, including repository management, contributor pathways, and governance principles. We’ll also share real examples from our own journey transitioning a company-backed OSS project, Nebari (https://nebari.dev/), to be more community-oriented. Open source communities come in a lot of different flavors and have many different ways of operating. However, there is a common thread of promoting kindness in communication, improving the contributor and user experience, and working to make the project more inclusive, accessible, and sustainable. We, the presenters, recently worked to transition a company-backed open source project, Nebari (https://nebari.dev/), to be more community-oriented in its development, maintenance, and governance. We focused on creating a community-first foundation that builds on years of learnings from other leading communities, including Jupyter, NumPy, Gatsby JS, and more. In this talk, we want to share our journey and the things we learned along the way. We aim to provide a step-by-step guide for open source projects looking to adopt more community-driven practices. We will discuss everything from repository management, and contributor and maintainer pathways, to documentation and governance principles. This talk will be most helpful for projects in their formative stages and projects transitioning from company-backed models, however we feel everyone can learn something new to implement in their communities. /media/2023/submissions/9JTLCF/community-talk-banner_Wf0ag8L.png Pavithra EswaramoorthyDharhas Pothina 2023-07-14T10:45:00-05:00 10:45 00:30 Grand Salon C 2023-164-small-town-police-accountability-a-data-science-toolkit https://cfp.scipy.org//2023/talk/AXPZZG/ false Small Town Police Accountability: A Data Science Toolkit Social Science and the Digital Humanities Talk en In this talk we will share a Python library to obtain and analyze policing data, that was developed in conjunction with community activists, data scientists, social scientists and the Small Town Police Accountability (SToPA) Research Lab. We will showcase components of the SToPA library which use Python tools such as web drivers, optical character recognition, geospatial mapping, machine learning and statistical sampling to better understand the policing landscape. The goal of this work is to present an easily replicable framework for analyzing police and community interactions with accessible on-ramps for activists, developers and researchers. Recent years have highlighted the urgent need for transparency and accountability within police departments across the United States. Typically, large cities have access to policing data and the resources to analyze and interrogate such data to hold authority accountable. Small towns face the same injustices at the hands of police, but these issues receive comparatively little attention, in part due to a lack of resources and tools to investigate the data. Additional challenges arise in the clarity and consistency of the data that may be available. Consequently, the public are generally unable to take data-informed action toward social justice in these regions. The overarching goal of the The Small Town Police Accountability (SToPA) Research Lab is to create an adaptable tool that enables small-town residents to analyze police actions to increase transparency and accountability. This talk will introduce the interdisciplinary work of the research group to (1) obtain data through digital portals and records requests, (2) create a flexible, scaffolded software toolkit for organizing and analyzing police data for users with various levels of technical expertise and (3) use data-driven modeling tools to uncover potential patterns and anomalies in select small town data, serving as a template for investigations elsewhere. The SToPA toolkit consists of a range of components including instructions for data gathering; adaptable tools for reading, cleaning, and organizing data; and machine learning applications to analyze and understand patterns in policing. Using case studies of a handful of small towns, the SToPA toolkit provides a broadly applicable methodology for reading and parsing police data. Where data is available online in a somewhat structured format, the SToPA library offers tools for web crawling and scraping. In other cases, data is only available as a printed physical copy, necessitating digitization, text identification using tools such as PyTesseract, word-level data cleaning, and testing for accuracy. This pipeline includes the use of user-defined, non-standard language dictionaries (such as a list of town-specific locations), geometric methods for word location detection, regular expressions, and fuzzy string matching. After data is collected, cleaned, and structured, a second thrust of the SToPA lab is to analyze police interactions with machine learning and statistical tools. A diverse set of policing data, including dates, locations, names, and free text narratives, yields rich opportunity for exploratory analysis and modeling. Explorable maps were created with various mapping and plotting libraries, revealing location-based patterns. Town-specific data from the US Census allows for demographic comparisons between how citizens are distributed vs. how they are policed. This analysis is further refined using statistical sampling and inference tools such as scikit-learn and PyEI. Narrative text data, unstructured language across thousands of reports, was also analyzed with natural language processing techniques such as topic modeling. This talk aims to be accessible to a diverse audience and to empower and inspire others to contribute to the growing SToPA repository: https://qsideinstitute.github.io/SToPA/ Ariana MendibleAnna Haensch 2023-07-14T11:25:00-05:00 11:25 00:30 Grand Salon C 2023-198-using-linear-tracking-data-to-estimate-backcountry-recreation-popularity https://cfp.scipy.org//2023/talk/LMMPRP/ false Using Linear Tracking Data to Estimate Backcountry Recreation Popularity Social Science and the Digital Humanities Talk en Geolocated data from smartphone apps are well-established resources for research. While most of that data come as points (e.g., geotagged photos), there are a growing number of apps that collect linear data from users activities (e.g., running, hiking, off-road driving). Using established ecological methods, shallow-machine learning packages, and multiprocessing we demonstrate a novel approach using mobile app data to estimate back-country recreation popularity at multiple scales. The topics covered include normalizing and thinning coordinate data, merging linear data from multiple sources, and accounting for spatial bias while preserving the integrity of the original data. Official sources (typically governments) provide the cleanest and most trustworthy data. Decades of established standards and years of archived records provide a framework for reliable data collection making it a strong foundations for research. The drawback to official centralized sources is that they often focus on the macro level, and because of this, the data tends to be lower resolution, leaving broad areas of obscurity at a micro level. With a need to geospatially estimate and represent backcountry recreation habits on a statewide level down to square mile grid, our team needed high-resolution datasets. Social media data is high-resolution, dense and valuable, which often leads companies to limit access to their data. App downloads and active user counts fluctuate with the market and long-term utility of an app's data is not guaranteed to last. Despite these limitations, social media offers significantly higher resolution data than official sources. We will discuss the methods we developed to overcome the unique challenges of processing and standardizing social media data so it complements and informs official datasets. We will cover acquiring data from multiple apps using modular methods that can be applied to new apps as older ones become obsolete. It is rare for apps to offer identical metrics, so we developed a flexible approach that can translate different metrics into a standardized form. It is also important that a model addresses the inherent unknowns that lie beyond the app's userbase. Our specific use case uses linear geolocation data gathered from mobile tracking apps. Data comes in the form of GeoJSON coordinates, Google Earth polylines, and shapefiles. We will discuss the specific packages used to read each and store them in a common format. Linear data brings with it unique challenges relative to point and polygon data. We will describe how we used Python to redistribute points along line segments while maintaining a minimum distance between segment vertices as a first step towards standardizing the linear data. We then needed to "thin" the data to minimize spatial bias caused by overlapping line segments and circuitous routes; this will include our reasoning for not averaging or interpolating new data points between each dataset, and how instead we used a method that preserved the integrity of original geolocation data. We will explain the ways multiprocessing and nonlinear data structures were used to process large numbers of vertices when we aggregated all datasets together and ran the thinning algorithm on the combined points; the goal of which was to create an overall presence dataset that represents recreation across the state with minimal spatial bias. We will also review the resulting data structure: coordinate pairs, aggregated metrics and IDs that point back to rows from the original datasets gathered from each app. The presentation will finish with a summary and conclusion on how the resulting presence data was processed using the MaxEnt ecological model to inform and supplement official data sources and provide the state of Arizona with a clearer picture of recreation. Vincent SutherlandDavid C. Folch 2023-07-14T13:15:00-05:00 13:15 00:30 Grand Salon C 2023-161-allegro-and-flare-fast-and-accurate-machine-learning-potentials-for-extreme-scale-simulations https://cfp.scipy.org//2023/talk/BDV3EE/ false Allegro and FLARE: Fast and accurate machine learning potentials for extreme-scale simulations Materials and Chemistry Talk en Allegro and FLARE are two very different packages for constructing machine learning potentials that are fast, accurate, and suitable for extreme-scale molecular dynamics simulations. Allegro uses PyTorch for efficient equivariant potentials with state-of-the-art accuracy, while FLARE is a sparse Gaussian process potential with an optimized C++ training backend leveraging Kokkos, OpenMP, and MPI for state-of-the-art performance, and a user-friendly Python frontend. We will compare and contrast the two methods, discuss lessons learned, and show spectacular scientific applications. Molecular dynamics is a common method for studying molecules and materials at the atomistic level, in which the dynamics of atoms are simulated directly using Newton’s equations of motion. This requires a model for the forces between the atoms, often referred to as a potential. Traditionally, there have been two approaches to computing the interatomic forces. First, there are empirical potentials, which are based on simple, physically motivated functional forms with a few parameters that are fit to match experimental measurements of material properties. These models are fast, but they have limited accuracy and are hard to transfer between applications. The alternative is quantum mechanical methods, which are highly accurate. In return, they are computationally expensive and have limited scalability. In recent years, machine learning potentials (MLPs) have emerged as a compromise in terms of accuracy and computational efficiency. The idea is to generate a small amount of training data with a quantum mechanical method. The MLP learns to reproduce its forces and energies and can be used for large and long-timescale molecular dynamics simulations with an accuracy approaching that of the quantum mechanical method. Allegro and FLARE are two drastically different MLPs. FLARE approximates the energy of an atom as a sparse Gaussian process (SGP) as a function of the atom’s local environment. The environment is encoded in a rotationally invariant vector with high descriptive power. By using an invariant descriptor, FLARE correctly respects the symmetry of the problem. Allegro, on the other hand, exploits the symmetry of the problem by using an equivariant neural network, i.e., a neural network where tensor product layers force the features to systematically transform with the input. While more computationally demanding, the added symmetry information allows Allegro and other equivariant models to be significantly more accurate and data-efficient than traditional models. For extreme-scale simulations, scalability and performance are of utmost importance. Through its model design of avoiding message passing, Allegro is the only scalable equivariant neural network potential, with excellent performance demonstrated up to 100 million atoms. FLARE, being a simpler model, takes this to the extreme and has achieved record scalability and performance, simulating 0.5 trillion atoms on 27,336 NVIDIA V100 GPUs. On the implementation side, Allegro and FLARE are also very different. Allegro is implemented in Python with PyTorch, which allows for a high-level implementation with excellent GPU performance through the JIT compiler. FLARE has a low-level training backend written in C++ with OpenMP, MPI, and Kokkos. The C++ code is conveniently wrapped for Python use with pybind11. In this talk, we will compare and contrast these two methods, discuss lessons learned, and show spectacular scientific applications. Links: Allegro repository: https://github.com/mir-group/allegro Allegro paper: https://www.nature.com/articles/s41467-023-36329-y FLARE repository: https://github.com/mir-group/flare FLARE LAMMPS active learning tutorial: https://bit.ly/flarelmpotf Preprint on FLARE scalability: https://arxiv.org/abs/2204.12573 /media/2023/submissions/BDV3EE/pthviz_p1gFqaY.png Anders Johansson 2023-07-14T13:55:00-05:00 13:55 00:30 Grand Salon C 2023-197-a-graph-neural-network-based-model-for-rapid-prediction-of-thermal-transport-in-metal-organic-frameworks https://cfp.scipy.org//2023/talk/EYHNUV/ false A Graph-Neural Network-Based model for rapid prediction of Thermal Transport in Metal-Organic Frameworks Materials and Chemistry Talk en Metal-Organic Frameworks (MOFs) have vast potential for gas adsorption, but their practical use hinges on their ability to dissipate thermal energy generated during adsorption. Here, we performed the first high-throughput screening of thermal conductivity in over 10,000 MOFs using molecular dynamics simulations. Next, we developed a graph neural network (GNN) based model to swiftly predict the diagonal components of the thermal conductivity tensor for accelerated materials discovery. Attendees will gain insights into how GNNs can be trained to predict material tensor properties, benefiting both the materials science and machine learning communities. Metal-organic frameworks (MOFs) are a promising class of porous materials that have potential applications in various areas, including gas storage and separations. However, effective thermal energy management in MOFs is critical to enhancing their performance in these applications. Unfortunately, there is still a lack of understanding regarding the structure-property relationships that govern thermal transport in MOFs. In order to provide a data-driven perspective on these relationships, a large-scale computational screening study was conducted to investigate the thermal conductivity of MOFs. This study utilized classical molecular dynamics simulations to calculate the thermal conductivities of 10,194 hypothetical MOFs generated using the Topology-Based Crystal Constructor (ToBaCCo) code developed in Python. These MOFs comprised 1,015 different topologies, along with 40 types of organic edge building blocks and 38 inorganic and organic nodular building blocks. The study discovered that high thermal conductivity in MOFs is favored by high densities, small pores (<10 Å), and four-connected metal nodes. Moreover, it identified 36 MOFs with ultra-low thermal conductivity (<0.02 W/mK) primarily due to their extremely large pores (~65 Å). Additionally, the study uncovered six hypothetical MOFs with exceptionally high thermal conductivity (>10 W/mK). To handle a large number of MOFs screened, an algorithm was developed to adaptively determine the appropriate plateaued interval of the thermal conductivity vs. correlation time curve based on a set of criteria. The search strategy utilized for finding the optimal plateaued interval involved iteratively performing linear fitting to data segments of 2 ps in length at 1 ps increments if the data was between 0 and 10 ps, and segments of 10 ps length at 5 ps increments if the data was beyond 10 ps. The normalized slopes and normalized average oscillation amplitudes were then calculated with respect to the average thermal conductivity for each of those data segments. Using the 10,194 MOF-thermal conductivity data, a range of state-of-the-art graph neural network-based models, including CGCNN, iCGCNN, MEGNet, DimeNet++, ALIGNN, and others, were trained for the rapid prediction of thermal conductivity in MOFs. Finally, the model that demonstrated the best performance on the test data was applied to screen the Computation-Ready, Experimental (CoRE) MOF database, resulting in the identification of experimentally viable MOF structures with potentially exceptional thermal transport properties. This talk will discuss the ToBaCCo hypothetical MOF crystal generation algorithm, various state-of-the-art GNN architectures, and their implementation in PyTorch. This presentation will be of interest to the wider material science community, particularly those with a passion for deep learning models. The findings of this study have the potential to enhance our understanding of thermal transport in MOFs, paving the way for the development of more efficient MOFs for gas storage and separation applications. Meiirbek Islamov 2023-07-14T14:35:00-05:00 14:35 00:30 Grand Salon C 2023-56-from-espaloma-to-sake-to-brew-distill-and-mix-force-fields-with-balanced-briskness-smoothness-and-intricacy- https://cfp.scipy.org//2023/talk/F9P3F3/ false From Espaloma to SAKE: To brew, distill, and mix force fields with balanced briskness, smoothness, and intricacy. Materials and Chemistry Talk en Force fields (FF)—the (parametrized) mapping from geometry to energy, are a crucial component of molecular dynamics (MD) simulations, whose associated Boltzmann-like target probability densities are sampled to estimate ensemble observables, to harvest quantitative insights of the system. State-of-the-art force fields are either fast (molecular mechanics, MM-based) or accurate (quantum mechanics, QM-based), but seldom both. Here, leveraging graph-based machine learning and incorporating inductive biases crucial to chemical modeling, we approach the balance between accuracy and speed from two angles---to make MM more accurate and to make machine learning force fields faster. A force field as accurate as quantum mechanics (QM) and as fast as molecular mechanics (MM), with which one can simulate a biomolecular system efficiently enough and meaningfully enough to get quantitative insights, is among the most ardent dreams of biophysicists. Machine learning force forces have been designed to bring us one step closer to this dream, by fitting simpler functional forms to QM data and extrapolating to chemically and geometrically diverse regions. Nonetheless, current state-of-the-art architectures, though approaching or surpassing the quantum chemical accuracy, are by magnitudes slower than MM and manifest various pathologies when it comes to interpretability, generalizability, and stability. In this talk, we introduce our efforts to approach the lotusland from two angles: by making MM force fields more accurate (using a GNN to replace the atom typing schemes, Espaloma) and making state-of-the-art machine learning force fields faster (maintaining local universal approximative power without employing spherical harmonics, SAKE). Along the way, we show a plethora of useful gadgets, including the first unified force field for joint protein--ligand parametrization, an AM1-BCC surrogate charge model thousands-fold faster with error smaller than discrepancies among backends, and a way to forecast the fate of dynamic systems before the simulation even starts. With these, we identify the opportunities and challenges of machine learning force fields design: What interpretable, stable, simple yet expressive function forms to use? How do we bake domain knowledge in, e.g., forces vanish when particles are far and explode when close? Can we detach sophisticated neural networks during inference? Can force fields be uncertainty-aware? And finally how do we stir these ingredients well to achieve the delicious balance between stability and speed and accuracy? Yuanqing Wang 2023-07-14T16:40:00-05:00 16:40 00:55 Classroom 105 2023-291--bof-room-105-open-source-project-code-of-conduct-management-and-dei-support https://cfp.scipy.org//2023/talk/LTDRGY/ false [BoF Room 105] Open Source Project Code of Conduct Management and DEI Support Birds of a Feather (BoF) Talk en NumFOCUS will facilitate a discussion around open source projects managing a robust Code of Conduct as well as ongoing DEI support Leah SilenNoa TamirInessa Pawson 2023-07-14T17:45:00-05:00 17:45 00:55 Classroom 105 2023-294--bof-room-105-scipy-2024 https://cfp.scipy.org//2023/talk/9H9KFM/ false [BoF Room 105] SciPy 2024 Birds of a Feather (BoF) Talk en Feedback on SciPy 2023 and ideas for SciPy 2024 SciPy 2023 Committee 2023-07-14T16:40:00-05:00 16:40 00:55 Classroom 103 2023-289--bof-room-103-scipy-2023-sprint-prep-bof https://cfp.scipy.org//2023/talk/JXWQPG/ false [BoF Room 103] SciPy 2023 Sprint Prep BoF Birds of a Feather (BoF) Talk en Come join the BoF to do a practice run on contributing to a GitHub project. We will walk through how to open a Pull Request for a bugfix, using the workflow most libraries participating at the weekend sprints use (hosted by the sprint chairs) Brigitta SipőczGil ForsythMadickenMatt Davis 2023-07-14T17:45:00-05:00 17:45 00:55 Classroom 103 2023-292--bof-room-103-cpython-performance https://cfp.scipy.org//2023/talk/XNVLQA/ false [BoF Room 103] CPython performance Birds of a Feather (BoF) Talk en Discuss the effects of recent and potential performance improvements on the scientific Python packages. The goal is to discuss the cost/benefit tradeoffs of adapting existing libraries to take advantage of potential improvements, especially per-interpreter GIL and nogil, but also type specializations in the interpreter. Michael Droettboom 2023-07-14T16:40:00-05:00 16:40 00:55 Classroom 104 2023-290--bof-room-104-future-of-python-programming-language-in-the-artificial-intelligence-era https://cfp.scipy.org//2023/talk/VA7ENC/ false [BoF Room 104] Future of Python Programming Language in the Artificial Intelligence Era Birds of a Feather (BoF) Talk en Here the aim of the panel would be to throw light on role code assistants like Co-Pilot and tools like ChatGPT and how they revolutionize coding careers. Also, provide insights that help young and budding programmers to prepare themselves for futuristic careers. Also, try to find answers to some hypothetical questions like can AI replace human programmers? Can it add or suggest new features to the language itself? and problems people may face while developing enterprise-grade applications with AI. Gajendra Deshpande 2023-07-14T17:45:00-05:00 17:45 00:55 Classroom 104 2023-293--bof-room-104-beyond-notebooks-from-reproducible-to-reusable-research https://cfp.scipy.org//2023/talk/LGZUNG/ false [BoF Room 104] Beyond Notebooks: From reproducible to reusable research Birds of a Feather (BoF) Talk en "Notebooks can be a powerful tool for the purposes for which they were designed—learning, experimenting, and sharing results. However, users face many challenges when trying to achieve true reproducbility with notebooks alone, including lack of dependency management, pitfalls of non-linear interactive execution, and requiring bespoke tooling to open and execute. Furthermore, there is a growing need to go beyond reprodubility of individual results—siloed into an opaque format possessing limited interoperability with the rest of the Python ecosystem—toward reusuability of research methods, that can be shared, built upon, and deployed by users across the world. Therefore, we invite the community to share their tools and workflows to go beyond reproducibility and towards true reusable science, built on the shoulders of giants. Furthermore, we hope to explore how we can encourage users and the community to move beyond the notebooks monoculture and toward a holistic, open, modular and interoperable approaches to conducting research and developing scientific code." C.A.M. GerlachJuanita Gomez 2023-07-15T09:00:00-05:00 09:00 01:00 Amphitheater 204 2023-298-open-source-sprints-kickoff-in-room-204- https://cfp.scipy.org//2023/talk/NMNJYF/ false Open Source Sprints [Kickoff in Room 204] Talk en Everyone will meet in Room 204 and organize before breaking out for the remainder of the day. Every year, our community dedicates the last 2 days of the SciPy conference to Sprints, where we work together on open-source projects to push our ecosystem forward. Sprints are an informal part of the conference, where all are welcome to exchange ideas, hack on exciting projects, and create lasting connections. All programming levels are welcome at the sprints. Join us for the preparatory Sprint BoF as well on Friday at 4:40 in Room 103 - https://cfp.scipy.org/2023/talk/JXWQPG/ Interested in leading a sprint at SciPy 2022? Sign up at https://www.scipy2023.scipy.org/sprints Sprints FAQs What will you do as an attendee? There are a variety of ways to contribute during the sprints session including testing code, fixing bugs, adding new features, and improving documentation. You could also contribute to an entirely brand new project that our ecosystem is missing. One of the best parts about the sprints is that you might also have the opportunity to work with authors and core contributors of your favorite open source packages, as well as, the opportunity to work alongside other developers who are just as excited as you are to make the SciPy community even better. What are the benefits of attending a sprint? Make open source Python better! Code alongside package authors/contributors, while learning from them. Become a power user of a core package by gaining a deeper understanding of its inner workings. Improve your github profile. Get to know other SciPy community members at the Sprints dinner. Can I participate? Yes! Sprints are free and open to everyone no matter what your programming level of experience. Sprints are a great way to add your contribution to your favorite Python libraries and packages. Thanks to the generosity of our sponsors, sprints are free of charge for all participants, including the Sprints dinner on Saturday evening. If you aren't sure about how you can contribute to a project, it's not a problem. We'll get you up to speed at the How to Contribute to Open Source BoF on Friday and we have helpers at the beginner friendly sprints. Brigitta SipőczTania AllardAlan Braz 2023-07-16T09:00:00-05:00 09:00 01:00 Amphitheater 204 2023-299-open-source-sprints-kickoff-in-room-204- https://cfp.scipy.org//2023/talk/WTNHTR/ false Open Source Sprints [Kickoff in Room 204] Talk en Everyone will meet in Room 204 and organize before breaking out for the remainder of the day. Every year, our community dedicates the last 2 days of the SciPy conference to Sprints, where we work together on open-source projects to push our ecosystem forward. Sprints are an informal part of the conference, where all are welcome to exchange ideas, hack on exciting projects, and create lasting connections. All programming levels are welcome at the sprints. Join us for the preparatory Sprint BoF as well on Friday at 4:40 in Room 103 - https://cfp.scipy.org/2023/talk/JXWQPG/ Interested in leading a sprint at SciPy 2022? Sign up at https://www.scipy2023.scipy.org/sprints Sprints FAQs What will you do as an attendee? There are a variety of ways to contribute during the sprints session including testing code, fixing bugs, adding new features, and improving documentation. You could also contribute to an entirely brand new project that our ecosystem is missing. One of the best parts about the sprints is that you might also have the opportunity to work with authors and core contributors of your favorite open source packages, as well as, the opportunity to work alongside other developers who are just as excited as you are to make the SciPy community even better. What are the benefits of attending a sprint? Make open source Python better! Code alongside package authors/contributors, while learning from them. Become a power user of a core package by gaining a deeper understanding of its inner workings. Improve your github profile. Get to know other SciPy community members at the Sprints dinner. Can I participate? Yes! Sprints are free and open to everyone no matter what your programming level of experience. Sprints are a great way to add your contribution to your favorite Python libraries and packages. Thanks to the generosity of our sponsors, sprints are free of charge for all participants, including the Sprints dinner on Saturday evening. If you aren't sure about how you can contribute to a project, it's not a problem. We'll get you up to speed at the How to Contribute to Open Source BoF on Friday and we have helpers at the beginner friendly sprints. Brigitta SipőczTania AllardAlan Braz