SciPy 2023

Building better data structures, APIs and configuration systems for scientific software using Pydantic
07-10, 08:00–12:00 (America/Chicago), Classroom 105

This tutorial is an introduction to Pydantic, a library for data validation and settings management using Python type annotations. Using a semi-realistic ML and / or scientific software pipeline scenario we demonstrate how Pydantic can be used to support type validations for scientific data structures, APIs and configuration systems. We show how the use of Pydantic in scientific and ML software leads to a more pleasant user experience as well as more robust and easier to maintain code. A minimum knowledge of Python type annotations, class definitions and data structures will be helpful
for beginners but not required.


One of the most controversial design choices of Python is the use of dynamic types. Dynamic types of variables can often lead to confusion for beginners, but also for experts it is a common sources of hard-to-find bugs. For this reason the concept of type annotations has been introduced later in the language to allow for static code analysis and more detailed source code documentation. Pydantic is a Python library that makes use of these type annotations to parse and validate types for class based data structures. In the past years Pydantic has gained tremendous popularity among web developers and is now the most widely used data validation library for Python. In this tutorial we show how the use of Pydantic can help to build better data structures, APIs and configuration systems for scientific Python packages as well. In many cases the validated types lead to a more pleasant user experience as well as more robust and easier to maintain code.

In the first block we introduce the basics of the library such as the concept of Pydantic models, type annotations and atomic types such as int, float, str etc. We show how types are parsed and how models can be configured to forbid extra attributes. At the end of the block participants will try to implement the first Pydantic model and explore basic configuration settings.

We then proceed with the introduction of more complex types, such as typed dicts, Enums and date time objects. We will also cover custom types, which can bet used to build nested models. Then we introduce the basics of type validation for multiple scenarios, such as pre and post init and root validation. At the end of this block we will cover the topic of dynamic model creation. In the following hands-on session participants will implement a more complex Pydantic model representing the response from a weather data API at multiple levels of difficulty.

The subsequent block will be dedicated to serialization and deserialization of Pydantic models. We will first motivate the need and then introduce the JSON and YAML data formats. We will show how to support custom types for JSON serialization and give an overview of configuration options related to serialization. We will conclude with performance remarks for serialization a large number of model. In the corresponding hands-on exercise participants will use the weather data structure and build a small configurable data processing pipeline which visually compares the weather forecast data from different models.

Finally we will give a summary and key takeaways of the tutorial and recommend additional resources for learning Pydantic.


Installation Instructions

https://github.com/adonath/scipy-2023-pydantic-tutorial/tree/main#setup-instructions

Prerequisites

A minimum knowledge of Python type annotations, class definitions and data structures will be helpful
for beginners but not required. Some minimal knowledge of Pandas and Numpy for the hands-on exercises.

I'm a Postdoc researcher at the Center for Astrophysics. My research interests include the Galactic X-Ray and Gamma-Ray source populations as well as statistical methods for analysis of low counts data in general. I'm also interested in methods to combine data from multiple instruments. I'm the lead developer of the open source software package Gammapy, sub-package maintainer of Astropy and a member of the CHASC astro-statistics collaboration. I'm also editor for the Astronomy and Astrophysics track of the Journal of Open Source Software JOSS.

This speaker also appears in:

I am a senior machine learning engineer at VideaHealth, Inc. where I am currently developing AI models for automatic detection of dental diseases. My background is in astrophysics and I have 4 years experience as a teaching assistant at the University of Illinois at Urbana-Champaign and Harvard University. The coursework ranged from introductory to mid-level physics in both theoretical and laboratory settings. I have contributed to several conferences, most notably an invited talk at an exoplanet conference in Göttingen, Germany. There I presented my Ph. D work on improving exoplanet analysis pipelines through the use of machine learning.