SciPy 2023

Pandera: Beyond Pandas Data Validation
07-12, 13:15–13:45 (America/Chicago), Zlotnik Ballroom

Data quality remains a core concern for practitioners of machine learning, data science, and data engineering, and in recent years specialized packages have emerged to validate and monitor data and models. However, as the open source community iterates on data frameworks – notably, highly performant entrants such as Polars – data quality libraries need to catch up to support them. In this talk, you will learn about Pandera and its journey from being a pandas-only validator to a generic tool for testing arbitrary data containers so that it can provide a standardized way of creating data validation tools.


Motivation

Data quality remains a core concern for practitioners in machine learning, data science, and data engineering, and many specialized packages have emerged to fulfill the need of validating and monitoring data and models. However, as the open source community creates new data manipulation frameworks – notably, new highly performant entrants such as Polars – existing data quality frameworks need to catch up to support them, and in some cases, the community creates new data validation libraries for these new data frameworks.

Origins

Pandera started as a small project in 2018 with the goal of providing a lightweight, flexible, and expressive API to validate Pandas DataFrames. This part of the talk provides a short primer on data validation and property-based testing with Pandera, providing insights into how its design facilitates code-first schema authoring and maintenance, which in turn gives rise to safer and more robust data pipelines.

This primer will contain content similar to the "Introduction to Pandera" notebook in the pandera documentation: https://pandera.readthedocs.io/en/stable/try_pandera.html

Evolution

After gaining traction over the years, the author and community of contributors began to expand Pandera’s scope to support pandas-compliant data frameworks such as GeoPandas, Dask, Modin, and Pyspark Pandas (formerly Koalas). As requests for other libraries increased in frequency, it became clear that Pandera in its existing state was not well-suited for extension beyond Pandas objects. This part of the talk focuses on some of the key design failures that made it difficult to extend to other data frameworks.

Rewrites are Fun! (not): Imagine doing a complete internal rewrite of a library while bug reports, feature requests, and pull requests are coming in from contributors: does it sound fun? In the author’s experience, it’s like juggling three balls while playing drums with your feet as someone throws water balloons in your face. This part of the talk outlines the challenges, lessons learned, and things the author would have done differently to anticipate issues related to the separation of concerns, modularity, and extensibility.

Conclusion

This talk is about how Pandera has evolved to provide a standard schema interface for easily extending and supporting validation backends for arbitrary statistical data containers. Attendees will learn not only about data testing principles such as run-time validation and property-based testing, they will also learn about the challenges of maintaining and evolving an open source project that many people rely on as a critical piece of their data infrastructure. The high-level goal for this talk is to highlight lessons learned from Pandera’s particular journey from supporting only Pandas as a backend to supporting a whole suite of data objects.

Niels is the Chief Machine Learning Engineer at Union.ai, and core maintainer of Flyte, an open source workflow orchestration tool, author of UnionML, an MLOps framework for machine learning microservices, and creator of Pandera, a statistical typing and data testing tool for scientific data containers. His mission is to help data science and machine learning practitioners be more productive.

He has a Masters in Public Health with a specialization in sociomedical science and public health informatics, and prior to that a background in developmental biology and immunology. His research interests include reinforcement learning, AutoML, creative machine learning, and fairness, accountability, and transparency in automated systems.

This speaker also appears in: