SciPy 2024

My NumPy year: From no CPython C API experience to shipping a new DType in NumPy 2.0
07-10, 13:55–14:25 (US/Pacific), Room 315

Support for string data in NumPy has long been a sore spot for the community. At the beginning of 2023 I was given the task to solve that problem by writing a new UTF-8 variable-length string DType leveraging the new NumPy DType API. I will offer my personal narrative of how I accomplished that goal over the course of 2023 and offer my experience as a model for others to take on difficult projects in the scientific python ecosystem, offering tips for how to get help when needed and contribute productively to an established open source community.


I have three goals for presenting my work to the community at SciPy. My primary goal is to explain how I went from zero experience with the CPython C API to becoming a NumPy maintainer and shipping a big new feature in a little more than a year. I want to offer my anecdote of how to pull off these kinds of efforts given an opportunity and sufficient motivation. Additionally, I want to explain and demonstrate a big new feature for NumPy to the community - support for NumPy arrays containing variable-width UTF-8 encoded strings. I also want to make the community aware that writing non-trivial DTypes outside of NumPy is now possible using the new DType API shipping in NumPy 2.0.

I will first give an overview of the state of support for string data in NumPy before the 2.0 release, highlighting the performance, memory usage, and usability problems inherent in only offering support for fixed-width ASCII or 32-bit unicode character strings. I'll describe how the community has worked around this limitation in packages like Pandas and scikit-learn, and how after waiting years for someone to take an interest and fix this independently, community stakeholders came together and proposed to NASA to, among other things, fund adding a new variable-width string DType to NumPy.

I will explain how I overcame my initial trepidation and pre-existing fear of understanding the C core of NumPy by starting with small goals, meeting them, and starting again with something more ambitious. I will explain how I got involved in the NumPy community, found how to ask questions in a way that would get me the answers I needed, and eventually grew comfortable working with the guts of NumPy. My goal will be to model to the audience that they too can do this and that it is possible to take on an enormous, complicated codebase like NumPy and become comfortable with it.

Next, I will introduce the new NumPy DType API at a high level and explain how it can be used to define a NumPy data type outside of NumPy. I will highlight that the API is now generally available and there's a lot of low-hanging fruit for community-defined data types to be implemented and possibly upstreamed to NumPy once a need has been demonstrated.

I will then explain the design of StringDType at a high level and outline its structure and how I incrementally implemented it. I will present benchmarks demonstrating that it is faster and more memory-efficient than the unicode fixed-width string DType and object arrays of Python strings and show usage examples that demonstrate the features of the DType, including native support for missing data.

The final part of the talk will discuss efforts to aid community adoption of the DType, including support for the new DType in Pandas.

I am a software engineer at Quansight Labs where I help maintain NumPy and contribute to open source software on behalf of consulting clients. My background is in astrophysics, I completed my PhD in 2015 at UC Santa Cruz and worked as a postdoc and research scientist at the University of Illinois. During my academic career I become increasingly involved in community open source projects, contributing to projects across the scientific python ecosystem and as a maintainer of the yt project. Since then I've transitioned from academia to industry, but I still believe strongly in open science and the importance of community research software to building reproducible scientific workflows.

This speaker also appears in: