SciPy 2024

Making Research Data Flow with Python
07-11, 10:45–11:15 (US/Pacific), Ballroom

Many tools exist for large-scale data transfer (tens of terabytes or more), but they often don't match the needs of scientific data flows. In this talk, I'll explain how we built the 'librarian' framework with FastAPI, postgres, and Globus to ease this challenge. Designed for the Simons Observatory's petabyte-scale data transfer, I'll cover building reliable web services, flexible development with dependency injection, effective testing with pytest, and deployment using NERSC's Spin. I hope to demystify web and database programming for a scientific audience.


Moving any significant quantity of data (tens of TB or more, or even just thousands or more files) is surprisingly difficult. There are many tools available (rucio, iRODS, datalad, ...) but they are frequently over-provisioned, feature-incomplete, or have data orchestration paths that are incompatible with the required data flow. This leads to a need for homegrown solutions, which is at odds with the serious aversion that scientific programmers have to web and database development.

In this talk, I will attempt to demystify these topics by talking about building the librarian data transfer orchestration framework with FastAPI, postgres, and Globus. librarian is designed to manage the transfer of raw data from the Simons Observatory to various sites across the world for analysis, scaling up to ~1 petabyte a year once full science operations commences. I will describe the process of building a persistent web service, how development practices such as dependency injection allow for flexibility, testing using pytest (which is nontrivial given the many moving parts and multiple independent processes!), and deploying at NERSC using their Spin facility.

The intended audience for this talk is anyone with scientific programming experience, but without any web programming experience. It is all too common that we have folks who are expert numpy users, but have never touched a database (despite it being an extremely useful data structure)!

Attendees will come away with an understanding of what goes into building a basic web service. This talk is not intended as a full tutorial, but rather an overview of how one goes from "I need these computers to talk to each other" to a reliable production service. Detailed introductions to the underlying libraries and topics will be left to users to investigate themselves, though the talk will contain significant signposting.

In addition, some attendees may leave the talk finding that the open source librarian (or one of the other open source tools I will cover in a short overview at the start) is already appropriate for their needs.

I am a Research Software Engineer working on the Simons Observatory at the University of Pennsylvania USA, working on data management and visualisation. I also have interests in numerical galaxy formation simulations

I was previously postdoctoral researcher in astrophysics at the MIT Kavli Institute, in Massachusetts, USA. I did my PhD at the Institute for Computational Cosmology at the University of Durham in the UK.