Tom Nicholas
Tom Nicholas is a core developer of Xarray, and the original author of xarray.DataTree. He has made numerous contributions throughout the Pangeo stack, including to VirtualiZarr, Cubed, xGCM, and pint-xarray. He currently works on the open-source Pangeo stack full-time at Earthmover. Prior to that he worked at a non-profit on open-source tools for monitoring carbon dioxide removal, and as a Research Software Engineer in Ryan Abernathey's Climate Data Science Lab at Columbia University. He first started using the open-source scientific python stack during his PhD, when he was studying plasma turbulence in nuclear fusion reactors. He has delivered many Xarray tutorials, including at SciPy 2022, 2023, and 2024.

Sessions
Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets. Many real-world datasets often have hierarchical or heterogeneous structure, and are best organized through groups of related data arrays. Through xarray.DataTree, the xarray data model now supports opening datasets with a hierarchical structure of groups, such as HDF5 files and Zarr stores. This expanded data model is now general enough to manage data across different scientific disciplines, including geosciences and biosciences. This hands-on tutorial focuses on intermediate and advanced workflows using xarray to analyze real-world hierarchical data.
Cubed is a framework for distributed processing of large arrays without a cluster. Designed to respect memory constraints at all times, Cubed can express any NumPy-like array operation as a series of embarrassingly-parallel, bounded-memory steps. By using Zarr as persistent storage between steps, Cubed can run in a serverless fashion on both a local machine and on a range of Cloud platforms. After explaining Cubed’s model, we will show how Cubed has been integrated with Xarray and demonstrate its performance on various large array geoscience workloads.
The best way to distribute large scientific datasets is via the Cloud, in Cloud-Optimized formats. But often this data is stuck in archival pre-Cloud file formats such as netCDF.
VirtualiZarr makes it easy to create "Virtual" Zarr datacubes, allowing performant access to huge archival datasets as if it were in the Cloud-Optimized Zarr format, without duplicating any of the original data.
We will demonstrate using VirtualiZarr to generate references to archival files, combine them into one array datacube using xarray-like syntax, commit them to Icechunk, and read the data back with zarr-python v3.