SciPy 2025

Hierarchical Data Analysis with Xarray DataTree & Zarr
2025-07-08 , Ballroom C

Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets. Many real-world datasets often have hierarchical or heterogeneous structure, and are best organized through groups of related data arrays. Through xarray.DataTree, the xarray data model now supports opening datasets with a hierarchical structure of groups, such as HDF5 files and Zarr stores. This expanded data model is now general enough to manage data across different scientific disciplines, including geosciences and biosciences. This hands-on tutorial focuses on intermediate and advanced workflows using xarray to analyze real-world hierarchical data.


Xarray is an open-source Python project that makes working with complex, multi-dimensional arrays elegant, intuitive, and efficient. Real-world datasets are often a collection of many related variables on a common grid rather than raw numbers. Such datasets are common in the disciplines of earth science, astronomy, biology, healthcare, infrastructure, and finance. These datasets are more than just arrays of values: they have labels describing how array values map to locations in dimensions such as space and time, and metadata that describes how the data was collected and processed. For example, Pandas-inspired label-based syntax temperature.sel(place=”Boston”) is more intuitive and less error-prone compared to NumPy syntax: temperature[0].

Xarray recently gained first-class support for hierarchical data through the release of xarray.DataTree, which can be used to analyze data with hierarchical or heterogeneous structure. The datatree model maps to an entire HDF5 file containing many groups, a structure familiar to scientists across many different domains. This model similarly maps onto a multi-group Zarr Store, which enables data-proximate computation on massive cloud-native data repositories.

In this hands-on tutorial, users will work with example data from multiple fields of science (including biology and geosciences) to achieve these learning objectives:
- Understand xarray’s core data structures
- Named arrays and coordinates (Variable)
- Groups of arrays with coordinates (DataArray and Dataset)
- Hierarchical trees of related groups (DataTree)
- Understand how to map typical xarray computations and workflows over hierarchical data,
- Understand which common storage formats correspond to the DataTree model, focusing on HDF5 and Zarr,
- Open a public Zarr store in the cloud and manipulate the contents,
- Use Dask to parallelize the analysis of large hierarchical datasets.

This hands-on tutorial assumes participants have some familiarity with Jupyter Notebooks, NumPy, Pandas, and Xarray, and focuses on intermediate workflows using hierarchical real-world datasets. All material will be presented in curated Jupyter Notebooks with exercises to solidify understanding of key concepts. Tutorial material is available online with instructions for running examples on free hosted infrastructure or on a local computer. No specific scientific domain expertise is required to participate effectively in this tutorial. Example datasets will either be small enough to download locally or available as Zarr stores in public cloud buckets.

We encourage participants to review last year’s tutorial prior to attending and bring your questions and enthusiasm to make our 4-hour session as interactive as possible!


Prerequisites:

This hands-on tutorial assumes participants have some familiarity with Jupyter Notebooks, NumPy, Pandas, and Xarray, and focuses on intermediate workflows using hierarchical real-world datasets. All material will be presented in curated Jupyter Notebooks with exercises to solidify understanding of key concepts. Tutorial material is available online with instructions for running examples on free hosted infrastructure or on a local computer. No specific scientific domain expertise is required to participate effectively in this tutorial. Example datasets will either be small enough to download locally or available as Zarr stores in public cloud buckets.

We encourage participants to review last year’s tutorial prior to attending and bring your questions and enthusiasm to make our 4-hour session as interactive as possible!

Installation Instructions:

This tutorial will be presented on the cloud platform Nebari. Instructions for Nebari are here -- you should have received a list of coupon codes by email.

Deepak Cherian is an Xarray maintainer and Forward Engineer at Earthmover. Previously he was an oceanographer at the National Center for Atmospheric Research. He helps build and maintain many parts of the scientific Python ecosystem, includinh Xarray, dask, zarr and related projects.

This speaker also appears in:

I recently completed my PhD, during which I built software to manage the acquisition of combined epifluorescence and single-cell Raman spectroscopy time-lapse data. I extensively used Xarray and Zarr in both the data acquisition and analysis of this project. I have also presented multiple workshops on using Python for scientific data analysis and on using the SciPy stack (including Xarray) for microscopy data.

During graduate school, I discovered a passion for contributing to open-source scientific projects, which led me to my current role as a Xarray Community Developer at Earthmover. In this role, I am focused on improving Xarray for use cases in biological research.

This speaker also appears in:

Eni is a scientific software developer at NASA Goddard’s Earth Science and Information Services Center (GESDISC) and Xarray core developer. At GESDISC she uses open-source tools to create services in support of NASA’s vast earth science catalog and contributes to several enterprise NASA ESDIS tools. She is interested in using computational science to improve our understanding of the natural world.

Tom Nicholas is a core developer of Xarray, and the original author of xarray.DataTree. He has made numerous contributions throughout the Pangeo stack, including to VirtualiZarr, Cubed, xGCM, and pint-xarray. He currently works on the open-source Pangeo stack full-time at Earthmover. Prior to that he worked at a non-profit on open-source tools for monitoring carbon dioxide removal, and as a Research Software Engineer in Ryan Abernathey's Climate Data Science Lab at Columbia University. He first started using the open-source scientific python stack during his PhD, when he was studying plasma turbulence in nuclear fusion reactors. He has delivered many Xarray tutorials, including at SciPy 2022, 2023, and 2024.

This speaker also appears in:

Scott Henderson is research scientist in the University of Washington (UW) Department of Earth and Space Sciences and data science fellow at the eScience Institute. His research involves applications of satellite measurements for characterizing cryospheric processes and geohazards, and he enjoys contributing to open source software like Xarray which accelerate scientific research!

This speaker also appears in:

Justus Magin is a research engineer working at the Laboratoire de l’Oceanographie Physique et Spatiale (LOPS) in Brest, France, where he assists scientists in making computations scalable. He is also an Xarray maintainer and contributes to many projects in the Pangeo ecosystem, most notably to pint-xarray and xdggs.

This speaker also appears in:

Joe Hamman is a climate scientist, engineer, and the co-founder and CTO of Earthmover, where he leads the development of Arraylake, a cloud platform for scientific data teams. Previously, he was a founder and Technology Director at CarbonPlan and a scientist at the Climate and Global Dynamics Laboratory at the National Center for Atmospheric Research. He holds a Ph.D. in Civil and Environmental Engineering from the University of Washington, and is a licensed Professional Engineer in Washington State. He co-founded the Pangeo Project and is a core developer of both the Xarray and Zarr-Python projects.

Negin Sobhani is a High Performance Computing consultant and computational atmospheric scientist at the National Center for Atmospheric Research (NCAR). She has extensive experience developing and supporting open-source tools and infrastructure that improve the performance and accessibility of Earth System models, bridging the gap between data science, atmospheric science, and software engineering. Her broader work encompasses the development of large-scale distributed training, optimization of resource utilization, and data pipelines across advanced computing environments for geoscience applications.

This speaker also appears in: