07-08, 13:30–17:30 (US/Pacific), Room 315
Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets. Many real-world datasets often have hierarchical or heterogeneous structure, and are best organized through groups of related data arrays. Through xarray.DataTree, the xarray data model now supports opening datasets with a hierarchical structure of groups, such as HDF5 files and Zarr stores. This expanded data model is now general enough to manage data across different scientific disciplines, including geosciences and biosciences. This hands-on tutorial focuses on intermediate and advanced workflows using xarray to analyze real-world hierarchical data.
Xarray is an open-source Python project that makes working with complex, multi-dimensional arrays elegant, intuitive, and efficient. Real-world datasets are often a collection of many related variables on a common grid rather than raw numbers. Such datasets are common in the disciplines of earth science, astronomy, biology, healthcare, infrastructure, and finance. These datasets are more than just arrays of values: they have labels describing how array values map to locations in dimensions such as space and time, and metadata that describes how the data was collected and processed. For example, Pandas-inspired label-based syntax temperature.sel(place=”Boston”)
is more intuitive and less error-prone compared to NumPy syntax: temperature[0]
.
Xarray recently gained first-class support for hierarchical data through the release of xarray.DataTree, which can be used to analyze data with hierarchical or heterogeneous structure. The datatree model maps to an entire HDF5 file containing many groups, a structure familiar to scientists across many different domains. This model similarly maps onto a multi-group Zarr Store, which enables data-proximate computation on massive cloud-native data repositories.
In this hands-on tutorial, users will work with example data from multiple fields of science (including biology and geosciences) to achieve these learning objectives:
- Understand xarray’s core data structures
- Named arrays and coordinates (Variable)
- Groups of arrays with coordinates (DataArray and Dataset)
- Hierarchical trees of related groups (DataTree)
- Understand how to map typical xarray computations and workflows over hierarchical data,
- Understand which common storage formats correspond to the DataTree model, focusing on HDF5 and Zarr,
- Open a public Zarr store in the cloud and manipulate the contents,
- Use Dask to parallelize the analysis of large hierarchical datasets.
This hands-on tutorial assumes participants have some familiarity with Jupyter Notebooks, NumPy, Pandas, and Xarray, and focuses on intermediate workflows using hierarchical real-world datasets. All material will be presented in curated Jupyter Notebooks with exercises to solidify understanding of key concepts. Tutorial material is available online with instructions for running examples on free hosted infrastructure or on a local computer. No specific scientific domain expertise is required to participate effectively in this tutorial. Example datasets will either be small enough to download locally or available as Zarr stores in public cloud buckets.
We encourage participants to review last year’s tutorial prior to attending and bring your questions and enthusiasm to make our 4-hour session as interactive as possible!
This hands-on tutorial assumes participants have some familiarity with Jupyter Notebooks, NumPy, Pandas, and Xarray, and focuses on intermediate workflows using hierarchical real-world datasets. All material will be presented in curated Jupyter Notebooks with exercises to solidify understanding of key concepts. Tutorial material is available online with instructions for running examples on free hosted infrastructure or on a local computer. No specific scientific domain expertise is required to participate effectively in this tutorial. Example datasets will either be small enough to download locally or available as Zarr stores in public cloud buckets.
We encourage participants to review last year’s tutorial prior to attending and bring your questions and enthusiasm to make our 4-hour session as interactive as possible!
Tom Nicholas is a core developer of Xarray, and the original author of xarray.DataTree. He has made numerous contributions throughout the Pangeo stack, including to VirtualiZarr, Cubed, xGCM, and pint-xarray. He currently works on the open-source Pangeo stack full-time at Earthmover. Prior to that he worked at a non-profit on open-source tools for monitoring carbon dioxide removal, and as a Research Software Engineer in Ryan Abernathey's Climate Data Science Lab at Columbia University. He first started using the open-source scientific python stack during his PhD, when he was studying plasma turbulence in nuclear fusion reactors. He has delivered many Xarray tutorials, including at SciPy 2022, 2023, and 2024.