SciPy 2024

Simplifying analysis of hierarchical HDF5 and NetCDF4 files with xarray-datatree
07-11, 16:30–17:00 (US/Pacific), Room 315

Xarray-datatree [1], is a Python package that supports HDFs (Hierarchical Data Format) with hierarchical group structures by creating a tree-like hierarchical data structure in xarray. When an HDF file is opened with Datatree, a DataTree object is created that contains all of the groups in the file. The tree-like structure allows each group to be accessed once a DataTree object is instantiated. This eliminates the need for a user to go through each group and subgroup to access observational data.

We will present our use case for Datatree in NASA’s Harmony Level 2 Subsetter (HL2SS). HL2SS provides variable and dimension subsetting for Earth observation data from different NASA data centers. To subset hierarchical datasets without Datatree, HL2SS flattens the entire data structure into a new file by copying all of the grouped and subgrouped variables into the root group. With this new file, a variable or dimension subset is conducted. However, the flattened and subsetted file has to be in the same hierarchical structure of the original file, so it is unflattened, its attributes are copied, and the variables are grouped back to preserve the original group hierarchy. With the open_datatree() function, HL2SS can open datasets containing multiple groups at once and have all of their group hierarchies preserved. This functionality has significant benefits towards optimizing the workflow in HL2SS, since it would eliminate the need to flatten and unflatten grouped datasets.

[1] https://github.com/xarray-contrib/datatree


NASA’s Earth Observing System Data and Information System (EOSDIS) contains thousands of Earth science datasets from satellites, models, and field campaigns. Datasets from EOSDIS are typically stored as HDFs, like HDF5 and NetCDF4 (Network Common Data Format). HDF and its derivatives are well supported in the Earth Science community. The HDF specification allows for directory-like structures, known as groups. With NASA EOSDIS data, individual groups often contain the observational data, its respective metadata, and other groups within groups.

Working with datasets that have a group hierarchical structure can be difficult because of the nested structure of groups. Popular tools like xarray, can be utilized to open grouped datasets, but require additional configurations to access data not contained in the root group. Datasets containing multiple groups often only have metadata stored in the root group. In order to work with such data within xarray, users must open each individual group and subgroup as separate datasets.

I am an earth scientist and software developer at NASA. I create tools to help solve ecological and environmental issues.

I am an atmospheric scientist with a background in cloud microphysics and numerical modeling. I previously worked on the Kerchunk project and am interested in enabling better cloud access to existing archival datasets. I recently finished my PhD at the University of California, Davis and am now working at NASA's Goddard Earth Sciences Data and Information Services Center (GES DISC) where I help develop data subsetting and service code alongside maintaining data production and archival systems.

Tom currently works at [C]Worthy, a non-profit building the computation tools needed to ensure safe, effective ocean-based carbon dioxide removal.

Before that he was a Research Software Engineer working in Ryan Abernathey's Climate Data Science Lab at Lamont Doherty Earth Observatory, Columbia University.

He first started using the open-source scientific python stack during his PhD, when he was studying plasma turbulence in nuclear fusion reactors.

He is a member of the xarray core development team, and also works on Cubed, xGCM, pint-xarray, and xarray-datatree. He is heavily involved with the Pangeo community for Big Data Geoscience.

This speaker also appears in: