07-10, 15:50–16:20 (US/Pacific), Room 315
The best way to distribute large scientific datasets is via the Cloud, in Cloud-Optimized formats. But often this data is stuck in archival pre-Cloud file formats such as netCDF.
VirtualiZarr makes it easy to create "Virtual" Zarr datacubes, allowing performant access to huge archival datasets as if it were in the Cloud-Optimized Zarr format, without duplicating any of the original data.
We will demonstrate using VirtualiZarr to generate references to archival files, combine them into one array datacube using xarray-like syntax, commit them to Icechunk, and read the data back with zarr-python v3.
Many scientific datasets, including level-3 geoscience data products, are distributed as collections of thousands of individual files or granules which makes it difficult to address the data as a coherent datacube. Worse, the data is often stuck in pre-cloud archival file formats, precluding efficient access from cloud object storage.
VirtualiZarr [1] is a python tool for creating “virtual” Zarr datacubes, enabling cloud-optimized access to a range of archival file formats (e.g. netCDF and TIFF) without copying the original data. Data is accessed via Icechunk, an open-source cloud-native transactional storage engine, which can store “virtual Zarr chunks” in the form of references to byte ranges in other objects.
Virtualization provides a win-win-win for users, data engineers, and data providers: Users access fast-opening zarr-compliant stores that work performantly out of the box with libraries like Xarray and Dask, data engineers need only add a lightweight virtualization layer on top of existing data (even without the data provider’s involvement), and data providers don’t have to modify their legacy files to provide cloud-optimized access.
VirtualiZarr works by creating a metadata-only representation of files in legacy formats, including references to byte ranges inside specific chunks of data on disk. VirtualiZarr is similar to the Kerchunk package (which inspired it), except that it uses an array-level representation of the underlying data, stored in “chunk manifests”. Metadata-only references to data are saved to disk either via the Kerchunk on-disk reference file format, or using the Icechunk transactional storage engine, which facilitates later cloud-optimized access using Zarr-Python v3 and Xarray.
This approach has three advantages:
-
An array-level abstraction means users of VirtualiZarr do not need to learn a new interface, as they can use Xarray to manipulate virtual representations of their data to arrange the files comprising their datacube.
-
“Chunk manifests” enable writing the virtualized arrays out as valid Zarr stores directly (using Icechunk), meaning Zarr API implementations in any language can read the archival data directly. Zarr as a “universal reader” will allow data providers to serve all their archival multidimensional data via a common high-performance interface, regardless of the actual underlying file formats.
-
The integration with Icechunk allows “virtual” and “native” chunks to be treated interchangeably, so that an initial version of a datacube pointing at archival file formats can be gradually updated with new icechunk-native chunks with the safety of ACID transactions without the data users needing to make any distinction.
This talk is useful to anyone who wants to make large scientific datasets publicly available via the cloud. You will learn how to use VirtualiZarr and Icechunk to create a virtual Zarr datacube, using the [C]Worthy Ocean Alkalinity Enhancement Efficiency Map [2] dataset as an example. This dataset consists of ~50TB of data spread across ~500,000 netCDF files, so virtualizing it requires complicated array manipulations, and using serverless frameworks to generate references to many files at scale. Nevertheless, VirtualiZarr’s xarray-compatible API makes this possible in essentially just 3 lines of code.
[1] https://github.com/zarr-developers/VirtualiZarr
[2] https://carbonplan.org/research/oae-efficiency-explainer
Tom Nicholas is a core developer of Xarray, and the original author of xarray.DataTree. He has made numerous contributions throughout the Pangeo stack, including to VirtualiZarr, Cubed, xGCM, and pint-xarray. He currently works on the open-source Pangeo stack full-time at Earthmover. Prior to that he worked at a non-profit on open-source tools for monitoring carbon dioxide removal, and as a Research Software Engineer in Ryan Abernathey's Climate Data Science Lab at Columbia University. He first started using the open-source scientific python stack during his PhD, when he was studying plasma turbulence in nuclear fusion reactors. He has delivered many Xarray tutorials, including at SciPy 2022, 2023, and 2024.