SciPy 2023

Tidy Geospatial Cubes
07-13, 15:50–16:20 (America/Chicago), Grand Salon C

The open-source project, Xarray, combines labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. Xarray has strong user bases in the physical sciences and geospatial community. However, new users commonly struggle to fit their dataset into the Xarray model and with conceptualizing and constructing an Xarray object that makes subsequent analysis steps easy (“dataset wrangling”). We take inspiration from the “tidy data” concept for dataframes — “datasets structured to facilitate analysis” (Wickham, 2014) — and attempt a definition of tidy data for labeled array objects provided by Xarray.


The open-source project, Xarray, combines the convenience of labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays ("cubes") to provide an intuitive and scalable interface for scientific analysis. Xarray is now widely used across many areas of scientific research, with a particularly strong user base in the physical sciences. New users commonly struggle to fit their dataset into the Xarray data model and, in particular, struggle with conceptualizing and constructing an Xarray object that makes subsequent analysis steps easy (“dataset wrangling”). We take inspiration from the “tidy data” concept for dataframes — “datasets structured to facilitate analysis” (Wickham, 2014) — and attempt a definition of tidy data for labeled array objects provided by Xarray.

A ‘tidy dataset’ framework will help streamline processing workflows across the physical sciences and provide a set of norms and principles to guide the use and construction of large and complex datasets encountered in these fields. The utility of this exercise is twofold: helping dataset producers construct more useful Analysis-Ready datasets; and developing a set of guidelines that can help users wrangle their datasets into a form that enables convenient analysis with Xarray. In addition, a commonly-defined concept for ‘tidy’ geospatial array data might enable development of ‘tidy’ tools that consume and produce tidy datasets (Wickham, 2014).

This presentation will examine three datasets and the processes of ‘tidying’ them. We will demonstrate various ways that a dataset may be ‘untidy’ — not conducive to analysis — and present a useful set of rules to define ‘tidy geospatial cubes.’ The examples we will discuss are: 1) Harmonized Landsat Sentinel-2 (HLS), a dataset of multispectral reflectance measurements, 2) Aquarius, a dataset of remotely sensed sea surface salinity measurements; and 3) ITS_LIVE, a multi-sensor dataset of ice velocity measurements for glaciers and ice sheets based on satellite image pairs. Our presentation will walk through common analytical workflows with these remote sensing datasets and highlight the organizational choices a user must make along the way (related to metadata, variables, coordinates, and dimensions) to efficiently arrive at a computational result with Xarray.

Defining a common framework for labeled array objects will ease the learning curve for new users and minimize the time spent on data-wrangling steps. At present, the examples are satellite remote sensing datasets, and we recognize that there might be elements of the ‘tidy Xarray’ definition that are specific to this subdomain. We hope to spark a discussion that will help generalize the presented principles.

I am a graduate student at the University of Utah in the Geography Department. My research uses remote sensing data and other tools to study recent variability of alpine glaciers in High Mountain Asia. I am excited to return for my second SciPy after attending for the first time in 2022!

Scott is research scientist in the University of Washington (UW) Department of Earth and Space Sciences and data science fellow at the eScience Institute. He works on numerous NASA-funded efforts to develop open Cloud computing solutions for data intensive research.

This speaker also appears in: