07-11, 11:25–11:55 (US/Pacific), Room 315
At the NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC), we're doing the heavy lifting to make large geospatial datasets easily accessible from the cloud. No more downloading data. No more worrying about quirky metadata or missing dimensions. No more concatenating hundreds or thousands of files together. Just fire up your Jupyter notebook somewhere in Amazon Web Services (AWS)'s US-West-2 region, get some free temporary AWS credentials, open our Zarr stores, and start doing your science.
We didn't actually set out to make public Zarr stores. We set out to migrate an aging on-premises web tool, Giovanni, into the Cloud. Giovanni (https://giovanni.gsfc.nasa.gov/giovanni/) allows users to visualize and analyze about ~2000 geospatial variables for free in a browser without downloading any data. Moving that analytic capability to the Cloud meant figuring out how to move all that data to the Cloud. After first trying Parquet (https://www.youtube.com/watch?v=CbH9SVrPSUA), we landed on Zarr as a more appropriate file format for geospatial data.
Once we started moving data into the cloud, we realized that the same things we like about our Zarr stores - they're re-chunked for speed, they've been normalized to simplify use, and they have consistent metadata - might make them appealing to users as well. So we've started down the road of making our Zarr stores publicly accessible.
In this presentation, we will spend a little bit of time talking about why we ended up moving from Parquet to Zarr. We'll talk about what Zarr is and why you should know about it for cloud-based data analysis. We'll talk about how we've restructured the data to make analysis easier, using some deliciously starchy technical terms (pancakes and churros and scones, oh my!), and talk about how we've standardized metadata, incorporating the nascent GeoZarr specification, https://github.com/zarr-developers/geozarr-spec.
The first set of data variables that we are making public as part of our beta release consists of 24 hydrology-focused variables. We'll walk you through an example hydrology study to whet your appetite, comparing the simplicity of using Zarr stores with the traditional search-download-analyze workflow.
While we are in this beta phase, we'd love to get feedback on the usability of the data we've made public. Can you find what you are looking for? Can you run the analysis you want to run? Is there something missing from the metadata? Come talk to us at SciPy or shoot us a message later with your feedback.
Why use our Zarr stores rather than other stores you've found online? We are the archive of record. This means we have staff dedicated to professionally managing and curating the data, metadata, and documentation. When you come to the source, you get up-to-date, tested, and supported datasets.
Our ultimate goal is to make all the data in Giovanni dual use: a performant data cache for our own algorithms and public access for yours. Help us with your feedback to make that public access as simple as possible.
I'm a principal software engineer at the NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC). Our prime directive is to archive earth science data and make that data available to the public for free. Since joining the GES DISC, I've mainly focused on the services end of public data access, working on tools that allow users to do some initial data exploration and visualization without having to download, understand, and open raw data files. I'm happy to wax poetic about metadata, interoperability, and well designed colorbars.