SciPy 2024

Tom Nicholas

Tom currently works at [C]Worthy, a non-profit building the computation tools needed to ensure safe, effective ocean-based carbon dioxide removal.

Before that he was a Research Software Engineer working in Ryan Abernathey's Climate Data Science Lab at Lamont Doherty Earth Observatory, Columbia University.

He first started using the open-source scientific python stack during his PhD, when he was studying plasma turbulence in nuclear fusion reactors.

He is a member of the xarray core development team, and also works on Cubed, xGCM, pint-xarray, and xarray-datatree. He is heavily involved with the Pangeo community for Big Data Geoscience.

The speaker's profile picture

Sessions

07-08
13:30
240min
Xarray: Friendly, Interactive, and Scalable Scientific Data Analysis
Negin Sobhani, Max Jones, Jessica Scheick, Don Setiawan, Tom Nicholas, Luis Lopez, Scott Henderson, Wietze Suijker

Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines the convenience of labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. This hands-on tutorial focuses on intermediate and advanced workflows using complex real-world data. We encourage participants in this workshop to bring your own dataset as we will dedicate ample time to apply tutorial concepts to datasets of interest!

Tutorials
Ballroom B/C
07-11
16:30
30min
Simplifying analysis of hierarchical HDF5 and NetCDF4 files with xarray-datatree
Eniola Awowale, Lucas Sterzinger, Tom Nicholas, Nick Lenssen

Xarray-datatree [1], is a Python package that supports HDFs (Hierarchical Data Format) with hierarchical group structures by creating a tree-like hierarchical data structure in xarray. When an HDF file is opened with Datatree, a DataTree object is created that contains all of the groups in the file. The tree-like structure allows each group to be accessed once a DataTree object is instantiated. This eliminates the need for a user to go through each group and subgroup to access observational data.

We will present our use case for Datatree in NASA’s Harmony Level 2 Subsetter (HL2SS). HL2SS provides variable and dimension subsetting for Earth observation data from different NASA data centers. To subset hierarchical datasets without Datatree, HL2SS flattens the entire data structure into a new file by copying all of the grouped and subgrouped variables into the root group. With this new file, a variable or dimension subset is conducted. However, the flattened and subsetted file has to be in the same hierarchical structure of the original file, so it is unflattened, its attributes are copied, and the variables are grouped back to preserve the original group hierarchy. With the open_datatree() function, HL2SS can open datasets containing multiple groups at once and have all of their group hierarchies preserved. This functionality has significant benefits towards optimizing the workflow in HL2SS, since it would eliminate the need to flatten and unflatten grouped datasets.

[1] https://github.com/xarray-contrib/datatree

Earth, Ocean, Geo, and Atmospheric Science
Room 315
0min
Cubed: Bounded-Memory Serverless Array Processing in Xarray
Tom Nicholas

Cubed is a framework for distributed processing of large arrays without a cluster. Designed to respect memory constraints at all times, Cubed can express any NumPy-like array operation as a series of embarrassingly-parallel, bounded-memory steps. By using Zarr as persistent storage between steps, and Lithops as an abstraction layer, Cubed can run in a serverless fashion on a range of Cloud platforms. After explaining Cubed’s model, we will show how Cubed has been integrated with Xarray and demonstrate its performance on various large array workloads.

General