07-13, 15:00–15:30 (America/Chicago), Grand Salon C
yt_xarray is a new package in the scientific python ecosystem for linking yt and xarray. yt, primarily used in computational astrophysics, has gradually broadened support for scientific domains, including geoscience disciplines. Most geoscience data, however, still requires manual steps to load into yt. yt_xarray, a new xarray extension, aims to streamline communication of data from xarray to yt, providing a potentially useful tool to the many geoscience researchers already using xarray while allowing yt to leverage the distributed backends already supported by xarray. In this presentation, we will provide an overview of the usage and design of yt_xarray.
A number of recent efforts within the yt community have broadened the scope of scientific domains supported by yt. Some of these efforts included improving generic functionality while others focused on adding functionality required for specific domains outside the astrophysics scientific community. For geoscience data in particular, the addition of a geographic coordinate handler and an interface to cartopy for producing maps within the yt plotting framework enabled analysis of geographic datasets. Getting the data into yt, however, was not as streamlined as it could be; with the exception of some new custom data ingestors (termed "frontends" in yt) for specific geoscience data products, most geoscience data still required manual loading of arrays with generic yt loaders. In addition to extra steps for the user, this limitation also required that the data fit entirely within memory. yt_xarray fills this gap in data regularization required for loading geodata in yt by leveraging xarray for reading of data on demand as yt needs it.
Rather than a traditional yt frontend, yt_xarray v0.1 introduced an xarray accessor
object that streamlines the creation of yt datasets from subsets of fields, simplifying the process of using yt with most regularly gridded datasets that xarray can load. While the initial release focuses on simply returning a yt dataset object for use with any yt function, future releases will further simplify access to yt functions from xarray by providing yt function wrappers from within yt_xarray.
While yt and xarray have some similarity in that they both load and manpipulate coordinate-referenced arrays, yt is inherently is designed primarily for volumetric data while xarray supports sets of labeled arrays more generally. This difference informed a number of important design choices in yt_xarray, in particular with regards to how chunked arrays are handled. For gridded datasets in yt, a physical domain can be subdivided into multiple grid objects so that a single yt "chunk" maps to a subdomain of the whole grid. During processing, subdomains are processed sequentially so that data is loaded as needed. In xarray, chunks are defined as contiguous index ranges within arrays, with the actual data potentially residing in on-disk files or existing as delayed computations. yt_xarray merges these two chunking systems by building yt grids that map spatial subdomains to index ranges of xarray fields. This allows a 1:1 mapping of Dask-xarray chunks to yt grid objects but also allows multiple Dask-xarray chunks to be contained within a yt grid object.
In this presentation, we will provide an overview for using yt_xarray for loading and analyzing regularly gridded 2D and 3D xarray datasets. In addition to the general usage and development plans, we will describe the design of yt_xarray with a focus on leveraging the performance benefits of distributed arrays loaded via xarray.
Chris Havlin is a Research Scientist in the School of Information Sciences at the University of Illinois. His work focuses on open source scientific software development and computational geodynamics.