SciPy 2024

Xarray: Friendly, Interactive, and Scalable Scientific Data Analysis
07-08, 13:30–17:30 (US/Pacific), Ballroom B/C

Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines the convenience of labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. This hands-on tutorial focuses on intermediate and advanced workflows using complex real-world data. We encourage participants in this workshop to bring your own dataset as we will dedicate ample time to apply tutorial concepts to datasets of interest!


Xarray is an open-source Python project that makes working with complex, multi-dimensional arrays elegant, intuitive, and efficient. Real-world datasets are often a collection of many related variables on a common grid rather than raw numbers. Such datasets are common in the disciplines of earth science, astronomy, biology, healthcare, infrastructure, and finance. These datasets are more than just arrays of values: they have labels describing how array values map to locations in dimensions such as space and time, and metadata that describes how the data was collected and processed.

Xarray embraces complexity of scientific data and enables users to use metadata such as dimension names and coordinate labels to easily analyze, manipulate, and visualize their datasets. For example, Pandas-inspired label-based syntax temperature.sel(place=”Boston”) is more intuitive and less error-prone compared to NumPy syntax: temperature[0].

This hands-on tutorial assumes participants have at least some experience using Xarray, and therefore focuses on intermediate and advanced workflows using complex real-world datasets. We highly encourage participants to bring your own data to work with during this tutorial! All material will be presented in curated Jupyter Notebooks with exercises to solidify understanding of key concepts. Participants will apply these concepts to their own datasets during open sessions, directly supported by the tutorial team.

The participant learning goals for the tutorial are to:

  • Learn how to leverage Xarray’s powerful backend and extension capabilities to customize workflows and open a variety of scientific datasets
  • Understand how Xarray can wrap other array types in the scientific Python ecosystem
  • Effectively use Xarray’s multidimensional indexing operations
  • Understand how to apply custom functions to Xarray data structures

Tutorial material is available online with instructions for running examples on free hosted infrastructure or on a local computer. Participants are expected to have some familiarity with Jupyter Notebooks, NumPy, Pandas, and Xarray. No specific scientific domain expertise is required to effectively participate in this tutorial. We encourage participants to review last year’s tutorial prior to attending and bring your questions and enthusiasm to make our 4 hour session as interactive as possible!


Prerequisites

Participants are expected to have some familiarity with Jupyter Notebooks, NumPy, Pandas, and Xarray. No specific scientific domain expertise is required to effectively participate in this tutorial. We encourage participants to review last year’s tutorial prior to attending and bring your questions and enthusiasm to make our 4 hour session as interactive as possible!

Installation Instructions

https://tutorial.xarray.dev/overview/get-started.html

Don Setiawan is a Senior Research Software Engineer at the University of Washington, eScience Institute, Scientific Software Engineering Center (SSEC). He has expertise in Python programming, web development, geospatial data analytics, and cloud-based data engineering. He is interested in building scalable, open software to facilitate scientific discovery across fields and enforce software best practices. He has been involved with various open-source software projects with Ocean Observatory Initiative (OOI), U.S. Integrated Ocean Observing System (IOOS), National Oceanic and Atmospheric Administration (NOAA), and National Aeronautics and Space Administration (NASA).

This speaker also appears in:

Tom currently works at [C]Worthy, a non-profit building the computation tools needed to ensure safe, effective ocean-based carbon dioxide removal.

Before that he was a Research Software Engineer working in Ryan Abernathey's Climate Data Science Lab at Lamont Doherty Earth Observatory, Columbia University.

He first started using the open-source scientific python stack during his PhD, when he was studying plasma turbulence in nuclear fusion reactors.

He is a member of the xarray core development team, and also works on Cubed, xGCM, pint-xarray, and xarray-datatree. He is heavily involved with the Pangeo community for Big Data Geoscience.

This speaker also appears in:

Luis A. López is a Research Software Engineer at the National Snow and Ice Data Center (NSIDC) in Boulder, Colorado. He is a passionate advocate of open science, open-source and a collaborator in projects like NASA Openscapes and ITS_LIVE He’s always happy to help scientists find ways to make their workflows simpler and more efficient.

Scott Henderson is senior research scientist in the University of Washington (UW) Department of Earth and Space Sciences and data science fellow at the eScience Institute. He has worked on numerous NASA-funded efforts to develop open Cloud computing solutions for data intensive research. He is a lead organizer for ‘Hackweeks’ hosted by the UW eScience institute which are designed as participant-driven events to promote collaboration and open science.

Wietze works at Space Intelligence as a Product Architect, providing data on forest coverage and carbon storage to achieve zero deforestation and mass restoration.
He is a data engineer and the scrum master of the team that maintains Space Intelligence’s data platform.
The data platform builds on the Pangeo stack and uses machine learning workflows to provide satellite data products at scale.
Previously, he worked at a Water and IT company as the technical lead for a range of satellite-data-driven projects.