SciPy 2025

Cubed: Scalable array processing with bounded-memory in Python
07-09, 13:15–13:45 (US/Pacific), Room 318

Cubed is a framework for distributed processing of large arrays without a cluster. Designed to respect memory constraints at all times, Cubed can express any NumPy-like array operation as a series of embarrassingly-parallel, bounded-memory steps. By using Zarr as persistent storage between steps, Cubed can run in a serverless fashion on both a local machine and on a range of Cloud platforms. After explaining Cubed’s model, we will show how Cubed has been integrated with Xarray and demonstrate its performance on various large array geoscience workloads.


Motivation:

Serverless computing presents an ambitious vision for computation at scale in which users are relieved of complex cluster management, yet enjoy elastic scaling across arbitrarily large workloads (https://arxiv.org/abs/1702.04024). Scientific users especially want ease of use and predictable performance on unforeseen problems, so their ideal parallel processing system would also be robust to failures, resumable, and have predictable runtime memory usage.

Design:

We present Cubed (https://cubed-dev.github.io/cubed/), an open-source pure-Python parallel computing framework explicitly designed to provide these features for the case of N-dimensional array analytics.

Cubed is a generalization of Pangeo’s Rechunker package (https://github.com/pangeo-data/rechunker), which parallelizes an all-to-all rechunk (shuffle) operation at arbitrary scale whilst respecting a preset memory limit by writing to persistent storage via Zarr. Cubed extends this to support the entire Python Array API Standard (https://data-apis.org/array-api/2023.12/), by expressing all chunked array operations as a series of bounded-memory “blockwise” or rechunk operations.

This approach neatly sidesteps the issue of managing memory in designing and running distributed systems, where often successful usage of general-purpose systems such as Dask requires understanding and tuning the memory configuration before the shuffle.

With every operation instead now a series of embarrassingly-parallel steps, each chunk can be processed by an independent task, without the need for a complex scheduler. This is an ideal fit for a “serverless” computing model, which means there is no need for a cluster for the user to manage.

Cubed can run via various execution backends, including on a local machine and in the cloud. For the cloud it uses Lithops (https://github.com/lithops-cloud/lithops) as an abstraction layer to run on various cloud provider’s serverless services (e.g. AWS Lambda). Cubed’s Plan objects can also be converted to Dask graphs or Apache-Beam pipelines, and then run via a dedicated service. Executors for Ray, Spark, and HPC are also in development.

Integration with Xarray allows scientific users to try out Cubed without altering their analysis code, and opens the door to an array ecosystem in which users can seamlessly test different approaches to scaling up computation, as if they were running SQL queries against different query engines.

Results:

We compare running Cubed in the cloud against other array frameworks such as Dask, for various common array analytics workloads, at TB scales. Using recent graph optimizations within Cubed, we show that comparable performance to cluster-based frameworks can be achieved at reasonable cost, whilst requiring fewer choices from the user, and increasing reliability by respecting memory constraints.

Conclusion:

Cubed provides a new paradigm for array analytics at scale, using serverless services to parallelize large computations in the cloud, whilst allowing users to focus on scientific questions rather than configuring clusters.

Links:

Previous presentations by the authors:

Tom White:
* Genomics | Life Science Lightning Talk | Tom White | Dask Summit 2021 (https://www.youtube.com/watch?v=qt6YsHoPpZs)

Tom Nicholas:
* Cubed: Bounded-Memory Serverless Array Processing in Xarray (https://youtu.be/kYc6hIddjwA?si=AvtCgn7hHJpKvy2u)
* Enabling Petabyte-scale Ocean Data Analytics- Thomas Nicholas, Julius Busecke | SciPy 2022 (https://www.youtube.com/watch?v=ftlgOESINvo)

Ryan Abernathey:
* Pangeo Forge: Crowdsourcing Open Data in the Cloud- Ryan Abernathey | SciPy 2022 (https://youtu.be/sY20UpYCAEE?si=x9TP0VRKb-pa6ugV)

Tom Nicholas is a core developer of Xarray, and the original author of xarray.DataTree. He has made numerous contributions throughout the Pangeo stack, including to VirtualiZarr, Cubed, xGCM, and pint-xarray. He currently works on the open-source Pangeo stack full-time at Earthmover. Prior to that he worked at a non-profit on open-source tools for monitoring carbon dioxide removal, and as a Research Software Engineer in Ryan Abernathey's Climate Data Science Lab at Columbia University. He first started using the open-source scientific python stack during his PhD, when he was studying plasma turbulence in nuclear fusion reactors. He has delivered many Xarray tutorials, including at SciPy 2022, 2023, and 2024.

This speaker also appears in:

Tom White is an independent software engineer. His long-term professional interest centres around large-scale distributed storage and processing. Over the last few years he has focused on big data infrastructure for scientists, including GATK, Scanpy, sgkit, and most recently Cubed. In a previous life Tom wrote “Hadoop: the Definitive Guide” published by O’Reilly. He lives in the Brecon Beacons in Wales with his family.