SciPy 2023

Fast Exploration of the Milky Way (or any other n-dimensional dataset)
07-12, 10:45–11:15 (America/Chicago), Zlotnik Ballroom

N-dimensional datasets are common in many scientific fields, and quickly accessing subsets of these datasets is critical for an efficient exploration experience. Blosc2 is a compression and format library that recently added support for multidimensional datasets. Compression is crucial in effectively dealing with sparse datasets as the zeroed parts can be almost entirely suppressed, while the non-zero parts can still be stored in smaller sizes than their uncompressed counterparts. Moreover, the new double data partition in Blosc2 reduces the need for decompressing unnecessary data, which allows for top-class slicing speed.


Blosc is a high-performance compressor optimized for binary data, such as floating-point numbers, integers, and booleans, although it can also handle string data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed direct memory fetch approach, which uses a memcpy() OS call. Blosc is widely used in popular storage libraries like HDF5 (via h5py or PyTables) or Zarr, and is probably producing many petabytes of compressed data every day around the world.

C-Blosc2 (https://github.com/Blosc/c-blosc2) is the latest major version of C-Blosc. It comes with Python-Blosc2 (https://github.com/Blosc/python-blosc2), a lightweight Python wrapper that exposes many of its new features. Some of the most interesting features are:

  • 64-bit containers: There is no practical limit in dataset sizes.
  • Frames: Data can be serialized either on-disk or in-memory.
  • Meta-layers: Meta-data can be added in different layers inside frames.
  • Blosc2 NDim: N-dimensional datasets can be created, read, and sliced efficiently.
  • Double partitioning: Data can be split into fine-grained cubes for faster reads of n-dimensional slices.
  • Parallel reads: When several blocks of a chunk need to be read, this is done in parallel.
  • Support for special values: Large sequences of repeated values can be represented efficiently.

With leveraging these features, Blosc2 provides a powerful, yet flexible tool for data handling. For example, when Blosc2 cooperates with libraries like PyTables/HDF5, it allows to query 100 trillion rows tables in human time frames.

Furthermore, being able to compress multidimensional data is of great help in handling large multidimensional datasets because 1) it reduces the amount of storage resources and 2) reduces the bandwidth necessary to bring data from storage (disk, memory) to the CPU, allowing to process data more effectively in general. Additionally, compression can represent a wide variety of sparse data without requiring a specific format. Instead, compression works to minimize the number of zeros and keep storage requirements to a minimum.

We will address common misconceptions about compressing data, such as: 1) decompressing data takes CPU time, which may slow down computations, and 2) when retrieving a subset of data, all affected partitions must be decompressed, adding overhead. To debunk these myths, we offer the following facts: 1) decompressing data within CPU caches often saves transmission cycles, and 2) Blosc2 features a novel double partitioning schema that minimizes decompression overhead.

We will leverage Python-Blosc2 to:

  • Describe the main features of Blosc2
  • Provide useful advice on the best codecs and filters for different types of datasets
  • Explain how to partition multidimensional datasets for efficient slicing
  • Compare efficiency and resource savings with other packages, such as h5py, PyTables, and Zarr

Finally, we will demonstrate an example of exploring the Milky Way's 3D dataset effectively, using data from the Gaia mission.