SciPy 2023

Zarr: Community specification of large, cloud-optimised, N-dimensional, typed array storage
07-13, 15:50–16:20 (America/Chicago), Zlotnik Ballroom

A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling.


Zarr is a data format for storing chunked, compressed N-dimensional arrays and is sponsored by NumFOCUS under their umbrella.

In this presentation, we will discuss the evolution of Zarr, first introduced at SciPy 2019; the development of the Zarr Enhancement Process (ZEP) and its use to define the next major version of the Zarr Specification (V3); as well as uptake of the format across the research landscape.

Outline:

First, we’ll be talking about:

Introduction and Working of Zarr (10 mins.)

  • What is Zarr, and how it works?
    • The inner workings of Zarr using illustrated graphics
    • When and Why should you use Zarr?
    • Extensive pluggable compressors (via numcodecs) and file-storage systems
  • What is the Zarr Specification?
    • A summary of the technical specification of Zarr
    • Adoption of the Zarr specification in various programming languages like Python, C, C++, Java, and Javascript and how all of us form a wonderful community together
  • Development of Zarr since it was first presented in SciPy 2019 by Alistair Miles
    • Highlighting some important technical and community milestones since 2019
    • Securing grants from CZI and getting sponsored by NumFOCUS

After this:

Usage of Zarr across several domains (5 mins.)

  • Interoperability with Dask, Xarray and Numpy
  • Adoption of Zarr by various communities like Geospatial, Bio-imaging, Genomics, Data Science/Engineering etc.
  • Development of convention processes like GeoZarr and OME-Zarr

Then we’ll discuss:

ZEP Process (10 mins.)

  • Need and origin of a community feedback process for the evolution of Zarr specification
  • How it works?
  • Transformation from steering council governed to community-owned specification
  • Learnings when migrating from Spec V2Spec V3

And finally:

Conclusion (5 mins.)

  • Key takeaways
  • How can you get involved?
  • QnA

This talk aims to address an audience who works with large amounts of data and is looking for a format which is transparent, open-source, reliable, cloud-optimised, and friendly to the environment. Also, we’d like to invite anyone interested in the lessons we learnt by maintaining the project throughout the years.

The tone of the talk is set to be informative, story-telling and fun.

After this talk, you’d:

  • understand the basics of Zarr and its specification,
  • know why you should have a process for your project,
  • have essential takeaways regarding when an OSS project transitions from a young to a mature stage
  • as well as the pros and cons of a steering council vs a community-owned open-source project

Sanket is a data scientist based out of New Delhi, India. He likes to build data science tools and products and has worked with startups, government and organisations. He loves building community and bringing everyone together and is Chair of PyData Delhi and PyData Global. Currently, he's taking care of the community and OSS at Zarr as their Community Manager.
When he’s not working, he likes to play the violin and computer games and sometimes thinks of saving the world!

Got my B.S. & M.S. in Physics. After graduating went to work at Howard Hughes Medical Institute for 5 years working on image processing problems particularly in neuroscience. Got more involved in open source during that work with particular interest in packaging, storage, and distributed array processing. Then joined the NVIDIA RAPIDS team where there has been good overlap with these past interests as well as new ones.

This speaker also appears in:

Josh is a research software engineer focusing on the standardization and storage of bioimaging data. Typically, that means finding ways of storing large binary with well-defined metadata in order to make them shareable. To that end, he is a maintainer of the Open Microscopy Environment (OME) as well as Zarr projects.

You can find out more under https://joshmoore.github.io