SciPy 2023

An API for efficient and low-latency access to the largest standardized single-cell data repository by CZ CELLxGENE Discover.
07-14, 14:35–15:05 (America/Chicago), Zlotnik Ballroom

CZ CELxGENE Discover has released all of its human and mouse single-cell data through a new API that allows for efficient and low-latency querying. The data is fully standardized, hosted publicly and it is composed by a count matrix of 50 mi cells (observations) by >60 k genes (features) accompanied by cell and gene metadata. While these data are built from more than 700 datasets, the API enables convenient cell- and gene-based filtering to obtain any slice of interest in a matter of seconds. All data can be quickly transformed to numpy, pandas, anndata or Seurat objects.


As a part of the CZ CELxGENE Discover suite (cellxgene.cziscience.com) we have deployed Python and R APIs to query the largest aggregation of single-cell data from 50 million cells along >60 thousand genes from the major human and mouse tissues.

The data is comprised of more than 700 individual datasets represented as a single gene expression count matrix along with metadata data frames, where all cells have harmonized annotations across 11 variables (e.g. cell type, tissue, sequencing technology, donor id, etc) and all gene IDs and labels have been standardized on GENCODE references (https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md) . The APIs are able to perform efficient cell-based queries across all cells regardless of the dataset of origin.

The concatenated data presents a unique opportunity to apply machine learning on single-cell gene expression at an unprecedented scale for biological discoveries. More importantly, the data and APIs are built around a recently developed technology, TileDB-SOMA, which allows for cloud-optimized storage and access, low-latency access for larger-than-memory slices of data, querying and filtering under lazy evaluation, and transformers to pandas, pyarrow, anndata and Seurat.

The APIs are free to use (https://pypi.org/project/cell-census/) and the data is hosted publicly online, which allows users to fetch slices of data with less than 10 lines of code and under 2 minutes. Our main objective is to accelerate biological discoveries by providing ready-to-use standardized gene expression data from 50 million human and mouse cells in an interoperable manner. We are eager to provide the support necessary to enable researchers to effectively use the data and APIs.

Computational biologist at CZI focusing on providing access to all single-cell data hosted on CZ CELLxGENE (https://cellxgene.cziscience.com/). Ph.D. on cellular and molecular biology from Stanford University, and BSc in genomics from the Autonomous National University of Mexico