07-11, 11:25–11:55 (US/Pacific), Room 317
The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads.
Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud object storage without needing to download the entire dataset.
These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data.
They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. For example, Dask can efficiently read data in parallel from Object Storage in CO formats like ZARR.
Cloud-optimized formats are now widely used in geospatial settings with entire datasets available in the AWS Registry for Open Data like Sentinel-2 Cloud Optimized GeoTIFFs. In this line, COPC (Cloud Optimized Point Cloud) was developed to overcome the limitations of LIDAR. Likewise, Cloud Optimized GeoTIFF (COG) was developed to facilitate cloud processing of GeoTIFF files.
Nevertheless, there are no cloud optimized versions of widely used formats in genomics (FASTA, FASTQ, VCF, FASTQGZIP) and metabolomics (imzML). Furthermore, a costly preprocessing from legacy formats is required (from GeoTIFF to COG, from LIDAR to COPC). In this talk, we will present a novel data processing library called Dataplug that enables Cloud-optimized access to legacy formats without a costly preprocessing and also avoiding huge data movements. Dataplug covers legacy formats like LIDAR but also major data formats found in bioinformatics (genomics, metabolomics) that lack appropriate Cloud Optimized alternatives.
In this talk, you will learn how to process scientific data formats in Python using the Dataplug library from any Python data analytics platform like Dask or Ray. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management.
Objectives
By the end of this talk, you will be able to:
- Understand Cloud-Optimized data formats and their benefits for data processing in the Cloud
- Learn how to process Cloud Optimized Data from Object Storage in Python using Dask
- Use Dataplug library to enable on-the-fly partitioning of Cloud Optimized data (COG, ZARR, COPC).
- Use Dataplug library to enable on-the-fly partitioning of non-Cloud Optimized formats (LIDAR, FASTQGZIP, FASTA, FASTQ, VCF,imzML)
Outline
Introduction (10 minutes)
- About Us
- Understanding Cloud-Optimized data formats and Cloud Object storage
- Processing Cloud-Optimized data in Dask
Processing Cloud-optimized data in the Cloud with Python (15 minutes)
- Processing COG (Cloud-Optimized GeoTIFFs) in Python in the NDVI pipeline
- On-the-fly processing of compressed genomic data (FASTQGZIP) with Dataplug
- On-the-fly processing of metabolomics data (imzML) with Dataplug
- Commparing LIDAR and COPC processing with Dataplug library in Dask (code)
Conclusions (2 minutes)
Audience
The talk is aimed at Python developers interested in processing data in the Cloud. In particular, it may be of interest in the following domains:
geospatial data (COG, COPC, LIDAR, ZARR, Kerchunk), genomics data (FASTA, FASTQ, VCF, FASTQGZIP) and metabolomics data (imzML).
This talk requires basic understanding of Cloud Object Storage.
Pedro Garcia Lopez is professor of the Computer Engineering and Mathematics Department at the University Rovira i Virgili (Spain).
He leads he “Cloud and Distributed Systems Lab” research group and coordinates large research european projects.
In particular, he leads CloudStars (2023-2027), NearData (2023-2025), CloudSkin (2023-2025),
and he participates as partner in EXTRACT (2023-2025). He also coordinated FP7 CloudSpaces (2013-1015), H2020 IOStack (2015-2017)
and H2020 CloudButton (2019-2022). Pedro Garcia Lopez is one of the main architects and leaders of the Lithops project that was created
in collaboration with IBM in the Cloudbutton.eu project. Pedro is the main author of the "Serverless End Game" and "Dataplug" papers and co-author
of the paper on "Transparent serverless execution of Python multiprocessing applications".
Enrique Molina Giménez is a PhD candidate at Universitat Rovira i Virgili (URV), currently contributing to the EU HORIZON EXTRACT and UNICO I+D CLOUDLESS research projects. His work focuses on distributed systems and cloud computing, particularly exploring smart provisioning strategies and infrastructure cost reduction. With a transversal background spanning both industry and academia, Enrique brings hands-on experience from software development to systems engineering.