07-08, 08:00–12:00 (US/Pacific), Room 318
Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset.
These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data.
They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations.
In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless.
FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden.
Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup.
In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit.
Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions.
We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines
in the Cloud that demonstrate the benefits of cloud-optimized data management.
Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud object storage without needing to download the entire dataset.
These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data.
They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. For example, Dask can efficiently read data in parallel from Object Storage in CO formats like ZARR.
Cloud-optimized formats are now widely used in geospatial settings with entire datasets available in the AWS Registry for Open Data like Sentinel-2 Cloud Optimized GeoTIFFs. In this line, COPC (Cloud Optimized Point Cloud) was developed to overcome the limitations of LIDAR. Likewise, Cloud Optimized GeoTIFF (COG) was developed to facilitate cloud processing of GeoTIFF files.
Nevertheless, there are no cloud optimized versions of widely used formats in genomics (FASTA, FASTQ, VCF, FASTQGZIP) and metabolomics (imzML). Furthermore, a costly preprocessing from legacy formats is required (from GeoTIFF to COG, from LIDAR to COPC). In this talk, we will present a novel data processing library called Dataplug that enables Cloud-optimized access to legacy formats without a costly preprocessing and also avoiding huge data movements. Dataplug covers legacy formats like LIDAR but also major data formats found in bioinformatics (genomics, metabolomics) that lack appropriate Cloud Optimized alternatives.
In this talk, you will learn how to process scientific data formats in Python using the Dataplug library from any Python data analytics platform like Dask or Ray. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management.
Furthermore, we will demonstrate how cloud optimized on-the-fly data partitioning is specially suited for serverless data processing toolkits launching thousands of functions in parallel. Serverless Computing introduces a novel paradigm of resource disaggregation and elasticity, enabling ephemeral, burstable distributed services that execute tasks on demand with fine-grained billing. This model is particularly advantageous for data-parallel applications that rely on Object Storage for direct data processing.
Lithops stands out as a mature serverless data analytics platform with an active GitHub community\footnote{\url{https://github.com/lithops-cloud/lithops}} and extensible architecture. It supports compute and storage backends across major public cloud providers, including Amazon, Google, Azure, IBM, and Oracle. Lithops abstracts infrastructure management entirely. For instance, using lithops.map(func, bucket), users can transparently execute parallel tasks without worrying about resource management. The platform partitions data adaptively and allocates compute resources based on application-specific needs, launching a function for each data chunk. This allows Lithops to scale resources on demand for every map call, in stark contrast to static clusters, which rely on preprovisioned resources at the experiment's outset.
The adaptability of Lithops underscores the exceptional suitability of serverless platforms for embarrassingly parallel and stateless tasks. These include data staging, Extract-Transform-Load (ETL) processes, and large-scale data preparation.
We will demonstrate how Lithops and Dataplug together can excel in elastic data processing, outperforming cluster computing technologies like Dask. We will also show how to run the same Python code in Lithops in different Cloud providers, and thus overcoming vendor lock-in.
Audience
The talk is aimed at Python developers interested in processing data in the Cloud. In particular, it may be of interest in the following domains:
geospatial data (COG, COPC, LIDAR, ZARR, Kerchunk), genomics data (FASTA, FASTQ, VCF, FASTQGZIP) and metabolomics data (imzML).
This talk requires basic understanding of Cloud Object Storage and Serverless Functions.
Objectives
By the end of this tutorial , you will be able to:
- Understand Cloud-Optimized data formats and their benefits for data processing in the Cloud
- Learn how to process Cloud Optimized Data from Object Storage in Python using Dask
- Use Dataplug library to enable on-the-fly partitioning of Cloud Optimized data (COG, ZARR, COPC).
- Use Dataplug library to enable on-the-fly partitioning of non-Cloud Optimized formats (LIDAR, FASTQGZIP, FASTA, FASTQ, VCF,imzML)
- Understand Lithops Serverless Data Analytics platform and parallel map APIs
- Configure Lithops to use AWS Lambda and Amazon S3
- Create and run a simple Python parallel code that can run in the cloud with hundreds of processes
- Run the same Python parallel Map code in different Cloud providers (AWS,Google, IBM, Azure)
- Process massive data in parallel in the Cloud with Python and Dataplug
Outline
Block 1: Cloud Optimized Data and Dataplug
Introduction (10 minutes)
- Understanding Cloud-Optimized data formats and Cloud Object storage
- Processing Cloud-Optimized data in Dask
Processing Cloud-optimized data in the Cloud with Python (90 minutes)
- Processing COG (Cloud-Optimized GeoTIFFs) in Python in the NDVI pipeline
- On-the-fly processing of compressed genomic data (FASTQGZIP) with Dataplug
- On-the-fly processing of metabolomics data (imzML) with Dataplug
- Commparing LIDAR and COPC processing with Dataplug library in Dask (code)
Exercises (20 minutes)
TBD
Block 2: Serverless Data Processing with Lithops
Introduction (10 minutes)
- Overview of Lithops
- Understand Lithops APIs and storage API APIs.
- Understand Lithops backends and runtimes
Serverless Data processing in the Cloud with Lithops (90 minutes)
- Configure Lithops to use AWS Lambda and S3.
- Run Word Count simple example in the Cloud with Lithops
- Run Word Count in different Cloud providers (AWS, Google, IBM)
- Run Word Count in K8s in your own cluster
- Run Word Count in K8s in Managed K8s services (AWS Fargate, IBM Code Engine)
- Compare performance in different Clouds (AWS, Google, Azure, IBM) using Lithops compute and storage benchmarks
- Show speedups with a Python parallel code in the Cloud (Pi estimation) that can run in the cloud with hundreds of processes.
- Learn how to combine Lithops and Dataplug for parallel data processing
- Execute complex pipelines in Lithops (Metabolomics, genomics, astronomics)
Exercises (20 minutes)
TBD
Lithops can be installed locally in Linux (https://lithops-cloud.github.io/docs/source/configuration.html) or you can use for free pyrun.cloud to run the exercises providing your AWS credentials.
Prerequisites –This talk requires basic understanding of Cloud Object Storage and Serverless Functions.
Lithops is designed for Linux or UNIX environments.
The exercises require at least an AWS Account with access to AWS Lambda and S3.
Pedro Garcia Lopez is professor of the Computer Engineering and Mathematics Department at the University Rovira i Virgili (Spain).
He leads he “Cloud and Distributed Systems Lab” research group and coordinates large research european projects.
In particular, he leads CloudStars (2023-2027), NearData (2023-2025), CloudSkin (2023-2025),
and he participates as partner in EXTRACT (2023-2025). He also coordinated FP7 CloudSpaces (2013-1015), H2020 IOStack (2015-2017)
and H2020 CloudButton (2019-2022). Pedro Garcia Lopez is one of the main architects and leaders of the Lithops project that was created
in collaboration with IBM in the Cloudbutton.eu project. Pedro is the main author of the "Serverless End Game" and "Dataplug" papers and co-author
of the paper on "Transparent serverless execution of Python multiprocessing applications".
Enrique Molina Giménez is a PhD candidate at Universitat Rovira i Virgili (URV), currently contributing to the EU HORIZON EXTRACT and UNICO I+D CLOUDLESS research projects. His work focuses on distributed systems and cloud computing, particularly exploring smart provisioning strategies and infrastructure cost reduction. With a transversal background spanning both industry and academia, Enrique brings hands-on experience from software development to systems engineering.