Enrique Molina Giménez
Enrique Molina Giménez is a PhD candidate at Universitat Rovira i Virgili (URV), currently contributing to the EU HORIZON EXTRACT and UNICO I+D CLOUDLESS research projects. His work focuses on distributed systems and cloud computing, particularly exploring smart provisioning strategies and infrastructure cost reduction. With a transversal background spanning both industry and academia, Enrique brings hands-on experience from software development to systems engineering.
Sessions
Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset.
These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data.
They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations.
In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless.
FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden.
Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup.
In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit.
Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions.
We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines
in the Cloud that demonstrate the benefits of cloud-optimized data management.
The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads.