SciPy 2025

Pedro Garcia Lopez

Pedro Garcia Lopez is professor of the Computer Engineering and Mathematics Department at the University Rovira i Virgili (Spain).
He leads he “Cloud and Distributed Systems Lab” research group and coordinates large research european projects.
In particular, he leads CloudStars (2023-2027), NearData (2023-2025), CloudSkin (2023-2025),
and he participates as partner in EXTRACT (2023-2025). He also coordinated FP7 CloudSpaces (2013-1015), H2020 IOStack (2015-2017)
and H2020 CloudButton (2019-2022). Pedro Garcia Lopez is one of the main architects and leaders of the Lithops project that was created
in collaboration with IBM in the Cloudbutton.eu project. Pedro is the main author of the "Serverless End Game" and "Dataplug" papers and co-author
of the paper on "Transparent serverless execution of Python multiprocessing applications".

The speaker's profile picture

Sessions

07-08
08:00
240min
Processing Cloud-optimized data in Python with Serverless Functions (Lithops, Dataplug)
Pedro Garcia Lopez, Enrique Molina-Giménez

Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset.
These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data.
They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations.
In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless.
FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden.
Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup.

In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit.
Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions.
We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines
in the Cloud that demonstrate the benefits of cloud-optimized data management.

Tutorials
Room 318
07-11
11:25
30min
Processing Cloud-optimized data in Python (Dataplug)
Pedro Garcia Lopez, Enrique Molina-Giménez

The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads.

Bioinformatics, Computational Biology, and Neuroscience
Room 317
0min
Serverless Data Processing with Lithops
Pedro Garcia Lopez, Enrique Molina-Giménez

Unexpectedly, the rise of serverless computing has also collaterally started the “democratization” of massive-scale data parallelism. This new trend heralded by PyWren pursues to enable untrained users to execute single-machine code in the cloud at massive scale through platforms like AWS Lambda. Driven by this vision, this article presents Lithops, which carries forward the pioneering work of PyWren to better exploit the innate parallelism of à la MapReduce tasks atop several Functions-as-a-Service platforms such as AWS Lambda, IBM Cloud Functions, Google Cloud Functions or Knative. Instead of waiting for a cluster to be up and running in the cloud, Lithops makes easy the task of spawning hundreds and thousands of cloud functions to execute a large job in a few seconds from start. With Lithops, for instance, users can painlessly perform exploratory data analysis from within a Jupyter notebook, while it is the Lithops’s engine which takes care of launching the parallel cloud functions, loading dependencies, automatically partitioning the data, etc. In this article, we describe the design and innovative features of Lithops and evaluate it using several representative applications, including MapReduce WordCount and Monte Carlo simulations. These applications manifest the Lithops’ ability to scale single-machine code computations to thousands of cores. And very importantly, without the need of booting a cold cluster or keeping a warm cluster for occasional tasks.

Machine Learning, Data Science, and Explainable AI