SciPy 2024

Orchestrating Bioinformatics Workflows Across a Heterogeneous Toolset with Flyte
07-12, 10:45–11:15 (US/Pacific), Room 315

While Python excels at prototyping and iterating quickly, it’s not always performant enough for whole-genome scale data processing. Flyte, an open-source Python-based workflow orchestrator, presents an excellent way to tie together the myriad tools required to run bioinformatics workflows. Flyte is a k8s native orchestrator, meaning all dependencies are captured and versioned in container images. It also allows you to define custom types in Python representing genomic datasets, enabling a powerful way to enforce compatibility across tools. Computational biologists, or any scientists processing data with a heterogeneous toolset, stand to benefit from a common orchestration layer that is opinionated yet flexible.


Motivation
Since the sequencing of the human genome, and as other wet lab processes have scaled in the last couple decades, computational approaches to understanding the living world have exploded. The firehose of data generated from all these experiments led to algorithms and heuristics developed in low-level high-performance languages such as C and C++. Later on, industry standard collections of tools like GATK were written in Java. A number of less performance intensive offerings such as MultiQC are written in Python; and R is used extensively where it excels: visualization and statistical modeling. Finally, newer AI models and Rust based components are entering the fray.
Different languages also come with different dependencies and approaches to dependency management, interpreted versus compiled languages for example handle this very differently. They also need to be installed correctly and available in the user’s PATH for execution. Moreover, compatibility between different tools in bioinformatics often falls back on standard file types expected in specific locations on a traditional filesystem. In practice this means searching through datafiles or indices available at a particular directory and expecting a specific naming convention or filetype.
In short, bioinformatics suffers from the same reproducibility crisis as the broader scientific landscape. Standardizing the interfaces, orchestration and encapsulation of these different tools in a flexible and future-proof way is of paramount importance on this unrelenting march towards larger and larger datasets.

Approach
Solving these problems using Flyte is accomplished by wrapping tools in Flyte tasks, defining custom types and enforcing them at the task boundary, abstracting away the filesystem using an object store, and capturing dependencies flexibly with dynamically generated container images.
While Flyte tasks are written in Python, there are a couple of ways to wrap arbitrary tools. ShellTasks are one such way, allowing you to define scripts as multi-line strings in Python. For added flexibility around packing and unpacking data types before and after execution, Flyte also ships with a subproc_execute function which can be used in vanilla Python tasks.
Having rich data types to enforce compatibility at the task boundary is essential to these wrapped tools working together. Flyte supports arbitrary data types through Python’s dataclasses library. Data types representing raw reads and alignment files allow us to reason about these files and their metadata more easily across tasks, as well as enforce naming conventions. Importantly, Flyte abstracts the object store, allowing you to load these assets into pods wherever is most convenient for your tool. This not only makes it easier to work with these files, but also safer as you’re working with ephemeral storage during execution instead of a full production filesystem.
Since Flyte is a k8s native orchestrator, all tasks run in their own pods. Capturing dependencies in container images has been a gold-standard for some time now, but this is taken a step further with ImageSpec. When a new Flyte project is initialized, a default Dockerfile is created with Flyte dependencies built in. It’s easy to extend this with whatever dependencies you might need. ImageSpec is an extension of this, allowing you to specify dependencies on top of an existing image right inline with your task code. This image will be built and uploaded when your tasks and workflows are registered to a Flyte cluster.

Conclusion
Different steps in a bioinformatics pipeline often require tools with significantly different characteristics. As such, different languages are employed where their strengths are best leveraged. Being able to wrap these executables and the data they operate over into a common orchestration layer presents an enormous benefit to the developer experience and consequently the reproducibility and extensibility of the research project as a whole.

Bioinformatics Solutions Architect at Union AI