Breaking the silo: composable bioinformatics through cross-disciplinary open standards SciPy 2025

Breaking the silo: composable bioinformatics through cross-disciplinary open standards
.ical

2025-07-09 10:45–11:15, Room 317

The practice of data science in genomics and computational biology is fraught with friction. This is largely due to a tight coupling of bioinformatic tools to file input/output. While omic data is specialized and the storage formats for high-throughput sequencing and related data are often standardized, the adoption of emerging open standards not tied to bioinformatics can help better integrate bioinformatic workflows into the wider data science, visualization, and AI/ML ecosystems. Here, we present three libraries as short vignettes for composable bioinformatics. First, we present Oxbow, a Rust and Python-based adapter library that unifies access to common genomic data formats by efficiently transforming queries into Apache Arrow, a standard in-memory columnar representation for tabular data analytics. Second, we present Bioframe, a Python library that performs genomic range operations using standard Pandas dataframes. Last, we present Anywidget, an architecture based on modern web standards for sharing interactive visualizations across all Jupyter-compatible runtimes, including JupyterLab, Google Colab, VSCode, and more. Together, we demonstrate the composition of these libraries to build a custom connected genomic analysis and visualization environment. We propose that components such as these, which leverage scientific domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data analysis and machine learning workflows as well as systems for exploratory data analysis and visualization.

The practice of data science in genomics and computational biology is fraught with friction. This is in large part because bioinformatic tools tend to be tightly coupled to file input/output. As a result, bioinformatic workflows shuffle data through meandering, labor-intensive, and time-consuming transformations in order to accommodate each tool’s requirements. Similarly, genomics visualization tools need to handle various complex file types or require further data conversion by end users. We argue that the adoption of emerging open standards not tied to bioinformatics can help alleviate this coupling, freeing authors to focus on problem-specific concerns, and enabling bioinformatic tools to integrate better into the wider data science, visualization, and AI/ML ecosystems.

In this talk, we will present three libraries as short vignettes to illustrate the potential of composable bioinformatics.

First, we present oxbow (https://github.com/abdenlab/oxbow), an adapter library that unifies access to common genomic data formats. Despite their varied on-disk representations, many specialized bioinformatic formats share a fundamentally tabular structure. Oxbow efficiently transforms queries to such files into a common in-memory representation, Apache Arrow. Arrow is a standard, columnar and self-describing layout for tabular data for both efficient in-memory analytics and binary transport. It is now widely supported by various open-source data technologies, including popular data frame libraries. Oxbow's core is written in Rust, which provides memory safety, performance, and ease of binding to high-level languages including Python and R. For file connectivity, Oxbow makes use of the noodles[1] implementation of GA4GH formats in Rust (SAM/BAM, VCF/BCF, etc.) as well as the bigtools[2] Rust library for the UCSC Big Binary Indexed formats (bigWig and bigBed). The Python API provides a simple interface to lazy and distributed data libraries used in Python, including dask, polars, and duckdb.

Second, we present bioframe[3] (https://github.com/open2c/bioframe), a Python library from the Open Chromosome Collective (Open2C) for operations on genomic intervals in Pandas data frames, operations that are fundamental to bioinformatic analyses. Bioframe’s design principles emphasize reuse of existing general-purpose data structures. Namely, (1) bioframe does not introduce new custom objects: interval sets are standard Pandas data frames and (2) join operations are performed by reusing NumPy-based primitives rather than implementing interval tree structures, with similar performance. Bioframe facilitates smooth integration with the Python stack, removing the need to convert or serialize data between operations.

Finally, we give a high-level overview anywidget [4,5] (https://anywidget.dev). Anywidget is a standard and toolkit for authoring web-based interactive widgets in computational notebooks. Trevor Manz presented and he and I gave a very positively received tutorial on anywidget at SciPy last year. In this talk, we will show how anywidget can be leveraged in a bioinformatics context.

In practice, third-party Jupyter widgets are cumbersome to author, distribute, and install because widgets must be built, bundled, and installed as individual frontend extensions and the frontends of different Jupyter-compatible platforms (JCPs) – including JupyterLab, Google Colab, and VSCode – install and load extensions in disparate ways. Another consequence of this architecture is that kernel (Python) and frontend (Javascript) modules for a widget must be distributed separately. To address these difficulties, anywidget (1) supplies a single universal extension plugin for all JCP runtimes and (2) defines a narrow frontend widget API based on web standard ECMAScript modules, which work natively across all modern browsers without transformation. Consequently, Jupyter widgets authored using anywidget (anywidgets) do not require installation or the use of build toolchains for development. This enables modern conveniences to support rapid development cycles. Furthermore, anywidgets can be pasted and executed live in a code cell or distributed as single Python packages.

We will demonstrate how to combine these tools to create a custom connected genomic analysis and visualization environment. We propose that components such as these, which leverage domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data processing workflows, computational analyses, and systems for exploratory data analysis and visualization.

[1] Macias, M. Noodles. Last accessed March, 2025. url: https://github.com/zaeleus/noodles
[2] Huey, J. D., Abdennur, N. (2024). Bigtools: a high-performance BigWig and BigBed library in Rust. Bioinformatics.
[3] Open2C, Abdennur, N., Fudenberg, G., Flyamer, I., M, Galitsyna, A., A, Goloborodko, A., Imakaev, M., & Venev, S. (2024). Bioframe: Operations on Genomic Intervals in Pandas Dataframes. Bioinformatics.
[4] Manz, T., Gehlenborg, N., Abdennur, N. (2024). Any notebook served: authoring and sharing reusable interactive widgets. Proceedings of the 23rd Python in Science Conference.
[5] Manz, T., Abdennur, N., Gehlenborg, N. (2024). anywidget: reusable widgets for interactive analysis and visualization in computational notebooks. JOSS.

Trevor Manz

Nezar Abdennur

I am an Assistant Professor in the Department of Genomics and Computational Biology and the Department of Systems Biology at UMass Chan Medical School.

I lead a computational research group (https://abdenlab.org) with a dual mandate. My group's biological research focuses on the 3D organization of the genome (3C/Hi-C technologies), its relationship to the epigenome, and the resulting manifold influences on cellular fate, differentiation, aging, and disease. My group's open-source interests are in supporting foundational software infrastructure to improve genomic and multi-omic data science, especially in the scientific Python ecosystem.

Breaking the silo: composable bioinformatics through cross-disciplinary open standards .ical 2025-07-09 10:45–11:15, Room 317

Breaking the silo: composable bioinformatics through cross-disciplinary open standards
.ical

2025-07-09 10:45–11:15, Room 317