SciPy 2023

tfmodisco-lite: an attribution-based motif discovery algorithm
07-12, 11:25–11:55 (America/Chicago), Grand Salon C

An important problem in genomics is identifying the proteins that bind to DNA. Although many methods attempt to learn DNA motifs underlying protein binding as position-weight matrices (PWMs), these PWMs cannot faithfully represent real biology. For instance, a static PWM cannot describe a zinc-finger protein whose fingers can optionally include one-nucleotide spacing. TF-MoDISco is a framework for extracting motifs using attribution scores from a machine-learning model. The learned motifs and syntax overcome many of the limitations presented by PWM. I will describe the TF-MoDISco algorithm and showcase its efficient re-implementation, tfmodisco-lite.


Understanding the binding of proteins to the genome is crucial for deciphering gene expression programs across cell types. Yet, identifying where and when these proteins bind along the genome is complicated. Most proteins bind to a specific sequence of nucleotides, known as a "motif." But not all proteins are this simple: zinc-finger proteins are comprised of many "fingers" that each bind to short 3-4 basepair motifs. While these short motifs are always found in the same order, variable spacing can be found between these short motifs, and not all are always necessary for binding. Other proteins require the presence of a co-factor to bind to their motifs. Faithfully describing the sequence determinants of protein binding, sometimes called the cis-regulatory logic, for all proteins is a challenging task.

Increasingly, people have been using machine learning to understand biology by training neural networks to take in nucleotide sequence and predict a readout of interest, e.g. ATAC-seq, ChIP-seq, CAGE, etc. One can then run a feature attribution algorithm, such as ISM or DeepLIFT, to highlight the nucleotides that drive the predicted readouts. However, summarizing these attributions into repeated patterns has thus far been a missing component of the analysis pipeline.

TF-MoDISco is a framework for using attribution-weighted sequence to discover motifs. The approach differs from classic motif finding algorithms in both the input and the output. Rather than operating solely nucleotide sequence, TF-MoDISco also takes in the attributions from a machine learning model using any attribution algorithm. These attributions highlight the nucleotides involved in accurate predictions and so distinguish between driver motifs and passenger motifs. At the end of the procedure, TF-MoDISco returns clusters of "seqlets," or found motif hits. These patterns, aligned to each other to account for spacing, represent the true heterogeneity of protein binding to the genome. By returning clustering of seqlets, TF-MoDISco overcomes many of the problems of position-weight matrices (PWMs), such as the inability to account for variable spacing and linear assumption across nucleotides.

This talk will describe the TF-MoDISco procedure at a high level (first 15 minutes) and give a tutorial on how to use the code for discovery in practice (second 15 minutes). Examples will come from models used to predict chromatin accessibility via ATAC-seq as well as protein binding via ChIP-seq readouts. Specifically, the tutorial will cover tfmodisco-lite, a rewrite of the original algorithm that scales significantly better, runs faster, and requires less code. By the end of the talk, one should feel comfortable applying the method to their own data and interpreting the reports that are generated.

Jacob Schreiber is a post-doctoral researcher at Stanford University, where he studies human genomics using modern machine-learning tools. In his "free time," he contributes to the Python data science ecosystem in the form of pomegranate, a package for probabilistic modeling, and apricot, a package for submodular optimization for summarizing large data. In the past, he was a core developer for scikit-learn.