SciPy 2023

Interactive Exploration of Large-Scale Datasets with Jupyter-Scatter
07-13, 11:25–11:55 (America/Chicago), Amphitheater 204

Jupyter-scatter is a scalable, interactive, and interlinked scatter plot widget for exploring datasets with up to several million data points. It focuses on data-driven visual encodings and offers two-way pan+zoom and lasso interactions. Beyond a single instance, jupyter-scatter can compose multiple scatter plots and synchronize their views and selections. Moreover, points can be connected by spline-interpolated lines. Thanks to the underlying WebGL rendering engine, spatial and color changes are smoothly transitioned. Finally, the API integrates seamlessly with Pandas DataFrames and offers functional methods that group properties by type to ease accessibility and readability.


Visualizing datasets as a 2D scatter plot is one of the most popular data visualization methods for understanding the distributions, identifying trends, and discovering correlations. The method is used in any scientific domain. For instance, in biology, machine learning, or digital humanities, high-dimensional datasets are often summarized with dimensionality-reduction methods like PCA, t-SNE, or UMAP, and the results are typically visualized as 2D scatter plots to discover clusters.

Unfortunately, many visualization tools are unable to scale or compromise user experience with datasets that grow in size, dimensionality, and quantity. For instance, while datashader can render datasets of almost any size, it offers limited interactions. On the other hand, Plotly provides interactivity but does not extend nearly as well to millions of points. Ideally, we want to be able to render and interactively explore one or more datasets with millions of data points.

Jupyter-scatter (https://github.com/flekschas/jupyter-scatter) is a purpose-built widget for Jupyter Notebook, Lab, and Google Colab that supports interactive, interlinked, and scalable exploration of multiple large-scale datasets as scatter plots. It focuses on data-driven visual encodings, offers pan+zoom interactions, and two-way lasso selection. Beyond a single instance, jupyter-scatter can compose multiple scatter plots and synchronize their views and selections. Moreover, points can be connected by spline-interpolated lines. Thanks to the underlying WebGL rendering engine (https://github.com/flekschas/regl-scatterplot), changes in the spatial or color encoding of the points are smoothly transitioned. Finally, the widget API is inspired by seaborn and integrates seamlessly with Pandas DataFrames. As the number of arguments can get overwhelming when many properties are customized, jupyter-scatter provides a functional API that groups properties by type and exposes them via meaningfully-named methods. This functional API additionally allows users to programmatically modify active widgets from Python. To further ease the usability, jupyter-scatter infers sensible default color encodings from the data and dynamically adjusts the point opacity based on the point density in the current field of view.

Using examples from single-cell biology and machine learning we demonstrate how jupyter-scatter works, how it enables more efficient exploration of large-scale datasets, and how it can be integrated with other ipywidgets to build bespoke applications.

Fritz Lekschas is a computer scientist researching scalable visual exploration of biomedical data. As the Head of Visualization Research at Ozette Technologies, he is leading the development of web-based data visualization and exploration tools for analyzing high-dimensional single-cell data. Fritz earned his PhD in computer science from Harvard University, where he was advised by Hanspeter Pfister and Nils Gehlenborg. He has published more than twenty peer-reviewed papers and his work has been recognized with several awards.

In his free time, Fritz likes to work on open-source tools for visual data exploration.