SciPy 2025

DataMapPlot: Rich Tools for UMAP Visualizations
07-09, 10:45–11:15 (US/Pacific), Room 315

A lot of data scientists use UMAP to help them quickly visualize and explore complex datasets. This could be exploring large unstructured datasets via neural embeddings, or working on LLM explainability by mapping out Sparse Autoencoder features. Making the visualizations good enough, and compelling enough, to present to end users is much harder. However, if done right a good UMAP plot can be a powerful communication tool, or a rich interactive experience that draws users in. Attendees will come away with a sense of what is possible, and an introduction to open source tools that can make it easy.


A lot of data scientists use UMAP to help them quickly visualize and explore complex datasets. This could be exploring large unstructured datasets via neural embeddings, or working on LLM explainability by mapping out Sparse Autoencoder features. Making the visualizations good enough, and compelling enough, to present to end users is much harder. However, if done right a good UMAP plot can be a powerful communication tool, or a rich interactive experience that draws users in. Attendees will come away with a sense of what is possible, and an introduction to open source tools that can make it easy.

This is a talk focussing on visual presentation of data, and the best way to do that is by example. The talk will provide examples of great visualizations, and give clear visual examples of the many pitfalls in trying to reconstruct them. The aim is to explain how to make UMAP plots compelling to users unfamiliar with the technical details – providing guides via annotation labels, clusters, and spatial palettes. It will introduce the DataMapPlot library, an open source project that specifically aims to make all the difficult parts of UMAP plots easy, and let users focus on the overall aesthetics.

Outline:
* UMAP and Data Maps (6 min)
- High dimensional embedding vectors are everywhere
+ Introduction to data maps
- Low dimensional embeddings to explore high dimensional data
- Examples of impactful data map visualizations
* Plotting Challenges: (8 min)
- Common pitfalls of data map visualization
+ overplotting
+ color maps
+ placing text labels
+ interactivity
- Importance of clear and effective visualizations
- Real-world examples: comparing "plain" vs. well-designed plots
* DataMapPlot (8 min)
- Simplifying data map visualization creation with DataMapPlot
- Generating static visualizations for publication
- Generating interactive visualizations for exploration
* Examples (6 min)
- Showcasing some examples created with DataMapPlot (including code to reproduce)

Leland McInnes is a researcher at the Tutte Institute for Mathematics and Computing. He is the author of a number of open source tools for data analysis and machine learning, including UMAP, HDBSCAN, PyNNDescent, EVōC, DataMapPlot and Toponymy.