SciPy 2024

Vector space embeddings and data maps for cyber defense
07-12, 13:55–14:25 (US/Pacific), Ballroom

Vast amounts of information of interest to cyber defense organizations comes in the form of unstructured data; from host-based telemetry and malware binaries, to phishing emails and network packet sequences. All of this data is extremely challenging to analyze. In recent years there have been huge advances in the methodology for converting unstructured media into vectors. However, leveraging such techniques for cyber defense data remains a challenge.

Imposing structure on unstructured data allows us to leverage powerful data science and machine learning tools. Structure can be imposed in multiple ways, but vector space representations, with a meaningful distance measure, have proven to be one of the most fruitful.

In this talk, we will demonstrate a number of techniques for embedding cyber defense data into vector spaces. We will then discuss how to leverage manifold learning techniques, clustering, and interactive data visualization to broaden our understanding of the data and enrich it with expert feedback.

At the Tutte Institute for Mathematics and Computing (TIMC), we believe in the importance of reproducibility and in making research techniques accessible to the broader cyber defense community. To that end, this talk will leverage several open source libraries and techniques that we have developed at TIMC: Vectorizers, UMAP, HDBSCAN, ThisNotThat and DataMapPlot.

Over the years, the Tutte Institute for Mathematics and Computing has been tasked with multiple difficult analysis problems with a common thread of deriving analytic insight from large semi-structured or unstructured datasets. In this talk, we would thus first produce several examples of these problems, and suggest how they may be similarly addressed through a common methodological plan:

  1. Embed data records or aggregates in a large-dimension vector space.
  2. Reduce the dimension to 2D through manifold learning that preserves a natural or asserted notion of similarity between the data vectors.
  3. When no other class information is known for the data, use density-based clustering to discover natural trends within a low-dimension representation of the data.
  4. Examine and annotate the 2D representation of these vectors to validate analytic hypotheses, track track trends and anomalies expressed through the similarity structure and explore surprising observations.

We have developed a number of open source libraries to address these steps.

  1. While common open models should be preferred when available for embedding data such as image and text into vector spaces, our Vectorizers library provides multiple tools for embedding other data types.
  2. PyNNDescent builds an efficient nearest neighbor index to query and capture the local similarity structure of the dataset.
  3. UMAP has become the de facto industry standard for dimension reduction by manifold learning over a large number of similarity metrics.
  4. HDBSCAN, now maintained through Scikit-Learn Contrib, constitutes a robust implementation of density-based clustering.
  5. ThisNotThat offers Jupyter-integrated interactive visualization and labeling of data for scientists and experts, and DataMapPlot enables the exchange of data map representations for publication and further exploration.

Our talk will show how these tools compose into a complete approach for analyzing datasets and through novel perspectives, and communicating findings to peers and stakeholders alike.