SciPy 2024

Benoit Hamelin

The speaker's profile picture

Sessions

07-12
13:55
30min
Vector space embeddings and data maps for cyber defense
Benoit Hamelin

Vast amounts of information of interest to cyber defense organizations comes in the form of unstructured data; from host-based telemetry and malware binaries, to phishing emails and network packet sequences. All of this data is extremely challenging to analyze. In recent years there have been huge advances in the methodology for converting unstructured media into vectors. However, leveraging such techniques for cyber defense data remains a challenge.

Imposing structure on unstructured data allows us to leverage powerful data science and machine learning tools. Structure can be imposed in multiple ways, but vector space representations, with a meaningful distance measure, have proven to be one of the most fruitful.

In this talk, we will demonstrate a number of techniques for embedding cyber defense data into vector spaces. We will then discuss how to leverage manifold learning techniques, clustering, and interactive data visualization to broaden our understanding of the data and enrich it with expert feedback.

At the Tutte Institute for Mathematics and Computing (TIMC), we believe in the importance of reproducibility and in making research techniques accessible to the broader cyber defense community. To that end, this talk will leverage several open source libraries and techniques that we have developed at TIMC: Vectorizers, UMAP, HDBSCAN, ThisNotThat and DataMapPlot.

Data Science and AI/Machine Learning
Ballroom