SciPy 2025

Scaling Clustering for Big Data: Leveraging RAPIDS cuML
07-07, 08:00–12:00 (US/Pacific), Room 317

This tutorial will explore GPU-accelerated clustering techniques using RAPIDS cuML, optimizing algorithms like K-Means, DBSCAN, and HDBSCAN for large datasets. Traditional clustering methods struggle with scalability, but GPU acceleration significantly enhances performance and efficiency.

Participants will learn to leverage dimensionality reduction techniques (PCA, T-SNE, UMAP) for better data visualization and apply hyperparameter tuning with Optuna and cuML. The session also includes real-world applications like topic modeling in NLP and customer segmentation. By the end, attendees will be equipped to implement, optimize, and scale clustering algorithms effectively, unlocking faster and more powerful insights in machine learning workflows.


Clustering is a fundamental machine learning technique widely used across various industries for applications such as customer segmentation, topic modeling, anomaly detection, and more. However, traditional clustering algorithms such as K-Means and HDBSCAN struggle with large datasets due to their computational complexity. This tutorial aims to provide a comprehensive overview of different clustering algorithms, including K-Means, DBSCAN, and HDBSCAN, and demonstrate how to leverage NVIIDA cuML to accelerate these algorithms, achieving higher performance with minimal code changes.

Participants will gain insights into the strengths and use cases of each clustering algorithm, enabling data scientists and developers to select the most appropriate method for their specific needs. By harnessing the power of GPUs, we will showcase how common clustering operations can be dramatically accelerated compared to traditional CPU-only systems, significantly reducing computation time and enhancing scalability.

NVIDIA cuML offers an intuitive transition for those familiar with popular clustering algorithms in Python, requiring minimal to no code modifications. We will also cover common debugging and profiling techniques to optimize the performance of your clustering applications, ensuring you can fully exploit the capabilities of GPU acceleration.

Beyond clustering algorithms, the workshop will explore dimensionality reduction techniques such as PCA, T-SNE, and UMAP for visualizing clusters, making complex data more interpretable. Additionally, we will explain how to perform hyperparameter tuning with Optuna and cuML for optimal clustering results. Lastly, we will delve into a few real-world use cases such as Topic Modeling for Natural Language Processing (NLP) which will help in identify hidden patterns and insights within textual data and customer segmentation. By the end of the workshop, attendees will be equipped with the knowledge and tools to implement and optimize clustering algorithms effectively, leveraging GPU acceleration to achieve superior performance.

Similar Topic was Presented at Other Events:
Accelerate Clustering Algorithms to Achieve the Highest Performance


Installation Instructions

pull RAPIDS container; access to GPUs is provided

Prerequisites
  1. Familiarity with Python and basic data science libraries (NumPy, Pandas, Scikit-Learn)
  2. Basic understanding of machine learning concepts
  3. Experience with Jupyter Notebooks or Python scripting

Allison is a Developer Advocate for GPU-accelerated AI APIs, libraries, and tools at NVIDIA, with a specialization in advanced data science techniques and large language models (LLMs) . She brings over eight years of hands-on experience as a data scientist, focusing on managing and delivering end-to-end data science solutions. Her academic background includes a strong emphasis on data science, natural language processing (NLP), and generative AI. Allison holds a master’s degree in Applied Statistics from Cornell University and a master’s degree in Computer Science from San Francisco Bay University.

This speaker also appears in: