Scaling Clustering for Big Data: Leveraging RAPIDS cuML SciPy 2025

Scaling Clustering for Big Data: Leveraging RAPIDS cuML
.ical

2025-07-07 08:00–12:00, Room 318

This tutorial will explore GPU-accelerated clustering techniques using RAPIDS cuML, optimizing algorithms like K-Means, DBSCAN, and HDBSCAN for large datasets. Traditional clustering methods struggle with scalability, but GPU acceleration significantly enhances performance and efficiency.

Participants will learn to leverage dimensionality reduction techniques (PCA, T-SNE, UMAP) for better data visualization and apply hyperparameter tuning with Optuna and cuML. The session also includes real-world applications like topic modeling in NLP and customer segmentation. By the end, attendees will be equipped to implement, optimize, and scale clustering algorithms effectively, unlocking faster and more powerful insights in machine learning workflows.

Clustering is a fundamental machine learning technique widely used across various industries for applications such as customer segmentation, topic modeling, anomaly detection, and more. However, traditional clustering algorithms such as K-Means and HDBSCAN struggle with large datasets due to their computational complexity. This tutorial aims to provide a comprehensive overview of different clustering algorithms, including K-Means, DBSCAN, and HDBSCAN, and demonstrate how to leverage NVIIDA cuML to accelerate these algorithms, achieving higher performance with minimal code changes.

Participants will gain insights into the strengths and use cases of each clustering algorithm, enabling data scientists and developers to select the most appropriate method for their specific needs. By harnessing the power of GPUs, we will showcase how common clustering operations can be dramatically accelerated compared to traditional CPU-only systems, significantly reducing computation time and enhancing scalability.

NVIDIA cuML offers an intuitive transition for those familiar with popular clustering algorithms in Python, requiring minimal to no code modifications. We will also cover common debugging and profiling techniques to optimize the performance of your clustering applications, ensuring you can fully exploit the capabilities of GPU acceleration.

Beyond clustering algorithms, the workshop will explore dimensionality reduction techniques such as PCA, T-SNE, and UMAP for visualizing clusters, making complex data more interpretable. Additionally, we will explain how to perform hyperparameter tuning with Optuna and cuML for optimal clustering results. Lastly, we will delve into a few real-world use cases such as Topic Modeling for Natural Language Processing (NLP) which will help in identify hidden patterns and insights within textual data and customer segmentation. By the end of the workshop, attendees will be equipped with the knowledge and tools to implement and optimize clustering algorithms effectively, leveraging GPU acceleration to achieve superior performance.

Similar Topic was Presented at Other Events:
Accelerate Clustering Algorithms to Achieve the Highest Performance

Prerequisites:

Familiarity with Python and basic data science libraries (NumPy, Pandas, Scikit-Learn)
Basic understanding of machine learning concepts
Experience with Jupyter Notebooks or Python scripting

Installation Instructions:

1) Create/log in to your NVIDIA Developer Program account: https://courses.nvidia.com/join 2) Visit this link and make sure all three test steps are checked "Yes". Note: Only Chrome and Firefox are supported. 3) he course material will not be available in your account before the start of the workshop.

Allison Ding

Allison is a Developer Advocate for GPU-accelerated AI APIs, libraries, and tools at NVIDIA, with a specialization in advanced data science techniques and large language models (LLMs) . She brings over eight years of hands-on experience as a data scientist, focusing on managing and delivering end-to-end data science solutions. Her academic background includes a strong emphasis on data science, natural language processing (NLP), and generative AI. Allison holds a master’s degree in Applied Statistics from Cornell University and a master’s degree in Computer Science from San Francisco Bay University.

This speaker also appears in:

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

Scaling Clustering for Big Data: Leveraging RAPIDS cuML .ical 2025-07-07 08:00–12:00, Room 318

Scaling Clustering for Big Data: Leveraging RAPIDS cuML
.ical

2025-07-07 08:00–12:00, Room 318