07-09, 13:15–13:45 (US/Pacific), Room 315
Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII redaction—improving data quality and training efficiency.
We will showcase real-world examples demonstrating how multi-node, multi-GPU processing scales dataset preparation to 100+ TB of data, achieving up to 7% improvement in LLM downstream tasks. Attendees will gain insights into configurable pipelines that enhance training workflows, with a focus on reproducibility, scalability, and open-source integration within Python's scientific computing ecosystem.
The development and performance of Large Language Models (LLMs) are increasingly constrained by the availability of high-quality, diverse, and representative datasets. Traditional data collection and curation methods suffer from challenges related to cost, scalability, bias, and ethical concerns, often leading to limitations in model performance. Ensuring that training data is clean, deduplicated, and well-structured is critical for achieving superior accuracy and efficiency in LLMs. This session introduces NeMo Curator, an open-source, GPU-accelerated data curation framework designed to scale data processing across multi-node, multi-GPU systems, enabling the efficient preparation of terabyte-scale datasets for AI training.
One of the key innovations of NeMo Curator is its modular, scalable pipeline architecture, which provides an end-to-end workflow for data cleaning, filtering, and deduplication. By integrating semantic deduplication, heuristic filtering, classification, and personally identifiable information (PII) redaction, NeMo Curator helps reduce noise and redundancy in training data, ultimately improving LLM performance by up to 7% on downstream tasks. Unlike conventional CPU-based preprocessing workflows, which can be slow and computationally expensive, NeMo Curator leverages NVIDIA RAPIDS and distributed computing to accelerate dataset processing, significantly reducing training bottlenecks.
Beyond just removing duplicate data, semantic deduplication ensures that AI models are not overfitting on semantically identical but lexically different text, a common issue in web-scale datasets. Additionally, NeMo Curator supports automated text classification, allowing users to filter out low-quality data and balance dataset distributions for fairer, more robust model training. The PII redaction module ensures compliance with data privacy regulations, making NeMo Curator a valuable tool for industries such as healthcare, finance, and enterprise AI applications.
This talk will provide a hands-on walkthrough of NeMo Curator’s data processing pipelines, demonstrating how to scale LLM dataset curation across multi-node environments. Attendees will learn how to configure pipelines for text deduplication, classification, and data quality improvement—all implemented using Python-based APIs in Jupyter Notebooks. Through real-world case studies, we will highlight how NeMo Curator enables organizations to preprocess large-scale datasets more efficiently, making LLM training more scalable, cost-effective, and accurate.
By the end of this session, attendees will understand the challenges of LLM data processing and why scalable solutions are necessary and learn how NeMo Curator accelerates dataset curation using GPU-based optimizations. Additionally, they can explore semantic deduplication, filtering, and classification techniques to improve dataset quality and gain hands-on experience with configuring, optimizing, and deploying NeMo Curator pipelines for real-world AI applications.
Detailed Outlines:
1. Challenges in LLM Data Processing (5 min)
2. Introducing NeMo Curator (10 min)
3. Core Functionalities & Workflow (5 min)
4. Hands-on Demonstration (5 min)
5. Real-World Applications & Q&A (5 min)
Similar Topic was Presented at Other Events:
Scaling Data Processing for Training Large Language Models
Allison is a Developer Advocate for GPU-accelerated AI APIs, libraries, and tools at NVIDIA, with a specialization in advanced data science techniques and large language models (LLMs) . She brings over eight years of hands-on experience as a data scientist, focusing on managing and delivering end-to-end data science solutions. Her academic background includes a strong emphasis on data science, natural language processing (NLP), and generative AI. Allison holds a master’s degree in Applied Statistics from Cornell University and a master’s degree in Computer Science from San Francisco Bay University.