07-09, 15:25–15:55 (US/Pacific), Room 315
Scaling artificial intelligence (AI) and machine learning (ML) workflows on high-performance computing (HPC) systems presents unique challenges, particularly as models become more complex and data-intensive. This study explores strategies to optimize AI/ML workflows for enhanced performance and resource utilization on HPC platforms.
We investigate advanced parallelization techniques, such as Data Parallelism (DP), Distributed Data Parallel (DDP), and Fully Sharded Data Parallel (FSDP). Implementing memory-efficient strategies, including mixed precision training and activation checkpointing, significantly reduces memory consumption without compromising model accuracy. Additionally, we examine various communication backends( i.e. NCCL, MPI, and Gloo) to enhance inter-GPU and inter-node communication efficiency. Special attention is given to the complexities of implementing these backends in HPC environments, providing solutions for proper configuration and execution.
Our findings demonstrate that these optimizations enable stable and scalable AI/ML model training and inference, achieving substantial improvements in training times and resource efficiency. This presentation will detail the technical challenges encountered and the solutions developed, offering insights into effectively scaling AI/ML workflows on HPC systems.
Scaling AI and ML workloads on HPC platforms demands a specialized approach to ensure efficiency and accuracy. This presentation focuses on methods for optimizing AI/ML workflows as they grow increasingly complex and data-intensive for geoscientific applications. We explore parallelization strategies, ranging from standard Data Parallelism (DP) to advanced techniques like Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP), that significantly reduce GPU memory footprints through mixed-precision training and activation checkpointing.
Moreover, we address the challenges of configuring diverse communication backends (NCCL, MPI, and Gloo) within HPC environments, outlining practical solutions for seamless data exchange across both single-node multi-GPU and multi-node multi-GPU setups. Our presentation demonstrates how these combined optimizations can deliver stable and scalable model training and inference while significantly reducing training time and resource usage. Participants will gain actionable insights into the core technical obstacles and proven strategies for streamlining large-scale AI/ML workflows on HPC infrastructures.
Negin Sobhani is a High Performance Computing consultant and computational atmospheric scientist at the National Center for Atmospheric Research (NCAR). She has extensive experience developing and supporting open-source tools and infrastructure that improve the performance and accessibility of Earth System models, bridging the gap between data science, atmospheric science, and software engineering. Her broader work encompasses the development of large-scale distributed training, optimization of resource utilization, and data pipelines across advanced computing environments for geoscience applications.