Negin Sobhani
Negin Sobhani is a High Performance Computing consultant and computational atmospheric scientist at the National Center for Atmospheric Research (NCAR). She has extensive experience developing and supporting open-source tools and infrastructure that improve the performance and accessibility of Earth System models, bridging the gap between data science, atmospheric science, and software engineering. Her broader work encompasses the development of large-scale distributed training, optimization of resource utilization, and data pipelines across advanced computing environments for geoscience applications.

Sessions
Scaling artificial intelligence (AI) and machine learning (ML) workflows on high-performance computing (HPC) systems presents unique challenges, particularly as models become more complex and data-intensive. This study explores strategies to optimize AI/ML workflows for enhanced performance and resource utilization on HPC platforms.
We investigate advanced parallelization techniques, such as Data Parallelism (DP), Distributed Data Parallel (DDP), and Fully Sharded Data Parallel (FSDP). Implementing memory-efficient strategies, including mixed precision training and activation checkpointing, significantly reduces memory consumption without compromising model accuracy. Additionally, we examine various communication backends( i.e. NCCL, MPI, and Gloo) to enhance inter-GPU and inter-node communication efficiency. Special attention is given to the complexities of implementing these backends in HPC environments, providing solutions for proper configuration and execution.
Our findings demonstrate that these optimizations enable stable and scalable AI/ML model training and inference, achieving substantial improvements in training times and resource efficiency. This presentation will detail the technical challenges encountered and the solutions developed, offering insights into effectively scaling AI/ML workflows on HPC systems.