Negin Sobhani SciPy 2025

Negin Sobhani
.ical

Negin Sobhani is a High Performance Computing consultant and computational atmospheric scientist at the National Center for Atmospheric Research (NCAR). She has extensive experience developing and supporting open-source tools and infrastructure that improve the performance and accessibility of Earth System models, bridging the gap between data science, atmospheric science, and software engineering. Her broader work encompasses the development of large-scale distributed training, optimization of resource utilization, and data pipelines across advanced computing environments for geoscience applications.

Sessions

07-08

13:30

240min

Hierarchical Data Analysis with Xarray DataTree & Zarr

Deepak Cherian, Ian Hunt-Isaak, Eniola Awowale, Tom Nicholas, Scott Henderson, Justus Magin, Joe Hamman, Negin Sobhani

Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets. Many real-world datasets often have hierarchical or heterogeneous structure, and are best organized through groups of related data arrays. Through xarray.DataTree, the xarray data model now supports opening datasets with a hierarchical structure of groups, such as HDF5 files and Zarr stores. This expanded data model is now general enough to manage data across different scientific disciplines, including geosciences and biosciences. This hands-on tutorial focuses on intermediate and advanced workflows using xarray to analyze real-world hierarchical data.

Scaling AI/ML Workflows on HPC for Geoscientific Applications.

Negin Sobhani

Scaling artificial intelligence (AI) and machine learning (ML) workflows on high-performance computing (HPC) systems presents unique challenges, particularly as models become more complex and data-intensive. This study explores strategies to optimize AI/ML workflows for enhanced performance and resource utilization on HPC platforms.

We investigate advanced parallelization techniques, such as Data Parallelism (DP), Distributed Data Parallel (DDP), and Fully Sharded Data Parallel (FSDP). Implementing memory-efficient strategies, including mixed precision training and activation checkpointing, significantly reduces memory consumption without compromising model accuracy. Additionally, we examine various communication backends( i.e. NCCL, MPI, and Gloo) to enhance inter-GPU and inter-node communication efficiency. Special attention is given to the complexities of implementing these backends in HPC environments, providing solutions for proper configuration and execution.

Our findings demonstrate that these optimizations enable stable and scalable AI/ML model training and inference, achieving substantial improvements in training times and resource efficiency. This presentation will detail the technical challenges encountered and the solutions developed, offering insights into effectively scaling AI/ML workflows on HPC systems.

Machine Learning, Data Science, and Explainable AI

Room 315

Negin Sobhani .ical

Sessions

Negin Sobhani
.ical