Bryce Adelstein Lelbach
Bryce Adelstein Lelbach has spent over a decade developing programming languages, compilers, and software libraries. He is a Principal Architect at NVIDIA, where he leads programming language efforts and drives the technical roadmap for NVIDIA's compute compilers and libraries. Bryce is one of the leaders of the C++ community. He has served as chair of INCITS/PL22, the US standards committee for programming languages and the Standard C++ Library Evolution group. Bryce served as the program chair for the C++Now and CppCon conferences for many years. On the C++ Committee, he has personally worked on concurrency primitives, parallel algorithms, executors, and multidimensional arrays. He is one of the founding developers of the HPX parallel runtime system. Outside of work, Bryce is passionate about airplanes and watches.

Sessions
Block-based programming divides inputs into local arrays that are processed concurrently by groups of threads. Users write sequential array-centric code, and the framework handles parallelization, synchronization, and data movement behind the scenes. This approach aligns well with SciPy's array-centric ethos and has roots in older HPC libraries, such as NWChem’s TCE, BLIS, and ATLAS.
In recent years, many block-based Python programming models for GPUs have emerged, like Triton, JAX/Pallas, and Warp, aiming to make parallelism more accessible for scientists and increase portability.
In this talk, we'll present cuTile and Tile IR, a new Pythonic tile-based programming model and compiler recently announced by NVIDIA. We'll explore cuTile examples from a variety of domains, including a new LLAMA3-based reference app and a port of miniWeather. You'll learn the best practices for writing and debugging block-based Python GPU code, gain insight into how such code performs, and learn how it differs from traditional SIMT programming.
By the end of the session, you'll understand how block-based GPU programming enables more intuitive, portable, and efficient development of high-performance, data-parallel Python applications for HPC, data science, and machine learning.