SciPy 2025

cuTile, the New/Old Kid on the Block: Python Programming Models for GPUs
07-09, 14:35–15:05 (US/Pacific), Ballroom

Block-based programming divides inputs into local arrays that are processed concurrently by groups of threads. Users write sequential array-centric code, and the framework handles parallelization, synchronization, and data movement behind the scenes. This approach aligns well with SciPy's array-centric ethos and has roots in older HPC libraries, such as NWChem’s TCE, BLIS, and ATLAS.

In recent years, many block-based Python programming models for GPUs have emerged, like Triton, JAX/Pallas, and Warp, aiming to make parallelism more accessible for scientists and increase portability.

In this talk, we'll present cuTile and Tile IR, a new Pythonic tile-based programming model and compiler recently announced by NVIDIA. We'll explore cuTile examples from a variety of domains, including a new LLAMA3-based reference app and a port of miniWeather. You'll learn the best practices for writing and debugging block-based Python GPU code, gain insight into how such code performs, and learn how it differs from traditional SIMT programming.

By the end of the session, you'll understand how block-based GPU programming enables more intuitive, portable, and efficient development of high-performance, data-parallel Python applications for HPC, data science, and machine learning.


Context

In block-based programming models, you write seemingly sequential functions that operate on small, local arrays that subdivide your inputs. These functions are then invoked concurrently on multiple instances. Each instance has a group of threads associated with it, and array operations are parallelized across those threads. Concurrency and data movement within groups of threads are implicit and abstracted away, in contrast to models like SIMT where users must explicitly synchronize and coordinate threads and tensor cores, pipeline loading of data, account for memory coalescing, etc.

Block-based programming has been a staple in numerical and scientific computing for decades. Examples include NWChem's Tensor Contraction Engine, BLIS, and ATLAS. Block-based programming is a form of array programming, and draws inspiration from languages and frameworks such as APL, MATLAB, and NumPy.

Recently, there has been an explosion of interest in block-based Python programming models targeting GPUs, driven by the machine learning community. Many new Python frameworks have been developed, such as Triton, JAX/Pallas, and Warp. In March 2025, NVIDIA announced a new block-based dialect (cuTile) and compiler stack for CUDA (Tile IR).

Motivation

This trend towards Pythonic block-based models for GPU programming is due to a variety of factors:

  • More and more scientists are programming GPUs, including those who are not experts in concurrency and hardware performance.
  • Block-based code is simpler to design, write, and debug for data-parallel GPU applications.
  • Compilers can reason about block-based programs without more complex and brittle analysis.
  • Array-centric paradigms are more intuitive for Python developers familiar with NumPy.
  • Block-based GPU frameworks offer better portability even as GPU architectures change more and more between generations.
  • Block-based models significantly simplifies programming machine learning acceleration technology like tensor cores.

Simply put, more scientists have to use GPUs and GPU technology is evolving rapidly, creating a need for higher level and more portable paradigms.

Results

We'll present the recently announced cuTile and Tile IR. cuTile is a new block-based programming model for NVIDIA's CUDA platform. It is implemented with Numba and a novel compiler stack and intermediate representation called Tile IR. We will reveal further details about these new technologies for the first time during this SciPy session.

We'll explore the use of block-based models for a variety of domains through examples from HPC, data science, and machine learning. We'll show a new reference large-language-model GPU application based on LLAMA3 and implemented with a variety of different block-based GPU programming frameworks, including cuTile, as well as in traditional SIMT. We'll also present a port of the popular miniWeather HPC mini-app to Python and cuTile. We'll analyze performance results and discuss the tradeoffs between programming models.

By attending this talk, you will:

  • Learn the best practices for writing block-based Python applications for GPUs.
  • Gain insight into the performance of block-based Python GPU code and how it actually gets executed.
  • Discover how to reason about and debug block-based Python GPU applications.
  • Understand the differences between block-based and SIMT programming and when each paradigm should be used.
  • Dive into real examples of block-based Python GPU code.
  • Explore NVIDIA's new cuTile and Tile IR for the first time.

Bryce Adelstein Lelbach has spent over a decade developing programming languages, compilers, and software libraries. He is a Principal Architect at NVIDIA, where he leads programming language efforts and drives the technical roadmap for NVIDIA's compute compilers and libraries. Bryce is one of the leaders of the C++ community. He has served as chair of INCITS/PL22, the US standards committee for programming languages and the Standard C++ Library Evolution group. Bryce served as the program chair for the C++Now and CppCon conferences for many years. On the C++ Committee, he has personally worked on concurrency primitives, parallel algorithms, executors, and multidimensional arrays. He is one of the founding developers of the HPX parallel runtime system. Outside of work, Bryce is passionate about airplanes and watches.