SciPy 2024

Dante’s Externo: Injecting Python Functions into a Template-Driven CUDA C++ Framework
07-11, 16:30–17:00 (US/Pacific), Ballroom

Non-Python codebases that use metaprogramming present significant challenges to cross-language development. These challenges are further compounded with the inclusion of GPU processing. While common methods of Python/GPU interoperation are covered by popular Python frameworks, these frameworks do not trivialize this use case.

In this talk, we will discuss the process of integrating a Python code for Monte Carlo particle transport (MCDC) with a template-based CUDA C++ framework which applies inversion of control (Harmonize). We will discuss managing the complexity of cross-language dependency injection, relevant implementation strategies, and pitfalls to avoid.


Background

With the increasing scale of computation in scientific computing, GPU processing has become a central concern to scientific software development. This technological shift, plus the ubiquity of Python in the space, drove the development of GPU processing capabilities in Python frameworks such as CuPY and Numba. To support the use of pre-existing codebases and allow developers to delegate to systems languages, these frameworks provide interoperability with non-Python CPU/GPU code. This includes the ability to compile non-Python code into Python-friendly abstractions, link to device code binaries, and to compile Python code into linkable binaries.

Motivation

While powerful, these capabilities do not trivialize all scenarios in Python/GPU interop. For example, if a CUDA C++ codebase relies upon templates to provide functionality, the required specializations of those templates must be generated and compiled before linking. In addition, if this CUDA C++ code is meant to provide functionality through inversion of control, that code must be provided with linkable definitions of the Python functions it would be calling. Further complicating matters, in order to use this inversion of control, a Python program must link into the binaries generated after injection and specialization.

This set of difficulties was faced by the developers of MCDC (the Monte Carlo Dynamic Code). While MCDC relies upon Numba to handle JIT compilation on the CPU, the development roadmap for GPU processing includes the integration of Harmonize, which uses CUDA C++ templates to map systems of functions into asynchronous programs that dynamically schedule function calls on-GPU.

Methods

The overarching problem with such an integration is that at least two layers of FFI are required, with compiled CUDA C++ sandwiched between two layers of Python code. Since all layers are compiled, this requires the management of two interface layers, both of which must be linked across. For MCDC and Harmonize, this was handled through the following phases of compilation/linking:

  • Relevant Python functions from MCDC are compiled into PTX and saved as files.
  • Extern CUDA C++ function declarations matching the signatures of the compiled Python functions
  • Specializations of Harmonize’s program/function specification templates are generated, each of which wrap calls to the external Python functions.
  • The generated CUDA C++ code is compiled into PTX using nvcc.
  • The resulting PTX is linked into by a wrapper kernel in MCDC.

This strategy can be generally applied if a Python codebase can compile functions into linkable IR, can link into that IR, and the executing system has a compiler capable of ingesting the corresponding non-Python language and linking/producing that IR.

While this strategy holds up theoretically, it comes with various pitfalls that can still trip up cautious and technically proficient developers:

  • To declare Python functions as extern functions, their signature must be derived.
  • To guarantee parameters are handled correctly by the non-Python code, types matching the size and alignment of these parameters must be defined.
  • To link to functions produced through specialization without accounting for mangling, wrapper extern functions calling into said methods must be generated.
  • To ensure all layers of the resulting program handle data structures the same way, the size, alignment, and layout of types must be regulated

Braxton Cuneo is an Assistant Professor at Seattle University's CS department and a member of the PSAAP Center for Exascale Monte Carlo Neutron Transport (CEMeNT). Braxton specializes in GPU and parallel processing, with a focus upon making resource/execution management more efficient and ergonomic.