SciPy 2025

Enabling Innovative Analysis on Heterogeneous Clusters through HTCdaskgateway
07-10, 16:30–17:00 (US/Pacific), Room 317

High energy particle (HEP) physics research is going through fundamental changes as we move to collect larger amounts of data from the Large Hadron Collider (LHC). Analysis facilities and distributed computing, through HTCs, have come together to create the next pythonic generation of analysis by utilizing htcdaskgateway, a Dask gateway extension, allowing users to spawn workers compatible with both their analysis and heterogeneous clusters in line with authentication requirements. This is enabling physicists to engage with scientific python in ways they had not before because of domain specific C++ tools. An example of htcdaskgateway’s use is Fermilab’s Elastic Analysis Facility.


Introduction:

High energy particle (HEP) physics research is going through fundamental changes as we move to collect larger amounts of data from the Large Hadron Collider (LHC). In HEP, we parse this collected data to perform statistical analysis but we are still relatively inefficient at this due to the sheer amount of data and the difficulties in developing an analysis. Given these challenges, there has been research into both how we do analysis and new software tools. One part of the solution is emphasizing distributed computing, specifically, High Throughput Computing (HTC), and another part is an Analysis Facility, which can be loosely thought of as a suite of tools aimed at making a coherent analysis ecosystem using software such as Jupyterhub and systems such as Kubernetes. Analysis facilities enable users to build more pythonic analyses that incorporate libraries from the PyHEP and Scikit-HEP ecosystem, such as matplotlib and hist, which has, until now, been less common in HEP research. For analysis facilities to provide distributed computing, they must be able to properly communicate with schedulers and workers.

The htcdaskgateway:

Htcdaskgateway is a Dask gateway extension that allows for communication between non-Dask workers, for example HTCondor workers, and a Dask system on an OKD kubernetes cluster where the jobs must be submitted by the user for authentication. OKD is a community driven kubernetes distribution for application management. Analysis facilities rely on software like htcdaskgateway to allow users to spawn workers compatible with both their analysis and the heterogeneous clusters they are working on. Like Dask gateway, it gives users the ability to choose images which allows for flexibility in the analysis pipeline. The htcdaskgateway specifically pre-configures an image utilizing Coffea, a new pythonic analysis framework being developed to make HEP analysis easier, that rests on several scientific python libraries. This is allowing physicists to utilize and engage with scientific python in ways they could not before due in large part to analysis being done through traditional domain specific C++ tools. This means access to awkward-array, vector, SciPy, pyhf, vector, numba, and so much more. A concrete example of the htcdaskgateway is Fermilab’s Elastic Analysis Facility that is supporting and pioneering the next generation of analysis, specifically the application of Coffea and python.

Conclusion:

HEP analysis is changing and htcdaskgateway is a large part of it, we hope to enable many more users to perform large scale analyses with the benefits of scientific python through Dask. While it needs more development to become a fully generalized tool, we are on our way to making a tool that could help connect other physicists, and possibly other scientists, to scientific python who need a similar tool where authentication is tied to the user.

Audience:

This talk appeals to physicists and research software engineers interested in making scientific python libraries available to general researchers on a heterogeneous cluster for distributed computing. This is also interesting for students, professionals, and academics that want to know more about connecting Dask, Kubernetes, and HTCs/HPCs.

The audience, regardless of level, will learn about Dask and OKD in the context of Dask gateway and htcdaskgateway. They will also learn about the use of scientific python in the HEP community.

I am a graduate student at the University of Wisconsin-Madison working on a PhD in High Energy Experimental Particle Physics with the CMS experiment. I am primarily interested in software development for science and I hope to go into research software engineering so I can support scientific development through robust and sustainable software. I also work with Fermi National Accelerator Laboratory (Fermilab).