SciPy 2024

How the Scientific Python ecosystem helps answering fundamental questions of the Universe
07-10, 13:55–14:25 (US/Pacific), Room 317

The ATLAS experiment at CERN explores vast amounts of physics data to answer the most fundamental questions of the Universe. The prevalence of Python in scientific computing motivated ATLAS to adopt it for its data analysis workflows while enhancing users' experience. This talk will describe to a broad audience how a large scientific collaboration leverages the power of the Scientific Python ecosystem to tackle domain-specific challenges and advance our understanding of the Cosmos. Through a simplified example of the renowned Higgs boson discovery, attendees will gain insights into the utilization of Python libraries to discriminate a signal in immersive noise, through tasks such as data cleaning, feature engineering, statistical interpretation and visualization at scale.

GitHub repository for the talk: https://github.com/ekourlit/scipy2024-ATLAS-demo


The ATLAS experiment at CERN, in Geneva, Switzerland, studies subatomic particles to seek answers to the most fundamental questions of the Universe. Due to the rarity of interesting subatomic phenomena, vast amounts of data are collected by the experiment. Those data are further reduced but physicists still have to analyze hundreds of terabytes for their studies. This has been traditionally conducted using experiment-specific custom C++ frameworks.
In recent years, there has been a broad community-driven shift towards the Scientific Python ecosystem inside of particle physics. The use of dataframes and NumPy-like idioms for data analysis has enhanced the user experience while providing efficient computations without the need of coding optimized low-level routines. This trend is followed by the education system too, as students learn python as the primary programming language.
Therefore, the ATLAS collaboration committed to extend and improve its data analysis paradigm by leveraging the Scientific Python ecosystem and its benefits. Open-source libraries have been used to build domain-specific data analytics and visualization workflows. Those libraries belong either to the general Scientific Python ecosystem or to the particle physics-specific Scikit-HEP ecosystem. Furthermore, next generation C++/Python binding tools have been used to expose pieces of the legacy C++ code, which encapsulates decades of domain-specific development, to the Python environment.
The talk will briefly introduce the big data science conducted at CERN along with the commensurate amount of data produced and waiting for analysis. The main objective of the talk will be to showcase how libraries from the Scientific Python ecosystem have been used to solve a common particle physics problem. This problem is the discrimination of a signal of interest in immersive noise. Successfully tackling such a problem, the physicists of the ATLAS collaboration discovered the Higgs boson in 2012. We will use as an example a simplified version of the data analysis workflow used for the Higgs boson discovery. This example is selected because it contains all the usual steps a particle physics data analyst has to undertake to discriminate such a signal; data cleaning and selection, feature engineering, visualization and basic statistical interpretation.
In a Jupyter notebook, first, we will show how to load particle physics data in Python using the library uproot. Using the library awkward, the data will be cleaned to remove defects and only relevant data will be selected for further analysis. Afterwards, some required corrections will be computed and applied to the data. Those corrections are computed by specialized custom C++ libraries, which have been exposed to Python using nanobind. The main quantity that the Higgs boson manifests itself on the data, the so-called invariant mass, will be computed and plotted in a histogram using numpy. Matplotlib will be used to visualize the histogram and numpy to extract some basic statistical measures. The whole workflow will be horizontally scaled using dask and especially using the native native dask collection for awkward arrays, dask-awkward. We will demonstrate how dask scaling greatly reduces the time to insight.
With this contribution the audience will have the chance to hear a story of how the open-source Scientific Python ecosystem is helping one of the largest international scientific collaborations in history to address its domain-specific problems and advance our understanding of the Universe. The talk is targeting a wide spectrum of scientists, with no prior knowledge in particle physics, who analyze large amounts of domain-specific data. It serves as a demonstration that solutions based on the Scientific Python ecosystem can well tackle problems that thus far had been addressed by specialized software without compromise on performance. Furthermore, library developers will have the chance to see how their products are being utilized for real scientific applications at scale.

Vangelis is a postdoctoral researcher at the Technical University of Munich and a member of the ATLAS Collaboration at CERN. He currently directs the data analytics group of ATLAS providing technical leadership on the development of the data analysis software and formats producing the results of hundreds of physics publications per year. His research is focused on enabling efficient analysis of terabytes of experimental data through array-oriented programming methods.

This speaker also appears in:

Matthew is a postdoctoral researcher in experimental high energy physics and data science at the Data Science Institute at the University of Wisconsin-Madison (a “data physicist”). He works as a member of the ATLAS collaboration on searches for physics beyond the standard model with experiments performed at CERN's Large Hadron Collider (LHC) in Geneva, Switzerland. He also serves on the executive board of the Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP) where he is a researcher and the Analysis Systems Area lead. Matthew has been involved with the SciPy conference since 2019 and serves as a member of the SciPy 2024 Organizing Committee.

Gordon Watts is a professor of physics at the University of Washington, Seattle, and a member of the ATLAS experiment at the Large Hadron Collider at CERN and deputy director of the National Science Foundation's IRIS-HEP Software Institute. He has extensive lecture and tutorial teaching experience in classrooms, labs, and informal tutorial settings. One of his main ATLAS responsibilities is helping to bring python-based analysis techniques to the ~3000 physicists who are part of the ATLAS experiment.

This speaker also appears in: