SciPy’s New Infrastructure for Probability Distributions and Random Variables SciPy 2025

SciPy’s New Infrastructure for Probability Distributions and Random Variables
.ical

2025-07-11 11:25–11:55, Ballroom

The SciPy library provides objects representing well over 100 univariate probability distributions. These have served the scientific Python ecosystem for decades, but they are built upon an infrastructure that has not kept up with the demands of today’s users. To address its shortcomings, SciPy 1.15 includes a new infrastructure for working with probability distributions. This talk will introduce users to the new infrastructure and demonstrate its many advantages in terms of usability, flexibility, accuracy, and performance.

SciPy [1] provides objects that represent well over 100 univariate probability distributions. For example, scipy.stats.norm is one of the most popular, and GitHub code search finds over 100,000 uses [2]. Each object has methods to compute statistics and other functions of the distribution, such as the probability density function (PDF), cumulative distribution function (CDF), inverse CDF (also “percent point function” or PPF), differential entropy, and moments. In theory, all other functions of a distribution follow from one of the others; for instance, the CDF is the integral of the PDF from the left end of the distribution’s support. Although methods of some distributions are overridden for improved performance or accuracy, the infrastructure also provides generic implementations; for instance, the default cdf method integrates the PDF numerically using scipy.integrate.quad. Besides providing generic method implementations, the infrastructure is responsible for ensuring a consistent API, generating documentation, and executing a common test suite.
As useful as the legacy infrastructure has been, users have reported many shortcomings over the past two decades [3]. For example, although the distributions permit vectorized use with array arguments and shape parameters, the generic method implementations work only on scalars, and they must loop in Python over each element of array inputs, eliminating the performance advantage of vectorized code. In some aspects, the API is oppressively self-consistent: all distributions inherit common location and scale (loc and scale) parameters that are not standard in the literature of many distributions, confusing users and leading to problems when fitting distributions to data. Distribution documentation generation involves use of exec and a nonstandard string replacement syntax, and the test suite is not comprehensive, so new bugs continue to be found. Users have also requested new features, which have become increasingly difficult to work into the patchwork codebase of many separate contributions. It has been clear for several years that a fresh start was needed.
The new infrastructure, released with SciPy 1.15.0, addresses these shortcomings. Generic implementations of methods are natively vectorized, leveraging SciPy’s new array API compatible functions for quadrature, series summation, root finding, and minimization [4]. Distributions support multiple parameterizations and do not force the inclusion of loc and scale parameters, so users can work with the parameterizations they are familiar with. Documentation is generated using more modern features of Python (such as f-strings), and the test suite is thorough.
Many new features are implemented atop these solid foundations.
• When a distribution-specific implementation of a method is not available, a decision tree chooses among several generic implementation strategies to ensure efficient computation of accurate results. For instance, central moments of a distribution can be computed by shifting raw moments, scaling standardized moments, numerically integrating the PDF, or numerically integrating the inverse-CDF; the best choice depends on which distribution-specific implementations are available. Using a method argument, the results of independent computation strategies can be compared against one other to assess accuracy – and indeed, this is the foundation of the extensive property-based test suite.
• Instances of distribution classes behave like random variables and can be manipulated with composable transformations including:
o elementary arithmetic operations (i.e. addition/subtraction for shifting location, multiplication/division for scaling) between random variables and arrays;
o functions of random variables (e.g. reciprocal, square and other powers, absolute value, natural exponential and logarithm); and
o truncation of the support.
• Quasi-Monte Carlo samples can be drawn from arbitrary distributions. All distributions now have methods for computing the mode, the inverse of logarithmic distribution functions, and a plot method for convenient visualization. Random variables representing order statistics can be derived from any other random variable.
• Multiple random variables can be combined in mixture models.
The new infrastructure also paves the way toward even more advanced features, such as:
• arithmetic operations between two random variables,
• circular distributions, including wrapped versions of arbitrary distributions,
• support for other Python Array API Compatible [5] backends beyond NumPy [6] (e.g. CuPy [7], PyTorch [8], JAX [9], and Dask [10]).
The talk will introduce users to the new infrastructure and demonstrate its many advantages in terms of usability, flexibility, accuracy, and performance.

References:
[1] SciPy, https://scipy.org/
[2] Code search results for “from scipy.stats import norm” and "scipy.stats.norm", https://github.com/search?q=%22from+scipy.stats+import+norm%22&type=code
[3] Matt Haberland. “RFC: stats: univariate distribution infrastructure”. GitHub scipy/scipy#15928. https://github.com/scipy/scipy/issues/15928
[4] Matt Haberland, Albert Steppi, and Pamphile Roy. “Vectorized Quadrature, Series Summation, Differentiation, Optimization, and Rootfinding in SciPy”. SciPy 2024 Conference. https://doi.org/10.25080/uyyk2727.
[5] Array API Standard, https://data-apis.org/array-api/latest/
[6] NumPy, https://numpy.org/
[7] CuPy, https://cupy.dev/
[8] PyTorch, https://pytorch.org/
[9] JAX, https://jax.readthedocs.io/en/latest/
[10] Dask, https://www.dask.org/

Matt Haberland

Matt Haberland is an Associate Professor at Cal Poly, San Luis Obispo, and a maintainer of SciPy and NumPy.

Albert Steppi

Albert Steppi (@steppi) is a Senior Software Engineer at Quansight Labs and a maintainer of the SciPy library. Previously he worked as a Machine Learning Scientist at Lendbuzz and before that as a Scientific Software Developer in the Labroratory of Systems Pharmacology at Harvard Medical School. He earned a PhD in Statistics from Florida State University in 2018 with research focusing on network aware bioinformatics analysis and biomedical text mining. He is broadly interested in numerical mathematics and scientific and statistical computing.

SciPy’s New Infrastructure for Probability Distributions and Random Variables .ical 2025-07-11 11:25–11:55, Ballroom

SciPy’s New Infrastructure for Probability Distributions and Random Variables
.ical

2025-07-11 11:25–11:55, Ballroom