SciPy 2023

Taming Black Swans: Long-tailed distributions in the natural and engineered world
07-13, 15:50–16:20 (America/Chicago), Amphitheater 204

Long-tailed distributions are common in natural and engineered systems; as a result, we encounter extreme values more often than we would expect from a short-tailed distribution. If we are not prepared for these "black swans", they can be disastrous.

But we have statistical tools for identifying long-tailed distributions, estimating their parameters, and making better predictions about rare events.

In this talk, I present evidence of long-tailed distributions in a variety of datasets -- including earthquakes, asteroids, and stock market crashes -- discuss statistical methods for dealing with them, and show implementations using scientific Python libraries.


You would think we'd be better prepared for disaster. But events like
Hurricane Katrina in 2005, which caused catastrophic flooding in New
Orleans, and Hurricane Maria in 2017, which caused damage in Puerto Rico
that has still not been repaired, show that large-scale disaster
response is often inadequate. Even wealthy countries -- with large
government agencies that respond to emergencies and well-funded
organizations that provide disaster relief -- have been caught
unprepared time and again.

The are many reasons for these failures, but one of them is that rare,
large events are fundamentally hard to comprehend. Because they are
rare, it is hard to get the data we need to estimate their likelihood precisely.
And because they are large, they challenge our ability to imagine
quantities that are orders of magnitude bigger than what we experience
in ordinary life.

In terms introduced by Nassim Taleb, a "black swan" is a large, impactful event that was
considered extremely unlikely before it happened, based on a model of
prior events. If the distribution of event sizes is actually long-tailed
and the model is Gaussian, black swans will happen with some regularity.
However, black swans can be "tamed'' by using appropriate models, including lognormal, Student t, and Pareto distributions.

In this talk, I introduce these distributions and show how they can be used to model measurements from natural and engineered systems -- including earthquakes, craters on the moon, solar flares, file sizes, and stock market crashes. We will use distributions and optimization tools from SciPy to estimate parameters and generate predictions, and Matplotlib to visualize the results.

Allen Downey is a curriculum designer at the online learning company Brilliant and professor emeritus at Olin College. He is the author of several books related to computer science and data science, including Think Python, Think Stats, and Think Bayes. His blog, Probably Overthinking It, features articles about Bayesian statistics. He received his Ph.D. in Computer Science from U.C. Berkeley, and M.S. and B.S. degrees from MIT.