SciPy 2023

Emukit: Python toolkit for uncertainty quantification
07-12, 14:35–15:05 (America/Chicago), Zlotnik Ballroom

Emukit is an open-source package for uncertainty quantification in Python. It provides various Bayesian methods, such as optimization, experimental design and quadrature, in a flexible unified way that leverages their commonalities. In the talk we will explain how and why Emukit was built, what are its strengths and weaknesses, how it is used today and in what scenarios one might find it useful.


Description

Emukit is a highly adaptable Python toolkit for enriching decision making under uncertainty. This is particularly pertinent to model complex systems where data is scarce or difficult to acquire. In these scenarios, propagating well-calibrated uncertainty estimates within a design loop or computational pipeline ensures that constrained resources are used effectively.

The main features currently available in Emukit are:
* Bayesian optimisation: optimise physical experiments and tune parameters of machine learning algorithms;
* Experimental design/Active learning: design the most informative experiments and perform active learning with machine learning models;
* Sensitivity analysis: analyse the influence of inputs on the outputs of a given system;
* Bayesian quadrature: efficiently compute the integrals of functions that are expensive to evaluate;
* Multi-fidelity emulation: build surrogate models when data is obtained from multiple information sources that have different fidelity and/or cost.

The package was released in 2019, and since then gained popularity among the research communities of Bayesian optimization, Bayesian quadrature, and multi-fidelity modelling. The aim of this talk is to present Emukit to a wider audience of Python developers. It may be of interest to machine learning practitioners in need of hyper-parameter optimization methods, scientists running complex simulations and looking for emulation and UQ techniques, and everyone interested in approaches for decision making under uncertainty. Hearing about our development experience and lessons learned may also be useful to those who look to develop scientific packages in Python.

The first part of the talk will focus on technical details of the package. We will start with a brief introduction into Emukit and the methods it provides. Emukit is a replacement for GPyOpt and the reasons that prompted its development will be discussed. We will go over the key software design principles of Emukit, and see how they lead to a flexible and adaptable toolkit, but also how they may hinder the computational efficiency. Other popular frameworks for Bayesian optimization, Trieste and BoTorch, will be used to highlight strengths and weaknesses of Emukit.

The second part will focus on usage and adoption. We will talk about target audience of the toolkit, existing uses for teaching and research, and discuss why anyone who is not an expert in Bayesian active learning methods would want to use Emukit.

Additional materials

Emukit is available on Github: https://github.com/EmuKit/emukit. There is also a website about the package: https://emukit.github.io/.

Emukit was first presented at NeurIPS workshop on ML and the Physical Sciences, 2019. Corresponding paper on arXiv: https://arxiv.org/abs/2110.13293.

Emukit is used for teaching ML and the Physical World course at the University of Cambridge. The course website can be found at https://mlatcl.github.io/mlphysical/.

Emukit was also adopted for the Gaussian Process Summer School 2022: https://gpss.cc/gpss22/.

Some of the previous talks given by the speaker can be found on his website: https://paleyes.info/#talks.

Andrei is currently pursuing PhD at the University of Cambridge. His research interests are somewhere between machine learning and software systems, leaning towards the latter. He also has keen interest in Bayesian optimization and is actively participating in several open source projects. Before jumping into the world of academia he has spent more than a decade as a software engineer, developing everything from small webapps to data center network software.