SciPy 2024

A hands-on forecasting guide: from theory to practice
07-08, 08:00–12:00 (US/Pacific), Ballroom D

Forecasting is central to decision-making in virtually all technical domains. For instance, predicting product sales in retail, forecasting energy demand, and anticipating customer churn all have tremendous value across different industries. However, the landscape of forecasting techniques is as diverse as it is useful, and different techniques and expertise are adapted to different types and sizes of data.
In this hands-on workshop, we give an overview of forecasting concepts, popular methods, and practical considerations. We’ll walk you through data exploration, data preparation, feature engineering, statistical forecasting (e.g., STL, ARIMA, ETS), forecasting with tabular machine learning models (e.g., decision forests), forecasting with deep learning methods (e.g., TimesFM, DeepAR), meta-modeling (e.g., hierarchical reconciliation and relational modeling, ensembles, resource models), and how to safely evaluate such temporal models.


Motivation
Forecasting, the prediction of future events from past data, is central to decision-making in virtually all technical domains. For instance, predicting product sales in retail, forecasting energy demand, or anticipating customer churn, all have tremendous value across different industries. However, the landscape of forecasting techniques is as diverse as it is useful, and different techniques and expertise are adapted to different types and sizes of data.
Content
In this hands-on tutorial led by experts from Tryolabs and Google, we give an overview of forecasting concepts, methods, and practical applications on the M5 forecasting challenge. In the context of forecasting, we’ll walk you through data exploration, data preparation, feature engineering, statistical and machine learning modeling, meta-modeling, and how to safely evaluate such temporal models.
Theory
We’ll kick off the session with a theoretical introduction to forecasting. We’ll properly define concepts like univariate and multivariate time series, time sequences, step and multi-step forecasts, aggregated forecasts, and conditional forecasts.
Data
We’ll use the data from the M5 Forecasting competition. The objective of this competition was to forecast the next 28 days of sales of 3k individual items in 10 different Walmart stores.
Data exploration
We’ll perform an insightful data exploration of the dataset, using visualizations to understand the data we are to deal with and posing the questions data scientists should be asking themselves when taking on a problem like this. We analyze the trend and seasonality in the data and decompose the series into those components using STL (Seasonal and Trend decomposition using Loess).
Evaluation
Before we start modeling, we define and explain some common forecasting metrics and the one we will be using.
Feature engineering
Most models can consume exogenous data, or independent or explanatory variables.
In our use case, data such as an item's selling price, general macroeconomic data, closeness in time to special calendar events, and aggregated information about its store or category's past behavior can all play a role in explaining current sales values.
We compute features that, based on our domain knowledge, will allow the model to learn the underlying patterns in the data.
Modeling
We iterate over several modeling methods, in ascending complexity:
A naive baseline, that forecasts the moving average of past sales.
A classical statistical model, SARIMA.
A Gradient Boosted Decision Trees model.
A state-of-the-art deep learning model.
We will introduce each model and explain how it works under the hood and what its strengths and weaknesses are.
Conclusion
We conclude the session with a discussion and a list of reading materials to learn more about forecasting. Join us for a fun and interactive tutorial hosted by Tryolabs and Google!
Libraries
We use the following libraries throughout the tutorial:
temporian: temporal data preprocessing and feature engineering
statsmodels: statistical modeling
tensorflow_decision_forests: tree-based modeling
tensorflow or pytorch: deep learning modeling
seaborn: custom plots and visualization
numpy, pandas: auxiliary data preprocessing


Prerequisites

Some experience using Python for data analysis, data science, and/or machine learning should be sufficient to complete the tutorial end-to-end. We will do a thorough job of introducing all the concepts we touch on and the libraries and models we use, both theoretically (what they do and why they’re useful) and practically (their API/how to use them). A basic understanding of machine learning in general is desirable.

Installation Instructions

Open the four notebooks links listed in: https://docs.google.com/presentation/d/1R4GYum358jIh0TJukz0oZAS9QQ3Ps6FIDi6vUqHR3l0/edit?usp=sharing

Lead Machine Learning Engineer @ Tryolabs | Founding Engineer @ Puppeteer AI | CTO @ Buen Provecho

Currently building
- Temporian, an open-source Python library for preprocessing and feature engineering of temporal data
- Puppeteer, an actually useful AI platform for the healthcare industry
- Buen Provecho, a startup fighting food waste in Latin America

This speaker also appears in:

I am a research engineer at Google Zurich. My work centers around moving ML research into production. Notably, I lead efforts to research and bring decision forest technologies to a production, making them accessible, scalable, and performant.

Before joining Google, I completed a postdoctoral fellowship at Carnegie Mellon University's Auto Lab. In 2012, I earned my doctorate in France at the INRIA Research Lab as part of the PRIMA team. I graduated from Imperial College London and "French Grande Ecole" ENSIMAG in 2009.

In my spare time, I delve into various hobbies such as tinkering with electronics, woodworking, 3D printing, and creating video games.

This speaker also appears in: