SciPy 2024

Safe, fast, and easy time series preprocessing with Temporian
07-12, 10:45–11:15 (US/Pacific), Ballroom

Temporal data is ubiquitous in data science and plays a vital role in machine learning pipelines and business decisions. Preprocessing temporal data using generic data tools can be tedious, lead to inefficient computation, and be prone to errors.
Temporian is an open-source library for safe, simple, and efficient preprocessing and feature engineering of temporal data. It supports common temporal data types, including non-uniform sampled, multi-variate, multi-index, and multi-source data. Temporian favors interactive development in notebooks and integration with other machine learning tools, and can run at scale using distributed computing.
This talk, aimed at data scientists and machine learning practitioners, will showcase Temporian’s key features along with its powerful API, and demonstrate its advantages over generic data preprocessing libraries for handling temporal data.


Motivation
To deploy machine learning (ML) pipelines on temporal data, practitioners must be able to convert raw temporal data into a format that ML models can consume. In business, this conversion process is often costly to develop and maintain, and has a critical impact on the overall quality of the ML pipeline.
Temporal data is commonly processed manually using generic data processing tools, such as pandas, NumPy, and SQL. Such approaches can be expensive, tedious, and error-prone, especially concerning future leakage. They also require engineers to have expertise in temporal data processing and efficient computation, and even then may still result in less powerful and less effective ML pipelines.

Temporian

Temporian is a library for safe, simple, and efficient preprocessing and feature engineering of temporal data. It supports common temporal data types, including nonuniform sampling, multi-variate, multi-index, and multi-source time sequences.

Developed in collaboration between Google and Tryolabs, it addresses common challenges in creating temporal data processing components in machine learning pipelines. It is versatile and can be used by both novice users with simple pipelines and small datasets, as well as professionals working on complex production pipelines with large datasets and intricate preprocessing requirements.

Informally, Temporian is to temporal data what Pandas is to tabular data.

This library addresses most of the concerns listed above:

It supports most types of temporal data: both uniformly sampled and non-uniformly sampled data, both single-variate and multivariate data, both flat and multi-index data, and both mono-source and multi-source non-synchronized events.
It is optimized for temporal data: its core computation is implemented in C++. We’ve reported speed ups of over 1,000x over off-the-shelf data processing libraries when operating on temporal data.
It is easy to integrate into an existing ML ecosystem: Temporian does not perform any ML model training - instead, it integrates seamlessly with any ML library.
It prevents unwanted future leakage: unless explicitly specified, feature computation cannot depend on future data, thereby preventing unwanted, hard-to-debug, and potentially costly future leakage.

GitHub: https://github.com/google/temporian
Documentation: https://temporian.readthedocs.io/en/stable/
Introduction to Temporian (45m webinar): https://www.youtube.com/watch?v=krVFwQPrGuM

Content
We showcase Temporian’s main features, its powerful API, and its main advantages over generic data preprocessing libraries, using short self-contained code examples.

I am a research engineer at Google Zurich. My work centers around moving ML research into production. Notably, I lead efforts to research and bring decision forest technologies to a production, making them accessible, scalable, and performant.

Before joining Google, I completed a postdoctoral fellowship at Carnegie Mellon University's Auto Lab. In 2012, I earned my doctorate in France at the INRIA Research Lab as part of the PRIMA team. I graduated from Imperial College London and "French Grande Ecole" ENSIMAG in 2009.

In my spare time, I delve into various hobbies such as tinkering with electronics, woodworking, 3D printing, and creating video games.

This speaker also appears in:

Lead Machine Learning Engineer @ Tryolabs | Founding Engineer @ Puppeteer AI | CTO @ Buen Provecho

Currently building
- Temporian, an open-source Python library for preprocessing and feature engineering of temporal data
- Puppeteer, an actually useful AI platform for the healthcare industry
- Buen Provecho, a startup fighting food waste in Latin America

This speaker also appears in: