SciPy 2023

PPML: Machine Learning on data you cannot see
07-10, 08:00–12:00 (America/Chicago), Classroom 203

Privacy guarantee is the most crucial requirement when it comes to analyse sensitive data. However, data anonymisation techniques alone do not always provide complete privacy protection; moreover Machine Learning models could also be exploited to leak sensitive data when attacked, and no counter-measure is applied. Privacy-preserving machine learning (PPML) methods hold the promise to overcome all these issues, allowing to train machine learning models with full privacy guarantees. In this tutorial we will explore several methods for privacy-preserving data analysis, and how these techniques can be used to safely train ML models without actually seeing the data.


Privacy guarantees are the most crucial requirement when it comes to analyse sensitive data. These requirements could be sometimes very stringent, so that it becomes a real barrier for the entire pipeline. Reasons for this are manifold, and involve the fact that data could not be shared nor moved from their silos of resident, let alone analysed in their raw form. As a result, data anonymisation techniques are sometimes used to generate a sanitised version of the original data. However, these techniques alone are not enough to guarantee that privacy will be completely preserved. Moreover, the memoisation effect of Deep learning models could be maliciously exploited to attack the models, and reconstruct sensitive information about samples used in training, even if these information were not originally provided.

Privacy-preserving machine learning (PPML) methods hold the promise to overcome all those issues, allowing to train machine learning models with full privacy guarantees.

This workshop will be mainly organised in three main parts. In the first part, we will introduce the main concepts of differential privacy: what is it, and how this method differs from more classical anonymisation techniques (e.g. k-anonymity). In the second part, we will focus on Machine learning experiments. We will start by demonstrating how DL models could be exploited (i.e. inference attack ) to reconstruct original data solely analysing models predictions; and then we will explore how differential privacy can help us protecting the privacy of our model, with minimum disruption to the original pipeline. Finally, we will conclude the tutorial considering more complex ML scenarios to train Deep learning networks on encrypted data, with specialised distributed federated learning strategies.


Prerequisites

This workshop will assume familiarity with the PyTorch deep learning framework, as well as with the basics of Machine/Deep Learning experiments. No prior specialised knowledge of anonymisations, nor security will be required. Lecture notes will be delivered via interactive Jupyter Notebooks, so the audience should be familiar with the Jupyter environment. All the instructions on how to set up the environment will be shared with delegates in due course, and prior to the workshop.

Installation Instructions

https://github.com/leriomaggio/ppml-tutorial

Valerio Maggio is a Researcher, a Data scientist Advocate at Anaconda, and a casual "Magic: The Gathering" wizard. He is well versed in open science and research software, supporting the adoption of best software development practice (e.g. Code Review) in Data Science. Valerio is also an open-source contributor, and an active member of the Python community. Over the last twelve years he has contributed and volunteered to the organization of many international conferences and community meetups like PyCon Italy, PyData, EuroPython, and EuroSciPy. All his talks, workshop materials and random ramblings are publicly available on his Speaker Deck and GitHub profiles.