SciPy 2024

Introduction to Causal Inference using pgmpy
07-10, 14:35–15:05 (US/Pacific), Ballroom

In the domain of data science, a significant number of questions are aimed at understanding and quantifying the effects of interventions, such as assessing the efficacy of a vaccine or the impact of price adjustments on the sales volume of a product. Traditional association based methods machine learning methods, predominantly utilized for predictive analytics, prove inadequate for answering these causal questions from observational data, necessitating the use of causal inference methodologies. This talk aims to introduce the audience to the Directed Acyclic Graph (DAG) framework for causal inference. The presentation has two main objectives: firstly, to provide an insight into the types of questions where causal inference methods can be applied; and secondly, to demonstrate a walkthrough of causal analysis on a real dataset, highlighting the various steps of causal analysis and showcasing the use of the pgmpy package.


In both research and industry, we often seek to determine causal relationships, such as whether increasing the minimum wage leads to lower unemployment rates, or how customer service quality affects brand loyalty and repeat purchases. While traditional association based machine learning methods excel at identifying correlations, they fall short in establishing causal effects. The gold standard for uncovering these causal effects is to conduct randomized controlled trials (RCTs) or perform A/B tests. However, RCTs and A/B tests can sometimes be unethical or expensive.

Causal inference methods offer an alternative, providing ways to estimate causal effects from observational data sets. This talk will focus on a Directed Acyclic Graph (DAG)-based mathematical framework for causal inference, structured into two main parts:

  1. Motivating Examples: Through simple examples, including Simpson's paradox, the talk will highlight scenarios that require causal inference methodologies. Key concepts such as confounding, identification, intervention, and counterfactuals will be introduced.

  2. A Causal Inference Analysis Walkthrough: Analyzing causal relationships in observational data often involves several steps, such as constructing a DAG, learning the parameters of the DAG, identifying the causal effect of interest, and estimating this effect. Each of these steps can requires critical decisions. Using a real dataset (such as the protein signaling network or the adult income dataset), the talk will give a walkthrough of a complete causal analysis pipeline using the pgmpy package.

Outline of the Talk:
1. (5 mins) Introduction to causal inference with examples to illustrate the general idea, focusing on the issue of unobserved confounders.
2. (5 mins) An overview of Pearl's DAG framework for visualizing and reasoning about causality, introducing basic causal inference concepts such as intervention and identification.
3. (15 mins) A step-by-step walkthrough of a causal inference analysis using a real dataset, covering:
a. DAG construction, structure learning, and model testing.
b. Identification of the causal parameter of interest.
c. Estimation of the causal parameter.
4. (5 mins) Discussion on various challenges and pitfalls, including model selection and scalability issues.

Intended Audience: This talk is aimed at a foundational level and is accessible without any prerequisites.

Ankur Ankan is a postdoctoral researcher at Radboud University in the Netherlands, where his research focuses on causal inference. His main interest lies in developing practical methods for causal inference, along with developing software tools for it. He also started and maintains the Python package pgmpy, which offers tools for probabilistic and causal inference in graphical models.