Introduction to Causal Inference

Causal Inference

Author

Neba Nfonsang

Published

June 12, 2026

This article provides a foundational introduction to causal inference and treatment effect estimation, covering the concepts, frameworks, and assumptions that enable researchers and scientists to draw causal inferences or conclusions.

Why Causal Inference Matters

Causal inference allows us to go beyond finding associations between variables and answer causal questions such as:

Did the discount (or coupon) program increase sales?
Did the intervention work?
Did the AI tutoring system improve math scores?
Does the treatment (x) improve health outcomes (y)?
What is the effect of training program x on the unemployment rate?

Association and Causation

We often hear the statement, “correlation is not causation” or “commonality is not causality.” When two variables move together, they are said to be correlated. This means they tend to increase together or decrease together. However, this relationship does not necessarily mean that one variable causes the other.

For example, crime rates and ice cream sales may both increase at the same time. This does not mean that committing crimes causes people to buy more ice cream. Instead, both may be influenced by another factor, such as warmer weather, which increases outdoor activity.

Researchers have developed frameworks and assumptions that make it possible to estimate causal effects beyond simple associations. For instance, comparing unemployment rates between states that have increased minimum wage and those that have not reflects only an associative difference. This difference cannot be interpreted as a causal effect unless certain assumptions are satisfied, ensuring that other factors are not responsible for the observed outcome.

What is Inference? 🤔

Several statistical terms are not intuitive to most non-statisticians, so I’ll start by defining concepts that will facilitate the understanding of causal inference.

Inference

Inference in statistics refers to drawing conclusions about a population using information from a sample. A sample is a subset of the population.

Therefore, generalizing results from a sample to a population is inference. In other words, estimating population parameters using sample statistics is inference.

For example, say you wanted to estimate what percentage of customers like your product. If your business serves only the United States, then the population of customers is all of your customers in the United States. Instead of collecting data from all of your customers in the United States (which could be a tedious and costly process, or perhaps not feasible), you could instead select a representative random sample of customers from the population and study the proportion of customers in the sample who like your product.

If 75% of the customers in the sample like your product, you can then use that number, 75%, as an estimate of the percentage of customers who like your product in the entire United States. You have used the sample results to make an inference, or draw conclusions, about the population.

What is Causal Inference?

Causal Inference

Causal inference is a type of statistical inference that uses sample data to estimate the extent to which changes in one variable \(X\) cause changes in another variable \(Y\).

Causal inference enables researchers to quantify the effect of \(X\) on \(Y\) and to draw conclusions about the presence and strength of a causal relationship between the variables \(X\) and \(Y\).

Causal inference can be traced back to the days of the Greek philosopher Aristotle. In early philosophical thinking, causal reasoning was focused on identifying possible causes of an outcome. However, the goal of causal inference has shifted toward estimating the effect of a specific cause.

What is Causal Effects?

Causal Effect

A causal effect is the change in an outcome variable \(Y\) caused by a change in another variable \(X\).

That is, a causal effect measures how much \(Y\) would change if we actively change \(X\), while keeping everything else constant or controlling for other factors. When \(X\) is a treatment, the effect of the treatment on the outcome is called treatment effect.

Potential-outcomes Framework

The potential-outcomes framework is one of the most commonly used theories for estimating causal or treatment effects. This framework is also called Rubin-Causal-Model (RCM), the Neyman-Rubin Counterfactual Framework of Causality or simply the Counterfactual Framework.

The potential outcomes approach to causal inference originated with Neyman and was formalized by Rubin (1974). In the Rubin Causal Model, a causal effect is defined as the difference between potential outcomes under different treatment conditions.

Causal Effect for an Individual

For an individual, the causal effect is given by: \[ {\tau}_i = Y_{i}(1)−Y_{i}(0)\] where:

\(Y_{i}\)(1): outcome if treated
\(Y_{i}\)(0): outcome if not treated

The Fundamental Problem

Though Rubin’s formulation of causal effects is mathematically appealing, the difficulty is that only one potential outcome can be observed for each unit, leading to the fundamental problem of causal inference. For any given unit, we can never observe both potential outcomes at the same time. This is because, at any particular time, a unit is assigned to either the treatment group or the control group, not both.

The Average Treatment Effect (ATE)

The population ATE (\(\tau\)) of treatment \(T\) on outcome \(Y\) is defined in terms of potential outcome as the:

average causal effect of a treatment in the population
average of individual causal effects in the population
mean or expected difference between the potential outcome under treatment and the potential outcome under no treatment.

\[ \tau = \frac{1}{N} \sum_{i=1}^{N} \left( Y_i(1) - Y_i(0) \right) \] \[ \begin{aligned} \tau &= \mathbb{E}[Y(1) - Y(0)] \\ &= \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] \end{aligned} \] ATE of treatment \(T\) on outcome \(Y\) exists if:

\[ \mathbb{E}[Y(1)] \neq \mathbb{E}[Y(0)] \]

Note

The average treatment effect is defined in terms of potential outcomes. Although counterfactual outcomes for each unit are unobserved, the potential outcomes framework provides tools and assumptions to address this fundamental problem.

Interpreting Average Treatment Effect (ATE)

For a continuous outcome \(Y\), the ATE is interpreted as the average change in the level of \(Y\) in the population due to the treatment. For example, the ATE may measure the change in sales revenue resulting from a discount program.
For a binary outcome \(Y\), the ATE is the change in the proportion (or probability) of units experiencing the positive outcome in the population due to the treatment. For example, the ATE may measure the change in the proportion of individuals who are cured in the population as a result of the treatment.

Average Treatment Effect on the Treated (ATT)

Theoretically, ATE finds treatment effect using everyone in the population while ATT finds treatment effect only for a subpopulation, those who receive the treatment.

ATE answers the question: What is the average effect of the treatment if it were applied to the entire population? For example, What would be the average increase in income if everyone received job training?
ATT answers the question: What was the effect of the treatment on the people who actually received it?. For example, What would be the average increase in income among individuals who participated in job training?

ATT is theoretically formulated under the potential outcome framework as:

\[ \tau_{\text{ATT}} = \mathbb{E}[Y(1) - Y(0) \mid T = 1] \]

Average Treatment Effect on the Control (ATC)

Similar to ATT, ATC measures the average effect of the treatment for those who did not receive the treatment.

ATC answers the question: What would have been the effect of the treatment for those who were not treated, if they had received the treatment?

ATC is theoretically formulated under the potential outcome framework as:

\[ \tau_{ATC} = \mathbb{E}[Y(1) - Y(0) \mid T = 0] \]

When is ATE = ATT = ATC?

If units are assigned to treatment through a perfectly randomized process, or if the data are perfectly balanced, then the treated and control groups have identical distributions of characteristics. This implies that every treated unit has a comparable counterpart in the control group, and every control unit has a comparable counterpart in the treatment group.

As a result, the distribution of individual treatment effects is the same across groups. Therefore, \(ATE=ATT=ATC\), since averaging individual treatment effects over the entire population, the treated population, or the untreated population yields the same result.

In practice, it is difficult to achieve perfect randomization or perfect balance. As a result, ATE, ATT, and ATC may be approximately equal when treatment assignment is approximately random or when the data is sufficiently well balanced.

Addressing the Fundamental Problem

Since counterfactual outcomes are unobserved for each unit, we approximate these counterfactuals using experimental designs, assumptions, and statistical methods to estimate causal effects.

To address the fundamental problem of causal inference, we require an experimental (randomized or non-randomized) setting with two groups: treated and untreated. When these groups are comparable, each can serve as the counterfactual for the other on average.

The causal framework quantifies what would happen if units receive treatment and if they do not. However, in practice, each unit either receives treatment or not, meaning one potential outcome is always missing.

Under the assumption that the treated and untreated groups are similar, outcomes observed in the untreated group can be used to approximate the missing counterfactual outcomes for the treated group, and vice versa.

In an experimental setting with treated and untreated groups, the Rubin causal model
\[\tau = \mathbb{E}[Y(1) - Y(0)]\] becomes \[ \tau = \mathbb{E}[Y \mid T = 1] - \mathbb{E}[Y \mid T = 0] \] assuming the treated (\(T=1\)) and untreated groups (\(T=0\)) are similar, there is no systematic difference or selection bias between them. If the groups differ systematically, the estimated average treatment effect will be biased. Therefore, similarity between groups ensures an unbiased estimate of the average treatment effect.

Groups are considered similar if they have comparable distributions of observed covariates. Covariates are variables that describe the characteristics of units, and may influence both the treatment assignment and the outcome.

In practice, sample data is used to estimate ATE as follows (assuming the treated and untreated groups are comparable):

\[ \hat{\tau} = \bar{Y}_1 - \bar{Y}_0 \]

where \[ \bar{Y}_1 = \frac{1}{N_1} \sum_{i: T_i = 1} Y_i \]

\[ \bar{Y}_0 = \frac{1}{N_0} \sum_{i: T_i = 0} Y_i. \]

\(\bar{Y}_1\): average outcome for treated units
\(\bar{Y}_0\): average outcome for control (untreated) units
\(N_1\): number of treated units
\(N_0\): number of control units

Randomized vs Non-Randomized Experiments

In randomized experimental settings, random assignment creates treated and control groups that are comparable, thereby reducing selection bias and controlling for confounding variables. In other words, random assignment eliminates selection bias by balancing confounders across groups, satisfying the ignorability (or unconfoundedness) assumption. Randomized experiments control for both observed and unobserved covariates.

In non-randomized settings, comparable groups are achieved using statistical techniques such as propensity score methods (e.g., matching and weighting). However, only observed covariates can be controlled for using these methods. Therefore, it is important to measure covariates that are known to influence both treatment assignment and outcomes when working with observational data.

Potential Outcome Framework Assumptions

To estimate treatment effects such as ATE using the potential-outcome framework:

\[ATE = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \mathbb{E}[Y \mid T = 1] - \mathbb{E}[Y \mid T = 0],\] these main assumptions should be satisfied:

Stable Unit Treatment Value (SUTVA)
Unconfoundedness assumption
Overlap assumption

SUTVA

The SUTVA has two components: consistency and no interference

Consistency: Consistency means that the treatment is well-defined and applied uniformly. For any given treatment condition, there should not be multiple versions of the treatment. For example, under the treatment condition T=1, units should not receive different levels or variations of the treatment, such as differing dosages of a medical intervention. In addition, consistency requires that the observed outcome corresponds to the potential outcome under the treatment received. That is \(Y = Y(T)\), meaning if \(T=1, Y=Y(1)\) and if \(T=0, Y=Y(0)\).
No interference: No interference means there are no spillover effects - a unit’s outcome depends only on its own treatment, and not on the treatment assigned to any other unit.

Unconfoundedness Assumption

The unconfoundedness assumption, also referred to as the ignorability assumption, states that the potential outcomes are independent of the treatment assignment conditional on observed covariates. That is,

\[ (Y(1), Y(0)) \perp T \mid X \]

After controlling for covariates, the potential outcomes are independent of the treatment assignment, meaning that treated and untreated units are comparable. This implies that we can ignore the treatment assignment mechanism as a source of bias, because after conditioning on \(X\), treatment assignment is as good as random and no longer related to the potential outcomes.

When recruiting people for job training in a study of the effect of job training on income, if mostly motivated individuals sign up, then the treatment group will contain a disproportionately high share of motivated people. If motivation is also correlated with income, then treatment assignment is not independent of the potential outcomes. As a result, the ignorability (or unconfoundedness) assumption is violated due to the assignment process being related to the potential outcomes, and the estimated treatment effect may be biased due to selection.

Overlap Assumption

The overlap assumption also known as the positivity or common support assumption requires that, for every combination of covariates, there is a positive probability of observing both treated and untreated units.

That means, within each combination of covariates, there are both treated units (T=1) and untreated units (T=0).

The overlap assumption ensures:

treated and untreated units are comparable within each subgroup
we have data support to estimate causal effects

Without overlap:

we cannot estimate counterfactuals
causal effects are not identifiable for those regions