In Chapter 1, we defined causal effects using counterfactual outcomes. In Chapter 2, we showed how randomized experiments allow us to estimate causal effects because randomization ensures exchangeability. But most research questions cannot be answered with randomized experiments—either because it would be unethical, impractical, or impossible to randomize treatment. This chapter discusses observational studies, which do not involve randomization of treatment.
The key question is: under what conditions can we validly estimate causal effects from observational data? This chapter introduces the fundamental identifiability conditions necessary for causal inference in observational studies.
In an observational study, treatment is not randomly assigned by the investigator. Instead, individuals receive treatment based on their characteristics, preferences, physician recommendations, or other factors. As a result, treated and untreated individuals may differ systematically in ways that affect the outcome.
Can we still estimate causal effects from observational data? The answer is yes—but only under certain conditions called identifiability conditions.
Definition 1 (Identifiability) A causal quantity (such as the average treatment effect) is identifiable if it can be computed from the observed data distribution under a given set of assumptions.
Three key identifiability conditions are required for causal inference from observational data:
Without these conditions, association does not equal causation. Specifically:
The most critical identifiability condition is exchangeability. Informally, exchangeability means that the treated and untreated are comparable with respect to their potential outcomes.
Definition 2 (Conditional Exchangeability) The treated and untreated are exchangeable conditional on covariates \(L\) when:
\[Y^a \perp\!\!\!\perp A \mid L \quad \text{for all } a\]
This means the potential outcome \(Y^a\) is independent of treatment \(A\) within levels of \(L\).
Under conditional exchangeability:
\[Pr[Y^a = 1 | A = 1, L] = Pr[Y^a = 1 | A = 0, L] = Pr[Y^a = 1 | L]\]
for all values of \(a\) and \(L\).
Confounding is the absence of exchangeability. When exchangeability fails, comparing treated and untreated groups yields a biased estimate of the causal effect.
Example 1 (Confounding Example) Suppose we want to estimate the causal effect of smoking on lung cancer using observational data. If smokers differ from non-smokers in other ways that affect lung cancer risk (e.g., occupational exposures, genetic factors), then:
\[Pr[Y^{a=1} = 1 | A = 1] \neq Pr[Y^{a=1} = 1 | A = 0]\]
The potential outcome under smoking is not independent of actual smoking status. The treated (smokers) are not exchangeable with the untreated (non-smokers).
In observational studies, we try to achieve conditional exchangeability by:
The second identifiability condition is positivity, also called the experimental treatment assignment assumption.
Definition 3 (Positivity) Positivity requires that, for every combination of values of \(L\) for which \(Pr[L] > 0\):
\[Pr[A = a | L] > 0 \quad \text{for all } a\]
In words: every individual has a non-zero probability of receiving every level of treatment, conditional on their measured covariates.
Without positivity, we cannot estimate causal effects for all individuals. If certain individuals (with specific values of \(L\)) have zero probability of receiving treatment, we cannot learn about their counterfactual outcome under treatment from the data.
Example 2 (Positivity Violation Example) Suppose we want to estimate the effect of a treatment on 90-year-old individuals, but in our data, no 90-year-old person received the treatment. Then:
\[Pr[A = 1 | \text{Age} = 90] = 0\]
We cannot estimate \(E[Y^{a=1} | \text{Age} = 90]\) from the data because we have no treated 90-year-olds to observe.
The third identifiability condition is consistency. Unlike exchangeability and positivity, consistency is not about the relationship between treatment and outcome. Instead, it concerns the definition of the counterfactual outcome itself.
Definition 4 (Consistency) The consistency assumption states that:
\[Y = Y^A\]
In words: the observed outcome \(Y\) for an individual equals their counterfactual outcome \(Y^a\) under the treatment level \(a\) that they actually received.
For consistency to hold, the treatment must be well-defined. This means we must be able to precisely specify what it means to receive treatment level \(a\).
Example 3 (Ill-Defined Treatment Example) Consider “exercise” as a treatment. What does \(A = 1\) (receives exercise) mean?
If different individuals in the \(A = 1\) group received different forms of exercise, then \(Y^{a=1}\) is not well-defined. The potential outcome under “exercise” depends on which specific form of exercise.
The consistency assumption also requires that there is no interference between individuals.
Interference occurs when one individual’s treatment affects another individual’s outcome. For consistency to hold, we need:
\[Y_i = Y_i^{A_i}\]
The outcome for individual \(i\) depends only on individual \(i\)’s treatment, not on other individuals’ treatments.
Example 4 (Interference Example) Consider a vaccine study. If vaccinating person A reduces person B’s risk of disease (through herd immunity), then:
\[Y_B \neq Y_B^{A_B}\]
Person B’s outcome depends not just on \(A_B\) (whether B was vaccinated), but also on \(A_A\) and the vaccination status of others in the population.
Consistency also requires that the timing of treatment and outcome measurement is well-defined.
A useful framework for thinking about identifiability conditions in observational studies is the target trial.
Definition 5 (Target Trial) The target trial is the (hypothetical) randomized experiment we would conduct if we could. By specifying the target trial, we clarify:
When conducting an observational study, we should ask: “How closely can we emulate the target trial with the available data?”
The identifiability conditions can be understood as requirements for successfully emulating a randomized experiment:
Example 5 (Target Trial Example) Research question: Does taking statins reduce the risk of cardiovascular disease (CVD)?
Target trial specification:
Observational study emulation:
This chapter introduced the fundamental identifiability conditions for causal inference in observational studies:
Under these three conditions, we can identify causal effects from observational data by adjusting for measured confounders \(L\).
The target trial framework provides a useful way to think about observational studies: we should try to emulate the randomized experiment we would have conducted if we could. The identifiability conditions tell us what is required for this emulation to succeed.