This chapter addresses a critical question in causal inference: Which variables should we adjust for? Not all variables that predict the outcome should be included in causal models. Some variables, if adjusted for, can introduce bias rather than remove it. We provide guidance on variable selection using causal diagrams.
Traditional variable selection methods are designed for prediction, not causal inference.
Prediction goal: Minimize prediction error for \(Y\) given covariates
Causal inference goal: Estimate \(E[Y^a]\) or \(E[Y^{a=1}] - E[Y^{a=0}]\)
Traditional approach: Stepwise regression (forward, backward, or both)
Problem for causal inference:
Recommendation: Do not use stepwise selection for causal inference.
What exactly is a confounder, and when should we adjust for it?
Definition 1 (Confounder (Formal Definition)) A variable \(L\) is a confounder for the effect of \(A\) on \(Y\) if:
Causal criterion: \(L\) is on a backdoor path from \(A\) to \(Y\).
Backdoor path: A path from \(A\) to \(Y\) that starts with an arrow into \(A\)
\[A \leftarrow L \to Y\]
Such paths create non-causal association between \(A\) and \(Y\).
Goal: Block all backdoor paths to eliminate confounding.
Definition 2 (Sufficient Adjustment Set) A set of variables \(L\) is sufficient for confounding adjustment if conditioning on \(L\) blocks all backdoor paths from \(A\) to \(Y\).
Equivalently: \((Y^a \perp\!\!\!\perp A \mid L)\) for all \(a\) (conditional exchangeability).
Multiple sufficient sets: There may be many sufficient adjustment sets. We want to choose one that:
When we adjust for a sufficient set, we remove confounding. But be careful about adjusting for too much.
Confounders: Variables on backdoor paths
Example DAG: \[A \leftarrow L \to Y\]
Adjust for \(L\) to block the backdoor path.
Mediators: Variables on the causal path from \(A\) to \(Y\)
Example DAG: \[A \to M \to Y\]
If we adjust for \(M\), we block the causal path through \(M\).
Descendants of treatment: Variables affected by \(A\)
Definition 3 (Collider) A collider on a path is a variable with two arrows pointing into it.
Example: \[A \to C \leftarrow U \to Y\]
\(C\) is a collider on the path \(A \to C \leftarrow U \to Y\).
Property: This path is blocked by default (without conditioning on \(C\)).
Danger: If we condition on \(C\) (or its descendants), we open the path, creating collider bias.
Rule: Do NOT adjust for colliders (unless necessary to block other paths).
Some variables should not be adjusted for even if they’re associated with both treatment and outcome.
DAG structure:
U1 → L ← U2
↓ ↓
A Y
Properties:
Result: Adjusting for \(L\) introduces bias even though \(L\) is associated with both \(A\) and \(Y\).
An instrumental variable \(Z\) satisfies:
Z → A → Y
with no backdoor paths from \(Z\) to \(Y\).
Should we adjust for \(Z\)?
Time-varying treatments create new challenges for variable selection.
Setting: Treatment varies over time (\(A_0, A_1, \ldots\)), as do confounders (\(L_0, L_1, \ldots\))
Time-varying confounder: \(L_1\) is a confounder for the effect of \(A_1\) on \(Y\)
Problem: If \(A_0\) affects \(L_1\), then:
Definition 4 (Intermediate Confounder (Time-Dependent Confounder Affected by Prior Treatment)) A variable \(L_1\) is an intermediate confounder if:
DAG:
A_0 → L_1 → Y
↓ ↓
A_1 → Y
Standard regression fails: Cannot correctly adjust for \(L_1\) using standard methods.
Solutions:
After ensuring confounding is addressed, can we include additional variables to improve precision?
Definition: Variables associated with the outcome but not with treatment (after accounting for confounders).
Example DAG:
A → Y ← V
\(V\) is associated with \(Y\) but not with \(A\) (no arrow from \(V\) to \(A\) or shared causes).
Effect of adjustment:
Recommendation: Include precision variables to improve efficiency.
Question: Should we include instrumental variables in outcome models?
Answer: Generally NO.
Causal DAGs (directed acyclic graphs) are invaluable tools for variable selection.
Draw the DAG:
Identify backdoor paths:
Find sufficient adjustment sets:
dagitty R package) if DAG is complexChoose an adjustment set:
R package dagitty:
Example:
Key principles for variable selection in causal inference:
Variables to include:
Variables to exclude:
Special cases:
Tools:
dagitty, ggdag (R), DAGitty (web interface)Common mistakes:
Practical workflow: