Chapter 15: Outcome Regression and Propensity Scores

This chapter explores outcome regression and propensity scores in greater depth, clarifying their roles in causal inference. We examine when simple regression adjustment is sufficient, when it fails, and how propensity scores can be used for confounding adjustment through matching, stratification, or weighting.

1 15.1 Outcome Regression (pp. 207-210)

Outcome regression estimates causal effects by modeling the outcome as a function of treatment and confounders.

The Outcome Regression Approach

Definition 1 (Outcome Regression) Outcome regression for causal inference:

Fit a model for \(\text{E}{\left[Y \mid A, L\right]}\)
Use the model to compute standardized means (g-formula)
Estimate causal effects as contrasts of standardized means

For simple cases, the treatment coefficient may approximate the causal effect, but this requires strong assumptions.

When Does the Treatment Coefficient Equal the Causal Effect?

Model: \(\text{E}{\left[Y \mid A, L\right]} = \beta_0 + \beta_1 A + \beta_2^{\top} L\)

Question: When does \(\beta_1 = \text{E}{\left[Y^{a=1}\right]} - \text{E}{\left[Y^{a=0}\right]}\)?

Answer: Only under restrictive conditions:

No confounding: \(Y^a \perp\!\!\!\perp A\) (conditional exchangeability not needed)
No effect modification: The causal effect doesn’t vary with \(L\)
Correct model specification: Linear model is correct

If effect modification exists, \(\beta_1\) is a weighted average of conditional effects, not generally equal to the marginal causal effect.

Example: NHEFS Study

Simple model: \[\text{E}{\left[\text{Weight Change} \mid A, L\right]} = \beta_0 + \beta_1 \text{Quit} + \beta_2 \text{Age} + \beta_3 \text{Sex} + \ldots\]

Issues:

Assumes effect of quitting is the same for all individuals
If the effect varies by age, sex, or other factors, \(\beta_1\) doesn’t equal the marginal causal effect
Need to add interactions or use g-formula

Better approach: \[\text{E}{\left[Y \mid A, L\right]} = \beta_0 + \beta_1 A + \beta_2^{\top} L + \beta_3^{\top} (A \times L)\]

Then use g-formula to compute marginal effect.

2 15.2 Propensity Scores (pp. 210-213)

The propensity score is the probability of receiving treatment given confounders. It plays a central role in observational studies.

Definition 2 (Propensity Score) The propensity score is:

\[e(L) = \Pr[A = 1 \mid L]\]

For individual \(i\) with covariates \(L_i\), the propensity score is \(e(L_i) = \Pr[A = 1 \mid L = L_i]\).

Balancing Property

Key theorem: If \(Y^a \perp\!\!\!\perp A \mid L\), then:

\[Y^a \perp\!\!\!\perp A \mid e(L)\]

Interpretation: Conditional on the propensity score, treatment assignment is independent of potential outcomes.

Implication: We can adjust for confounding by adjusting for the propensity score alone, rather than all components of \(L\).

Estimating Propensity Scores

Common approach: Logistic regression

\[\text{logit} \Pr[A = 1 \mid L] = \alpha_0 + \alpha_1^{\top} L\]

Estimation:

Fit logistic regression with treatment \(A\) as outcome, confounders \(L\) as predictors
Predict \(\hat{e}(L_i) = \hat{\Pr}[A = 1 \mid L_i]\) for each individual
Use \(\hat{e}(L_i)\) for matching, stratification, or weighting

Model selection:

Include all confounders
Consider interactions and nonlinear terms
Assess balance after adjustment (see Section 15.3)

3 15.3 Propensity Stratification and Standardization (pp. 213-216)

Propensity scores can be used to stratify the population and then standardize.

Propensity Score Stratification

Procedure:

Estimate propensity score \(\hat{e}(L_i)\) for all individuals
Create strata (e.g., quintiles) of the propensity score
Within each stratum, compute \(\hat{\text{E}}{\left[Y \mid A = a, \text{stratum } s\right]}\)
Standardize across strata:

\[\hat{\text{E}}{\left[Y^a\right]} = \sum_{s=1}^S \hat{\text{E}}{\left[Y \mid A = a, \text{stratum } s\right]} \times \Pr[\text{stratum } s]\]

Checking Balance

After stratification, check whether confounders are balanced within strata:

Balance: Within stratum \(s\), the distribution of \(L\) should be similar for treated and untreated.

Diagnostics:

Compare means/proportions of \(L\) across treatment groups within strata
Standardized differences: \(\frac{\bar{L}_{A=1,s} - \bar{L}_{A=0,s}}{SD_{\text{pooled}}}\)
Target: Standardized differences < 0.1 (rule of thumb)

If balance is poor, refine the propensity score model (add interactions, polynomials, etc.).

Example: Quintile Stratification

Steps:

Fit logistic regression for \(\Pr[A = 1 \mid L]\)
Divide individuals into 5 groups (quintiles) based on \(\hat{e}(L)\)
Within each quintile, compare treated vs untreated outcomes
Standardize across quintiles using quintile proportions as weights

Common finding: Most of the confounding is removed by stratifying on propensity score quintiles, though finer stratification may improve balance.

4 15.4 Propensity Matching (pp. 216-219)

Propensity score matching creates pairs (or sets) of treated and untreated individuals with similar propensity scores.

Matching Algorithms

1-to-1 nearest neighbor matching:

For each treated individual, find the untreated individual with the closest propensity score
Form matched pairs
Compute the effect as the average within-pair difference

Matching with replacement:

Each untreated individual can be matched to multiple treated individuals
Reduces bias but complicates variance estimation

Caliper matching:

Only match if propensity scores are within a specified distance (caliper)
Individuals without a close match are excluded
Improves balance but may reduce sample size

Assessing Match Quality

After matching, assess balance:

Standardized differences: Compare means of \(L\) in matched treated vs untreated
Love plots: Graphical display of standardized differences before and after matching
Distribution plots: Compare distributions of confounders in matched samples

Target: Standardized differences < 0.1 for all confounders

Example: NHEFS Matching

Procedure:

Estimate propensity score for quitting smoking
Match each quitter to a non-quitter with similar propensity score
In matched sample, compare weight change between quitters and non-quitters
Estimate causal effect as mean difference in matched pairs

Advantages: Intuitive, allows checking balance on all confounders

Disadvantages: Discards some individuals, may not achieve perfect balance

5 15.5 Propensity Models, Treatment Models, and Marginal Structural Models (pp. 219-222)

Clarifying terminology: propensity scores, treatment models, and MSMs.

Definitions

Propensity score: \(e(L) = \Pr[A = 1 \mid L]\)

Treatment model: Any model for \(\Pr[A \mid L]\) (or \(f(A \mid L)\) for non-binary \(A\))

Marginal structural model (MSM): Model for \(\text{E}{\left[Y^a\right]}\) or \(\text{E}{\left[Y^a \mid V\right]}\)

Relationship

IP weighting uses the treatment model to create weights:

\[W^A = \frac{1}{\Pr[A \mid L]}\]

For binary \(A\), this uses the propensity score:

\[W^A = \frac{1}{e(L)} \text{ if } A = 1, \quad W^A = \frac{1}{1 - e(L)} \text{ if } A = 0\]

MSM is then fit using IP weights:

\[\text{E}{\left[Y^a\right]} = \beta_0 + \beta_1 a\]

6 15.6 Propensity Scores and Outcome Regression (pp. 222-224)

Can we combine propensity scores with outcome regression?

Doubly Robust Estimation

Idea: Use both a treatment model and an outcome model.

Estimator: Fit outcome model within propensity score strata (or matched sets), then standardize.

Double robustness: The estimator is consistent if EITHER:

The propensity score model is correct, OR
The outcome model is correct

(But not necessarily both)

Augmented IP Weighting (AIPW)

Advanced approach: Combine IP weighting with outcome modeling:

\[\hat{\text{E}}{\left[Y^a\right]} = \frac{1}{n}\sum_{i=1}^n \left[\frac{I(A_i = a) Y_i}{f(a \mid L_i)} - \frac{I(A_i = a) - f(a \mid L_i)}{f(a \mid L_i)} m(a, L_i)\right]\]

where \(m(a, L) = \hat{\text{E}}{\left[Y \mid A = a, L\right]}\) is the outcome model.

Properties:

Doubly robust: Consistent if either model is correct
More efficient than IP weighting alone when outcome model is correct
Locally efficient (optimal variance) when both models are correct

7 15.7 Propensity Scores for Continuous Treatments (pp. 224-226)

Propensity scores extend to continuous treatments, though with additional complexity.

Generalized Propensity Score

For continuous treatment \(A\), the generalized propensity score is the conditional density:

\[f(A \mid L)\]

Balancing property: Under conditional exchangeability,

\[Y^a \perp\!\!\!\perp A \mid f(A \mid L)\]

Estimation

Common approach: Model the conditional distribution of \(A\) given \(L\).

Example: Normal model

\[A \mid L \sim \text{Normal}(\mu(L), \sigma^2)\]

where \(\mu(L) = \alpha_0 + \alpha_1^{\top} L\)

GPS: \(f(A_i \mid L_i) = \phi\left(\frac{A_i - \mu(L_i)}{\sigma}\right)\) where \(\phi\) is the standard normal density.

Using the GPS

IP weighting: Create weights

\[W_i = \frac{f(A_i)}{f(A_i \mid L_i)}\]

where \(f(A_i)\) is the marginal density of \(A\) (unconditional).

Stratification: Stratify on the GPS and standardize within strata.

8 Summary

Key concepts:

Outcome regression: Models \(\text{E}{\left[Y \mid A, L\right]}\) to estimate causal effects via g-formula
Propensity score: \(e(L) = \Pr[A = 1 \mid L]\), reduces confounding adjustment to a single dimension
Balancing property: Conditioning on propensity score achieves conditional exchangeability
Propensity stratification: Create strata by propensity score and standardize
Propensity matching: Match treated and untreated with similar propensity scores
Double robustness: Combining treatment and outcome models for robustness
Continuous treatments: Generalized propensity score extends to non-binary treatments

Methods comparison:

Method	Uses	Advantages	Disadvantages
Outcome regression	\(\text{E}{\left[Y \mid A, L\right]}\)	Natural, efficient when correct	Requires correct outcome model
IP weighting	\(\Pr[A \mid L]\)	Natural for MSMs, handles time-varying	Can be unstable, needs correct treatment model
Propensity matching	\(e(L)\)	Intuitive, easy to check balance	Discards data, complex inference
Propensity stratification	\(e(L)\)	Reduces dimensionality	Requires choosing # of strata
Doubly robust	Both models	Robust to one misspecification	More complex, needs both models

Practical recommendations:

Always check balance: After propensity score adjustment, assess whether confounders are balanced
Model carefully: Propensity scores are only as good as the treatment model
Check positivity: Extreme propensity scores (near 0 or 1) indicate violations
Use multiple methods: Try outcome regression, IP weighting, and propensity methods as sensitivity analyses
Consider double robustness: When feasible, doubly robust methods provide insurance against misspecification

Hernán, Miguel A, and James M Robins. 2020. Causal Inference: What If. Chapman & Hall/CRC. https://miguelhernan.org/whatifbook.