Chapter 10: Random Variability

Part I focused on causal inference in settings where we conceptualized study populations as effectively infinite, allowing us to ignore random variability and focus solely on systematic bias from confounding, selection, and measurement. Part II now introduces random variability and the use of statistical models for causal inference. This chapter bridges identification (Part I) and estimation (Part II), explaining why we need models and how to quantify uncertainty.

1 10.1 Identification Versus Estimation (pp. 131-134)

Up to now, we have focused on identification: determining whether causal effects can be computed from observed data under certain assumptions. Now we turn to estimation: using finite data to approximate those causal effects.

Key Concepts in Statistical Estimation

Estimand: The population parameter of interest (e.g., \(\Pr[Y = 1|A = a]\) in the super-population).

Estimator: A rule for computing the estimand from sample data.

Estimate: The numerical value obtained by applying the estimator to a particular sample (a point estimate).

Example 1 (Sample Proportion as an Estimator) Estimand: Super-population risk \(\Pr[Y = 1|A = 1]\)

Estimator: Sample proportion \(\widehat{Pr}[Y = 1|A = 1]\)

Estimate: From our 20-person study, \(\widehat{Pr}[Y = 1|A = 1] = 7/13 \approx 0.54\)

Consistency of Estimators

An estimator is consistent if estimates get arbitrarily close to the true parameter as sample size increases.

\[Pr\left[|\hat{\theta}_n - \theta| > \epsilon\right] \rightarrow 0 \text{ as } n \rightarrow \infty, \text{ for all } \epsilon > 0\]

The sample proportion \(\widehat{Pr}[Y = 1|A = a]\) is a consistent estimator of \(\Pr[Y = 1|A = a]\).

Confidence Intervals

A 95% confidence interval quantifies uncertainty due to random sampling.

Construction (Wald interval):

Compute point estimate \(\hat{p}\)
Estimate standard error: \(\widehat{SE} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
Compute interval: \(\hat{p} \pm 1.96 \times \widehat{SE}\)

Example: For \(\hat{p} = 7/13 \approx 0.54\) with \(n = 13\):

\(\widehat{SE} = \sqrt{(7/13)(6/13)/13} = 0.138\)
95% CI: \(0.54 \pm 1.96(0.138) = (0.27, 0.81)\)

2 10.2 Estimation of Causal Effects (pp. 134-136)

In randomized experiments with random sampling, standard statistical methods can be used to estimate causal effects and compute confidence intervals.

Setting

Suppose:

Study population is a random sample from a super-population
Treatment is randomly assigned in the super-population (or in the sample)
All individuals adhere to assigned treatment
Exchangeability holds: \(\Pr[Y^a = 1] = \Pr[Y = 1|A = a]\)

Causal Inference

Because of exchangeability, the causal risk difference equals the associational risk difference in the super-population:

\[\Pr[Y^{a=1} = 1] - \Pr[Y^{a=0} = 1] = \Pr[Y = 1|A = 1] - \Pr[Y = 1|A = 0]\]

Estimators:

Causal risk difference: \(\widehat{Pr}[Y = 1|A = 1] - \widehat{Pr}[Y = 1|A = 0]\)
Causal risk ratio: \(\widehat{Pr}[Y = 1|A = 1] / \widehat{Pr}[Y = 1|A = 0]\)

Standard statistical methods provide confidence intervals for these causal effects.

Observational Studies

In observational studies, similar methods apply after adjusting for confounding:

Compute standardized or IP weighted estimates
Calculate confidence intervals using appropriate standard errors
Account for adjustment when computing standard errors

3 10.3 The Myth of the Super-Population (pp. 136-139)

The concept of a “super-population” is a useful fiction that allows us to apply statistical methods, but it raises important questions about the sources of randomness.

Two Sources of Randomness

Sampling variability: Random selection of individuals from a super-population
Nondeterministic counterfactuals: Intrinsic randomness in outcomes even with fixed treatment and covariates

When Are Binomial Confidence Intervals Valid?

The standard binomial confidence interval for \(p = \Pr[Y = 1|A = a]\) is valid in two scenarios:

Scenario 1: Random sampling from a super-population

Study participants are randomly sampled from a large super-population
Each individual \(i\) has a fixed but unknown \(Y_i^a\) (deterministic counterfactuals)
Randomness comes solely from which individuals are sampled

Scenario 2: Nondeterministic counterfactuals

Study participants are the entire population of interest (no sampling)
Each individual has a probability \(p_i\) of outcome, not a fixed \(Y_i^a\)
Randomness comes from the probabilistic nature of outcomes

The super-population is often a fiction:

In many studies, we don’t actually randomly sample from a larger population:

Clinical trials: Enroll volunteers who meet criteria (not random sampling)
Observational studies: Use convenient samples (hospitals, registries)
Rare diseases: Study all available patients (no super-population)

Why use super-population framework anyway?

Even without actual random sampling, the super-population framework provides a convenient way to:

Quantify uncertainty via confidence intervals
Use standard statistical methods
Think about generalizability

Alternative: Randomization-based inference

Some authors prefer inference based solely on the randomization of treatment within the observed sample, without invoking a super-population. This approach has different technical requirements.

Practical Implications

Most applied researchers use confidence intervals computed under the super-population framework, even when:

No random sampling was performed
The super-population is not well-defined
Generalization to a specific population is unclear

This practice is justified by the convenience and familiarity of standard methods, though it requires careful interpretation.

4 10.4 The Conditionality Principle (pp. 139-140)

When should we condition on variables when computing causal effects and their standard errors?

The Principle

Conditionality principle: If a variable \(L\) is independent of treatment \(A\) and outcome \(Y\) under the intervention, we may condition on \(L\) when computing estimates and standard errors without affecting validity.

Applications

Example 1: Stratified randomization

If treatment is randomized within strata of sex \(L\):

Can estimate effects unconditional on \(L\) (marginal effects)
Can estimate effects conditional on \(L\) (stratum-specific effects)
Both are valid; choice depends on scientific question

Example 2: Baseline covariates in randomized trials

Even when baseline covariates \(L\) are balanced across treatment groups:

Conditioning on \(L\) may improve precision (narrower confidence intervals)
Does not introduce bias if \(L\) is a pre-treatment variable
Called “covariate adjustment” or “regression adjustment”

5 10.5 The Curse of Dimensionality (pp. 140-142)

As the number of confounders or effect modifiers increases, nonparametric estimation becomes increasingly difficult. This is the curse of dimensionality.

The Problem

Suppose we need to adjust for 10 binary confounders:

Number of possible covariate patterns: \(2^{10} = 1024\)
Need enough observations in each pattern to estimate effects
With limited data, many cells will be sparse or empty

Consequences:

Positivity violations: Some covariate patterns have no treated or untreated individuals
Unstable estimates: Small cell counts lead to large standard errors
Impractical stratification: Cannot stratify on so many variables simultaneously

The Solution: Parametric Models

Parametric models make assumptions about the functional form relating variables:

Logistic regression: \(\text{logit} \Pr[Y = 1|A, L] = \beta_0 + \beta_1 A + \beta_2 L\)
Linear regression: \(\text{E}{\left[Y|A, L\right]} = \beta_0 + \beta_1 A + \beta_2 L\)

Advantages:

Can “borrow strength” across covariate patterns
Requires fewer parameters than nonparametric estimation
Provides smooth estimates even with sparse data

Disadvantages:

Model misspecification: If functional form is wrong, estimates are biased
Trade-off between bias (from model assumptions) and variance (from limited data)

Why models are necessary:

With finite data and many covariates, we must make modeling assumptions. The question is not whether to use models, but which models to use and how to assess their adequacy.

Nonparametric vs. parametric:

Nonparametric: No assumptions about functional form, but requires large data relative to dimension of covariates
Semiparametric: Weaker assumptions than parametric, stronger than nonparametric (e.g., requires only correct propensity score model)
Parametric: Strong assumptions about functional form, works with smaller data

Modern approach:

Use flexible, data-adaptive methods (machine learning)
Cross-validation to assess model fit
Doubly robust methods that combine multiple models
Sensitivity analyses to assess robustness to model assumptions

The curse applies to all adjustment methods:

Stratification: Need cells for all covariate combinations
Standardization: Need to estimate \(\text{E}{\left[Y|A, L\right]}\) for all \(L\) values
IP weighting: Need to estimate \(\Pr[A|L]\) for all \(L\) values
Matching: Need to find matches for all covariate patterns

Why This Matters for Part II

The remainder of Part II describes methods that use parametric and semiparametric models to:

Adjust for many confounders efficiently
Handle high-dimensional covariate spaces
Estimate causal effects with adequate precision

Understanding the curse of dimensionality motivates the need for these modeling approaches.

6 Summary

This chapter introduced random variability and bridged Part I (identification) and Part II (estimation).

Key concepts:

Identification vs. estimation:
- Identification: Can we express causal effects in terms of observables?
- Estimation: How do we estimate from finite data?
Statistical concepts:
- Estimands, estimators, estimates
- Consistency: Estimates approach truth as \(n \to \infty\)
- Confidence intervals: Quantify random sampling uncertainty
The super-population:
- Convenient fiction for statistical inference
- Two sources of randomness: sampling and nondeterministic counterfactuals
- Justifies use of standard confidence intervals
The conditionality principle:
- Can condition on variables independent of treatment/outcome
- Conditioning on baseline variables can improve precision
The curse of dimensionality:
- High-dimensional covariates require parametric/semiparametric models
- Trade-off between bias (model misspecification) and variance (limited data)
- Motivates modeling approaches in Part II

Transition to Part II:

Part I established what we want to estimate (causal effects) and when they can be identified (under exchangeability, positivity, consistency).

Part II addresses how to estimate causal effects from finite data using statistical models.

Looking ahead:

Chapter 11: Standardization and IP weighting with parametric models
Chapter 12: The parametric g-formula
Chapter 13: Propensity scores and marginal structural models
Chapter 14: Instrumental variable estimation
Chapter 15: Outcome regression and g-estimation
Chapter 16: Structural nested models

All of these methods address the curse of dimensionality by making modeling assumptions, with different trade-offs between flexibility and robustness.

Key message:

In practice, we almost always need models. The question is not whether to make assumptions, but which assumptions to make and how to assess their plausibility.

7 References

Hernán, Miguel A, and James M Robins. 2020. Causal Inference: What If. Chapman & Hall/CRC. https://miguelhernan.org/whatifbook.