Chapter 11: Why Model?

Part I of this book was mostly conceptual, with calculations kept to a minimum. In contrast, Part II requires the use of computers to fit regression models. This chapter describes the differences between the nonparametric estimators used in Part I and the parametric (model-based) estimators used in Part II. It reviews the concept of smoothing and the bias-variance trade-off in modeling decisions, motivating the need for models in data analysis regardless of whether the goal is causal inference or prediction.

1 11.1 Data Cannot Speak for Themselves (pp. 147-150)

Even the simple task of estimating a population mean requires modeling assumptions when data become sparse.

Example: HIV Treatment and CD4 Count

Consider a study of 16 HIV-positive individuals randomly sampled from a super-population. Each receives treatment \(A\) (antiretroviral therapy), and we measure outcome \(Y\) (CD4 cell count, cells/mm³).

Goal: Estimate the population mean \(\text{E}{\left[Y|A = a\right]}\) for each treatment level \(a\).

Scenario 1: Binary Treatment

Treatment \(A \in \{0, 1\}\) with 8 individuals in each group.

Estimator: Sample average within each group

Estimate for \(A = 0\): \(\bar{Y}_{A=0} = 67.50\)
Estimate for \(A = 1\): \(\bar{Y}_{A=1} = 146.25\)

This nonparametric estimator (sample mean) is consistent and unbiased.

Scenario 2: Four Treatment Levels

Treatment \(A \in \{1, 2, 3, 4\}\) (none, low-dose, medium-dose, high-dose) with 4 individuals per group.

Estimates: 70.0, 80.0, 117.5, 195.0 for \(A = 1, 2, 3, 4\) respectively.

Issue: With only 4 individuals per category:

Sample averages are still unbiased
But estimates are less precise (wider confidence intervals)
More variability in estimates across categories

Scenario 3: Continuous Treatment

Treatment \(A\) is dose in mg/day, taking integer values from 0 to 100 mg.

Problem: With 16 individuals and 101 possible treatment values:

Many treatment levels have zero observations
Cannot compute sample average for unobserved treatment levels
The nonparametric estimator is undefined for \(A\) values with no data

Question: How do we estimate \(\text{E}{\left[Y|A = 90\right]}\) when no one received dose 90?

2 11.2 Parametric Estimators of the Conditional Mean (pp. 150-152)

Parametric models make assumptions about the functional form relating treatment to outcome, allowing estimation even with sparse data.

Linear Regression Model

Assume the conditional mean follows a linear function:

\[\text{E}{\left[Y|A = a\right]} = \beta_0 + \beta_1 a\]

Parameters: \((\beta_0, \beta_1)\) define the line.

Estimation: Fit the model using least squares to estimate \((\hat{\beta}_0, \hat{\beta}_1)\).

Prediction: For any value \(a\), estimate \(\text{E}{\left[Y|A = a\right]} = \hat{\beta}_0 + \hat{\beta}_1 a\).

Example 1 (Linear Model for Continuous Treatment) With the HIV data and continuous treatment dose:

Fit: \(\text{E}{\left[Y|A\right]} = \beta_0 + \beta_1 A\)
Obtain estimates: \(\hat{\beta}_0 = 70\), \(\hat{\beta}_1 = 1.25\)
Predict for \(A = 90\): \(\hat{\text{E}}{\left[Y|A = 90\right]} = 70 + 1.25(90) = 182.5\)

Even though no one received dose 90, the model provides an estimate by interpolation from observed doses.

Other Parametric Models

Quadratic model: \[\text{E}{\left[Y|A = a\right]} = \beta_0 + \beta_1 a + \beta_2 a^2\]

Logarithmic model: \[\text{E}{\left[Y|A = a\right]} = \beta_0 + \beta_1 \log(a)\]

Piecewise linear (splines): Different linear relationships in different ranges of \(A\).

Each model makes different assumptions about the shape of the dose-response curve.

3 11.3 Smoothing (pp. 152-153)

Smoothing refers to techniques that estimate the conditional mean as a smooth function, balancing between nonparametric flexibility and parametric smoothness.

The Smoothing Spectrum

Nonparametric (no smoothing): - Sample means within groups - No assumptions about functional form - High variance when data are sparse

Parametric (maximum smoothing): - Linear, quadratic, etc. models - Strong assumptions about functional form - Low variance but potential bias

Semiparametric (intermediate smoothing): - Methods that smooth but make weaker assumptions - Examples: kernel smoothing, local regression, splines - Balance bias and variance

Kernel Smoothing

Idea: Estimate \(\text{E}{\left[Y|A = a\right]}\) using a weighted average of nearby observations.

\[\hat{\text{E}}{\left[Y|A = a\right]} = \frac{\sum_i K\left(\frac{A_i - a}{h}\right) Y_i}{\sum_i K\left(\frac{A_i - a}{h}\right)}\]

where:

\(K(\cdot)\) is a kernel function (e.g., Gaussian)
\(h\) is the bandwidth (controls amount of smoothing)

Bandwidth selection:

Large \(h\): More smoothing, use distant observations
Small \(h\): Less smoothing, use only nearby observations

4 11.4 The Bias-Variance Trade-Off (pp. 153-155)

Every statistical estimator involves a trade-off between bias and variance.

Definitions

Bias: Systematic error, the difference between the expected value of the estimator and the true parameter.

\[\text{Bias}(\hat{\theta}) = \text{E}{\left[\hat{\theta}\right]} - \theta\]

Variance: Random error, the variability of the estimator across repeated samples.

\[\text{Var}(\hat{\theta}) = \text{E}{\left[(\hat{\theta} - \text{E}{\left[\hat{\theta}\right]})^2\right]}\]

Mean squared error (MSE): Combines both sources of error.

\[MSE(\hat{\theta}) = \text{Bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta})\]

The Trade-Off

Nonparametric estimators (e.g., sample means):

Bias: Low (if sufficient data in each cell)
Variance: High (when data are sparse)
Works well with dense data, fails with sparse data

Parametric estimators (e.g., linear regression):

Bias: Potentially high (if model is misspecified)
Variance: Low (uses all data efficiently)
Works with sparse data, but relies on correct specification

5 11.5 The Bias-Variance Trade-Off in Action (pp. 155-156)

Simulation studies can illustrate the bias-variance trade-off across different sample sizes.

Simulation Setup

Specify a true data-generating process (e.g., \(\text{E}{\left[Y|A\right]} = \beta_0 + \beta_1 A + \beta_2 A^2\))
Generate many datasets of size \(n\) from this process
Fit different models to each dataset
Compute bias, variance, and MSE of each estimator

Typical Results

Small sample size (\(n = 16\)):

Nonparametric: High variance, low bias (where defined)
Simple parametric (linear): Low variance, moderate bias
Flexible parametric (quadratic): Moderate variance, low bias
Winner: Simple or moderately flexible model (minimizes MSE)

Large sample size (\(n = 1000\)):

Nonparametric: Low variance, low bias
Simple parametric (linear): Low variance, high bias (if misspecified)
Flexible parametric: Low variance, low bias
Winner: Flexible models or nonparametric (both work well)

6 Summary

This chapter motivated the need for statistical models in data analysis.

Key concepts:

Data sparsity problem:
- Nonparametric estimation fails when data are sparse
- Need to borrow strength across observations
Parametric models:
- Make assumptions about functional form
- Allow estimation even with sparse data
- Examples: Linear, quadratic, logarithmic models
Smoothing:
- Spectrum from nonparametric to parametric
- Semiparametric methods balance flexibility and smoothness
- Bandwidth/complexity controls amount of smoothing
Bias-variance trade-off:
- Bias: Systematic error from modeling assumptions
- Variance: Random error from finite samples
- MSE = Bias² + Variance
- Optimal estimator minimizes MSE
Sample size matters:
- Small samples: Need strong assumptions (parametric models)
- Large samples: Can use flexible/nonparametric approaches

7 References

Hernán, Miguel A, and James M Robins. 2020. Causal Inference: What If. Chapman & Hall/CRC. https://miguelhernan.org/whatifbook.