Skip to content

Commit 7d25baf

Browse files
authored
Merge pull request #278 from igerber/revise-survey-theory-doc
Revise survey theory doc for accuracy and precision
2 parents f11e169 + aace57b commit 7d25baf

2 files changed

Lines changed: 105 additions & 54 deletions

File tree

docs/methodology/survey-theory.md

Lines changed: 104 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -15,23 +15,27 @@
1515

1616
## 1. Motivation
1717

18-
### 1.1. The problem: survey data violates the iid assumption
18+
### 1.1. The problem: naive standard errors under complex survey designs
1919

2020
Policy evaluations frequently rely on nationally representative surveys:
2121
NHANES (health outcomes), ACS (demographics and housing), BRFSS (behavioral
22-
risk factors), CPS (labor force), and MEPS (medical expenditure). These surveys
23-
employ stratified multi-stage cluster sampling to achieve national coverage at
24-
manageable cost. The resulting data carry two features that invalidate naive
25-
standard errors: (i) observations within the same primary sampling unit (PSU)
26-
are correlated, and (ii) stratification constrains the sampling variability.
22+
risk factors), CPS (labor force), and MEPS (medical expenditure). Most of these
23+
are repeated cross-sectional surveys (with the partial exception of CPS's
24+
rotating panel); the sampling frame --- strata, PSUs --- may shift across waves,
25+
adding a layer of complexity to design-based variance estimation that does not
26+
arise with a fixed panel. These surveys employ stratified multi-stage cluster
27+
sampling to achieve national coverage at manageable cost. The resulting data
28+
carry two features that invalidate naive standard errors:
29+
(i) observations within the same primary sampling unit (PSU) are correlated,
30+
and (ii) stratification constrains the sampling variability.
2731

2832
Naive standard errors --- whether heteroskedasticity-robust (HC1) or clustered
2933
at the individual level --- treat the sample as if it were drawn by simple
3034
random sampling. Under complex survey designs this ignores intra-cluster
3135
correlation within PSUs, which typically inflates variance relative to SRS, and
32-
stratification, which typically deflates it. The net effect is design-specific,
33-
but in practice the clustering effect dominates and naive SEs understate true
34-
sampling variance. The ratio of design-based to naive variance is the *design
36+
stratification, which typically deflates it. The net effect is design-specific;
37+
naive SEs are generally incorrect --- and often too small --- under complex
38+
survey designs. The ratio of design-based to naive variance is the *design
3539
effect* (DEFF); values of 2--5 are common in health and social surveys.
3640

3741
This matters especially for difference-in-differences (DiD) estimation because:
@@ -49,13 +53,21 @@ This matters especially for difference-in-differences (DiD) estimation because:
4953

5054
The modern DiD literature derives estimators and their asymptotic properties
5155
under sampling assumptions that are incompatible with complex survey designs.
52-
Every foundational paper in this literature either assumes iid sampling
53-
explicitly, or adopts a framework that sidesteps sampling design entirely:
56+
The foundational papers in this literature either assume iid sampling
57+
explicitly, or adopt frameworks that do not incorporate complex survey design
58+
features (strata, PSU clustering, FPC):
59+
60+
*Note on terminology.* The recent DiD literature uses "design-based" to refer
61+
to treatment-assignment design (Athey & Imbens 2022), where uncertainty arises
62+
from which units receive treatment; throughout this document, "design-based"
63+
refers to survey sampling design (Binder 1983), where uncertainty arises from
64+
which units are sampled. Same term, different referent.
5465

5566
- **Callaway & Sant'Anna (2021)** state iid as a numbered assumption
5667
(Assumption 2) and derive the multiplier bootstrap under it. The paper
57-
acknowledges design-based inference as an alternative --- citing Athey &
58-
Imbens (2018) --- but does not pursue it.
68+
acknowledges design-based inference in the treatment-assignment sense ---
69+
citing Athey & Imbens (2018; published 2022) --- but does not pursue
70+
survey-design-based inference.
5971
- **Sant'Anna & Zhao (2020)** assume iid (Assumption 1) and derive the doubly
6072
robust influence function and semiparametric efficiency bounds under it.
6173
- **Borusyak, Jaravel & Spiess (2024)** adopt a conditional/fixed-design
@@ -72,31 +84,38 @@ explicitly, or adopts a framework that sidesteps sampling design entirely:
7284

7385
The most comprehensive recent review of the DiD literature --- Roth, Sant'Anna,
7486
Bilinski & Poe (2023), "What's Trending in Difference-in-Differences?" ---
75-
contains no discussion of survey weights, complex survey designs, or
76-
design-based variance estimation.
87+
discusses design-based inference in the treatment-assignment sense (Section 5.2),
88+
where randomness comes from treatment assignment rather than sampling, but does
89+
not address survey sampling design, survey weights, or strata/PSU/FPC-based
90+
variance estimation.
7791

7892
### 1.3. The gap in software
7993

8094
Existing software implementations reflect this theoretical gap. R's `did`
8195
package (Callaway & Sant'Anna) accepts a `weightsname` parameter for point
82-
estimation, but its multiplier bootstrap draws iid unit-level weights without
83-
accounting for strata, PSU, or FPC. Stata's `csdid` (Rios-Avila, Sant'Anna &
84-
Callaway) accepts `pweight` for point estimation but does not support the
85-
`svy:` prefix --- variance estimation ignores the survey design structure.
86-
Neither `did_multiplegt_dyn` (de Chaisemartin & D'Haultfoeuille) nor
87-
`eventstudyinteract` (Sun & Abraham) nor `didimputation` (Borusyak, Jaravel
88-
& Spiess) provide design-based variance.
89-
90-
In all these implementations, sampling weights enter the point estimate but the
91-
variance estimator treats data as if it were iid (or clustered at the panel
92-
unit, not the survey PSU).
96+
estimation and supports cluster-level multiplier bootstrap via `clustervars`
97+
(drawing Rademacher weights at the cluster level rather than per unit), but
98+
does not account for stratification or finite population corrections. Stata's
99+
`csdid` (Rios-Avila, Sant'Anna & Callaway) accepts `pweight` for point
100+
estimation and supports clustered wild bootstrap, but does not support the
101+
`svy:` prefix --- there is no mechanism for strata or FPC.
102+
`did_multiplegt_dyn` (de Chaisemartin & D'Haultfoeuille) clusters at the group
103+
level by default but likewise lacks strata and FPC support.
104+
`eventstudyinteract` (Sun & Abraham) does not accept probability weights.
105+
`didimputation` (Borusyak, Jaravel & Spiess) accepts estimation weights via
106+
`wname` but does not provide survey-design variance.
107+
108+
These implementations support weights for point estimation and allow
109+
cluster-robust inference, but none provides full survey-design variance
110+
estimation that jointly accounts for strata, PSU clustering, and finite
111+
population corrections.
93112

94113
### 1.4. Adjacent work: survey inference for causal effects
95114

96115
The survey statistics literature has developed design-based variance theory for
97116
smooth functionals (Binder 1983; Demnati & Rao 2004; Lumley 2004), and recent
98-
work has extended this to causal inference --- but only for cross-sectional
99-
estimators, not panel DiD:
117+
work has extended this to causal inference --- but primarily for cross-sectional
118+
estimators or simple two-period designs, not for modern staggered DiD:
100119

101120
- **DuGoff, Schuler & Stuart (2014)** provide practical guidance on combining
102121
propensity score methods with complex surveys using Stata's `svy:` framework,
@@ -105,19 +124,27 @@ estimators, not panel DiD:
105124
propensity score estimators using influence functions --- the closest work to
106125
the bridge we describe --- but for cross-sectional IPW/augmented weighting,
107126
not staggered DiD.
127+
- **Ye, Bilinski & Lee (2025)** study DiD with repeated cross-sectional survey
128+
data, combining propensity scores with survey weights. However, their
129+
estimator is limited to two periods and two groups, uses bootstrap-only
130+
variance (no analytical design-based derivation), and does not address the
131+
modern heterogeneity-robust estimators considered here.
108132

109-
No published work formally derives design-based variance for the influence
110-
functions of modern heterogeneity-robust DiD estimators.
133+
No published work formally derives design-based variance --- in the survey-
134+
statistics sense of strata/PSU/FPC-based Taylor series linearization --- for
135+
the influence functions of modern heterogeneity-robust DiD estimators
136+
(Callaway--Sant'Anna, Sun--Abraham, imputation DiD, etc.).
111137

112138
### 1.5. What this document provides
113139

114-
This document bridges the two literatures. The core argument (Section 4) is
115-
that modern DiD estimators are smooth functionals of the empirical distribution,
116-
and Binder's (1983) theorem therefore guarantees that applying the
140+
This document bridges the two literatures. The core argument (Section 4) is a
141+
careful application of existing survey linearization theory (Binder 1983) to
142+
modern DiD estimators: because these estimators are smooth functionals of the
143+
empirical distribution, Binder's theorem guarantees that applying the
117144
stratified-cluster variance formula to their influence function values produces
118-
a design-consistent variance estimator. The argument is a straightforward
119-
application of existing theory, but it has not previously been stated for the
120-
DiD case.
145+
a design-consistent variance estimator. The argument applies existing theory to
146+
a new setting --- it has not previously been stated for the modern
147+
heterogeneity-robust DiD case.
121148

122149
diff-diff implements this connection: it is the only package --- across R,
123150
Stata, and Python --- that provides design-based variance estimation
@@ -146,10 +173,12 @@ stratified multi-stage design used by most federal statistical agencies.
146173

147174
Each sampled observation i carries a sampling weight w_i = 1 / pi_i, where
148175
pi_i is the inclusion probability. Under probability-weight (`pweight`)
149-
semantics, w_i represents how many population units observation i represents.
150-
diff-diff normalizes probability weights to mean 1 (sum = n) to avoid scale
151-
dependence in regression coefficients while preserving the relative
152-
representativeness of each observation.
176+
semantics, the raw weight w_i = 1/pi_i represents how many population units
177+
observation i represents. diff-diff normalizes probability weights to mean 1
178+
(sum = n) to avoid scale dependence in regression coefficients. After
179+
normalization, weights preserve relative representativeness --- w_i = 2 means
180+
observation i represents twice as many population units as the average --- but
181+
no longer indicate absolute population counts.
153182

154183
### Finite population correction
155184

@@ -187,15 +216,19 @@ literatures reason about functionals, just from different perspectives.
187216

188217
## 3. Survey-Weighted Estimation
189218

190-
### Horvitz-Thompson consistency
219+
### Design consistency
191220

192221
Under the survey design, the survey-weighted empirical distribution is:
193222

194223
```
195224
F_hat_w = sum_i w_i * delta_{x_i} / sum_i w_i
196225
```
197226

198-
where the sum is over sampled observations and delta_{x_i} is the point mass
227+
This is the Hájek (self-normalized) form of the design-weighted estimator,
228+
preferred when the population size N is unknown. It is design-consistent for
229+
the same target as the Horvitz-Thompson estimator.
230+
231+
The sum is over sampled observations and delta_{x_i} is the point mass
199232
at x_i. When T is a smooth functional, the plug-in estimator theta_hat =
200233
T(F_hat_w) is design-consistent for theta = T(F): as the sample size grows
201234
within the finite-population asymptotic framework, theta_hat converges in
@@ -346,7 +379,10 @@ T(F_hat_w) - T(F) = sum_i d_i * psi_i + o_p(n^{-1/2})
346379
```
347380

348381
where d_i = 1 if unit i is sampled (0 otherwise), and psi_i = w_i * IF(x_i;
349-
T, F) / N is the scaled influence function value. The key observation: this
382+
T, F) / N is the scaled influence function value. (In practice, the population
383+
size N is typically unknown and is estimated by N_hat = sum_i w_i. After
384+
diff-diff normalizes pweights to mean 1, sum_i w_i = n; the scaling is
385+
variance-equivalent because only relative weights affect the sandwich meat.) The key observation: this
350386
linearized form is a weighted sum over the sampled observations, and its
351387
variance is determined by the sampling design --- not by T. The IF transforms
352388
the problem of estimating Var(theta_hat) into the simpler problem of estimating
@@ -369,7 +405,7 @@ values, and psi_h_bar is the within-stratum mean of PSU totals.
369405

370406
This works because theta_hat is asymptotically equivalent to a linear function
371407
of survey-weighted totals. Once linearized via the IF, the variance of
372-
theta_hat inherits the same structure as the variance of a Horvitz-Thompson
408+
theta_hat inherits the same structure as the variance of a design-weighted
373409
total, which the survey statistics literature has established formulas for.
374410

375411
### 4.5. Combining the pieces
@@ -487,7 +523,8 @@ strategies via the `lonely_psu` parameter:
487523
- **"certainty"**: Treat singleton PSUs as sampled with certainty (f_h = 1),
488524
contributing zero to the variance.
489525
- **"adjust"**: Center the singleton stratum's PSU total at the grand mean of
490-
all PSU totals instead of the (undefined) within-stratum mean.
526+
all PSU totals instead of the (undefined) within-stratum mean (matching
527+
Stata's `singleunit(centered)` behavior).
491528

492529
---
493530

@@ -589,20 +626,28 @@ surveys --- the t-distribution approximation with df = n_PSU - n_strata may
589626
be anti-conservative. diff-diff reports the survey degrees of freedom so users
590627
can assess this directly.
591628

592-
**Informative sampling.** Binder's theorem assumes non-informative sampling:
593-
selection into the sample depends only on design variables (strata, PSU), not
594-
on potential outcomes conditional on those variables. If treatment effects vary
595-
with selection probability in ways not captured by the stratification, IF
596-
values may be biased even after weighting.
629+
**Estimand dependence on weights.** The design-based framework treats population
630+
values as fixed and relies on probability weighting to target finite-population
631+
parameters. Binder's variance formula is consistent for the variance of
632+
whatever the weighted estimator targets. However, if treatment effects vary
633+
with inclusion probability in ways not captured by the stratification, the
634+
survey-weighted estimator may target a different population quantity than the
635+
intended ATT. In such cases, the variance estimate is correct for the estimand
636+
actually being estimated, but that estimand may not correspond to the causal
637+
parameter of interest.
597638

598639
**SUTVA.** Survey weighting does not address interference between units. If
599640
treatment of one unit affects outcomes of another (spillovers), the ATT
600641
estimand is not well-defined regardless of the variance estimator.
601642

602643
**Weight variability.** Highly variable weights reduce effective sample size.
603-
The design effect DEFF = n * sum(w_i^2) / (sum(w_i))^2 measures this: when
604-
DEFF >> 1, estimates are less precise than the nominal sample size suggests.
605-
diff-diff reports DEFF in `SurveyMetadata` to help users assess this.
644+
The Kish design effect due to unequal weighting,
645+
deff_w = n * sum(w_i^2) / (sum(w_i))^2, measures this: when deff_w >> 1,
646+
estimates are less precise than the nominal sample size suggests. (This
647+
captures only the weighting component of the full design effect discussed in
648+
Section 1.1, which also incorporates clustering and stratification effects.)
649+
diff-diff reports this quantity as `SurveyMetadata.design_effect` (the Kish
650+
deff_w) to help users assess weight variability.
606651

607652
**Model misspecification.** For doubly-robust and IPW estimators
608653
(CallawaySantAnna with `estimation_method='dr'` or `'ipw'`), the IF
@@ -700,6 +745,9 @@ Two bootstrap strategies interact with survey designs:
700745

701746
### Modern DiD
702747

748+
- Athey, S. & Imbens, G.W. (2022). "Design-Based Analysis in
749+
Difference-in-Differences Settings with Staggered Adoption." *Journal of
750+
Econometrics* 226(1), 62--79.
703751
- Borusyak, K., Jaravel, X. & Spiess, J. (2024). "Revisiting Event-Study
704752
Designs: Robust and Efficient Estimation." *Review of Economic Studies*
705753
91(6), 3253--3285.
@@ -728,6 +776,9 @@ Two bootstrap strategies interact with survey designs:
728776
Surveys." *Health Services Research* 49(1), 284--303.
729777
- Solon, G., Haider, S.J. & Wooldridge, J.M. (2015). "What Are We Weighting
730778
For?" *Journal of Human Resources* 50(2), 301--316.
779+
- Ye, K., Bilinski, A. & Lee, Y. (2025). "Difference-in-differences analysis
780+
with repeated cross-sectional survey data." *Health Services & Outcomes
781+
Research Methodology*. DOI: 10.1007/s10742-025-00364-7.
731782
- Zeng, S., Li, F. & Tong, X. (2025). "Moving toward Best Practice when
732783
Using Propensity Score Weighting in Survey Observational Studies."
733784
arXiv:2501.16156.

docs/survey-roadmap.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ design-based variance estimation with modern DiD influence functions:
120120
1. Modern heterogeneity-robust DiD estimators (CS, SA, BJS) are smooth
121121
functionals of the weighted empirical distribution
122122
2. Survey-weighted empirical distribution is design-consistent for the
123-
superpopulation quantity (Horvitz-Thompson)
123+
finite-population quantity (Hájek/design-weighted estimator)
124124
3. The influence function is a property of the functional, not the
125125
sampling design — IFs remain valid under survey weighting
126126
4. TSL (stratified cluster sandwich) and replicate-weight methods are

0 commit comments

Comments
 (0)