1515
1616## 1. Motivation
1717
18- ### 1.1. The problem: survey data violates the iid assumption
18+ ### 1.1. The problem: naive standard errors under complex survey designs
1919
2020Policy evaluations frequently rely on nationally representative surveys:
2121NHANES (health outcomes), ACS (demographics and housing), BRFSS (behavioral
22- risk factors), CPS (labor force), and MEPS (medical expenditure). These surveys
23- employ stratified multi-stage cluster sampling to achieve national coverage at
24- manageable cost. The resulting data carry two features that invalidate naive
25- standard errors: (i) observations within the same primary sampling unit (PSU)
26- are correlated, and (ii) stratification constrains the sampling variability.
22+ risk factors), CPS (labor force), and MEPS (medical expenditure). Most of these
23+ are repeated cross-sectional surveys (with the partial exception of CPS's
24+ rotating panel); the sampling frame --- strata, PSUs --- may shift across waves,
25+ adding a layer of complexity to design-based variance estimation that does not
26+ arise with a fixed panel. These surveys employ stratified multi-stage cluster
27+ sampling to achieve national coverage at manageable cost. The resulting data
28+ carry two features that invalidate naive standard errors:
29+ (i) observations within the same primary sampling unit (PSU) are correlated,
30+ and (ii) stratification constrains the sampling variability.
2731
2832Naive standard errors --- whether heteroskedasticity-robust (HC1) or clustered
2933at the individual level --- treat the sample as if it were drawn by simple
3034random sampling. Under complex survey designs this ignores intra-cluster
3135correlation within PSUs, which typically inflates variance relative to SRS, and
32- stratification, which typically deflates it. The net effect is design-specific,
33- but in practice the clustering effect dominates and naive SEs understate true
34- sampling variance . The ratio of design-based to naive variance is the * design
36+ stratification, which typically deflates it. The net effect is design-specific;
37+ naive SEs are generally incorrect --- and often too small --- under complex
38+ survey designs . The ratio of design-based to naive variance is the * design
3539effect* (DEFF); values of 2--5 are common in health and social surveys.
3640
3741This matters especially for difference-in-differences (DiD) estimation because:
@@ -49,13 +53,21 @@ This matters especially for difference-in-differences (DiD) estimation because:
4953
5054The modern DiD literature derives estimators and their asymptotic properties
5155under sampling assumptions that are incompatible with complex survey designs.
52- Every foundational paper in this literature either assumes iid sampling
53- explicitly, or adopts a framework that sidesteps sampling design entirely:
56+ The foundational papers in this literature either assume iid sampling
57+ explicitly, or adopt frameworks that do not incorporate complex survey design
58+ features (strata, PSU clustering, FPC):
59+
60+ * Note on terminology.* The recent DiD literature uses "design-based" to refer
61+ to treatment-assignment design (Athey & Imbens 2022), where uncertainty arises
62+ from which units receive treatment; throughout this document, "design-based"
63+ refers to survey sampling design (Binder 1983), where uncertainty arises from
64+ which units are sampled. Same term, different referent.
5465
5566- ** Callaway & Sant'Anna (2021)** state iid as a numbered assumption
5667 (Assumption 2) and derive the multiplier bootstrap under it. The paper
57- acknowledges design-based inference as an alternative --- citing Athey &
58- Imbens (2018) --- but does not pursue it.
68+ acknowledges design-based inference in the treatment-assignment sense ---
69+ citing Athey & Imbens (2018; published 2022) --- but does not pursue
70+ survey-design-based inference.
5971- ** Sant'Anna & Zhao (2020)** assume iid (Assumption 1) and derive the doubly
6072 robust influence function and semiparametric efficiency bounds under it.
6173- ** Borusyak, Jaravel & Spiess (2024)** adopt a conditional/fixed-design
@@ -72,31 +84,38 @@ explicitly, or adopts a framework that sidesteps sampling design entirely:
7284
7385The most comprehensive recent review of the DiD literature --- Roth, Sant'Anna,
7486Bilinski & Poe (2023), "What's Trending in Difference-in-Differences?" ---
75- contains no discussion of survey weights, complex survey designs, or
76- design-based variance estimation.
87+ discusses design-based inference in the treatment-assignment sense (Section 5.2),
88+ where randomness comes from treatment assignment rather than sampling, but does
89+ not address survey sampling design, survey weights, or strata/PSU/FPC-based
90+ variance estimation.
7791
7892### 1.3. The gap in software
7993
8094Existing software implementations reflect this theoretical gap. R's ` did `
8195package (Callaway & Sant'Anna) accepts a ` weightsname ` parameter for point
82- estimation, but its multiplier bootstrap draws iid unit-level weights without
83- accounting for strata, PSU, or FPC. Stata's ` csdid ` (Rios-Avila, Sant'Anna &
84- Callaway) accepts ` pweight ` for point estimation but does not support the
85- ` svy: ` prefix --- variance estimation ignores the survey design structure.
86- Neither ` did_multiplegt_dyn ` (de Chaisemartin & D'Haultfoeuille) nor
87- ` eventstudyinteract ` (Sun & Abraham) nor ` didimputation ` (Borusyak, Jaravel
88- & Spiess) provide design-based variance.
89-
90- In all these implementations, sampling weights enter the point estimate but the
91- variance estimator treats data as if it were iid (or clustered at the panel
92- unit, not the survey PSU).
96+ estimation and supports cluster-level multiplier bootstrap via ` clustervars `
97+ (drawing Rademacher weights at the cluster level rather than per unit), but
98+ does not account for stratification or finite population corrections. Stata's
99+ ` csdid ` (Rios-Avila, Sant'Anna & Callaway) accepts ` pweight ` for point
100+ estimation and supports clustered wild bootstrap, but does not support the
101+ ` svy: ` prefix --- there is no mechanism for strata or FPC.
102+ ` did_multiplegt_dyn ` (de Chaisemartin & D'Haultfoeuille) clusters at the group
103+ level by default but likewise lacks strata and FPC support.
104+ ` eventstudyinteract ` (Sun & Abraham) does not accept probability weights.
105+ ` didimputation ` (Borusyak, Jaravel & Spiess) accepts estimation weights via
106+ ` wname ` but does not provide survey-design variance.
107+
108+ These implementations support weights for point estimation and allow
109+ cluster-robust inference, but none provides full survey-design variance
110+ estimation that jointly accounts for strata, PSU clustering, and finite
111+ population corrections.
93112
94113### 1.4. Adjacent work: survey inference for causal effects
95114
96115The survey statistics literature has developed design-based variance theory for
97116smooth functionals (Binder 1983; Demnati & Rao 2004; Lumley 2004), and recent
98- work has extended this to causal inference --- but only for cross-sectional
99- estimators, not panel DiD:
117+ work has extended this to causal inference --- but primarily for cross-sectional
118+ estimators or simple two-period designs , not for modern staggered DiD:
100119
101120- ** DuGoff, Schuler & Stuart (2014)** provide practical guidance on combining
102121 propensity score methods with complex surveys using Stata's ` svy: ` framework,
@@ -105,19 +124,27 @@ estimators, not panel DiD:
105124 propensity score estimators using influence functions --- the closest work to
106125 the bridge we describe --- but for cross-sectional IPW/augmented weighting,
107126 not staggered DiD.
127+ - ** Ye, Bilinski & Lee (2025)** study DiD with repeated cross-sectional survey
128+ data, combining propensity scores with survey weights. However, their
129+ estimator is limited to two periods and two groups, uses bootstrap-only
130+ variance (no analytical design-based derivation), and does not address the
131+ modern heterogeneity-robust estimators considered here.
108132
109- No published work formally derives design-based variance for the influence
110- functions of modern heterogeneity-robust DiD estimators.
133+ No published work formally derives design-based variance --- in the survey-
134+ statistics sense of strata/PSU/FPC-based Taylor series linearization --- for
135+ the influence functions of modern heterogeneity-robust DiD estimators
136+ (Callaway--Sant'Anna, Sun--Abraham, imputation DiD, etc.).
111137
112138### 1.5. What this document provides
113139
114- This document bridges the two literatures. The core argument (Section 4) is
115- that modern DiD estimators are smooth functionals of the empirical distribution,
116- and Binder's (1983) theorem therefore guarantees that applying the
140+ This document bridges the two literatures. The core argument (Section 4) is a
141+ careful application of existing survey linearization theory (Binder 1983) to
142+ modern DiD estimators: because these estimators are smooth functionals of the
143+ empirical distribution, Binder's theorem guarantees that applying the
117144stratified-cluster variance formula to their influence function values produces
118- a design-consistent variance estimator. The argument is a straightforward
119- application of existing theory, but it has not previously been stated for the
120- DiD case.
145+ a design-consistent variance estimator. The argument applies existing theory to
146+ a new setting --- it has not previously been stated for the modern
147+ heterogeneity-robust DiD case.
121148
122149diff-diff implements this connection: it is the only package --- across R,
123150Stata, and Python --- that provides design-based variance estimation
@@ -146,10 +173,12 @@ stratified multi-stage design used by most federal statistical agencies.
146173
147174Each sampled observation i carries a sampling weight w_i = 1 / pi_i, where
148175pi_i is the inclusion probability. Under probability-weight (` pweight ` )
149- semantics, w_i represents how many population units observation i represents.
150- diff-diff normalizes probability weights to mean 1 (sum = n) to avoid scale
151- dependence in regression coefficients while preserving the relative
152- representativeness of each observation.
176+ semantics, the raw weight w_i = 1/pi_i represents how many population units
177+ observation i represents. diff-diff normalizes probability weights to mean 1
178+ (sum = n) to avoid scale dependence in regression coefficients. After
179+ normalization, weights preserve relative representativeness --- w_i = 2 means
180+ observation i represents twice as many population units as the average --- but
181+ no longer indicate absolute population counts.
153182
154183### Finite population correction
155184
@@ -187,15 +216,19 @@ literatures reason about functionals, just from different perspectives.
187216
188217## 3. Survey-Weighted Estimation
189218
190- ### Horvitz-Thompson consistency
219+ ### Design consistency
191220
192221Under the survey design, the survey-weighted empirical distribution is:
193222
194223```
195224F_hat_w = sum_i w_i * delta_{x_i} / sum_i w_i
196225```
197226
198- where the sum is over sampled observations and delta_ {x_i} is the point mass
227+ This is the Hájek (self-normalized) form of the design-weighted estimator,
228+ preferred when the population size N is unknown. It is design-consistent for
229+ the same target as the Horvitz-Thompson estimator.
230+
231+ The sum is over sampled observations and delta_ {x_i} is the point mass
199232at x_i. When T is a smooth functional, the plug-in estimator theta_hat =
200233T(F_hat_w) is design-consistent for theta = T(F): as the sample size grows
201234within the finite-population asymptotic framework, theta_hat converges in
@@ -346,7 +379,10 @@ T(F_hat_w) - T(F) = sum_i d_i * psi_i + o_p(n^{-1/2})
346379```
347380
348381where d_i = 1 if unit i is sampled (0 otherwise), and psi_i = w_i * IF(x_i;
349- T, F) / N is the scaled influence function value. The key observation: this
382+ T, F) / N is the scaled influence function value. (In practice, the population
383+ size N is typically unknown and is estimated by N_hat = sum_i w_i. After
384+ diff-diff normalizes pweights to mean 1, sum_i w_i = n; the scaling is
385+ variance-equivalent because only relative weights affect the sandwich meat.) The key observation: this
350386linearized form is a weighted sum over the sampled observations, and its
351387variance is determined by the sampling design --- not by T. The IF transforms
352388the problem of estimating Var(theta_hat) into the simpler problem of estimating
@@ -369,7 +405,7 @@ values, and psi_h_bar is the within-stratum mean of PSU totals.
369405
370406This works because theta_hat is asymptotically equivalent to a linear function
371407of survey-weighted totals. Once linearized via the IF, the variance of
372- theta_hat inherits the same structure as the variance of a Horvitz-Thompson
408+ theta_hat inherits the same structure as the variance of a design-weighted
373409total, which the survey statistics literature has established formulas for.
374410
375411### 4.5. Combining the pieces
@@ -487,7 +523,8 @@ strategies via the `lonely_psu` parameter:
487523- ** "certainty"** : Treat singleton PSUs as sampled with certainty (f_h = 1),
488524 contributing zero to the variance.
489525- ** "adjust"** : Center the singleton stratum's PSU total at the grand mean of
490- all PSU totals instead of the (undefined) within-stratum mean.
526+ all PSU totals instead of the (undefined) within-stratum mean (matching
527+ Stata's ` singleunit(centered) ` behavior).
491528
492529---
493530
@@ -589,20 +626,28 @@ surveys --- the t-distribution approximation with df = n_PSU - n_strata may
589626be anti-conservative. diff-diff reports the survey degrees of freedom so users
590627can assess this directly.
591628
592- ** Informative sampling.** Binder's theorem assumes non-informative sampling:
593- selection into the sample depends only on design variables (strata, PSU), not
594- on potential outcomes conditional on those variables. If treatment effects vary
595- with selection probability in ways not captured by the stratification, IF
596- values may be biased even after weighting.
629+ ** Estimand dependence on weights.** The design-based framework treats population
630+ values as fixed and relies on probability weighting to target finite-population
631+ parameters. Binder's variance formula is consistent for the variance of
632+ whatever the weighted estimator targets. However, if treatment effects vary
633+ with inclusion probability in ways not captured by the stratification, the
634+ survey-weighted estimator may target a different population quantity than the
635+ intended ATT. In such cases, the variance estimate is correct for the estimand
636+ actually being estimated, but that estimand may not correspond to the causal
637+ parameter of interest.
597638
598639** SUTVA.** Survey weighting does not address interference between units. If
599640treatment of one unit affects outcomes of another (spillovers), the ATT
600641estimand is not well-defined regardless of the variance estimator.
601642
602643** Weight variability.** Highly variable weights reduce effective sample size.
603- The design effect DEFF = n * sum(w_i^2) / (sum(w_i))^2 measures this: when
604- DEFF >> 1, estimates are less precise than the nominal sample size suggests.
605- diff-diff reports DEFF in ` SurveyMetadata ` to help users assess this.
644+ The Kish design effect due to unequal weighting,
645+ deff_w = n * sum(w_i^2) / (sum(w_i))^2, measures this: when deff_w >> 1,
646+ estimates are less precise than the nominal sample size suggests. (This
647+ captures only the weighting component of the full design effect discussed in
648+ Section 1.1, which also incorporates clustering and stratification effects.)
649+ diff-diff reports this quantity as ` SurveyMetadata.design_effect ` (the Kish
650+ deff_w) to help users assess weight variability.
606651
607652** Model misspecification.** For doubly-robust and IPW estimators
608653(CallawaySantAnna with ` estimation_method='dr' ` or ` 'ipw' ` ), the IF
@@ -700,6 +745,9 @@ Two bootstrap strategies interact with survey designs:
700745
701746### Modern DiD
702747
748+ - Athey, S. & Imbens, G.W. (2022). "Design-Based Analysis in
749+ Difference-in-Differences Settings with Staggered Adoption." * Journal of
750+ Econometrics* 226(1), 62--79.
703751- Borusyak, K., Jaravel, X. & Spiess, J. (2024). "Revisiting Event-Study
704752 Designs: Robust and Efficient Estimation." * Review of Economic Studies*
705753 91(6), 3253--3285.
@@ -728,6 +776,9 @@ Two bootstrap strategies interact with survey designs:
728776 Surveys." * Health Services Research* 49(1), 284--303.
729777- Solon, G., Haider, S.J. & Wooldridge, J.M. (2015). "What Are We Weighting
730778 For?" * Journal of Human Resources* 50(2), 301--316.
779+ - Ye, K., Bilinski, A. & Lee, Y. (2025). "Difference-in-differences analysis
780+ with repeated cross-sectional survey data." * Health Services & Outcomes
781+ Research Methodology* . DOI: 10.1007/s10742-025-00364-7.
731782- Zeng, S., Li, F. & Tong, X. (2025). "Moving toward Best Practice when
732783 Using Propensity Score Weighting in Survey Observational Studies."
733784 arXiv:2501.16156.
0 commit comments