Reject strata/PSU/FPC for CallawaySantAnna, accept weights-only survey

igerber · claude · igerber · commit 00476e2c2461 · 2026-03-23T10:07:20.000-04:00
CallawaySantAnna per-cell and aggregation SEs use IF-based variance which
does not incorporate the full survey design structure (strata/PSU/FPC).
Rather than silently accepting a full design that doesn't affect SE magnitude,
reject it at runtime with NotImplementedError.

- Guard: strata/PSU/FPC in SurveyDesign raises NotImplementedError
- REGISTRY.md: updated to reflect weights-only support, removed overstated
  claim about design effects entering via WIF
- Roadmap: added CS strata/PSU/FPC, covariates+IPW/DR, and efficient DRDID
  IF to Phase 5 deferred work
- TODO.md: updated CS entry to reflect weights-only constraint
- Tests: updated CS tests to use weights-only, added strata/PSU/FPC rejection test

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/TODO.md b/TODO.md
@@ -52,7 +52,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails; fixing requires sparse least-squares alternatives) |
 | EfficientDiD: API docs / tutorial page for new public estimator | `docs/` | #192 | Medium |
 | Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | `estimators.py` | #218 | Medium |
-| CallawaySantAnna per-cell ATT(g,t) SEs under survey use influence-function variance, not full design-based TSL with strata/PSU/FPC. Design effects enter at aggregation via WIF and survey df. Full per-cell TSL would require constructing unit-level influence functions on the global index and passing through `compute_survey_vcov()`. | `staggered.py` | #233 | Medium |
+| CallawaySantAnna survey: strata/PSU/FPC rejected at runtime. Full design-based SEs require routing the combined IF/WIF through `compute_survey_vcov()`. Currently weights-only. | `staggered.py` | #233 | Medium |
 | CallawaySantAnna survey + covariates + IPW/DR: DRDID panel nuisance-estimation IF corrections not implemented. Currently gated with NotImplementedError. Regression method with covariates works (has WLS nuisance IF correction). | `staggered.py` | #233 | Medium |
 | EfficientDiD hausman_pretest() clustered covariance uses stale `n_cl` after filtering non-finite EIF rows — should recompute effective cluster count and remap indices after `row_finite` filtering | `efficient_did.py` | #230 | Medium |
 | EfficientDiD `control_group="last_cohort"` trims at `last_g - anticipation` but REGISTRY says `t >= last_g`. With `anticipation=0` (default) these are identical. With `anticipation>0`, code is arguably more conservative (excludes anticipation-contaminated periods). Either align REGISTRY with code or change code to `t < last_g` — needs design decision. | `efficient_did.py` | #230 | Low |
diff --git a/diff_diff/staggered.py b/diff_diff/staggered.py
@@ -1197,6 +1197,18 @@ def fit(
                     f"got '{resolved_survey.weight_type}'. The survey variance math "
                     f"assumes probability weights (pweight)."
                 )
+            if (
+                resolved_survey.strata is not None
+                or resolved_survey.psu is not None
+                or resolved_survey.fpc is not None
+            ):
+                raise NotImplementedError(
+                    "CallawaySantAnna does not yet support strata/PSU/FPC in "
+                    "SurveyDesign. Per-cell and aggregation SEs use IF-based "
+                    "variance which does not incorporate the full survey design "
+                    "structure. Use SurveyDesign(weights=...) only. Full "
+                    "design-based SEs via compute_survey_vcov() are planned."
+                )
 
         # Guard bootstrap + survey
         if self.n_bootstrap > 0 and resolved_survey is not None:
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -416,9 +416,9 @@ The multiplier bootstrap uses random weights w_i with E[w]=0 and Var(w)=1:
     a base period later than `t` (matching R's `did::att_gt()`)
   - Does not require never-treated units: when all units are eventually treated,
     not-yet-treated cohorts serve as controls for each other (requires ≥2 cohorts)
-- **Note:** CallawaySantAnna survey weights: regression method supports covariates; IPW/DR support no-covariate only (covariates+IPW/DR+survey raises NotImplementedError — DRDID nuisance IF not yet implemented). Survey weights compose with IPW weights multiplicatively. WIF in aggregation matches R's did::wif() formula. Bootstrap + survey deferred.
+- **Note:** CallawaySantAnna survey support: weights-only (strata/PSU/FPC raise NotImplementedError — full design-based SEs via compute_survey_vcov not yet implemented). Regression method supports covariates; IPW/DR support no-covariate only (covariates+IPW/DR raises NotImplementedError — DRDID nuisance IF not yet implemented). Survey weights compose with IPW weights multiplicatively. WIF in aggregation matches R's did::wif() formula. Bootstrap + survey deferred.
 - **Note (deviation from R):** CallawaySantAnna survey reg+covariates per-cell SE uses a conservative plug-in IF based on WLS residuals. The treated IF is `inf_treated_i = (sw_i/sum(sw_treated)) * (resid_i - ATT)` (normalized by treated weight sum, matching unweighted `(resid-ATT)/n_t`). The control IF is `inf_control_i = -(sw_i/sum(sw_control)) * wls_resid_i` (normalized by control weight sum, matching unweighted `-resid/n_c`). SE is computed as `sqrt(sum(sw_t_norm * (resid_t - ATT)^2) + sum(sw_c_norm * resid_c^2))`, the weighted analogue of the unweighted `sqrt(var_t/n_t + var_c/n_c)`. This omits the semiparametrically efficient nuisance correction from DRDID's `reg_did_panel` — WLS residuals are orthogonal to the weighted design matrix by construction, so the first-order IF term is asymptotically valid but may be conservative. SEs pass weight-scale-invariance tests. The efficient DRDID correction is deferred to future work.
-- **Note (deviation from R):** Per-cell ATT(g,t) SEs under survey weights use influence-function-based variance (matching R's `did::att_gt` analytical SE path) rather than full Taylor-series linearization with strata/PSU/FPC structure. Consequently, specifying strata/PSU/FPC in SurveyDesign does NOT change per-cell SE magnitudes — it only affects aggregation-level SEs (via the WIF), survey degrees of freedom (for t-distribution p-values/CIs), and reported metadata. This is consistent with R's approach where per-cell SEs are influence-function-based and design effects enter at the aggregation stage.
+- **Note (deviation from R):** Per-cell ATT(g,t) SEs under survey weights use influence-function-based variance (matching R's `did::att_gt` analytical SE path) rather than full Taylor-series linearization. Strata/PSU/FPC are rejected at runtime until full design-based SEs via `compute_survey_vcov()` on the combined IF/WIF are implemented.
 
 **Reference implementation(s):**
 - R: `did::att_gt()` (Callaway & Sant'Anna's official package)
diff --git a/docs/survey-roadmap.md b/docs/survey-roadmap.md
@@ -46,7 +46,7 @@ message pointing to the planned phase or describing the limitation.
 |-----------|------|----------------|-------|
 | ImputationDiD | `imputation.py` | Analytical | Weighted iterative FE, weighted ATT aggregation, weighted conservative variance (Theorem 3); bootstrap+survey deferred |
 | TwoStageDiD | `two_stage.py` | Analytical | Weighted iterative FE, weighted Stage 2 OLS, weighted GMM sandwich variance; bootstrap+survey deferred |
-| CallawaySantAnna | `staggered.py` | Analytical | Survey-weighted regression (all cases), IPW and DR (no-covariate only); survey-weighted WIF in aggregation; covariates+IPW/DR deferred (needs DRDID nuisance IF); bootstrap+survey deferred |
+| CallawaySantAnna | `staggered.py` | Weights-only | Weights-only SurveyDesign (strata/PSU/FPC rejected); reg supports covariates, IPW/DR no-covariate only; survey-weighted WIF in aggregation; full design SEs, covariates+IPW/DR, and bootstrap+survey deferred |
 
 **Infrastructure**: Weighted `solve_logit()` added to `linalg.py` — survey weights
 enter the IRLS working weights as `w_survey * mu * (1 - mu)`. This also unblocked
@@ -59,6 +59,9 @@ TripleDifference IPW/DR from Phase 3 deferred work.
 | ImputationDiD | Bootstrap + survey | Phase 5: bootstrap+survey interaction |
 | TwoStageDiD | Bootstrap + survey | Phase 5: bootstrap+survey interaction |
 | CallawaySantAnna | Bootstrap + survey | Phase 5: bootstrap+survey interaction |
+| CallawaySantAnna | Strata/PSU/FPC in SurveyDesign | Phase 5: route combined IF/WIF through `compute_survey_vcov()` for design-based aggregation SEs |
+| CallawaySantAnna | Covariates + IPW/DR + survey | Phase 5: DRDID panel nuisance IF corrections |
+| CallawaySantAnna | Efficient DRDID nuisance IF for reg+covariates | Phase 5: replace conservative plug-in IF with semiparametrically efficient IF |
 
 ### Remaining for Phase 5
 
diff --git a/tests/test_survey_phase4.py b/tests/test_survey_phase4.py
@@ -717,54 +717,34 @@ def test_uniform_weights_match_unweighted(self, staggered_survey_data):
             )
             assert abs(r_unw.overall_att - r_w.overall_att) < 1e-8, f"method={method}: ATT mismatch"
 
-    def test_survey_metadata_fields(self, staggered_survey_data, survey_design_full):
-        """survey_metadata has correct fields with full design."""
+    def test_survey_metadata_fields(self, staggered_survey_data, survey_design_weights_only):
+        """survey_metadata has correct fields with weights-only design."""
         result = CallawaySantAnna(estimation_method="reg").fit(
             staggered_survey_data,
             "outcome",
             "unit",
             "period",
             "first_treat",
-            survey_design=survey_design_full,
+            survey_design=survey_design_weights_only,
         )
         sm = result.survey_metadata
         assert sm is not None
         assert sm.weight_type == "pweight"
         assert sm.effective_n > 0
         assert sm.design_effect > 0
-        assert sm.n_strata is not None
-        assert sm.n_psu is not None
 
-    def test_se_differs_with_design(self, staggered_survey_data):
-        """Weights-only vs full design: same ATT, different inference via survey df."""
-        sd_w = SurveyDesign(weights="weight")
+    def test_strata_psu_fpc_raises(self, staggered_survey_data):
+        """Strata/PSU/FPC should raise NotImplementedError."""
         sd_full = SurveyDesign(weights="weight", strata="stratum", psu="psu")
-
-        r_w = CallawaySantAnna(estimation_method="reg").fit(
-            staggered_survey_data,
-            "outcome",
-            "unit",
-            "period",
-            "first_treat",
-            survey_design=sd_w,
-        )
-        r_full = CallawaySantAnna(estimation_method="reg").fit(
-            staggered_survey_data,
-            "outcome",
-            "unit",
-            "period",
-            "first_treat",
-            survey_design=sd_full,
-        )
-        # ATTs should be the same (same weights)
-        assert abs(r_w.overall_att - r_full.overall_att) < 1e-10
-        # Full design should carry survey df (strata/PSU structure)
-        assert r_full.survey_metadata is not None
-        assert r_full.survey_metadata.n_strata is not None
-        assert r_full.survey_metadata.n_psu is not None
-        # P-values should differ due to t-distribution with survey df
-        if np.isfinite(r_w.overall_p_value) and np.isfinite(r_full.overall_p_value):
-            assert r_w.overall_p_value != r_full.overall_p_value
+        with pytest.raises(NotImplementedError, match="strata/PSU/FPC"):
+            CallawaySantAnna(estimation_method="reg").fit(
+                staggered_survey_data,
+                "outcome",
+                "unit",
+                "period",
+                "first_treat",
+                survey_design=sd_full,
+            )
 
     def test_bootstrap_survey_raises(self, staggered_survey_data, survey_design_weights_only):
         """Bootstrap + survey should raise NotImplementedError."""
@@ -1197,34 +1177,19 @@ def test_survey_weights_change_per_cell_att(self, staggered_survey_data):
                 effects_no, effects_sv, atol=1e-6
             ), f"{method}: survey weights should change per-cell ATT"
 
-    def test_survey_df_affects_pvalues(self, staggered_survey_data):
-        """Survey df (from strata/PSU) should affect p-values via t-distribution."""
-        data = staggered_survey_data
-        sd_weights = SurveyDesign(weights="weight")
+    def test_strata_psu_fpc_raises_inference(self, staggered_survey_data):
+        """Strata/PSU/FPC raises NotImplementedError in inference context."""
         sd_full = SurveyDesign(weights="weight", strata="stratum", psu="psu")
-        est = CallawaySantAnna(estimation_method="reg")
-        r_w = est.fit(
-            data,
-            outcome="outcome",
-            unit="unit",
-            time="period",
-            first_treat="first_treat",
-            aggregate="simple",
-            survey_design=sd_weights,
-        )
-        r_f = est.fit(
-            data,
-            outcome="outcome",
-            unit="unit",
-            time="period",
-            first_treat="first_treat",
-            aggregate="simple",
-            survey_design=sd_full,
-        )
-        # ATT should be same (same weights), but p-values differ (different df)
-        assert np.isclose(r_w.overall_att, r_f.overall_att, atol=1e-8)
-        # Survey df from strata/PSU should change inference
-        assert r_f.survey_metadata.df_survey is not None
+        with pytest.raises(NotImplementedError, match="strata/PSU/FPC"):
+            CallawaySantAnna(estimation_method="reg").fit(
+                staggered_survey_data,
+                "outcome",
+                "unit",
+                "period",
+                "first_treat",
+                aggregate="simple",
+                survey_design=sd_full,
+            )
 
 
 # =============================================================================