Add survey design support to StaggeredTripleDifference by igerber · Pull Request #247 · igerber/diff-diff

igerber · 2026-03-31T18:37:09Z

Summary

Add full survey design support (pweight, strata/PSU/FPC, replicate weights) to StaggeredTripleDifference estimator
Extract collapse_survey_to_unit_level() from CallawaySantAnna to survey.py for reuse across panel IF-based estimators
Thread survey weights through all three pairwise DiD comparisons: propensity score estimation (weighted IRLS), outcome regression (WLS), and Riesz representer computation (weighted Hajek normalization)
Design-based variance at aggregation level via CallawaySantAnnaAggregationMixin infrastructure (TSL or replicate IF variance)
Guard against unsupported replicate-weight + bootstrap combination
Propagate effective df from aggregation mixin for correct replicate-weight inference

Methodology references (required if estimator / math changes)

Method name(s): Staggered Triple Difference (DDD), Survey-weighted DiD
Paper / source link(s): Ortiz-Villavicencio & Sant'Anna (2025) "Better Understanding Triple Differences Estimators" (arXiv:2505.09942); Lumley (2004) for TSL variance
Any intentional deviations from the source (and why): Pre-existing aggregation weight deviation from R's triplediff::agg_ddd() documented in REGISTRY.md (uses CallawaySantAnna mixin cohort-size weights instead of group-probability weights). R triplediff package does not support survey weights — this implementation is unique to diff-diff.

Validation

Tests added/updated: tests/test_survey_staggered_ddd.py (25 tests across 12 classes)
- Smoke tests for reg/ipw/dr with all aggregation modes
- Uniform weight equivalence (coef + SE match unweighted)
- Scale invariance (constant weight multiplier)
- Nontrivial weights change SE
- Full design (strata/PSU/FPC) with metadata validation
- Design-based aggregation SEs differ from pweight-only
- Replicate weights (BRR) produce finite results
- Replicate + bootstrap rejection (NotImplementedError)
- Survey-weighted aggregation point estimates differ from unweighted
- pweight-only validation (fweight/aweight rejected)
- Bootstrap + survey interaction
- Control group and base period variants with survey
Existing tests pass: test_methodology_staggered_triple_diff.py (40 tests), test_survey_phase4.py CallawaySantAnna tests (23 tests)

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Thread survey weights through all three pairwise DiD comparisons (propensity scores, outcome regression, Riesz representers) with design-based variance at aggregation via CallawaySantAnna mixin infrastructure. Extract collapse_survey_to_unit_level to survey.py for reuse. Full test coverage across estimation methods, survey designs, and aggregation modes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…erage Block unsupported replicate-weight + n_bootstrap>0 combination matching CallawaySantAnna guard. Propagate _effective_df from _aggregate_simple() to df_survey for correct replicate-weight inference. Add tests for replicate+bootstrap rejection and survey-weighted aggregation point estimates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-31T18:48:21Z

Overall Assessment

⚠️ Needs changes

Executive Summary

The pairwise survey-weighted RA/IPW/DR implementation itself looks consistent with the paper and companion triplediff code; the problem is in the new aggregation/replicate normalization layer, not in the core three-DiD decomposition. (ideas.repec.org)
P1: replicate-weight survey support is not invariant to arbitrary common rescaling of the full-sample and combined replicate weights, so equivalent replicate designs can yield different aggregated SEs/inference and, in single-comparison cells, different group-time SEs.
P2: the new BRR tests do not exercise the documented combined_weights contract and would not catch the scale bug above.
No separate security, performance, or maintainability findings.

Methodology

Severity: P1. Impact: Affected method is StaggeredTripleDifference with replicate-weight survey support. SurveyDesign.resolve() intentionally leaves full-sample weights unnormalized when replicate_weights are present, so w_r / w_full stays on the raw scale, but the new estimator then reuses those raw unit weights for size_gt normalization and for the inherited cohort-mass/WIF aggregation path. See survey.py#L189, survey.py#L230, staggered_triple_diff.py#L404, staggered_triple_diff.py#L430, staggered_triple_diff.py#L445, staggered_triple_diff.py#L829, staggered_aggregation.py#L65, staggered_aggregation.py#L288, and staggered_aggregation.py#L399. The source-material check matters because the companion triplediff aggregation code constructs pg as a probability term and plugs it into get_weight_influence(), while the pairwise compute_did() core uses i_weights inside the nuisance/Riesz pieces only. Feeding arbitrary raw replicate full-sample weights into the DDD aggregation layer breaks that normalization contract, and REGISTRY does not document such a deviation. (rdrr.io) Concrete fix: keep raw resolved_survey.weights/replicate_weights only for ratio-based replicate variance, but build a normalized unit-level full-sample-weight view for cohort masses, size_gt, stored aggregation IFs, and WIF construction.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The P1 above is not mitigated by TODO.md or a REGISTRY note.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new replicate tests generate rep_* columns as standalone perturbations and then rely on the default combined_weights=True, even though that mode means each replicate column already includes the full-sample weight. They also never test common-scale invariance for the replicate path. See test_survey_staggered_ddd.py#L383, test_survey_staggered_ddd.py#L390, test_survey_staggered_ddd.py#L425, and survey.py#L214. As written, these tests would not catch the P1 issue above. Concrete fix: either build true combined-weight replicate columns (rep_r = weight * factor_r) or set combined_weights=False, and add a regression test that rescales both weight and every rep_r by the same constant and asserts unchanged overall/event-study/group SEs.

Path to Approval

Normalize a separate unit-level full-sample-weight view for the DDD aggregation/WIF path, while preserving raw weights only for w_r / w_full.
Add replicate-weight scale-invariance tests for at least aggregate="simple", aggregate="event_study", and aggregate="group".
Fix the BRR fixtures to honor combined_weights semantics, and cover both combined_weights=True and combined_weights=False explicitly.

Normalize survey_weights_arr to sum=n in _precompute_structures() so size_gt/size_gt_ctrl denominators are scale-invariant for replicate designs. Fix BRR test fixtures to build combined replicate weights (rep_r = weight * factor) honoring combined_weights=True semantics. Add replicate scale-invariance tests for simple/event_study/group aggregation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-31T19:38:54Z

/ai-review

github-actions · 2026-03-31T19:50:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b552cb887097da277fb6e231fa0d6ae0ef5d81c2

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior replicate-weight scale-invariance blocker looks addressed: the PR now builds a normalized unit-level weight view before size_gt/aggregation use, and the new replicate fixtures correctly follow combined_weights=True semantics in the test data (diff_diff/staggered_triple_diff.py:L399-L445, diff_diff/staggered_triple_diff.py:L829-L848, tests/test_survey_staggered_ddd.py:L376-L388, tests/test_survey_staggered_ddd.py:L459-L531).
Cross-checking against Ortiz-Villavicencio & Sant'Anna and the companion triplediff source, the core staggered DDD structure still matches the intended RA/IPW/DR + multiple-comparison-group + WIF aggregation design; the remaining blockers are in the new survey-specific edge paths, not the baseline DDD decomposition. (arxiv.org)
P1: when survey-weighted propensity estimation fails, _compute_pscore() falls back to an unweighted treated share, so weighted IPW/DR can silently revert to the wrong target (diff_diff/staggered_triple_diff.py:L1276-L1300, diff_diff/linalg.py:L1137-L1165).
P1: zero-weight subgroups after domain/subpopulation weighting are not treated as empty cells; affected (g, g_c, t) comparisons can become NaN and then be dropped without the empty-subgroup warning/handling already used elsewhere in the library (diff_diff/staggered_triple_diff.py:L385-L397, diff_diff/staggered_triple_diff.py:L917-L944, diff_diff/staggered_triple_diff.py:L1194-L1198).
P2: the new test file still does not exercise the covariate-adjusted survey nuisance paths, estimator-level combined_weights=False, or direct event-study/group replicate invariance assertions, so key new branches remain unverified (diff_diff/staggered_triple_diff.py:L1106-L1123, diff_diff/staggered_triple_diff.py:L1253-L1364, tests/test_survey_staggered_ddd.py:L1-L718).

Methodology

Cross-check against source material is mostly clean: the documented survey-specific deviations now recorded in the registry are not defects, including the survey-weighted cell-size extension and the already-documented aggregation-weight differences (docs/methodology/REGISTRY.md:L1349-L1357, docs/methodology/REGISTRY.md:L1393-L1404).

Severity: P1. Impact: _compute_pscore() correctly calls weighted IRLS, but if that fit fails it falls back to np.mean(PA4) instead of a survey-weighted treated share. On non-uniform weights, IPW/DR estimates in the fallback branch target the wrong propensity model, despite the registry and solve_logit() docstring explicitly stating that survey weights enter propensity-score estimation. Concrete fix: in the fallback branch, compute the intercept-only propensity as a weighted mean over the positive-weight effective sample, e.g. np.average(PA4[pos], weights=survey_weights[pos]), and add a regression test that forces the fallback under non-uniform weights (diff_diff/staggered_triple_diff.py:L1276-L1300, diff_diff/linalg.py:L1137-L1165, docs/methodology/REGISTRY.md:L1393-L1396).
Severity: P1. Impact: the new SDDD survey path only checks raw subgroup counts. After subpopulation() or any zero-weight domain, a subgroup can have rows but zero effective survey mass; _compute_did_panel() then returns NaN, and the outer loop quietly drops the comparison/cell. That is a missing positive-weight assumption check in a new survey code path, and it can silently change which comparison groups contribute to a (g,t) estimate. Concrete fix: before any pairwise DiD run, compute positive-weight mass for each of the four DDD subgroups and treat zero-mass subgroups as empty with the same warning/skip behavior already used in the survey-aware staggered paths (diff_diff/staggered_triple_diff.py:L385-L397, diff_diff/staggered_triple_diff.py:L917-L944, diff_diff/staggered_triple_diff.py:L1194-L1198, diff_diff/staggered.py:L553-L562, diff_diff/staggered.py:L2557-L2568, docs/methodology/REGISTRY.md:L2293-L2298).

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new findings. The existing SDDD parity limitations remain tracked already and are not blockers (TODO.md:L68-L70).

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests mostly exercise no-covariate fits, so the code paths this PR actually adds for survey-weighted nuisance estimation (solve_logit(weights=...), solve_ols(weights=...), PS/OR IF corrections) are not covered. The replicate tests also stay on estimator-level combined_weights=True and only check overall_att/overall_se, not horizon/group outputs. Concrete fix: add at least one covariate-adjusted survey test for each of reg, ipw, and dr; add an estimator-level combined_weights=False replicate case; and assert scale invariance on event_study_effects and group_effects themselves, not only overall_se (diff_diff/staggered_triple_diff.py:L1106-L1123, diff_diff/staggered_triple_diff.py:L1253-L1364, tests/test_survey_staggered_ddd.py:L376-L531).

Execution note: I could not run the test suite in this environment because pytest and even numpy are unavailable.

Path to Approval

Preserve survey weighting in the unconditional propensity fallback in _compute_pscore() and add a regression test that forces that fallback under non-uniform survey weights.
Add positive-weight subgroup checks to the SDDD survey path so zero-mass domain/subpopulation cells are warned-and-skipped as empty rather than silently dropped after a NaN ATT.
Extend tests/test_survey_staggered_ddd.py with covariate-adjusted survey cases and estimator-level replicate coverage for combined_weights=False plus direct event-study/group invariance assertions.

Fix survey-weighted propensity fallback to use np.average(PA4, weights=sw) instead of unweighted np.mean(PA4). Add zero-weight subgroup detection in _compute_ddd_gt_gc for subpopulation/domain designs. Fix existing test expecting NotImplementedError for invalid survey_design type (now TypeError from _resolve_survey_for_fit). Add covariate-adjusted survey tests for reg/ipw/dr and combined_weights=False replicate test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-31T20:19:55Z

/ai-review

github-actions · 2026-03-31T20:27:46Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 227ae508692d5684fa66a56c2b130ec6af2ee82e

Overall Assessment

✅ Looks good

Executive Summary

The prior weighted-propensity fallback blocker is fixed: _compute_pscore() now falls back to a survey-weighted treated share instead of an unweighted mean when weighted IRLS fails. diff_diff/staggered_triple_diff.py:L1264
The prior zero-mass subgroup blocker is fixed: _compute_ddd_gt_gc() now treats zero survey-mass subgroups as empty and skips them with the same warning path used for unidentified comparisons. diff_diff/staggered_triple_diff.py:L889
Against the updated methodology registry and the in-code references, the affected method is StaggeredTripleDifference’s panel DDD path plus its aggregation/inference path; the implementation now consistently threads survey weights through weighted IRLS, WLS, weighted Riesz/Hajek terms, survey-weighted DDD cell sizes, and design-based aggregated variance. diff_diff/staggered_triple_diff.py:L1117 diff_diff/staggered_triple_diff.py:L1153 diff_diff/staggered_triple_diff.py:L1336 diff_diff/staggered_aggregation.py:L503 docs/methodology/REGISTRY.md:L1393
I did not find any remaining unmitigated P0/P1 methodology defects in the changed paths.
One P2 test gap remains: the new suite still does not directly pin the two newly fixed branches or directly assert invariance on returned event_study_effects / group_effects. tests/test_survey_staggered_ddd.py:L186 tests/test_survey_staggered_ddd.py:L477 tests/test_survey_staggered_ddd.py:L733 tests/test_survey_staggered_ddd.py:L760

Methodology

Severity: P3. Impact: No unmitigated methodology defect found. The affected method is the survey-weighted StaggeredTripleDifference ATT(g,t) / aggregation path. The changed code matches the updated registry note: survey weights enter propensity-score estimation, outcome regression, Riesz/Hajek normalization, DDD IF-combination cell sizes, and aggregated SE computation. Concrete fix: none. diff_diff/staggered_triple_diff.py:L214 diff_diff/staggered_triple_diff.py:L1026 diff_diff/staggered_triple_diff.py:L1153 diff_diff/staggered_triple_diff.py:L1264 diff_diff/staggered_triple_diff.py:L1336 docs/methodology/REGISTRY.md:L1393
Severity: P3. Impact: The visible deviation from R in this area remains the already documented aggregation-weight choice, so it is informational rather than a defect. Concrete fix: none. docs/methodology/REGISTRY.md:L1400

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new findings. The pre-existing staggered-DDD parity limitations remain tracked in TODO.md, so they are not blockers for this re-review. TODO.md:L68

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new survey test file materially improves coverage, but it still does not directly regression-test the two branches that resolved the prior P1s: the weighted unconditional propensity fallback and the zero-survey-mass subgroup skip. The replicate scale-invariance test also parameterizes aggregate="event_study" and aggregate="group" without asserting the returned event_study_effects or group_effects, so those outputs remain unpinned. Concrete fix: add one non-uniform-weight test that forces _compute_pscore() into fallback and checks the weighted-share behavior; add one domain/subpopulation test with a zero-mass subgroup and expected warning/skip; and extend the replicate invariance test to compare event_study_effects and group_effects values directly. diff_diff/staggered_triple_diff.py:L927 diff_diff/staggered_triple_diff.py:L1308 tests/test_survey_staggered_ddd.py:L186 tests/test_survey_staggered_ddd.py:L477 tests/test_survey_staggered_ddd.py:L733 tests/test_survey_staggered_ddd.py:L760

Execution note: I could not run pytest here because pytest is not installed, and I could not do a runtime smoke test because numpy is unavailable in this environment. I did confirm the changed Python files parse successfully with compile().

…b-aggregation invariance Pin the two newly fixed branches: weighted pscore fallback (forced via collinear covariate) and zero-mass subgroup warning/skip (via zeroed weights). Extend replicate scale-invariance test to assert event_study and group_effects values directly, not just overall_att/overall_se. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-31T20:39:15Z

/ai-review

github-actions · 2026-03-31T20:49:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c07887f1b84c63fbb9914bc8f61069aac26644e5

Overall Assessment

✅ Looks good

Executive Summary

Re-review outcome: I did not find any unmitigated P0/P1 issues in the changed survey-enabled StaggeredTripleDifference path.
The prior zero-mass subgroup blocker is fixed in diff_diff/staggered_triple_diff.py:938 and now directly regression-tested in tests/test_survey_staggered_ddd.py:869.
The prior weighted unconditional-propensity bug appears fixed in diff_diff/staggered_triple_diff.py:1308, but the new fallback regression test at tests/test_survey_staggered_ddd.py:833 still only checks finiteness, not that the weighted-share fallback actually executed.
The replicate-weight scale-invariance gap noted in the last review is addressed: the new test now asserts event_study_effects and group_effects directly in tests/test_survey_staggered_ddd.py:477.
I could not run pytest here because pytest is unavailable, and I could not do runtime smoke tests because numpy is unavailable; this is a static review plus compile-only validation.

Methodology

Severity: P3. Impact: The affected method is the survey-weighted staggered DDD ATT(g,t) / aggregation path in diff_diff/staggered_triple_diff.py:889 and diff_diff/staggered_triple_diff.py:1153. The changed code preserves the paper’s staggered-DDD structure and multiple-comparison-group setup, and the survey-variance extension is now explicitly documented in docs/methodology/REGISTRY.md:1393; the only visible deviation in this area remains the already-documented aggregation-weight choice in docs/methodology/REGISTRY.md:1400. Concrete fix: None. citeturn1search1turn1search2

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: fit() now populates survey_metadata on the results object in diff_diff/staggered_triple_diff.py:652, but diff_diff/staggered_triple_diff_results.py:308 still omits survey fields from to_dict(). That is not a correctness problem, but it leaves the new survey output incompletely propagated to an existing serialization surface. Concrete fix: add the standard survey keys (weight_type, effective_n, design_effect, sum_weights, n_strata, n_psu, df_survey, and replicate metadata when present) to to_dict(), and add a survey serialization assertion to the new test file.

Tech Debt

No new blocker. Existing staggered-DDD parity / benchmarking limitations remain tracked in TODO.md:68, TODO.md:69, and TODO.md:70.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new fallback regression test in tests/test_survey_staggered_ddd.py:833 does not actually prove that _compute_pscore() in diff_diff/staggered_triple_diff.py:1264 entered the fallback branch or used the survey-weighted treated share from diff_diff/staggered_triple_diff.py:1308; it only asserts that a full fit() returns a finite ATT. A future regression back to the unweighted fallback could therefore still pass this file. Concrete fix: force solve_logit() to raise (or call _compute_pscore() directly) with non-uniform weights and assert that the returned constant propensity score equals the weighted treated share after clipping, and differs from the unweighted mean in the skewed-weight case.
Validation note: I could not execute the suite here because pytest is not installed and numpy is unavailable; I only confirmed the touched Python files compile.

igerber and others added 2 commits March 31, 2026 13:52

igerber merged commit 55217c4 into main Mar 31, 2026
14 checks passed

igerber deleted the staggered-ddd-survey branch March 31, 2026 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add survey design support to StaggeredTripleDifference#247

Add survey design support to StaggeredTripleDifference#247
igerber merged 5 commits into
mainfrom
staggered-ddd-survey

igerber commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

igerber commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

igerber commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

igerber commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Mar 31, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

igerber commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

igerber commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

igerber commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant