Skip to content

Add survey design support to StaggeredTripleDifference#247

Merged
igerber merged 5 commits into
mainfrom
staggered-ddd-survey
Mar 31, 2026
Merged

Add survey design support to StaggeredTripleDifference#247
igerber merged 5 commits into
mainfrom
staggered-ddd-survey

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Mar 31, 2026

Summary

  • Add full survey design support (pweight, strata/PSU/FPC, replicate weights) to StaggeredTripleDifference estimator
  • Extract collapse_survey_to_unit_level() from CallawaySantAnna to survey.py for reuse across panel IF-based estimators
  • Thread survey weights through all three pairwise DiD comparisons: propensity score estimation (weighted IRLS), outcome regression (WLS), and Riesz representer computation (weighted Hajek normalization)
  • Design-based variance at aggregation level via CallawaySantAnnaAggregationMixin infrastructure (TSL or replicate IF variance)
  • Guard against unsupported replicate-weight + bootstrap combination
  • Propagate effective df from aggregation mixin for correct replicate-weight inference

Methodology references (required if estimator / math changes)

  • Method name(s): Staggered Triple Difference (DDD), Survey-weighted DiD
  • Paper / source link(s): Ortiz-Villavicencio & Sant'Anna (2025) "Better Understanding Triple Differences Estimators" (arXiv:2505.09942); Lumley (2004) for TSL variance
  • Any intentional deviations from the source (and why): Pre-existing aggregation weight deviation from R's triplediff::agg_ddd() documented in REGISTRY.md (uses CallawaySantAnna mixin cohort-size weights instead of group-probability weights). R triplediff package does not support survey weights — this implementation is unique to diff-diff.

Validation

  • Tests added/updated: tests/test_survey_staggered_ddd.py (25 tests across 12 classes)
    • Smoke tests for reg/ipw/dr with all aggregation modes
    • Uniform weight equivalence (coef + SE match unweighted)
    • Scale invariance (constant weight multiplier)
    • Nontrivial weights change SE
    • Full design (strata/PSU/FPC) with metadata validation
    • Design-based aggregation SEs differ from pweight-only
    • Replicate weights (BRR) produce finite results
    • Replicate + bootstrap rejection (NotImplementedError)
    • Survey-weighted aggregation point estimates differ from unweighted
    • pweight-only validation (fweight/aweight rejected)
    • Bootstrap + survey interaction
    • Control group and base period variants with survey
  • Existing tests pass: test_methodology_staggered_triple_diff.py (40 tests), test_survey_phase4.py CallawaySantAnna tests (23 tests)

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 2 commits March 31, 2026 13:52
Thread survey weights through all three pairwise DiD comparisons
(propensity scores, outcome regression, Riesz representers) with
design-based variance at aggregation via CallawaySantAnna mixin
infrastructure. Extract collapse_survey_to_unit_level to survey.py
for reuse. Full test coverage across estimation methods, survey
designs, and aggregation modes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…erage

Block unsupported replicate-weight + n_bootstrap>0 combination matching
CallawaySantAnna guard. Propagate _effective_df from _aggregate_simple()
to df_survey for correct replicate-weight inference. Add tests for
replicate+bootstrap rejection and survey-weighted aggregation point
estimates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

Executive Summary

  • The pairwise survey-weighted RA/IPW/DR implementation itself looks consistent with the paper and companion triplediff code; the problem is in the new aggregation/replicate normalization layer, not in the core three-DiD decomposition. (ideas.repec.org)
  • P1: replicate-weight survey support is not invariant to arbitrary common rescaling of the full-sample and combined replicate weights, so equivalent replicate designs can yield different aggregated SEs/inference and, in single-comparison cells, different group-time SEs.
  • P2: the new BRR tests do not exercise the documented combined_weights contract and would not catch the scale bug above.
  • No separate security, performance, or maintainability findings.

Methodology

  • Severity: P1. Impact: Affected method is StaggeredTripleDifference with replicate-weight survey support. SurveyDesign.resolve() intentionally leaves full-sample weights unnormalized when replicate_weights are present, so w_r / w_full stays on the raw scale, but the new estimator then reuses those raw unit weights for size_gt normalization and for the inherited cohort-mass/WIF aggregation path. See survey.py#L189, survey.py#L230, staggered_triple_diff.py#L404, staggered_triple_diff.py#L430, staggered_triple_diff.py#L445, staggered_triple_diff.py#L829, staggered_aggregation.py#L65, staggered_aggregation.py#L288, and staggered_aggregation.py#L399. The source-material check matters because the companion triplediff aggregation code constructs pg as a probability term and plugs it into get_weight_influence(), while the pairwise compute_did() core uses i_weights inside the nuisance/Riesz pieces only. Feeding arbitrary raw replicate full-sample weights into the DDD aggregation layer breaks that normalization contract, and REGISTRY does not document such a deviation. (rdrr.io) Concrete fix: keep raw resolved_survey.weights/replicate_weights only for ratio-based replicate variance, but build a normalized unit-level full-sample-weight view for cohort masses, size_gt, stored aggregation IFs, and WIF construction.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The P1 above is not mitigated by TODO.md or a REGISTRY note.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new replicate tests generate rep_* columns as standalone perturbations and then rely on the default combined_weights=True, even though that mode means each replicate column already includes the full-sample weight. They also never test common-scale invariance for the replicate path. See test_survey_staggered_ddd.py#L383, test_survey_staggered_ddd.py#L390, test_survey_staggered_ddd.py#L425, and survey.py#L214. As written, these tests would not catch the P1 issue above. Concrete fix: either build true combined-weight replicate columns (rep_r = weight * factor_r) or set combined_weights=False, and add a regression test that rescales both weight and every rep_r by the same constant and asserts unchanged overall/event-study/group SEs.

Path to Approval

  1. Normalize a separate unit-level full-sample-weight view for the DDD aggregation/WIF path, while preserving raw weights only for w_r / w_full.
  2. Add replicate-weight scale-invariance tests for at least aggregate="simple", aggregate="event_study", and aggregate="group".
  3. Fix the BRR fixtures to honor combined_weights semantics, and cover both combined_weights=True and combined_weights=False explicitly.

Normalize survey_weights_arr to sum=n in _precompute_structures() so
size_gt/size_gt_ctrl denominators are scale-invariant for replicate
designs. Fix BRR test fixtures to build combined replicate weights
(rep_r = weight * factor) honoring combined_weights=True semantics.
Add replicate scale-invariance tests for simple/event_study/group
aggregation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 31, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b552cb887097da277fb6e231fa0d6ae0ef5d81c2


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

Cross-check against source material is mostly clean: the documented survey-specific deviations now recorded in the registry are not defects, including the survey-weighted cell-size extension and the already-documented aggregation-weight differences (docs/methodology/REGISTRY.md:L1349-L1357, docs/methodology/REGISTRY.md:L1393-L1404).

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new findings. The existing SDDD parity limitations remain tracked already and are not blockers (TODO.md:L68-L70).

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests mostly exercise no-covariate fits, so the code paths this PR actually adds for survey-weighted nuisance estimation (solve_logit(weights=...), solve_ols(weights=...), PS/OR IF corrections) are not covered. The replicate tests also stay on estimator-level combined_weights=True and only check overall_att/overall_se, not horizon/group outputs. Concrete fix: add at least one covariate-adjusted survey test for each of reg, ipw, and dr; add an estimator-level combined_weights=False replicate case; and assert scale invariance on event_study_effects and group_effects themselves, not only overall_se (diff_diff/staggered_triple_diff.py:L1106-L1123, diff_diff/staggered_triple_diff.py:L1253-L1364, tests/test_survey_staggered_ddd.py:L376-L531).

Execution note: I could not run the test suite in this environment because pytest and even numpy are unavailable.

Path to Approval

  1. Preserve survey weighting in the unconditional propensity fallback in _compute_pscore() and add a regression test that forces that fallback under non-uniform survey weights.
  2. Add positive-weight subgroup checks to the SDDD survey path so zero-mass domain/subpopulation cells are warned-and-skipped as empty rather than silently dropped after a NaN ATT.
  3. Extend tests/test_survey_staggered_ddd.py with covariate-adjusted survey cases and estimator-level replicate coverage for combined_weights=False plus direct event-study/group invariance assertions.

Fix survey-weighted propensity fallback to use np.average(PA4, weights=sw)
instead of unweighted np.mean(PA4). Add zero-weight subgroup detection in
_compute_ddd_gt_gc for subpopulation/domain designs. Fix existing test
expecting NotImplementedError for invalid survey_design type (now TypeError
from _resolve_survey_for_fit). Add covariate-adjusted survey tests for
reg/ipw/dr and combined_weights=False replicate test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 31, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 227ae508692d5684fa66a56c2b130ec6af2ee82e


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new findings. The pre-existing staggered-DDD parity limitations remain tracked in TODO.md, so they are not blockers for this re-review. TODO.md:L68

Security

No findings.

Documentation/Tests

Execution note: I could not run pytest here because pytest is not installed, and I could not do a runtime smoke test because numpy is unavailable in this environment. I did confirm the changed Python files parse successfully with compile().

…b-aggregation invariance

Pin the two newly fixed branches: weighted pscore fallback (forced via
collinear covariate) and zero-mass subgroup warning/skip (via zeroed
weights). Extend replicate scale-invariance test to assert event_study
and group_effects values directly, not just overall_att/overall_se.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 31, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c07887f1b84c63fbb9914bc8f61069aac26644e5


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3. Impact: fit() now populates survey_metadata on the results object in diff_diff/staggered_triple_diff.py:652, but diff_diff/staggered_triple_diff_results.py:308 still omits survey fields from to_dict(). That is not a correctness problem, but it leaves the new survey output incompletely propagated to an existing serialization surface. Concrete fix: add the standard survey keys (weight_type, effective_n, design_effect, sum_weights, n_strata, n_psu, df_survey, and replicate metadata when present) to to_dict(), and add a survey serialization assertion to the new test file.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new fallback regression test in tests/test_survey_staggered_ddd.py:833 does not actually prove that _compute_pscore() in diff_diff/staggered_triple_diff.py:1264 entered the fallback branch or used the survey-weighted treated share from diff_diff/staggered_triple_diff.py:1308; it only asserts that a full fit() returns a finite ATT. A future regression back to the unweighted fallback could therefore still pass this file. Concrete fix: force solve_logit() to raise (or call _compute_pscore() directly) with non-uniform weights and assert that the returned constant propensity score equals the weighted treated share after clipping, and differs from the unweighted mean in the skewed-weight case.
  • Validation note: I could not execute the suite here because pytest is not installed and numpy is unavailable; I only confirmed the touched Python files compile.

@igerber igerber merged commit 55217c4 into main Mar 31, 2026
14 checks passed
@igerber igerber deleted the staggered-ddd-survey branch March 31, 2026 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant