Add survey R cross-validation: TSL variance vs R survey::svyglm by igerber · Pull Request #250 · igerber/diff-diff

igerber · 2026-04-01T12:10:08Z

Summary

Cross-validates diff-diff's survey variance estimates against R's survey::svyglm() (Lumley 2004) across three tiers
Tier 1: DifferenceInDifferences vs svyglm under 4 design variants (strata+PSU+FPC, strata+PSU, weights-only, strata-only), with covariates, across 3 seeds — 29 tests
Tier 2: CallawaySantAnna vs did::att_gt with survey weights, with and without covariates — 8 tests
Tier 3: BRR replicate weights via LinearRegression vs svrepdesign — 3 tests
All 43 tests pass with tight tolerances (ATT rtol=1e-4, SE rtol=1%). No deviations from R found.

Methodology references (required if estimator / math changes)

Method name(s): N/A — no methodology changes, validation only
Paper / source link(s): Lumley (2004) "Analysis of Complex Survey Samples" JSS 9(8); Binder (1983) "On the Variances of Asymptotically Normal Estimators from Complex Surveys" ISR 51(3)
Any intentional deviations from the source (and why): None — this PR validates existing implementation against R

Validation

Tests added/updated: tests/test_survey_r_crossvalidation.py (43 tests, 3 tiers)
R benchmark script: benchmarks/R/benchmark_survey_crossvalidation.R
JSON fixture: benchmarks/data/synthetic/survey_crossvalidation_r_results.json

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…yglm Cross-validates diff-diff's survey variance estimates against R's authoritative survey package (Lumley 2004) across three tiers: Tier 1: DifferenceInDifferences vs svyglm under 4 design variants (strata+PSU+FPC, strata+PSU, weights-only, strata-only), with covariates, and across 3 seeds for robustness — 29 tests Tier 2: CallawaySantAnna vs did::att_gt with survey weights, with and without covariates — 8 tests Tier 3: BRR replicate weights via LinearRegression vs svrepdesign — 3 tests All 43 tests pass with tight tolerances (ATT rtol=1e-4, SE rtol=1%). No deviations from R found. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-01T12:18:17Z

Overall Assessment

✅ Looks good

Executive Summary

No estimator implementation files changed; the diff adds an R benchmark generator, a committed golden JSON fixture, and Python cross-validation tests.
I cross-checked the affected methodology contracts against the registry for DifferenceInDifferences survey TSL, CallawaySantAnna weighted reg with base_period="varying"/never-treated controls, and BRR replicate-weight variance. I did not find an undocumented estimator deviation in the exercised implementation paths.
The main weakness is Tier 2: the fixture stores R SEs, but the tests only assert ATT parity, so the new suite does not fully lock down CallawaySantAnna survey variance in the non-covariate case.
The covariate-adjusted survey reg SE divergence from R is already documented in the methodology registry, so I treated that as informational rather than a defect.
Static review only: I could not execute the new module in this environment because pytest and the Python data stack are unavailable.

Methodology

Cross-check summary: the benchmark setup is consistent with the library’s documented contracts for never-treated encoding (first_treat=0 or Inf) and replicate-weight survey d.f. in docs/methodology/REGISTRY.md:L287-L290, docs/methodology/REGISTRY.md:L390-L424, docs/methodology/REGISTRY.md:L2191-L2245, diff_diff/staggered.py:L1214-L1216, diff_diff/staggered.py:L1323-L1325, and diff_diff/survey.py:L582-L610.

P2 Impact: Tier 2 exports overall_se and gt_se from R, but the Python side only checks ATT parity plus a weak smoke test that SE is finite and differs from the unweighted run. That means this PR does not actually cross-validate CallawaySantAnna survey variance for the non-covariate reg case, despite having the golden values available. Concrete fix: add an explicit result.overall_se vs r["overall_se"] assertion for cs_weighted_nocov; if cs_weighted_cov remains ATT-only, make that exception explicit and tie it to the documented registry note. Locations: tests/test_survey_r_crossvalidation.py:L293-L350, benchmarks/R/benchmark_survey_crossvalidation.R:L400-L417.
P3 Impact: The survey reg+covariates SE difference from R is already documented, so I did not treat the lack of SE parity for cs_weighted_cov as a defect. Concrete fix: add a short inline comment in the new test pointing to the documented deviation so future reviewers do not read the omission as accidental. Locations: docs/methodology/REGISTRY.md:L423-L424, tests/test_survey_r_crossvalidation.py:L242-L350.

Code Quality

P3 Impact: _load_scenario_data() silently maps any non-numeric first_treat token to 0, not just R’s "Inf". A malformed fixture would therefore be reinterpreted as never-treated instead of failing loudly, which could change control-group composition inside the benchmark. Concrete fix: explicitly map "Inf"/"Infinity" to 0 and raise on any other non-numeric value. Location: tests/test_survey_r_crossvalidation.py:L70-L75.

Performance

No findings.

Maintainability

P3 Impact: Tier 2 compares R group-time effects positionally after sorting only the Python side, while the generator exports gt_results in whatever order did::att_gt() returns. If fixture generation is rerun under a did version with different output ordering, the comparison becomes fragile for no methodological reason. Concrete fix: sort gt_results by (group, time) before export and/or compare keyed (group, time) -> effect mappings on both sides. Locations: benchmarks/R/benchmark_survey_crossvalidation.R:L381-L387, tests/test_survey_r_crossvalidation.py:L279-L291.

Tech Debt

No new untracked blocker found. Related survey items are already marked resolved in TODO.md:L57-L59.

Security

No findings. The added fixture appears synthetic, and I did not see secret-like material.

Documentation/Tests

P3 Impact: The benchmark script prints agg$overall.se as SE(naive), which is misleading when manually regenerating benchmarks because that value is the reported aggregation SE, not a naive unweighted reference SE. Concrete fix: relabel the message to SE or overall SE. Location: benchmarks/R/benchmark_survey_crossvalidation.R:L420-L422.

Verification note: I could not run tests/test_survey_r_crossvalidation.py here because this environment does not have pytest or pandas.

- P2: Add docstring explaining why Tier 2 compares ATT only (R's did::att_gt gives naive SEs, not design-based) - P3: Fail loudly on unexpected non-numeric first_treat values instead of silently mapping to 0 - P3: Sort R gt_results by (group, time) and use keyed comparison in Python for stable cross-version matching - P3: Relabel misleading "SE(naive)" to "SE" in R console output Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-01T17:32:35Z

/ai-review

github-actions · 2026-04-01T17:38:13Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2ba616a0ddd046bc673a537328be8f11800945aa

Overall Assessment

✅ Looks good

Executive Summary

No estimator implementation files changed; this PR adds an R benchmark generator, a committed JSON fixture, and Python cross-validation tests for survey/Taylor-linearized and BRR paths.
Re-review: the prior issues around permissive first_treat parsing, unstable (group, time) ordering, and the misleading benchmark print label are addressed in tests/test_survey_r_crossvalidation.py:L61-L94 and benchmarks/R/benchmark_survey_crossvalidation.R:L381-L423.
I cross-checked the exercised methodology against the registry for survey TSL, Callaway-Sant’Anna base_period="varying" / never-treated controls, and BRR replicate-weight df. I did not find any unmitigated P0/P1 mismatch in the changed code paths.
One P3 remains: Tier 2 now correctly documents ATT-only coverage, but cs_weighted_nocov still exports an R overall_se without asserting it, so that specific weighted-reg SE path is not locked down.
Static review only: I could not execute the tests in this environment because pandas is unavailable.

Methodology

Affected methods reviewed: DifferenceInDifferences survey TSL against survey::svyglm; CallawaySantAnna(estimation_method="reg", control_group="never_treated", base_period="varying", n_bootstrap=0) with survey weights against did::att_gt; LinearRegression BRR replicate-weight variance against svrepdesign; TwoWayFixedEffects survey plumbing on the balanced 2x2 benchmark. References: docs/methodology/REGISTRY.md:L390-L424, docs/methodology/REGISTRY.md:L2065-L2250, diff_diff/staggered.py:L1359-L1574, diff_diff/staggered_aggregation.py:L503-L545, diff_diff/survey.py:L581-L619, diff_diff/survey.py:L1365-L1703.
No unmitigated P0/P1 findings.
Severity: P3. Impact: Tier 2’s ATT-only rationale is now explicit, but the exact cs_weighted_nocov scenario still stores R overall_se, and the registry’s explicit deviation note is for the reg+covariates survey path rather than the no-covariate weighted path. A regression in that no-covariate overall SE could therefore slip through this new suite. Concrete fix: add an overall_se parity assertion for cs_weighted_nocov, and keep cs_weighted_cov ATT-only with the existing registry note. References: tests/test_survey_r_crossvalidation.py:L253-L374, benchmarks/R/benchmark_survey_crossvalidation.R:L378-L418, docs/methodology/REGISTRY.md:L420-L424, diff_diff/staggered_aggregation.py:L528-L545.

Code Quality

No findings. The prior first_treat parsing issue is fixed: unexpected non-numeric values now raise instead of being silently coerced to never-treated in tests/test_survey_r_crossvalidation.py:L61-L94.

Performance

No findings.

Maintainability

No findings. The prior ordering fragility is fixed by sorting exported (group, time) cells in benchmarks/R/benchmark_survey_crossvalidation.R:L381-L389 and comparing keyed mappings in tests/test_survey_r_crossvalidation.py:L297-L315.

Tech Debt

No new untracked blocker. The related survey items from earlier reviews are already recorded as resolved in TODO.md:L57-L59.

Security

No findings. The committed fixture appears synthetic, and I did not see secret-like material in the changed files.

Documentation/Tests

No additional findings beyond the Tier 2 SE-parity gap above.
Runtime verification was not possible here because the environment lacks pandas, so this review is based on source inspection rather than local test execution.

igerber merged commit a6a3325 into main Apr 1, 2026
13 checks passed

igerber deleted the survey-r-crossvalidation branch April 1, 2026 18:20

igerber mentioned this pull request Apr 1, 2026

Bump version to 2.8.1 #254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add survey R cross-validation: TSL variance vs R survey::svyglm#250

Add survey R cross-validation: TSL variance vs R survey::svyglm#250
igerber merged 2 commits into
mainfrom
survey-r-crossvalidation

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 1, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant