Skip to content

Add survey R cross-validation: TSL variance vs R survey::svyglm#250

Merged
igerber merged 2 commits into
mainfrom
survey-r-crossvalidation
Apr 1, 2026
Merged

Add survey R cross-validation: TSL variance vs R survey::svyglm#250
igerber merged 2 commits into
mainfrom
survey-r-crossvalidation

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 1, 2026

Summary

  • Cross-validates diff-diff's survey variance estimates against R's survey::svyglm() (Lumley 2004) across three tiers
  • Tier 1: DifferenceInDifferences vs svyglm under 4 design variants (strata+PSU+FPC, strata+PSU, weights-only, strata-only), with covariates, across 3 seeds — 29 tests
  • Tier 2: CallawaySantAnna vs did::att_gt with survey weights, with and without covariates — 8 tests
  • Tier 3: BRR replicate weights via LinearRegression vs svrepdesign — 3 tests
  • All 43 tests pass with tight tolerances (ATT rtol=1e-4, SE rtol=1%). No deviations from R found.

Methodology references (required if estimator / math changes)

  • Method name(s): N/A — no methodology changes, validation only
  • Paper / source link(s): Lumley (2004) "Analysis of Complex Survey Samples" JSS 9(8); Binder (1983) "On the Variances of Asymptotically Normal Estimators from Complex Surveys" ISR 51(3)
  • Any intentional deviations from the source (and why): None — this PR validates existing implementation against R

Validation

  • Tests added/updated: tests/test_survey_r_crossvalidation.py (43 tests, 3 tiers)
  • R benchmark script: benchmarks/R/benchmark_survey_crossvalidation.R
  • JSON fixture: benchmarks/data/synthetic/survey_crossvalidation_r_results.json

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…yglm

Cross-validates diff-diff's survey variance estimates against R's
authoritative survey package (Lumley 2004) across three tiers:

Tier 1: DifferenceInDifferences vs svyglm under 4 design variants
  (strata+PSU+FPC, strata+PSU, weights-only, strata-only),
  with covariates, and across 3 seeds for robustness — 29 tests
Tier 2: CallawaySantAnna vs did::att_gt with survey weights,
  with and without covariates — 8 tests
Tier 3: BRR replicate weights via LinearRegression vs svrepdesign — 3 tests

All 43 tests pass with tight tolerances (ATT rtol=1e-4, SE rtol=1%).
No deviations from R found.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

Overall Assessment

✅ Looks good

Executive Summary

  • No estimator implementation files changed; the diff adds an R benchmark generator, a committed golden JSON fixture, and Python cross-validation tests.
  • I cross-checked the affected methodology contracts against the registry for DifferenceInDifferences survey TSL, CallawaySantAnna weighted reg with base_period="varying"/never-treated controls, and BRR replicate-weight variance. I did not find an undocumented estimator deviation in the exercised implementation paths.
  • The main weakness is Tier 2: the fixture stores R SEs, but the tests only assert ATT parity, so the new suite does not fully lock down CallawaySantAnna survey variance in the non-covariate case.
  • The covariate-adjusted survey reg SE divergence from R is already documented in the methodology registry, so I treated that as informational rather than a defect.
  • Static review only: I could not execute the new module in this environment because pytest and the Python data stack are unavailable.

Methodology

Cross-check summary: the benchmark setup is consistent with the library’s documented contracts for never-treated encoding (first_treat=0 or Inf) and replicate-weight survey d.f. in docs/methodology/REGISTRY.md:L287-L290, docs/methodology/REGISTRY.md:L390-L424, docs/methodology/REGISTRY.md:L2191-L2245, diff_diff/staggered.py:L1214-L1216, diff_diff/staggered.py:L1323-L1325, and diff_diff/survey.py:L582-L610.

  • P2 Impact: Tier 2 exports overall_se and gt_se from R, but the Python side only checks ATT parity plus a weak smoke test that SE is finite and differs from the unweighted run. That means this PR does not actually cross-validate CallawaySantAnna survey variance for the non-covariate reg case, despite having the golden values available. Concrete fix: add an explicit result.overall_se vs r["overall_se"] assertion for cs_weighted_nocov; if cs_weighted_cov remains ATT-only, make that exception explicit and tie it to the documented registry note. Locations: tests/test_survey_r_crossvalidation.py:L293-L350, benchmarks/R/benchmark_survey_crossvalidation.R:L400-L417.
  • P3 Impact: The survey reg+covariates SE difference from R is already documented, so I did not treat the lack of SE parity for cs_weighted_cov as a defect. Concrete fix: add a short inline comment in the new test pointing to the documented deviation so future reviewers do not read the omission as accidental. Locations: docs/methodology/REGISTRY.md:L423-L424, tests/test_survey_r_crossvalidation.py:L242-L350.

Code Quality

  • P3 Impact: _load_scenario_data() silently maps any non-numeric first_treat token to 0, not just R’s "Inf". A malformed fixture would therefore be reinterpreted as never-treated instead of failing loudly, which could change control-group composition inside the benchmark. Concrete fix: explicitly map "Inf"/"Infinity" to 0 and raise on any other non-numeric value. Location: tests/test_survey_r_crossvalidation.py:L70-L75.

Performance

No findings.

Maintainability

  • P3 Impact: Tier 2 compares R group-time effects positionally after sorting only the Python side, while the generator exports gt_results in whatever order did::att_gt() returns. If fixture generation is rerun under a did version with different output ordering, the comparison becomes fragile for no methodological reason. Concrete fix: sort gt_results by (group, time) before export and/or compare keyed (group, time) -> effect mappings on both sides. Locations: benchmarks/R/benchmark_survey_crossvalidation.R:L381-L387, tests/test_survey_r_crossvalidation.py:L279-L291.

Tech Debt

No new untracked blocker found. Related survey items are already marked resolved in TODO.md:L57-L59.

Security

No findings. The added fixture appears synthetic, and I did not see secret-like material.

Documentation/Tests

  • P3 Impact: The benchmark script prints agg$overall.se as SE(naive), which is misleading when manually regenerating benchmarks because that value is the reported aggregation SE, not a naive unweighted reference SE. Concrete fix: relabel the message to SE or overall SE. Location: benchmarks/R/benchmark_survey_crossvalidation.R:L420-L422.

Verification note: I could not run tests/test_survey_r_crossvalidation.py here because this environment does not have pytest or pandas.

- P2: Add docstring explaining why Tier 2 compares ATT only (R's
  did::att_gt gives naive SEs, not design-based)
- P3: Fail loudly on unexpected non-numeric first_treat values
  instead of silently mapping to 0
- P3: Sort R gt_results by (group, time) and use keyed comparison
  in Python for stable cross-version matching
- P3: Relabel misleading "SE(naive)" to "SE" in R console output

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 2ba616a0ddd046bc673a537328be8f11800945aa


Overall Assessment

✅ Looks good

Executive Summary

  • No estimator implementation files changed; this PR adds an R benchmark generator, a committed JSON fixture, and Python cross-validation tests for survey/Taylor-linearized and BRR paths.
  • Re-review: the prior issues around permissive first_treat parsing, unstable (group, time) ordering, and the misleading benchmark print label are addressed in tests/test_survey_r_crossvalidation.py:L61-L94 and benchmarks/R/benchmark_survey_crossvalidation.R:L381-L423.
  • I cross-checked the exercised methodology against the registry for survey TSL, Callaway-Sant’Anna base_period="varying" / never-treated controls, and BRR replicate-weight df. I did not find any unmitigated P0/P1 mismatch in the changed code paths.
  • One P3 remains: Tier 2 now correctly documents ATT-only coverage, but cs_weighted_nocov still exports an R overall_se without asserting it, so that specific weighted-reg SE path is not locked down.
  • Static review only: I could not execute the tests in this environment because pandas is unavailable.

Methodology

Code Quality

Performance

  • No findings.

Maintainability

Tech Debt

  • No new untracked blocker. The related survey items from earlier reviews are already recorded as resolved in TODO.md:L57-L59.

Security

  • No findings. The committed fixture appears synthetic, and I did not see secret-like material in the changed files.

Documentation/Tests

  • No additional findings beyond the Tier 2 SE-parity gap above.
  • Runtime verification was not possible here because the environment lacks pandas, so this review is based on source inspection rather than local test execution.

@igerber igerber merged commit a6a3325 into main Apr 1, 2026
13 checks passed
@igerber igerber deleted the survey-r-crossvalidation branch April 1, 2026 18:20
@igerber igerber mentioned this pull request Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant