Phase 3: Survey design support for OLS-based standalone estimators by igerber · Pull Request #226 · igerber/diff-diff

igerber · 2026-03-21T20:05:38Z

Summary

Add survey_design parameter to fit() for 6 OLS-based standalone estimators: StackedDiD, SunAbraham, BaconDecomposition, TripleDifference (reg method), ContinuousDiD, and EfficientDiD
Implement weighted estimation with Taylor Series Linearization (TSL) variance using strata, PSU, and FPC for design-based inference
Add survey_metadata field to all 6 results classes with summary() and to_dict() support
Defer TripleDifference IPW/DR and bootstrap+survey combinations with NotImplementedError guards (pending Phase 5: weighted solve_logit())
Update survey roadmap with Phase 3 completion status and explicit deferred-work table
Add REGISTRY.md survey documentation for all 6 estimators

Estimator-specific details

Estimator	Survey Support	Key Implementation
StackedDiD	Full	Q-weight × survey-weight multiplicative composition; TSL vcov on composed weights
SunAbraham	Full	Survey weights in LinearRegression + weighted within-transform
BaconDecomposition	Diagnostic	Weighted cell means, within-transform, group shares (no inference)
TripleDifference	Reg only	Weighted per-cell OLS + TSL on combined influence functions
ContinuousDiD	Analytical	Weighted B-spline OLS + TSL on influence functions
EfficientDiD	Analytical	Weighted means/covariances in Omega* + TSL on EIF scores

Methodology references (required if estimator / math changes)

Method name(s): Taylor Series Linearization (TSL) variance for survey-weighted DiD estimators
Paper / source link(s): Implemented per Phases 1-2 infrastructure from PR Add survey data support (Phases 1-2) #218
Any intentional deviations from the source (and why):
- Note: Bacon decomposition with survey weights is diagnostic; exact-sum guarantee is approximate (Goodman-Bacon 2021 theorem derived for unweighted TWFE)
- Note: TripleDifference IPW/DR with survey weights deferred until weighted solve_logit() (Phase 5)
- Note: Bootstrap + survey interaction deferred for SunAbraham, ContinuousDiD, EfficientDiD (Phase 5)

Validation

Tests added/updated: tests/test_survey_phase3.py (31 new tests covering smoke, uniform-weight invariance, scale invariance, metadata, summary output, NotImplementedError guards)
All 2015 existing tests pass (full suite, no regressions)
All 453 affected estimator + survey tests pass

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

github-actions · 2026-03-21T20:16:19Z

Overall Assessment

⚠️ Needs changes

I cross-checked the changed survey paths against the Methodology Registry and the estimator implementations. The documented Phase 5 deferrals and documented Bacon diagnostic deviation are fine; the blockers below are all undocumented survey-path mismatches or missing assumption checks.

Executive Summary

Affected methods: ContinuousDiD, EfficientDiD, TripleDifference, and StackedDiD.
The documented deferrals in docs/methodology/REGISTRY.md:L497-L498, docs/methodology/REGISTRY.md:L673-L674, docs/methodology/REGISTRY.md:L1234-L1235, and docs/methodology/REGISTRY.md:L1016-L1017 are properly documented/guarded and are not blockers.
EfficientDiD’s survey SE helper misapplies TSL to unit-level EIFs by expanding them back to panel rows, which breaks the documented weights-only / stratified-no-PSU survey paths.
ContinuousDiD’s survey support is partial: aggregate="eventstudy" still uses sample-count aggregation and non-survey analytical SEs, and the unit-level survey collapse silently assumes within-unit survey fields are constant.
TripleDifference computes survey SEs but still uses ordinary sample df for p-values/CIs, and StackedDiD crashes on valid no-weight SurveyDesign inputs.

Methodology

Severity: P1. Impact: EfficientDiD’s survey SE helper repeats each unit-level EIF across every period and then calls compute_survey_vcov on the panel-level resolved design. That is only harmless when PSUs are explicit and repeated across time; for the documented weights-only / stratified-no-PSU paths it treats each period copy as a separate implicit PSU, so the returned survey SEs are too small. Concrete fix: collapse the survey design to one row per unit before TSL, as ContinuousDiD already does, and use that unit-level design for both ATT(g,t) and aggregated SEs. Locations: diff_diff/efficient_did.py:L476-L484, diff_diff/efficient_did.py:L644-L675, docs/methodology/REGISTRY.md:L673-L674, diff_diff/continuous_did.py:L1047-L1099.
Severity: P1. Impact: ContinuousDiD accepts aggregate="eventstudy" with survey_design, but that path still aggregates by raw n_treated counts and computes analytical event-study SEs with sqrt(sum(IF^2)) instead of TSL. So survey-weighted event-study estimates/inference are not the documented estimator. Concrete fix: either make event-study aggregation use survey-weighted treated mass plus unit-level compute_survey_vcov, or explicitly reject and document this combination instead of silently returning sample-weighted event-study results. Locations: diff_diff/continuous_did.py:L385-L388, diff_diff/continuous_did.py:L933-L962, diff_diff/continuous_did.py:L519-L564, docs/methodology/REGISTRY.md:L497-L498.
Severity: P1. Impact: ContinuousDiD and EfficientDiD both collapse observation-level survey information to the first row per unit without validating the assumption that those fields are constant within unit. If panel weights (or other design columns) vary by wave, the code silently computes arbitrary weighted means/covariances/SEs. Concrete fix: validate within-unit constancy for every supplied survey field (weights, strata, psu, fpc) before collapsing to unit level, and raise otherwise. Locations: diff_diff/continuous_did.py:L669-L682, diff_diff/efficient_did.py:L320-L336.
Severity: P1. Impact: TripleDifference computes the survey SE via TSL, then immediately falls back to df = n_obs - 8 - p for p-values/CIs. The survey registry says survey t-inference should use df = n_PSU - n_strata, so the current code returns wrong survey p-values and confidence intervals. Concrete fix: when survey_design is active, use resolved_survey.df_survey / survey_metadata.df_survey with the same nonpositive-df fallback logic used in LinearRegression. Locations: diff_diff/triple_diff.py:L579-L585, docs/methodology/REGISTRY.md:L1912-L1918.

Code Quality

Severity: P1. Impact: StackedDiD assumes every SurveyDesign has a weights column. Valid strata/PSU/FPC-only designs will hit stacked_df[None] both when composing weights and when recomputing metadata, so this new public parameter errors on a supported design shape. Concrete fix: if survey_design.weights is None, use all-ones stacked weights/raw weights instead of indexing a missing column, and add a no-weight survey smoke test. Locations: diff_diff/stacked_did.py:L369-L375, diff_diff/stacked_did.py:L433-L435.

Performance

No material findings in the reviewed diff.

Maintainability

No material findings in the reviewed diff.

Tech Debt

No existing TODO.md entry covers the P1 items above; the PR’s documented Phase 5 deferrals are the only survey limitations I would treat as mitigated. See TODO.md:L44-L56.

Security

No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

Static review only: I did not rerun the suite here because the shell Python in this environment cannot import the project dependencies (numpy was unavailable).
Severity: P3. Impact: tests/test_survey_phase3.py only covers weight-bearing survey designs for StackedDiD/EfficientDiD and only smoke-tests ContinuousDiD; it never exercises ContinuousDiD(..., aggregate="eventstudy", survey_design=...), SurveyDesign without weights, or rejection of time-varying within-unit survey fields. That gap is why the regressions above can slip through. Concrete fix: add targeted regression tests for those combinations. Locations: tests/test_survey_phase3.py:L248-L329, tests/test_survey_phase3.py:L503-L553, tests/test_survey_phase3.py:L561-L625, tests/test_survey_phase3.py:L84-L108.

Path to Approval

Replace EfficientDiD’s panel-level EIF expansion with a unit-level ResolvedSurveyDesign TSL path, and add weights-only + stratified-no-PSU regression tests.
Make ContinuousDiD’s survey event-study aggregation/inference fully survey-aware, or reject/document that combination instead of silently using sample-weighted event-study logic.
Add explicit within-unit validation for survey weights/strata/psu/fpc in the unit-level survey estimators (at least ContinuousDiD and EfficientDiD).
Use survey degrees of freedom in TripleDifference survey inference.
Make StackedDiD handle SurveyDesign objects without a weight column and add a dedicated no-weight survey test.

- EfficientDiD: collapse survey to unit-level before TSL (was expanding EIF to panel rows, making SEs too small for weights-only path) - ContinuousDiD: survey-weighted event-study aggregation and TSL SEs - ContinuousDiD/EfficientDiD: validate within-unit constancy for survey columns before collapsing to unit level - TripleDifference: use survey df (n_PSU - n_strata) for p-values/CIs instead of sample df (n_obs - 8) - StackedDiD: handle SurveyDesign without weights column (use ones) - Pass survey df to safe_inference() in all survey-aware estimators Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T20:49:08Z

/ai-review

github-actions · 2026-03-21T20:59:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9f48bc3bb96c8d6e4a00641f019fc88771086dda

Overall Assessment
⚠️ Needs changes

Executive Summary

Re-review status: the prior blockers on unit-constant survey validation, EfficientDiD’s panel-row TSL collapse, TripleDifference’s switch to survey df, and StackedDiD’s no-weight survey crash appear addressed.
The documented Phase 5 deferrals and the Bacon diagnostic note in the Methodology Registry are properly recorded and are not blockers.
ContinuousDiD’s new survey analytical path is not mathematically consistent with compute_survey_vcov(): it feeds already 1/n-scaled IF objects into the TSL sandwich and keeps panel-row survey df after collapsing to unit level.
EfficientDiD’s collapsed survey-row index is not aligned to all_units, so unsorted input can attach the wrong weights/strata/PSUs/FPCs to unit-level{EIF}s.
[Newly identified] SunAbraham still aggregates survey-weighted cohort effects with raw observation counts, so overall/event-study effects do not reflect survey-weighted cohort composition.
[Newly identified] StackedDiD computes survey TSL SEs but still uses normal-reference inference instead of df = n_PSU - n_strata.

Methodology
Re-review status: the earlier P1s called out in the last review are largely resolved, but the survey paths below are still not methodologically correct.

Severity: P1. Impact: ContinuousDiD’s survey SE path uses IF arrays that are already 1/n_t/1/n_c scaled for the library’s unweighted sqrt(sum(IF_i^2)) convention, then passes them into compute_survey_vcov(), which applies another bread term. The code’s own comment says those IFs already contain the scaling, so the survey SEs/CIs/p-values for overall ATT/ACRT, ATT(d), ACRT(d), and event-study bins are not for the documented estimator; on top of that, the survey df still comes from the original panel-level resolve rather than the collapsed unit-level design actually used for TSL. Concrete fix: rebuild the survey linearization variables on the unscaled score/residual scale expected by compute_survey_vcov(), construct the unit-level ResolvedSurveyDesign once, and use that unit-level object’s df_survey for all survey inference. Refs: diff_diff/continuous_did.py:L513, diff_diff/continuous_did.py:L1097, diff_diff/continuous_did.py:L1129, diff_diff/continuous_did.py:L1183, diff_diff/survey.py:L531.
Severity: P1. Impact: EfficientDiD’s unit-level survey collapse is still inconsistent. all_units is sorted, but _unit_first_panel_row is built in first-appearance order, and those positions are later used to subset weights/strata/PSUs/FPCs for the unit-level TSL helper. If the input rows are not already sorted by unit, the survey design arrays are permuted relative to the EIF vector, yielding incorrect survey SEs. The same path also stores self._survey_df from the original panel-level resolve, so weights-only/stratified-no-PSU designs use panel-row df rather than unit-level df. Concrete fix: build the first-row mapping keyed by all_units, create one unit-level ResolvedSurveyDesign aligned to the EIF order, and take df_survey from that collapsed design. Refs: diff_diff/efficient_did.py:L209, diff_diff/efficient_did.py:L303, diff_diff/efficient_did.py:L669.
Severity: P1. Impact: [Newly identified] SunAbraham’s new survey_design path stops at the saturated regression. _compute_iw_effects() and _compute_overall_att() still aggregate with raw observation counts, so survey-weighted cohort effects are collapsed with unweighted cohort composition. That changes the estimand whenever weights vary across cohorts or event times. Concrete fix: thread survey weights into _compute_iw_effects() and _compute_overall_att() and replace .size() / len(...) with treated weighted mass (sum(w)) for w_{g,e} and post-period aggregation weights. Refs: diff_diff/sun_abraham.py:L892, diff_diff/sun_abraham.py:L970.
Severity: P1. Impact: [Newly identified] StackedDiD now computes survey TSL SEs, but it still calls safe_inference() without a survey df for event-study coefficients and the overall ATT. The survey registry says survey t-inference should use df = n_PSU - n_strata; current p-values/CIs will therefore be too liberal in small-PSU designs. Concrete fix: when survey_design is active, pass survey_metadata.df_survey into both safe_inference() calls, with the same nonpositive-df fallback used in LinearRegression. Refs: diff_diff/stacked_did.py:L462, diff_diff/stacked_did.py:L492, docs/methodology/REGISTRY.md:L1912.

Code Quality
No additional material findings beyond the methodology blockers above.

Performance
No material findings in the reviewed diff.

Maintainability
No material findings in the reviewed diff.

Tech Debt
No relevant TODO.md entry mitigates the P1 items above, and these would not become mergeable via TODO tracking anyway because they are correctness/inference bugs.

Security
No security or secret-handling issues found in the reviewed diff.

Documentation/Tests
Static review only: I could not run the test suite here because pytest is not installed in this environment, and the shell Python also lacks numpy.

Severity: P3. Impact: tests/test_survey_phase3.py covers smoke/invariance/guard cases, but it does not exercise the failure modes above: survey-weighted SunAbraham aggregation, unsorted-unit EfficientDiD input, unit-level survey df on collapsed panel estimators, or correctness of ContinuousDiD survey SE magnitudes. Those regressions can slip through while the current file still passes. Concrete fix: add targeted regression tests for each of those cases. Refs: tests/test_survey_phase3.py:L119, tests/test_survey_phase3.py:L565, tests/test_survey_phase3.py:L710, tests/test_survey_phase3.py:L763.

Path to Approval

Rework ContinuousDiD’s survey analytical path so the objects passed to compute_survey_vcov() are unscaled linearization variables, then source survey df from the collapsed unit-level design used for TSL.
Rebuild EfficientDiD’s unit_first_panel_row mapping in all_units order, construct a single aligned unit-level ResolvedSurveyDesign, and use that same object for both SE computation and survey df.
Thread survey weights through SunAbraham’s _compute_iw_effects() and _compute_overall_att() so dynamic and overall effects aggregate by weighted treated mass rather than raw counts.
Pass survey_metadata.df_survey into StackedDiD’s event-study and overall safe_inference() calls when survey inference is active, with the project-standard fallback for nonpositive df.
Add regression tests for: a shuffled-unit EfficientDiD input, a SunAbraham dataset where survey weights materially change cohort composition, a collapsed-panel survey-df check for ContinuousDiD/EfficientDiD, and a small-PSU StackedDiD case where survey-df inference differs from normal-reference inference.

- ContinuousDiD: rescale IFs by n_units before compute_survey_vcov to avoid double-counting 1/n bread; use unit-level df_survey - EfficientDiD: align unit_first_panel_row to sorted all_units order; build unit-level ResolvedSurveyDesign once in fit(); use unit-level df - SunAbraham: thread survey weights into _compute_iw_effects and _compute_overall_att for survey-weighted cohort aggregation - StackedDiD: pass survey df to safe_inference for event-study and overall ATT p-values/CIs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T21:20:13Z

/ai-review

github-actions · 2026-03-21T21:35:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bb3b64a3e884975c5e19af686ee39764e6276304

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review status: the earlier blockers on EfficientDiD unit-order alignment, StackedDiD survey-df inference, and SunAbraham survey-weighted aggregation appear addressed in diff_diff/efficient_did.py:L307, diff_diff/stacked_did.py:L462, and diff_diff/sun_abraham.py:L907.
The documented Phase 5 deferrals and documented methodology notes are not blockers here, including the survey deferred-work table and Bacon’s diagnostic-only survey note in docs/survey-roadmap.md:L28, docs/methodology/REGISTRY.md:L1235, and docs/methodology/REGISTRY.md:L1572.
ContinuousDiD’s new survey analytical path still linearizes the old count-based formulas rather than the new weighted estimator, so its reported survey SEs, CIs, and p-values are not the documented TSL for the weighted B-spline estimator.
TripleDifference’s survey RA path still re-expands the three pairwise influence functions with unweighted n/n_j factors instead of weighted subset mass, so the combined survey TSL is not for the weighted DDD estimand when subgroup average weights differ.
The added tests are mostly smoke/guard coverage; they do not pin either remaining weighted-linearization bug, and I could not execute the suite here because pytest, numpy, and pandas are unavailable.

Methodology

Severity: P1. Impact: ContinuousDiD Phase 3 survey support changes the point estimator to weighted control means, weighted B-spline OLS, and weighted treated averaging, but _compute_analytical_se() and the event-study survey branch still build the linearization objects from raw counts (n_t, n_c, n_total) and an unweighted treated derivative average before sending them into the survey TSL path. That is not the methodology documented in the registry for “weighted B-spline OLS, TSL on influence functions,” so survey SEs/CIs/p-values for ATT^{glob}, ACRT^{glob}, ATT(d), ACRT(d), and event-study bins can be wrong whenever unit weights vary. Concrete fix: store weighted treated/control masses and weighted treated derivative averages in the per-cell bootstrap info, rebuild score-scale unit linearization variables for the weighted control mean and weighted WLS step, and pass those score-scale objects to compute_survey_vcov() instead of count-normalized IF pieces. Refs: diff_diff/continuous_did.py:L860, diff_diff/continuous_did.py:L894, diff_diff/continuous_did.py:L931, diff_diff/continuous_did.py:L1098, diff_diff/continuous_did.py:L1111, diff_diff/continuous_did.py:L1122, diff_diff/continuous_did.py:L621, docs/methodology/REGISTRY.md:L497.
Severity: P1. Impact: TripleDifference Phase 3 survey support estimates each pairwise RA DiD with survey weights, but after padding those IFs back to full length it still applies the unweighted w_j = n / n_j factors from the non-survey formula. Because survey weights are normalized globally to sum n, the survey analogue needs full-sample weighted mass over subset weighted mass, not full-sample counts over subset counts. When average weights differ across the {3,4}, {2,4}, and {1,4} subsets, the combined IF handed to TSL no longer linearizes the weighted DDD estimand, so survey SEs/CIs/p-values drift. Concrete fix: in survey mode replace the current count factors with weighted-mass scaling (sum(w_full) / sum(w_sub), equivalently n / sum(w_sub) under the current normalization) or derive each pairwise IF directly on the full-sample weighted scale before combining. Refs: diff_diff/survey.py:L117, diff_diff/triple_diff.py:L937, diff_diff/triple_diff.py:L1027, diff_diff/triple_diff.py:L1031, diff_diff/triple_diff.py:L1039, diff_diff/triple_diff.py:L1318, diff_diff/triple_diff.py:L1320, docs/methodology/REGISTRY.md:L1234.

Code Quality
No material findings beyond the methodology blockers above.

Performance
No material findings in the reviewed diff.

Maintainability
No material findings in the reviewed diff.

Tech Debt
The remaining P1 issues are not tracked under TODO.md:L44, and they would not be mitigated by TODO tracking even if they were because undocumented methodology mismatches and incorrect survey SE paths remain approval blockers.

Security
No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

Severity: P3. Impact: tests/test_survey_phase3.py exercises smoke cases, guardrails, and finite outputs for the new survey paths, but it does not contain an oracle-style regression that would fail on the two remaining weighted-linearization bugs. That means this PR can still pass its new coverage while shipping incorrect survey inference in ContinuousDiD and TripleDifference. Concrete fix: add one ContinuousDiD fixture with unequal unit weights and a direct weighted-linearization oracle for analytical survey SEs, plus one TripleDifference fixture where mean weights differ across {3,4}, {2,4}, and {1,4} so count-scaling and weighted-mass scaling produce different inference. Refs: tests/test_survey_phase3.py:L396, tests/test_survey_phase3.py:L503, tests/test_survey_phase3.py:L707.
Static review only: I could not run the test suite here because pytest, numpy, and pandas are not installed in this environment.

Path to Approval

Rework ContinuousDiD’s survey analytical path so the objects passed to compute_survey_vcov() are score-scale linearization variables for the weighted control mean and weighted WLS estimator, using weighted treated/control masses and weighted treated derivative averages throughout overall, dose-response, and event-study inference.
Rework TripleDifference’s survey RA combination step so each padded pairwise IF is placed on the full-sample weighted scale before TSL, replacing the current n/n_j count factors with weighted-mass scaling.
Add targeted regression tests for both fixes: one weighted-ContinuousDiD analytical oracle and one weighted-TripleDifference case where the current count-based subset scaling measurably disagrees with weighted-mass scaling.

…nce from PR #226 review (round 3) - ContinuousDiD: store survey-weighted treated/control masses and weighted dpsi_bar in bootstrap_info; use weighted masses for p_1, p_0, n_total in IF construction so TSL linearizes the weighted estimator - TripleDifference: use survey-weighted subgroup mass (sum(w_sub)) instead of raw counts (n_sub) for pairwise IF combination weights w3, w2, w1 when survey design is active Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T21:44:08Z

/ai-review

github-actions · 2026-03-21T21:58:45Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 11660240c427f80dc4af5c7ad172d4018361fdc1

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review status: the earlier blockers on EfficientDiD unit-order alignment, StackedDiD survey-d.f. inference, and TripleDifference’s survey subset scaling appear addressed in diff_diff/efficient_did.py:L307-L343, diff_diff/stacked_did.py:L462-L505, and diff_diff/triple_diff.py:L1027-L1053.
ContinuousDiD still has an unmitigated P1 in the new survey analytical path: the weighted B-spline score is normalized by weighted treated/control mass downstream, but the stored WLS bread is still normalized by the raw treated count. That leaves the survey linearization for ACRT_glob, ATT(d), and ACRT(d) mis-scaled whenever average survey weights differ across treated/control subsets.
The documented Phase 5 deferrals and documented methodology notes in the registry/roadmap are not blockers here, including the survey support notes for StackedDiD, TripleDifference, Bacon, EfficientDiD, and ContinuousDiD.
ContinuousDiD and EfficientDiD now expose survey_metadata, but for these panel estimators it is still computed from panel-level arrays even though inference is run on unit-collapsed survey designs, so the reported effective_n, n_psu, and df_survey can be misleading.
tests/test_survey_phase3.py adds useful smoke and guard coverage, but it still does not pin the remaining ContinuousDiD linearization bug or the panel-metadata mismatch. Static review only: pytest, numpy, and pandas are unavailable in this environment.

Methodology

Severity: P1. Impact: ContinuousDiD’s survey analytical path still mixes two different normalizations for the weighted spline regression. The stored bread is built as (Psi' W Psi / n_treated_raw)^-1, but the downstream beta perturbations divide by weighted treated/control mass after n_t/n_c are replaced with w_treated/w_control. Because the bread and score pieces are no longer on the same scale, the survey IF for the weighted WLS step is mis-scaled whenever subgroup-average weights are not 1. This affects survey SEs, CIs, and p-values for ACRT_glob, ATT(d), and ACRT(d); ATT_glob/event-study are largely fixed by the new weighted-mass updates. Concrete fix: use a single normalization convention end-to-end for the weighted spline score. Either keep both bread and per-unit score terms on the raw-count normalization, or switch both to weighted-mass normalization; do not mix them. Refs: diff_diff/continuous_did.py:L949-L955, diff_diff/continuous_did.py:L1097-L1103, diff_diff/continuous_did.py:L1130-L1147, docs/methodology/REGISTRY.md:L492-L503.

Code Quality

Severity: P2. Impact: ContinuousDiD and EfficientDiD correctly switch the actual inference d.f. to a unit-level collapsed survey design, but the new survey_metadata stored in results still comes from the original panel-level resolved design. On multi-period panels without explicit PSUs, that overstates effective_n, n_psu, and df_survey by roughly the number of periods and makes the summary metadata disagree with the inference that was actually run. Concrete fix: recompute SurveyMetadata from the unit-level ResolvedSurveyDesign used for inference and store that object in results for both estimators. Refs: diff_diff/continuous_did.py:L207-L208, diff_diff/continuous_did.py:L706-L706, diff_diff/continuous_did.py:L1223-L1231, diff_diff/efficient_did.py:L199-L210, diff_diff/efficient_did.py:L317-L343, diff_diff/efficient_did.py:L670-L670.

Performance

No material findings in the reviewed diff.

Maintainability

No material findings beyond the metadata inconsistency above.

Tech Debt

The P1 methodology issue above is not tracked under TODO.md:L34-L45, so it is not mitigated by the project’s deferred-work policy.

Security

No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

Severity: P3. Impact: the new survey Phase 3 suite mostly covers smoke paths, invariance, summaries, and guardrails. It does not include an unequal-weight analytical oracle that would fail on the remaining ContinuousDiD beta-linearization mismatch, and it does not assert that panel-estimator survey_metadata is unit-level rather than panel-level. Concrete fix: add one unequal-weight ContinuousDiD fixture that checks analytical ACRT_glob/ATT(d)/ACRT(d) SEs against directly assembled weighted linearization scores, and add one panel-metadata assertion for unit-level effective_n/df_survey. Refs: tests/test_survey_phase3.py:L503-L553, tests/test_survey_phase3.py:L707-L783.
Static review only: I could not run the tests here because pytest, numpy, and pandas are not installed in this environment.

Path to Approval

In ContinuousDiD, make the weighted spline linearization internally consistent by using the same normalization in both the stored bread and the downstream per-unit beta perturbations. The current raw-count bread plus weighted-mass score denominators need to be unified.
Add a targeted unequal-weight regression test for ContinuousDiD analytical survey inference that would fail under the current mixed normalization, covering at least ACRT_glob, ATT(d), and ACRT(d).

… from PR #226 review (round 4) - ContinuousDiD: normalize WLS bread by weighted treated mass (not raw count) for consistency with downstream IF score denominators; fixes ACRT_glob/ATT(d)/ACRT(d) survey SEs when subgroup-average weights differ - ContinuousDiD/EfficientDiD: recompute survey_metadata from unit-level ResolvedSurveyDesign so reported effective_n/n_psu/df_survey match the inference actually run (not the panel-level overcount) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T22:07:50Z

/ai-review

github-actions · 2026-03-21T22:18:10Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 67a87e3366df27f11c9ab93c1f1975919cb98dd6

Overall Assessment

⚠️ Needs changes

Executive Summary

The main blockers from the prior review appear resolved: ContinuousDiD now uses a consistent weighted-mass normalization in its survey analytical path, EfficientDiD now aligns unit order and reports unit-level survey metadata, StackedDiD now applies survey d.f. in inference, and TripleDifference now rescales the combined IF using survey-weighted subgroup mass. Refs: diff_diff/continuous_did.py:L959, diff_diff/continuous_did.py:L1111, diff_diff/efficient_did.py:L307, diff_diff/efficient_did.py:L679, diff_diff/stacked_did.py:L462, diff_diff/triple_diff.py:L1027
Remaining blocker 1: EfficientDiD’s survey path rebuilds unit_level_weights from the raw column and renormalizes it unconditionally, while the TSL path uses resolved_survey.weights. For SurveyDesign(weight_type="fweight"), estimation and inference now use different weight semantics, which conflicts with the project’s documented survey-weight handling.
Remaining blocker 2: the new exact-weight Bacon path indexes a NumPy weight array with DataFrame labels (w[df_22.index]), so non-RangeIndex inputs can error or silently misalign weights. This regression affects weights="exact" even without survey weights.
The Phase 5 deferrals and the Bacon diagnostic note are properly documented in the registry/roadmap and are not blockers.
Static review only: I could not run pytest here because the sandbox Python environment is missing pandas/numpy.

Methodology

Severity: P1. Impact: EfficientDiD’s new survey implementation is internally inconsistent for supported SurveyDesign(weight_type="fweight") inputs. The estimation path uses a hand-built, always-renormalized unit_level_weights, but the TSL path uses resolved_survey.weights[row_idx], which preserves raw fweight semantics. That means Omega*, generated outcomes, cohort fractions, and EIF construction can be based on different weights than the variance calculation, producing incorrect efficient weights, ATTs, and SEs for frequency-weighted survey designs. This is not covered by any documented registry deviation. Concrete fix: derive the unit-level estimation weights directly from resolved_survey.weights[row_idx], or at minimum branch on survey_weight_type and skip renormalization for fweight; then add a regression test with unequal integer fweights. Refs: diff_diff/efficient_did.py:L323, diff_diff/efficient_did.py:L374, diff_diff/survey.py:L117, docs/methodology/REGISTRY.md:L1846

Code Quality

Severity: P1. Impact: the new weighted exact-Bacon path uses w[df_22.index] inside _recompute_exact_weights. Because w is a position-based NumPy array and df_22.index is label-based, filtered or user-supplied non-RangeIndex DataFrames can raise out-of-bounds errors or pull the wrong weights. This is a production edge-case regression in a changed code path. Concrete fix: either reset the working index once in fit() or align weights with a Series keyed on df.index before subsetting. Add a regression test using a non-default index and weights="exact". Refs: diff_diff/bacon.py:L744

Performance

No material findings in the reviewed diff.

Maintainability

No material findings beyond the blockers above.

Tech Debt

Severity: P3. Impact: the deferred survey combinations are properly tracked and documented, so they are not blockers under the project’s review policy: SunAbraham bootstrap+survey, TripleDifference IPW/DR+survey, and ContinuousDiD/EfficientDiD bootstrap+survey. Bacon’s survey decomposition is also explicitly documented as diagnostic/approximate. Concrete fix: none required in this PR. Refs: docs/methodology/REGISTRY.md:L497, docs/methodology/REGISTRY.md:L673, docs/methodology/REGISTRY.md:L1016, docs/methodology/REGISTRY.md:L1234, docs/methodology/REGISTRY.md:L1571, docs/survey-roadmap.md:L17

Security

No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

Severity: P3. Impact: the new suite adds useful smoke, scale-invariance, metadata, and guard coverage, but it does not pin either blocker above: there is no EfficientDiD test exercising SurveyDesign(weight_type="fweight"), and the Bacon tests do not cover weights="exact" on a non-default DataFrame index. Those gaps make both regressions easy to ship unnoticed. Concrete fix: add one EfficientDiD regression with unequal integer frequency weights and one Bacon exact-weight regression using a filtered/non-reset index. Refs: tests/test_survey_phase3.py:L337, tests/test_survey_phase3.py:L644
Static review only: I could not execute the test suite in this environment because the sandbox Python runtime does not have pandas/numpy installed.

Path to Approval

In EfficientDiD, make the unit-level survey weight used for Omega*/generated outcomes/EIF construction come from the resolved survey object, not a separately renormalized raw column path. Add a regression test covering SurveyDesign(weight_type="fweight") with unequal integer weights.
In BaconDecomposition, remove label-based indexing from _recompute_exact_weights so exact weights work on non-default DataFrame indexes. Add a regression test using weights="exact" on a filtered DataFrame whose index has not been reset.

…ety from PR #226 review (round 5) - EfficientDiD: use resolved survey weights directly for unit-level estimation (Omega*, EIF, cohort fractions) instead of separately renormalized raw column, ensuring fweight/aweight consistency with TSL - BaconDecomposition: store survey weights as DataFrame column for safe label-based subsetting in _recompute_exact_weights, preventing out-of-bounds errors on non-default DataFrame indexes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T22:26:26Z

/ai-review

github-actions · 2026-03-21T22:36:21Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d26d498718eb53504e68677465d1d9562215c116

Overall Assessment

⚠️ Needs changes

Executive Summary

The two prior blockers called out in the last review appear fixed in code: EfficientDiD now derives unit-level survey weights from the resolved survey object, and Bacon exact weights now subset via a DataFrame-aligned weight column instead of NumPy label indexing. Locations: diff_diff/efficient_did.py:L323, diff_diff/bacon.py:L694
P1: ContinuousDiD’s new survey analytical SE path is mis-scaled for supported SurveyDesign(weight_type="fweight") inputs. The IF construction is normalized by weighted mass, but the TSL conversion hard-codes n_units, so overall, dose-response, and event-study SEs are wrong for frequency-weighted designs.
P1: BaconDecomposition’s new treated-vs-never weighted mean path lacks the empty-cell guard already present in the timing-comparison path. On some unbalanced/filtered panels, the PR now turns a warning-only path into a hard failure.
The Phase 5 survey deferrals and Bacon’s diagnostic exact-sum caveat are documented in the registry/roadmap, so they are non-blocking. Locations: docs/methodology/REGISTRY.md:L497, docs/methodology/REGISTRY.md:L673, docs/methodology/REGISTRY.md:L1234, docs/methodology/REGISTRY.md:L1571, docs/survey-roadmap.md:L35
Static review only: I could not run pytest here because the local environment is missing numpy, pandas, and pytest.

Methodology

No unmitigated registry/source-material mismatch stood out in EfficientDiD, SunAbraham, StackedDiD, or TripleDifference; the remaining methodology blocker is in ContinuousDiD.

P1 Impact: ContinuousDiD’s survey analytical SE path is only correct when survey weights have been normalized to sum to n_units (the pweight/aweight case). The new weighted IF path explicitly normalizes bread and IF denominators by weighted mass, but the TSL branch rescales those IFs by raw unit count n_units before calling compute_survey_vcov(). Because SurveyDesign explicitly supports unchanged integer fweights, this mis-scales SEs for supported frequency-weighted designs, including the event-study survey branch. Concrete fix: rescale by the unit-level total survey mass actually used in the weighted IF construction (for example unit_resolved.weights.sum()), not always by n_units, and apply the same correction to the event-study TSL path. Locations: diff_diff/continuous_did.py:L960, diff_diff/continuous_did.py:L1112, diff_diff/continuous_did.py:L1204, diff_diff/continuous_did.py:L643, diff_diff/survey.py:L109, docs/methodology/REGISTRY.md:L497, docs/methodology/REGISTRY.md:L1854

Code Quality

P1 Impact: BaconDecomposition still advertises unbalanced panels as warning-only, but _compute_treated_vs_never() now always uses np.average(..., weights=...) without checking whether the treated/never pre/post slices are empty. That means a filtered or more severely unbalanced panel can now raise instead of skipping the invalid comparison, which is a production edge-case regression introduced by the weighted rewrite. Concrete fix: add the same four-cell emptiness guard already used in _compute_timing_comparison() and return None before averaging if any cell is empty. Locations: diff_diff/bacon.py:L492, diff_diff/bacon.py:L820, diff_diff/bacon.py:L934, tests/test_bacon.py:L580

Performance

No material findings in the reviewed diff.

Maintainability

No material findings beyond the blockers above.

Tech Debt

P3 Impact: The deferred survey combinations are properly documented, so they are not blockers under project policy: SunAbraham bootstrap+survey, TripleDifference IPW/DR+survey, ContinuousDiD bootstrap+survey, EfficientDiD bootstrap+survey, and Bacon’s diagnostic exact-sum caveat. Concrete fix: none required in this PR. Locations: docs/methodology/REGISTRY.md:L497, docs/methodology/REGISTRY.md:L673, docs/methodology/REGISTRY.md:L1016, docs/methodology/REGISTRY.md:L1234, docs/methodology/REGISTRY.md:L1571, docs/survey-roadmap.md:L35

Security

No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

P3 Impact: The new survey suite still exercises only happy-path pweight estimator flows for ContinuousDiD/EfficientDiD and only happy-path Bacon survey cases, so neither blocker above is pinned by regression coverage. Concrete fix: add one ContinuousDiD analytical survey test with unequal integer fweights, and one Bacon regression that filters an unbalanced panel until a treated-vs-never pre/post cell is empty and asserts the comparison is skipped rather than raising. Locations: tests/test_survey_phase3.py:L337, tests/test_survey_phase3.py:L561, tests/test_survey_phase3.py:L763, tests/test_bacon.py:L580
Static review only: I could not execute the test suite in this environment because numpy, pandas, and pytest are not installed.

Path to Approval

In ContinuousDiD, replace the hard-coded n_units TSL rescaling with the unit-level total survey mass actually used by the weighted IF path, and apply the same fix to the event-study survey SE branch. Add an estimator-level regression test with unequal integer fweights.
In BaconDecomposition, add a treated-vs-never empty-cell guard before the new weighted averages so unbalanced/filtered panels skip invalid comparisons instead of raising. Add a regression test that creates exactly that unbalanced-cell case.

…ll guard from PR #226 review (round 6) - ContinuousDiD: rescale IFs by unit-level total survey mass (unit_resolved.weights.sum()) instead of hard-coded n_units, so TSL SEs are correct for fweight designs where mass != n_units - BaconDecomposition: add empty-cell guard in _compute_treated_vs_never before np.average() to prevent crashes on unbalanced/filtered panels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T22:48:23Z

/ai-review

github-actions · 2026-03-21T23:06:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3798bd4cd44fcaefe5327810d8676ec362a62d36

Overall Assessment

⚠️ Needs changes

Executive Summary

The two outstanding blockers from the last review appear resolved: ContinuousDiD now rescales survey TSL scores by total unit survey mass in both the overall and event-study branches, and BaconDecomposition now skips empty 2×2 cells instead of averaging empty slices. Locations: diff_diff/continuous_did.py:L641-L645, diff_diff/continuous_did.py:L1199-L1206, diff_diff/bacon.py:L825-L832, diff_diff/bacon.py:L943-L950
P1: SunAbraham’s new survey path computes aggregate SEs with survey TSL, but still reports overall and event-study p-values/CIs using the non-survey normal-reference safe_inference(..., df=None). That contradicts the registry’s survey-df rule and makes the stored inferential output wrong when PSU counts are small. Locations: diff_diff/sun_abraham.py:L651-L663, diff_diff/sun_abraham.py:L947-L956, docs/methodology/REGISTRY.md:L1908-L1921
P2 [Newly identified]: ContinuousDiD’s survey-aware fit path now computes df-aware dose-response CIs, but direct DoseResponseCurve.to_dataframe() calls still recompute t-stats/p-values with a normal approximation, so that helper API returns inconsistent inference on survey results. Locations: diff_diff/continuous_did.py:L512-L542, diff_diff/continuous_did_results.py:L49-L75
The documented Phase 5 deferrals and Bacon diagnostic caveat are properly recorded in the registry/roadmap, so they are non-blocking under project policy. Locations: docs/methodology/REGISTRY.md:L497-L498, docs/methodology/REGISTRY.md:L673-L674, docs/methodology/REGISTRY.md:L1016-L1017, docs/methodology/REGISTRY.md:L1234-L1235, docs/methodology/REGISTRY.md:L1571-L1572, docs/survey-roadmap.md:L19-L38
Static review only: I could not execute the tests here because this environment is missing numpy and pytest.

Methodology

P1 Impact: SunAbraham’s survey implementation uses survey-weighted within-transformation and TSL vcov, but it does not carry the required survey degrees of freedom into the aggregate inference layer. The registry explicitly says survey inference should use df = n_PSU - n_strata; this code instead calls safe_inference without df for both overall ATT and interaction-weighted event-study aggregates, so overall_p_value, overall_conf_int, and each event_study_effects[e] are too optimistic whenever survey df is materially smaller than the normal limit. Concrete fix: thread survey_metadata.df_survey (or the regression object’s survey df) into both aggregate safe_inference calls, and preserve that df when storing overall/event-study inference. Locations: diff_diff/sun_abraham.py:L651-L663, diff_diff/sun_abraham.py:L947-L956, docs/methodology/REGISTRY.md:L745-L745, docs/methodology/REGISTRY.md:L1908-L1921
P2 [Newly identified] Impact: the new survey path in ContinuousDiD now computes df-aware analytical inference at fit time, but the unchanged DoseResponseCurve.to_dataframe() helper recomputes analytic t-stats/p-values with safe_inference(..., df=None). Direct calls to result.dose_response_att.to_dataframe() or result.dose_response_acrt.to_dataframe() therefore disagree with the estimator’s own survey-aware inference. Concrete fix: store analytic t-stats/p-values (or at least survey_df) on DoseResponseCurve during fit(), and stop recomputing them with a normal reference distribution in to_dataframe(). Locations: diff_diff/continuous_did.py:L512-L542, diff_diff/continuous_did_results.py:L49-L75

Code Quality

No material findings beyond the inference issues above.

Performance

No material findings in the reviewed diff.

Maintainability

No material findings in the reviewed diff.

Tech Debt

P3 Impact: the PR’s deferred combinations are properly guarded and documented, so they do not block approval: SunAbraham bootstrap+survey, TripleDifference IPW/DR+survey, ContinuousDiD bootstrap+survey, EfficientDiD bootstrap+survey, and Bacon’s diagnostic exact-sum caveat. Concrete fix: none required in this PR. Locations: docs/methodology/REGISTRY.md:L497-L498, docs/methodology/REGISTRY.md:L673-L674, docs/methodology/REGISTRY.md:L1016-L1017, docs/methodology/REGISTRY.md:L1234-L1235, docs/methodology/REGISTRY.md:L1571-L1572, docs/survey-roadmap.md:L19-L38

Security

No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

P3 Impact: the new survey suite exercises smoke/invariance/metadata paths, but it does not pin the two inference regressions above. There is no SunAbraham test asserting that stored overall/event-study p-values use df_survey, and no ContinuousDiD test covering direct DoseResponseCurve.to_dataframe() on a survey fit. Concrete fix: add one low-PSU SunAbraham regression that compares stored p-values/CIs against safe_inference(..., df=result.survey_metadata.df_survey), and one ContinuousDiD regression that calls dose_response_att.to_dataframe() on a survey result and checks its t/p output against the same df-aware calculation. Locations: tests/test_survey_phase3.py:L119-L230, tests/test_survey_phase3.py:L503-L545
Static review only: I could not execute the survey tests in this environment because numpy and pytest are unavailable.

Path to Approval

In SunAbraham, pass survey df into aggregate inference so both overall_att and event-study outputs use safe_inference(..., df=survey_metadata.df_survey) when survey inference is active, and add a regression test with small survey df that would fail under the current normal-approximation behavior.

…(round 7) Thread survey_metadata.df_survey to safe_inference() for both overall ATT and event-study effects, so p-values/CIs use t-distribution with n_PSU - n_strata df when survey design is active. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T23:21:42Z

/ai-review

github-actions · 2026-03-21T23:30:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 918b876a6f95c1d778545c7f197c8b516bec3011

Overall Assessment

Looks good

Executive Summary

The prior SunAbraham blocker appears resolved: survey degrees of freedom now flow into both aggregate event-study and overall ATT inference. Locations: diff_diff/sun_abraham.py:L637-L672, diff_diff/sun_abraham.py:L918-L977
I did not find any new unmitigated P0/P1 issues in the modified survey paths for StackedDiD, EfficientDiD, TripleDifference, or BaconDecomposition.
P2: one previously identified helper-level inconsistency remains open in ContinuousDiD: direct DoseResponseCurve.to_dataframe() calls still emit normal-reference p-values on survey fits. Location: diff_diff/continuous_did_results.py:L49-L75
The Phase 5 survey deferrals and Bacon diagnostic caveat are documented in the registry/roadmap, so they are non-blocking under project policy. Locations: docs/survey-roadmap.md:L17-L41, docs/methodology/REGISTRY.md:L497-L498, docs/methodology/REGISTRY.md:L673-L674, docs/methodology/REGISTRY.md:L1016-L1017, docs/methodology/REGISTRY.md:L1234-L1235, docs/methodology/REGISTRY.md:L1571-L1572
Static review only: I could not execute the tests here because numpy, pandas, and pytest are unavailable in this environment.

Methodology

P2 Impact: direct result.dose_response_att.to_dataframe() / result.dose_response_acrt.to_dataframe() still recompute p_value with safe_inference(..., df=None), so survey-fit curves report normal-reference p-values even though fit-time analytical inference is built with df_survey. The stored CI arrays remain the df-aware ones; the inconsistency is in the helper’s recomputed p_value column. Concrete fix: store survey_df or precomputed analytical p_value on DoseResponseCurve and use it in to_dataframe() instead of recomputing with the normal reference. Locations: diff_diff/continuous_did_results.py:L49-L75, diff_diff/continuous_did.py:L524-L543, diff_diff/continuous_did.py:L658-L678

Code Quality

No material findings in the reviewed diff.

Performance

No material findings in the reviewed diff.

Maintainability

No material findings in the reviewed diff.

Tech Debt

P3 Impact: the deferred survey combinations are properly tracked and guarded, so they are mitigated under project policy: SunAbraham bootstrap+survey, ContinuousDiD bootstrap+survey, EfficientDiD bootstrap+survey, TripleDifference IPW/DR+survey, and Bacon’s diagnostic exact-sum caveat. Concrete fix: none required in this PR. Locations: docs/survey-roadmap.md:L17-L41, docs/methodology/REGISTRY.md:L497-L498, docs/methodology/REGISTRY.md:L673-L674, docs/methodology/REGISTRY.md:L1016-L1017, docs/methodology/REGISTRY.md:L1234-L1235, docs/methodology/REGISTRY.md:L1571-L1572

Security

No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

P3 Impact: the new survey suite exercises smoke/invariance/guardrails, but it still does not pin the remaining DoseResponseCurve.to_dataframe() survey p-value mismatch, and it also lacks a low-survey-df SunAbraham regression that would catch a future reversion of the now-fixed aggregate inference path. Concrete fix: add one survey-fit DoseResponseCurve.to_dataframe() assertion against safe_inference(..., df=result.survey_metadata.df_survey), and one small-PSU SunAbraham regression that checks stored overall/event-study inference against survey-df calculations. Locations: tests/test_survey_phase3.py:L116-L240, tests/test_survey_phase3.py:L503-L553, tests/test_survey_phase3.py:L726-L783

…-values from PR #226 review (round 8) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T23:40:32Z

/ai-review

github-actions · 2026-03-21T23:48:21Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7bf9c8fc216682f6c046535530dc2b3c2505fe6f

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior SunAbraham blocker appears resolved: survey degrees of freedom now flow into both aggregated event-study and overall ATT inference in the survey path. See diff_diff/sun_abraham.py:L637-L672 and diff_diff/sun_abraham.py:L918-L967.
P1: BaconDecomposition(weights="exact", survey_design=...) still mixes weighted and unweighted share terms inside the exact-weight formula, which does not match the newly documented “weighted group shares” methodology. See diff_diff/bacon.py:L747-L776, docs/methodology/REGISTRY.md:L1566-L1572, and docs/survey-roadmap.md:L19-L24.
The previous non-blocking ContinuousDiD helper inconsistency is still open: survey fits build df-aware analytical inference at fit time, but DoseResponseCurve.to_dataframe() still recomputes normal-reference p-values. See diff_diff/continuous_did.py:L512-L543, diff_diff/continuous_did.py:L658-L678, and diff_diff/continuous_did_results.py:L49-L75.
The Phase 5 survey deferrals are documented and guarded, so those remain non-blocking under project policy. See docs/survey-roadmap.md:L28-L40 and docs/methodology/REGISTRY.md:L1571-L1572.
Static review only: pytest is not available in this environment, so I could not execute the new survey test suite.

Methodology

P1 Impact: In the Bacon exact-weight survey path, sample_share and the within-2x2 treated share are survey-weighted, but unit_share falls back to raw unit counts: unit_share = (n_k + n_l) / n_total_units. That is an undocumented mismatch against the registry/roadmap promise of “weighted group shares,” and it will silently distort the exact comparison weights whenever survey weights are not proportional to cohort sizes. Concrete fix: make the subsample-share term weight-consistent in _recompute_exact_weights() as well, or explicitly document a deliberate raw-count deviation in the methodology registry. See diff_diff/bacon.py:L747-L776, docs/methodology/REGISTRY.md:L1566-L1572, and docs/survey-roadmap.md:L19-L24.
P2 Impact: ContinuousDiD analytical survey fits use df_survey when constructing fit-time inference, but DoseResponseCurve.to_dataframe() still recomputes p_value with safe_inference(..., df=None). The helper therefore returns internally inconsistent inference columns for the same survey fit. Concrete fix: store survey_df or precomputed analytical t_stat/p_value on DoseResponseCurve and reuse them in to_dataframe(). See diff_diff/continuous_did.py:L512-L543, diff_diff/continuous_did.py:L658-L678, and diff_diff/continuous_did_results.py:L49-L75.

Code Quality

No material findings in the reviewed diff.

Performance

No material findings in the reviewed diff.

Maintainability

No material findings in the reviewed diff.

Tech Debt

P3 Impact: The deferred survey combinations are properly tracked and guarded: SunAbraham bootstrap+survey, TripleDifference IPW/DR+survey, ContinuousDiD bootstrap+survey, EfficientDiD bootstrap+survey, plus Bacon’s documented diagnostic caveat. Under project policy these are mitigated and non-blocking. Concrete fix: none required in this PR. See docs/survey-roadmap.md:L28-L40 and docs/methodology/REGISTRY.md:L1571-L1572.

Security

No security or secret-handling findings in the reviewed diff.

Documentation/Tests

P3 Impact: The new survey suite does not cover BaconDecomposition(weights="exact", survey_design=...), so the exact-weight mismatch above can slip through unchanged. Concrete fix: add a survey-weighted exact-Bacon regression test with unequal cohort weights and assert the comparison weights against a direct weighted-share calculation. See tests/test_survey_phase3.py:L337-L388.
P3 Impact: test_continuous_did_dose_response_survey_pvalue is named like a df-awareness regression test, but it only checks that helper p-values are finite; it would still pass with the current wrong normal-reference helper output. Concrete fix: assert one or more DoseResponseCurve.to_dataframe()["p_value"] values against safe_inference(effect, se, df=result.survey_metadata.df_survey). See tests/test_survey_phase3.py:L839-L887.

Path to Approval

Make the Bacon exact survey path fully weight-consistent by replacing the raw-count unit_share factor in _recompute_exact_weights() with a survey-weighted subsample-share term, or document a deliberate deviation in the registry if raw counts are intentional. See diff_diff/bacon.py:L747-L776.
Add a regression test that exercises BaconDecomposition(weights="exact", survey_design=...) on data where weighted cohort mass differs materially from raw unit counts. See tests/test_survey_phase3.py:L337-L388.

- EfficientDiD: collapse survey to unit-level before TSL (was expanding EIF to panel rows, making SEs too small for weights-only path) - ContinuousDiD: survey-weighted event-study aggregation and TSL SEs - ContinuousDiD/EfficientDiD: validate within-unit constancy for survey columns before collapsing to unit level - TripleDifference: use survey df (n_PSU - n_strata) for p-values/CIs instead of sample df (n_obs - 8) - StackedDiD: handle SurveyDesign without weights column (use ones) - Pass survey df to safe_inference() in all survey-aware estimators Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- ContinuousDiD: rescale IFs by n_units before compute_survey_vcov to avoid double-counting 1/n bread; use unit-level df_survey - EfficientDiD: align unit_first_panel_row to sorted all_units order; build unit-level ResolvedSurveyDesign once in fit(); use unit-level df - SunAbraham: thread survey weights into _compute_iw_effects and _compute_overall_att for survey-weighted cohort aggregation - StackedDiD: pass survey df to safe_inference for event-study and overall ATT p-values/CIs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nce from PR #226 review (round 3) - ContinuousDiD: store survey-weighted treated/control masses and weighted dpsi_bar in bootstrap_info; use weighted masses for p_1, p_0, n_total in IF construction so TSL linearizes the weighted estimator - TripleDifference: use survey-weighted subgroup mass (sum(w_sub)) instead of raw counts (n_sub) for pairwise IF combination weights w3, w2, w1 when survey design is active Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… from PR #226 review (round 4) - ContinuousDiD: normalize WLS bread by weighted treated mass (not raw count) for consistency with downstream IF score denominators; fixes ACRT_glob/ATT(d)/ACRT(d) survey SEs when subgroup-average weights differ - ContinuousDiD/EfficientDiD: recompute survey_metadata from unit-level ResolvedSurveyDesign so reported effective_n/n_psu/df_survey match the inference actually run (not the panel-level overcount) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ety from PR #226 review (round 5) - EfficientDiD: use resolved survey weights directly for unit-level estimation (Omega*, EIF, cohort fractions) instead of separately renormalized raw column, ensuring fweight/aweight consistency with TSL - BaconDecomposition: store survey weights as DataFrame column for safe label-based subsetting in _recompute_exact_weights, preventing out-of-bounds errors on non-default DataFrame indexes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-22T10:59:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3c30f8e765984cad41da7bf3dd13260569547fb8

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review P1 coverage gaps are addressed: there is now a TripleDifference survey+covariates test at tests/test_survey_phase3.py:535, and EfficientDiD now has survey aggregation tests for event_study, group, and all at tests/test_survey_phase3.py:717, tests/test_survey_phase3.py:737, and tests/test_survey_phase3.py:756.
I did not find a new silent numerical correctness bug in the reviewed survey implementations.
One unmitigated P1 remains: EfficientDiD now rejects survey_design + covariates in code, but the Methodology Registry still presents survey support without documenting that deferral. The limitation is only documented in the survey roadmap, not the registry.
The other survey deferrals in this PR are guarded and documented closely enough to treat as informational.
Static review only: I could not execute the tests here because pytest is unavailable and the local Python environment is missing numerical packages.

Methodology

P1 Impact: EfficientDiD.fit(..., covariates=[...], survey_design=...) now raises NotImplementedError in diff_diff/efficient_did.py:278, but the registry entry at docs/methodology/REGISTRY.md:675 only documents the bootstrap deferral and otherwise reads as if survey support is implemented for the estimator generally. The same limitation is documented in the roadmap at docs/survey-roadmap.md:39, so the code and roadmap agree, but the registry is now out of sync with actual supported behavior. Concrete fix: add an explicit **Note:** in the EfficientDiD registry entry stating that covariates / DR nuisance estimation with survey_design is deferred or unsupported, or fully thread survey weights through the DR nuisance path and remove the guard.
P3 Impact: The other changed survey limitations are documented and guarded rather than silently returning wrong output, including TripleDifference IPW/DR + survey at docs/methodology/REGISTRY.md:1240, ContinuousDiD bootstrap + survey at docs/methodology/REGISTRY.md:497, and Bacon’s diagnostic weighted decomposition note at docs/methodology/REGISTRY.md:1577. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

P3 Impact: The previous re-review blockers on interaction coverage are now addressed by the new tests in tests/test_survey_phase3.py:535 and tests/test_survey_phase3.py:717. Concrete fix: none.
P3 Impact: I could not run the added tests locally because the environment does not have pytest or the numerical Python stack installed. Concrete fix: verify execution in CI or a provisioned dev environment.

Path to Approval

Update the EfficientDiD registry entry at docs/methodology/REGISTRY.md:675 to explicitly note that covariates + survey_design is deferred / unsupported, matching the code guard at diff_diff/efficient_did.py:278 and the roadmap entry at docs/survey-roadmap.md:39.

#226 review (round 13) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T11:01:45Z

/ai-review

github-actions · 2026-03-22T11:09:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 08ab80f5958f1666fa53fc6ebb2014903b4e6d3d

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review blocker is resolved: EfficientDiD now documents both bootstrap + survey and covariates + survey deferrals in the Methodology Registry, matching the new guards in code.
I did not find a new NaN/partial-inference bug in the changed survey inference paths; the modified estimators route scalar inference through safe_inference().
One unmitigated P1 remains in the new Bacon survey support: the weights="exact" path silently collapses each unit to its first observed survey weight but never checks that survey weights are constant within unit, so exact comparison weights can become row-order-dependent.
The other survey deferrals in this PR are guarded and documented rather than silently returning output.
Static review only: I could not run the test suite locally because this environment is missing numpy, pandas, scipy, and pytest.

Methodology

Severity: P1. Impact: BaconDecomposition(weights="exact", survey_design=...) accepts panel survey weights that vary over time within a unit, but the exact-weight recomputation later collapses each unit to a single weight via .groupby(unit).first(). That makes the reported “exact” Bacon weights depend on row order for otherwise valid inputs, and the registry does not document any unit-level collapse rule beyond “weighted cell means / weighted group shares.” See diff_diff/bacon.py:L474-L479, diff_diff/bacon.py:L776-L790, and docs/methodology/REGISTRY.md:L1578-L1579. Concrete fix: when self.weights == "exact" and survey_design.weights is set, either reject within-unit-varying survey weights (same pattern as _validate_unit_constant_survey() in the unit-collapsing estimators) or define and document a deterministic unit-level weighting rule and use it consistently instead of .first().
Severity: P3. Impact: The previous EfficientDiD documentation mismatch is fixed; the registry now explicitly documents both bootstrap + survey and covariates + survey deferrals, matching the NotImplementedError guards. See docs/methodology/REGISTRY.md:L675-L681 and diff_diff/efficient_did.py:L270-L286. Concrete fix: none.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

Severity: P3. Impact: I could not execute the new survey Phase 3 tests locally because the Python scientific stack is unavailable in this environment. Concrete fix: confirm the added coverage in CI or rerun the relevant survey tests in a provisioned dev environment before merge.

Path to Approval

Add a guard for BaconDecomposition(weights="exact", survey_design=...) that rejects within-unit-varying survey weights, or replace the .groupby(...).first() collapse with a documented deterministic unit-level weighting rule.
Add a regression test covering a panel with time-varying survey weights under weights="exact" and assert either a clear ValueError or stable exact comparison weights under two different row orders.

…n test from PR #226 review (round 14) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m PR #226 local review Pre-existing behavior (not introduced by survey PR): event-study aggregation uses all (g,t) cells without anticipation filtering. Affects both survey and non-survey paths equally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ocal review Maintainability P2: panel-to-unit survey collapse, post-filter re-resolution, and metadata recomputation patterns differ across ContinuousDiD, EfficientDiD, and StackedDiD. Candidate for shared helpers in a follow-up tech debt PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l review P3: near-identical survey metadata rendering in 7 results classes will grow worse as Phases 4-5 add more estimators. Extract a shared helper in a follow-up tech debt PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T11:24:21Z

/ai-review

github-actions · 2026-03-22T11:32:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4604badc913f29ef71fba509c70dc2f0d086a25c

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior Bacon blocker from the last review is resolved: BaconDecomposition.fit() now rejects within-unit-varying survey design columns, and there is a regression test for that case (diff_diff/bacon.py:L481, tests/test_survey_phase3.py:L1001).
I did not find a new unmitigated methodology defect in the estimator math itself. The survey-related deferrals are guarded and documented rather than silently returning output (docs/methodology/REGISTRY.md:L497, docs/methodology/REGISTRY.md:L675, docs/methodology/REGISTRY.md:L1241).
Two exported public convenience APIs were not updated for the new survey_design path: bacon_decompose() and triple_difference() still do not accept/pass it through, so part of the advertised Phase 3 support is unreachable through public helpers (diff_diff/bacon.py:L1039, diff_diff/triple_diff.py:L1693).
Static review only: I could not execute the new tests here because this environment lacks pytest, numpy, pandas, and scipy.

Methodology

No unmitigated methodology defects found in the changed estimator implementations.
Severity: P3. Impact: The PR’s survey-specific deviations and deferrals are documented in the registry and fail fast in code, so they are informational rather than blocking: Bacon remains diagnostic under survey weights, ContinuousDiD bootstrap+survey is deferred, EfficientDiD bootstrap+survey and covariates+survey are deferred, and TripleDifference IPW/DR+survey is deferred (docs/methodology/REGISTRY.md:L497, docs/methodology/REGISTRY.md:L680, docs/methodology/REGISTRY.md:L1242, diff_diff/continuous_did.py:L215, diff_diff/efficient_did.py:L270, diff_diff/triple_diff.py:L510). Concrete fix: none.

Code Quality

Severity: P1. Impact: The new survey_design support is not threaded through two exported helper entry points, so bacon_decompose(..., survey_design=...) and triple_difference(..., survey_design=...) will fail despite the underlying estimators supporting survey designs. That leaves the public API internally inconsistent and the feature partially inaccessible (diff_diff/bacon.py:L1039, diff_diff/triple_diff.py:L1693, diff_diff/init.py:L18, diff_diff/init.py:L129). Concrete fix: add survey_design=None to both convenience-function signatures/docstrings and forward it to .fit(...).

Performance

No findings.

Maintainability

Severity: P3. Impact: The follow-up refactors around shared survey helper extraction and duplicated survey summary formatting are properly tracked in TODO.md, so they are not blockers (TODO.md:L58, TODO.md:L59). Concrete fix: none; tracked.

Tech Debt

Severity: P3. Impact: The pre-existing ContinuousDiD event-study anticipation limitation is now explicitly tracked, so I am not treating it as a blocker in this re-review (TODO.md:L57). Concrete fix: none; tracked.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new Phase 3 tests add wrapper coverage for stacked_did(), but not for bacon_decompose() or triple_difference(), which likely let the missed public-API propagation ship (tests/test_survey_phase3.py:L301, diff_diff/bacon.py:L1039, diff_diff/triple_diff.py:L1693). Concrete fix: add convenience-wrapper tests for bacon_decompose(..., survey_design=...) and triple_difference(..., survey_design=...), including the reg-success and IPW/DR-guard cases for TripleDifference.
Severity: P3. Impact: I could not execute the added tests locally because this environment lacks pytest, numpy, pandas, and scipy. Concrete fix: confirm tests/test_survey_phase3.py and the affected estimator subsets in CI or a provisioned dev environment.

Path to Approval

Add survey_design passthrough support to bacon_decompose() and update its docstring/examples.
Add survey_design passthrough support to triple_difference() and update its docstring to describe reg-only survey support plus the IPW/DR guard.
Add regression tests covering both public convenience wrappers with survey inputs.

…) convenience functions from PR #226 review (round 15) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T12:07:38Z

/ai-review

github-actions · 2026-03-22T12:16:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cb2d8859475cd0790e3a200f7103bef68201079f

Overall Assessment

⛔ Blocker

Executive Summary

The prior public-wrapper blocker is resolved: bacon_decompose() and triple_difference() now accept and forward survey_design, with regression coverage in tests/test_survey_phase3.py:L428 and tests/test_survey_phase3.py:L573. The new wrapper entry points are at diff_diff/bacon.py:L1039 and diff_diff/triple_diff.py:L1693.
Severity P0 [Newly identified]: SunAbraham’s survey-weighted overall ATT path still floors each post-period survey mass at 1, which silently changes the aggregation weights whenever valid positive survey-weight sums fall below 1. That can change both overall_att and its delta-method SE. See diff_diff/sun_abraham.py:L993, diff_diff/sun_abraham.py:L1009, diff_diff/sun_abraham.py:L1013, and the registry’s share-weight formulation at docs/methodology/REGISTRY.md:L704.
I did not find another unmitigated survey-methodology defect in the changed estimator paths; the deferred survey combinations are documented and guarded rather than silently returning output. See docs/methodology/REGISTRY.md:L497, docs/methodology/REGISTRY.md:L675, docs/methodology/REGISTRY.md:L1241, and docs/survey-roadmap.md:L33.
The new TODO entries correctly track the pre-existing ContinuousDiD anticipation limitation and the survey-helper/summary refactors as non-blocking debt. See TODO.md:L57, TODO.md:L58, and TODO.md:L59.
Static review only: I could not run pytest here because this environment does not have pytest, numpy, pandas, or scipy.

Methodology

Severity: P0 [Newly identified]. Impact: SunAbraham._compute_overall_att() says the survey path should weight post-treatment horizons by survey-weighted mass, but it then applies max(n_at_e, 1). That floor is harmless for integer counts, but it is wrong for survey-weight sums because valid positive masses can be below 1. The result is a silent estimand change and wrong delta-method weights for overall_att/overall_se, violating both scale invariance and the documented share-weight logic. Concrete fix: in the survey branch, append n_at_e directly; only treat n_at_e <= 0 as invalid. See diff_diff/sun_abraham.py:L1009, diff_diff/sun_abraham.py:L1013, and docs/methodology/REGISTRY.md:L708.
Severity: P3. Impact: the survey-specific deferrals and deviations that remain in scope are documented in the registry/roadmap and guarded in code, so they are informational rather than blocking. Concrete fix: none. See docs/methodology/REGISTRY.md:L497, docs/methodology/REGISTRY.md:L680, docs/methodology/REGISTRY.md:L1242, and docs/survey-roadmap.md:L35.

Code Quality

No findings. The prior public-wrapper passthrough issue appears fixed in diff_diff/bacon.py:L1123 and diff_diff/triple_diff.py:L1780.

Performance

No findings.

Maintainability

Severity: P3. Impact: the inconsistent panel-survey helper patterns and duplicated survey summary formatting are now explicitly tracked in TODO.md, so they are non-blocking deferred refactors. Concrete fix: none in this PR; tracked in TODO.md:L58 and TODO.md:L59.

Tech Debt

Severity: P3. Impact: the pre-existing ContinuousDiD event-study anticipation limitation is now tracked and should not block this PR. Concrete fix: none in this PR; tracked in TODO.md:L57.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new SunAbraham scale-invariance coverage only multiplies weights upward by 10, which does not exercise the below-1-weight case that triggers the blocker above. That gap is why this slipped through. Concrete fix: add a regression test that rescales the same survey weights downward (for example weight * 0.1) and asserts overall_att and overall_se are unchanged. See tests/test_survey_phase3.py:L818.
Severity: P3. Impact: bacon_decompose() and triple_difference() now expose survey_design, but their wrapper docstrings still do not document that parameter, and triple_difference() still does not describe the reg-only survey limitation. Concrete fix: update those wrapper docstrings/examples. See diff_diff/bacon.py:L1048 and diff_diff/triple_diff.py:L1707.
Severity: P3. Impact: I could not execute the new tests in this environment because the required Python packages are missing. Concrete fix: verify tests/test_survey_phase3.py in CI or a provisioned dev environment.

Path to Approval

In SunAbraham._compute_overall_att(), remove the max(n_at_e, 1) floor from the survey-weighted branch and use the actual positive survey-weighted mass for post-period aggregation.
Add a regression test that rescales the same survey weights below 1 and verifies SunAbraham’s overall_att and overall_se are invariant to that positive rescaling.

…226 review (round 16) Remove max(n_at_e, 1) floor from survey-weighted branch — valid survey masses can be < 1 and flooring breaks scale invariance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T12:34:31Z

/ai-review

github-actions · 2026-03-22T12:43:03Z

🔁 AI review rerun (requested by @igerber)

Head SHA: efc09f966dc5e9c7b116b28c8b38eca216a193a6

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior SunAbraham blocker appears resolved: the overall ATT path now uses actual survey-weighted post-period mass, and the new regression test covers the below-1 rescaling case in diff_diff/sun_abraham.py:L1003-L1019 and tests/test_survey_phase3.py:L894-L921.
The remaining unmitigated issue I found is in StackedDiD: the new survey path silently accepts fweights even though it multiplies them by ratio-valued Q-weights, producing fractional “frequency weights” and bypassing the library’s own integer fweight validation in diff_diff/stacked_did.py:L369-L443, diff_diff/stacked_did.py:L845-L871, and diff_diff/survey.py:L108-L121.
I did not find another unmitigated P0/P1 across the changed survey estimator paths; the documented deferrals are properly guarded with NotImplementedError and recorded in the registry/roadmap at docs/methodology/REGISTRY.md:L497-L498, docs/methodology/REGISTRY.md:L675-L681, docs/methodology/REGISTRY.md:L1241-L1242, and docs/survey-roadmap.md:L28-L42.
The new TODO.md entries correctly mitigate the pre-existing ContinuousDiD anticipation limitation and the survey-helper refactor debt, so those are informational only in TODO.md:L57-L59.
Static review only: I could not run the tests in this environment because numpy, pandas, pytest, and scipy are not installed.

Methodology

Severity P1. Impact: StackedDiD now claims survey support via multiplicative Q-weight × survey-weight composition in docs/methodology/REGISTRY.md:L1023-L1024, but the implementation also accepts SurveyDesign(weight_type="fweight"). _Q_weight is ratio-valued and typically fractional in diff_diff/stacked_did.py:L845-L871; the code then multiplies it into the survey weights and renormalizes in diff_diff/stacked_did.py:L369-L381, then overwrites resolved_composed.weights after SurveyDesign.resolve() in diff_diff/stacked_did.py:L419-L423. That bypasses the integer fweight guard required by diff_diff/survey.py:L108-L121 and the survey methodology notes in docs/methodology/REGISTRY.md:L1895-L1901. The result is an undocumented unsupported combination that can silently treat fractional composed weights as frequency weights in both estimation and TSL variance. Concrete fix: reject weight_type="fweight" in StackedDiD’s survey path, or define and document a different supported semantics that does not preserve fweight after Q-weight composition.
Severity P3. Impact: the prior SunAbraham survey-mass mismatch is fixed, and the missing downward-rescaling regression test is now present in diff_diff/sun_abraham.py:L1003-L1019 and tests/test_survey_phase3.py:L894-L921. Concrete fix: none.
Severity P3. Impact: the remaining deferred survey combinations are documented and guarded rather than silently returning output, so they are not defects in this PR per the review policy in docs/methodology/REGISTRY.md:L497-L498, docs/methodology/REGISTRY.md:L675-L681, docs/methodology/REGISTRY.md:L1241-L1242, and docs/survey-roadmap.md:L28-L42. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3. Impact: the new TODO entries correctly track the remaining shared-helper and duplicated-summary refactors, so those issues are non-blocking in this PR under the project’s deferred-work policy at TODO.md:L58-L59. Concrete fix: none in this PR.

Tech Debt

Severity P3. Impact: the pre-existing ContinuousDiD event-study anticipation limitation is now explicitly tracked in TODO.md:L57-L59, so it should not block this merge. Concrete fix: none in this PR.

Security

No findings.

Documentation/Tests

Severity P3. Impact: static review only; I could not execute the new survey tests locally because this container does not have numpy, pandas, pytest, or scipy installed. Concrete fix: verify the new survey slice in CI or a provisioned development environment before merge.

Path to Approval

In diff_diff/stacked_did.py:L369-L443, add an explicit guard rejecting SurveyDesign(..., weight_type="fweight") for StackedDiD survey runs, because _Q_weight is ratio-valued and cannot preserve frequency-weight semantics after composition.
Add a regression test in tests/test_survey_phase3.py that uses integer survey weights with weight_type="fweight" and asserts StackedDiD.fit(..., survey_design=...) raises a clear ValueError or NotImplementedError instead of fitting silently.
Update the StackedDiD survey note in docs/methodology/REGISTRY.md:L1023-L1024 to document the supported weight-type restriction once the code enforces it.

…226 review (round 17) Q-weight composition produces non-integer composed weights, breaking frequency-weight semantics. Raise ValueError for weight_type='fweight'. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T12:49:18Z

/ai-review

github-actions · 2026-03-22T12:57:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a8e2a20273bd3000b2288f23c34fe6d1f521dfbb

Overall Assessment
⚠️ Needs changes

Static review only; I did not execute the test suite in this environment.

Executive Summary

The prior StackedDiD blocker is addressed: the survey path now rejects weight_type="fweight" and the new regression test covers that case in diff_diff/stacked_did.py:L246-L257 and tests/test_survey_phase3.py:L331-L344.
One unmitigated P1 remains in the new TripleDifference survey regression path: the nuisance OLS influence-function terms are still unweighted, so survey SEs, p-values, and CIs can be wrong in diff_diff/triple_diff.py:L1333-L1360 and diff_diff/triple_diff.py:L1042-L1053.
I did not find another unmitigated P0/P1 in the changed survey-estimator paths.
The pre-existing ContinuousDiD event-study anticipation limitation is now explicitly tracked and therefore non-blocking in TODO.md:L57.
The remaining survey-helper cleanup items are also properly tracked and non-blocking in TODO.md:L58-L59.

Methodology

Severity P1. Impact: TripleDifference now documents survey support as “weighted OLS + TSL on combined influence functions” in docs/methodology/REGISTRY.md:L1237-L1242. The final survey SE is then computed from the combined IF in diff_diff/triple_diff.py:L1042-L1053. But the weighted RA nuisance linearization in diff_diff/triple_diff.py:L1333-L1360 only weights the bread (XpX_pre/post) and leaves the score pieces wols_eX_pre/post unweighted. That makes the survey IF inconsistent with the weighted OLS fit, so the reported SE / p-value / CI can be wrong whenever survey weights vary. Concrete fix: weight wols_eX_pre/post by the same survey weights used in XpX_pre/post, with consistent normalization, before passing the combined IF to compute_survey_vcov().

Code Quality
No additional findings.

Performance
No findings.

Maintainability

Severity P3. Impact: the remaining panel-survey helper duplication and repeated survey-summary formatting are now explicitly tracked, so they are non-blocking in this PR under the project’s deferred-work policy in TODO.md:L58-L59. Concrete fix: none in this PR.

Tech Debt

Severity P3. Impact: ContinuousDiD event-study aggregation still does not filter by anticipation, but the PR adds a TODO.md entry that properly tracks it as pre-existing debt, so it should not block merge in TODO.md:L57. Concrete fix: none in this PR.

Security
No findings.

Documentation/Tests

Severity P3. Impact: the prior StackedDiD fweight blocker is fixed in code and test in diff_diff/stacked_did.py:L246-L257 and tests/test_survey_phase3.py:L331-L344, but the public docs still describe StackedDiD survey support as full / multiplicative without noting that fweight is rejected in docs/methodology/REGISTRY.md:L1023-L1024 and docs/survey-roadmap.md:L19-L24. Concrete fix: add a short note that StackedDiD survey support excludes weight_type="fweight" because Q-weight composition breaks frequency-weight semantics.
Severity P3. Impact: the new TripleDifference survey tests in tests/test_survey_phase3.py:L468-L602 only smoke-test finite output, metadata, and guard behavior; they do not benchmark the weighted RA SE against a known-good target, which is why the P1 above is currently unexercised. Concrete fix: add a regression test comparing the survey-reg se to a hand-worked or frequency-expanded equivalent on a small non-uniformly weighted 2×2×2 example.

Path to Approval

Fix the weighted RA nuisance linearization in diff_diff/triple_diff.py:L1333-L1360 so the pre/post OLS score terms include survey weights consistently with the weighted OLS fit.
Add a regression test in tests/test_survey_phase3.py:L468-L602 that asserts the TripleDifference survey-reg SE matches a known target on a small weighted example.

…hts in score terms from PR #226 review (round 18) The pre/post OLS score pieces (wols_eX_pre/post) and design matrices (wols_x_pre/post) now include survey weights when present, consistent with the weighted OLS bread. Previously only the bread was weighted, making the survey IF inconsistent with the weighted OLS fit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T13:04:50Z

/ai-review

github-actions · 2026-03-22T13:14:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5bdbccefae0e33dd80cea53441b69be2457b4a96

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in the changed PR paths. Static review only; I could not run the test suite in this sandbox because pytest and the Python scientific dependencies are unavailable here.

Executive Summary

The prior TripleDifference blocker appears fixed: the survey-regression RA path now weights both the nuisance WLS bread and score terms before the combined IF is sent through TSL in diff_diff/triple_diff.py:L1027-L1053 and diff_diff/triple_diff.py:L1331-L1364.
I did not find another unmitigated P0/P1 across the changed survey-estimator paths.
One P2 remains in BaconDecomposition: the new unit-constancy validator is applied to all survey fits, so even the default approximate decomposition rejects time-varying survey columns although only the exact-weight branch actually collapses to one weight per unit in diff_diff/bacon.py:L474-L485 and diff_diff/bacon.py:L580-L591.
The pre-existing ContinuousDiD anticipation issue and the survey-helper cleanup items are now properly tracked and non-blocking in TODO.md:L57-L59.
Two P3 items remain: docs still do not say that StackedDiD survey support excludes fweight, and TripleDifference survey tests are still smoke-level rather than pegged to a known weighted-SE target in diff_diff/stacked_did.py:L246-L257, docs/methodology/REGISTRY.md:L1023-L1024, docs/survey-roadmap.md:L21-L22, and tests/test_survey_phase3.py:L467-L602.

Methodology

No unmitigated P0/P1 findings. The previously reviewed TripleDifference survey-RA mismatch is addressed in diff_diff/triple_diff.py:L1027-L1053 and diff_diff/triple_diff.py:L1331-L1364.

Severity P2. Impact: BaconDecomposition.fit() now applies _validate_unit_constant_survey() to every survey run in diff_diff/bacon.py:L474-L485, so the default approximate survey path also rejects time-varying survey columns even though the changed approximate implementation itself only uses observation-level weighted means / weighted within-transformation / weighted group shares in diff_diff/bacon.py:L521-L591. That is stricter than the documented Bacon survey support in docs/methodology/REGISTRY.md:L1573-L1579, and it is also narrower than the new exact-only regression test name in tests/test_survey_phase3.py:L1076-L1096. Concrete fix: move the unit-constant survey validation behind if self.weights == "exact" or add an explicit Bacon registry/roadmap note that survey support requires within-unit-constant design columns even for weights="approximate".

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the non-blocking items already tracked in TODO.md.

Tech Debt

Severity P3. Impact: the PR properly tracks the pre-existing ContinuousDiD event-study anticipation limitation and the survey-helper / summary-formatting cleanup as deferred work, so they are non-blocking under the project policy in TODO.md:L57-L59. Concrete fix: none in this PR.

Security

No findings.

Documentation/Tests

Severity P3. Impact: StackedDiD now correctly rejects weight_type="fweight" and has a regression test for it in diff_diff/stacked_did.py:L246-L257 and tests/test_survey_phase3.py:L331-L344, but the methodology registry and survey roadmap still describe the survey support as full / multiplicative without noting that exclusion in docs/methodology/REGISTRY.md:L1023-L1024 and docs/survey-roadmap.md:L21-L22. Concrete fix: add a short note that StackedDiD survey support excludes fweight because Q-weight composition breaks frequency-weight semantics.
Severity P3. Impact: the new TripleDifference survey tests cover smoke behavior, guards, metadata, and scale invariance, but they still do not pin the survey-regression se to a known-good weighted target in tests/test_survey_phase3.py:L467-L602. Concrete fix: add a small non-uniform 2×2×2 regression test whose survey-reg se is checked against a hand-worked or frequency-expanded target.

…usions from PR #226 review (round 19) - BaconDecomposition: move _validate_unit_constant_survey behind weights="exact" — approximate path uses obs-level means and doesn't need within-unit-constant survey columns - REGISTRY.md: document StackedDiD fweight rejection and Bacon exact-only unit-constancy requirement - survey-roadmap.md: update StackedDiD support to "Full (pweight/aweight)" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T13:59:41Z

/ai-review

github-actions · 2026-03-22T14:12:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ae7f687b9290aef0536a968da2475848baf47c3c

Overall Assessment

⚠️ Needs changes

Static re-review only. I could not run the test suite in this sandbox because numpy is unavailable.

Executive Summary

Prior re-review items look addressed: Bacon now limits within-unit survey-constancy validation to weights="exact" in diff_diff/bacon.py:L481-L488, and the registry/TODO now document the Bacon exact-only restriction plus the pre-existing ContinuousDiD/helper debt in docs/methodology/REGISTRY.md:L1578-L1579 and TODO.md:L57-L59.
StackedDiD still has one unmitigated methodology problem: it advertises aweight survey support in diff_diff/stacked_did.py:L246-L256, docs/methodology/REGISTRY.md:L1023-L1024, and docs/survey-roadmap.md:L21, but the SE path composes Q_weight * survey_weight and keeps the result on the generic aweight branch in diff_diff/stacked_did.py:L382-L448.
Because compute_survey_vcov() treats aweight as the precision-weight case and drops weights from the meat in diff_diff/survey.py:L553-L558 and docs/methodology/REGISTRY.md:L1891-L1894, this can misstate StackedDiD SEs, p-values, and CIs for a supported option.
I did not find another unmitigated P0/P1 in the changed survey-estimator paths.
The new StackedDiD survey tests still do not exercise the advertised aweight branch in tests/test_survey_phase3.py:L251-L344.

Methodology

Affected method: StackedDiD survey-weighted WLS/TSL extension.

Resolved from the prior review: the Bacon approximate survey path no longer rejects time-varying survey columns, and that exact-only constraint is now documented in diff_diff/bacon.py:L481-L488 and docs/methodology/REGISTRY.md:L1578-L1579.

Severity P1 [Newly identified]. Impact: StackedDiD now points users to weight_type="aweight" as a supported survey option in diff_diff/stacked_did.py:L246-L256. But in the survey path it multiplies Wing et al.’s corrective Q weights by the survey weights, copies the resolved survey design without changing weight_type, and sends that product into compute_survey_vcov() in diff_diff/stacked_did.py:L382-L448. The survey core’s aweight branch omits weights from the meat in diff_diff/survey.py:L553-L558, matching the registry’s inverse-variance/precision interpretation of aweight in docs/methodology/REGISTRY.md:L1891-L1894. Wing et al.’s Q_sa is a corrective sample weight for the stacked regression, whereas aweight is a precision-weight concept; treating Q × aweight as a pure aweight is therefore an undocumented variance rule, not a documented source-backed extension, and it can silently distort StackedDiD SEs, p-values, and CIs for a supported branch. Concrete fix: either reject weight_type="aweight" in StackedDiD.fit() until a valid composed-weight variance derivation exists, or implement and document a dedicated Q × aweight variance path instead of reusing the generic aweight meat. (nber.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

No new findings beyond the already tracked survey-helper cleanup in TODO.md:L58-L59.

Tech Debt

Severity P3. Impact: the pre-existing ContinuousDiD anticipation limitation and the cross-estimator survey-helper cleanup are now explicitly tracked in TODO.md:L57-L59, so they are mitigated under the project’s deferred-work policy. Concrete fix: none required for approval in this PR.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new StackedDiD survey tests cover default weighted fits and fweight rejection, but not the publicly documented aweight branch in tests/test_survey_phase3.py:L251-L344, while the docs/roadmap still advertise full pweight/aweight support in docs/methodology/REGISTRY.md:L1023-L1024 and docs/survey-roadmap.md:L21. Concrete fix: after the methodology decision above, add an explicit estimator-level aweight test and make the docs/error text match the actual supported weight types.

Path to Approval

In diff_diff/stacked_did.py:L246-L256 and diff_diff/stacked_did.py:L382-L448, stop accepting SurveyDesign(weight_type="aweight") unless you add a separately derived composed-weight variance formula. The minimal safe fix is to reject aweight, like fweight.
Update docs/methodology/REGISTRY.md:L1023-L1024, docs/survey-roadmap.md:L21, and the StackedDiD help/error text so the documented support matrix matches the implementation.
Add a StackedDiD survey test for aweight: either a rejection test or a validated numeric SE target for the chosen composed-weight method.

…om PR #226 review (round 20) Q-weight composition changes inverse-variance weight semantics just as it breaks frequency-weight semantics. Update guard, docs, and tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T14:15:21Z

/ai-review

github-actions · 2026-03-22T14:24:46Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e33ed3b1fe14b7867b9766522d745869650a72f1

Overall Assessment

✅ Looks good

Static re-review only. I could not execute the test suite in this sandbox because numpy and pytest are unavailable.

Executive Summary

The prior StackedDiD blocker appears resolved: the PR now rejects weight_type="aweight" and weight_type="fweight" in code, mirrors that restriction in the methodology docs/roadmap, and adds explicit rejection tests in diff_diff/stacked_did.py:L246, docs/methodology/REGISTRY.md:L1023, docs/survey-roadmap.md:L21, tests/test_survey_phase3.py:L332, and tests/test_survey_phase3.py:L346.
I did not find any unmitigated P0/P1 issues in the changed survey-support paths.
The affected estimator surface matches the intended source-method families: corrective-weight stacked DiD, continuous-treatment DiD, interaction-weighted Sun-Abraham event studies, EIF-based EfficientDiD, and tailored TripleDifference decomposition. (nber.org)
The remaining issues are non-blocking: one SunAbraham registry wording drift, plus pre-existing ContinuousDiD/helper cleanup items that are now explicitly tracked in TODO.md.
This assessment is based on static review of the diff and surrounding code only.

Methodology

No unmitigated findings.
The previous StackedDiD methodology issue is addressed by rejecting unsupported composed-weight semantics for aweight/fweight, and that deviation is now documented rather than implicit in code. See diff_diff/stacked_did.py:L246, docs/methodology/REGISTRY.md:L1024, and docs/survey-roadmap.md:L21.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3 (tracked in TODO.md). Impact: the new survey support still leaves panel-to-unit survey collapsing and survey-metadata formatting split across multiple estimators/results classes, but the PR explicitly tracks both follow-ups in TODO.md:L58 and TODO.md:L59. Concrete fix: none required for approval; handle in the tracked helper-extraction cleanup.

Tech Debt

Severity P3 (tracked in TODO.md). Impact: ContinuousDiD event-study aggregation still does not filter by anticipation, but the PR now records that as pre-existing deferred work in TODO.md:L57. Concrete fix: none required for approval in this PR.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the SunAbraham registry text still says aggregated event-study and overall ATT p-values use the normal distribution, but the new survey path now routes both through survey-df safe_inference() in diff_diff/sun_abraham.py:L637, diff_diff/sun_abraham.py:L645, diff_diff/sun_abraham.py:L671, and diff_diff/sun_abraham.py:L966. That makes the survey-mode behavior sensible, but the registry is now ambiguous/stale at docs/methodology/REGISTRY.md:L733. Concrete fix: add a short note that non-survey aggregate inference remains normal-approximation, while survey-mode aggregate inference uses df = n_PSU - n_strata.
No additional test findings from the diff itself. The new survey tests materially improve coverage for the prior StackedDiD issue and for the new NotImplementedError guards.

igerber added a commit that referenced this pull request Mar 21, 2026

Add P3 regression tests: SunAbraham survey-df and DoseResponseCurve p…

7bf9c8f

…-values from PR #226 review (round 8) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document EfficientDiD covariates+survey deferral in REGISTRY.md from PR

08ab80f

#226 review (round 13) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber and others added 4 commits March 22, 2026 07:18

Add within-unit survey validation to BaconDecomposition and regressio…

31a6962

…n test from PR #226 review (round 14) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Thread survey_design through bacon_decompose() and triple_difference(…

cb2d885

…) convenience functions from PR #226 review (round 15) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber merged commit 15985c3 into main Mar 22, 2026
14 checks passed

igerber deleted the survey-enhancements branch March 22, 2026 14:58

Conversation

igerber commented Mar 21, 2026

Summary

Estimator-specific details

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!