Skip to content

Phase 3: Survey design support for OLS-based standalone estimators#226

Merged
igerber merged 24 commits into
mainfrom
survey-enhancements
Mar 22, 2026
Merged

Phase 3: Survey design support for OLS-based standalone estimators#226
igerber merged 24 commits into
mainfrom
survey-enhancements

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Mar 21, 2026

Summary

  • Add survey_design parameter to fit() for 6 OLS-based standalone estimators: StackedDiD, SunAbraham, BaconDecomposition, TripleDifference (reg method), ContinuousDiD, and EfficientDiD
  • Implement weighted estimation with Taylor Series Linearization (TSL) variance using strata, PSU, and FPC for design-based inference
  • Add survey_metadata field to all 6 results classes with summary() and to_dict() support
  • Defer TripleDifference IPW/DR and bootstrap+survey combinations with NotImplementedError guards (pending Phase 5: weighted solve_logit())
  • Update survey roadmap with Phase 3 completion status and explicit deferred-work table
  • Add REGISTRY.md survey documentation for all 6 estimators

Estimator-specific details

Estimator Survey Support Key Implementation
StackedDiD Full Q-weight × survey-weight multiplicative composition; TSL vcov on composed weights
SunAbraham Full Survey weights in LinearRegression + weighted within-transform
BaconDecomposition Diagnostic Weighted cell means, within-transform, group shares (no inference)
TripleDifference Reg only Weighted per-cell OLS + TSL on combined influence functions
ContinuousDiD Analytical Weighted B-spline OLS + TSL on influence functions
EfficientDiD Analytical Weighted means/covariances in Omega* + TSL on EIF scores

Methodology references (required if estimator / math changes)

  • Method name(s): Taylor Series Linearization (TSL) variance for survey-weighted DiD estimators
  • Paper / source link(s): Implemented per Phases 1-2 infrastructure from PR Add survey data support (Phases 1-2) #218
  • Any intentional deviations from the source (and why):
    • Note: Bacon decomposition with survey weights is diagnostic; exact-sum guarantee is approximate (Goodman-Bacon 2021 theorem derived for unweighted TWFE)
    • Note: TripleDifference IPW/DR with survey weights deferred until weighted solve_logit() (Phase 5)
    • Note: Bootstrap + survey interaction deferred for SunAbraham, ContinuousDiD, EfficientDiD (Phase 5)

Validation

  • Tests added/updated: tests/test_survey_phase3.py (31 new tests covering smoke, uniform-weight invariance, scale invariance, metadata, summary output, NotImplementedError guards)
  • All 2015 existing tests pass (full suite, no regressions)
  • All 453 affected estimator + survey tests pass

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

I cross-checked the changed survey paths against the Methodology Registry and the estimator implementations. The documented Phase 5 deferrals and documented Bacon diagnostic deviation are fine; the blockers below are all undocumented survey-path mismatches or missing assumption checks.

Executive Summary

  • Affected methods: ContinuousDiD, EfficientDiD, TripleDifference, and StackedDiD.
  • The documented deferrals in docs/methodology/REGISTRY.md:L497-L498, docs/methodology/REGISTRY.md:L673-L674, docs/methodology/REGISTRY.md:L1234-L1235, and docs/methodology/REGISTRY.md:L1016-L1017 are properly documented/guarded and are not blockers.
  • EfficientDiD’s survey SE helper misapplies TSL to unit-level EIFs by expanding them back to panel rows, which breaks the documented weights-only / stratified-no-PSU survey paths.
  • ContinuousDiD’s survey support is partial: aggregate="eventstudy" still uses sample-count aggregation and non-survey analytical SEs, and the unit-level survey collapse silently assumes within-unit survey fields are constant.
  • TripleDifference computes survey SEs but still uses ordinary sample df for p-values/CIs, and StackedDiD crashes on valid no-weight SurveyDesign inputs.

Methodology

  • Severity: P1. Impact: EfficientDiD’s survey SE helper repeats each unit-level EIF across every period and then calls compute_survey_vcov on the panel-level resolved design. That is only harmless when PSUs are explicit and repeated across time; for the documented weights-only / stratified-no-PSU paths it treats each period copy as a separate implicit PSU, so the returned survey SEs are too small. Concrete fix: collapse the survey design to one row per unit before TSL, as ContinuousDiD already does, and use that unit-level design for both ATT(g,t) and aggregated SEs. Locations: diff_diff/efficient_did.py:L476-L484, diff_diff/efficient_did.py:L644-L675, docs/methodology/REGISTRY.md:L673-L674, diff_diff/continuous_did.py:L1047-L1099.

  • Severity: P1. Impact: ContinuousDiD accepts aggregate="eventstudy" with survey_design, but that path still aggregates by raw n_treated counts and computes analytical event-study SEs with sqrt(sum(IF^2)) instead of TSL. So survey-weighted event-study estimates/inference are not the documented estimator. Concrete fix: either make event-study aggregation use survey-weighted treated mass plus unit-level compute_survey_vcov, or explicitly reject and document this combination instead of silently returning sample-weighted event-study results. Locations: diff_diff/continuous_did.py:L385-L388, diff_diff/continuous_did.py:L933-L962, diff_diff/continuous_did.py:L519-L564, docs/methodology/REGISTRY.md:L497-L498.

  • Severity: P1. Impact: ContinuousDiD and EfficientDiD both collapse observation-level survey information to the first row per unit without validating the assumption that those fields are constant within unit. If panel weights (or other design columns) vary by wave, the code silently computes arbitrary weighted means/covariances/SEs. Concrete fix: validate within-unit constancy for every supplied survey field (weights, strata, psu, fpc) before collapsing to unit level, and raise otherwise. Locations: diff_diff/continuous_did.py:L669-L682, diff_diff/efficient_did.py:L320-L336.

  • Severity: P1. Impact: TripleDifference computes the survey SE via TSL, then immediately falls back to df = n_obs - 8 - p for p-values/CIs. The survey registry says survey t-inference should use df = n_PSU - n_strata, so the current code returns wrong survey p-values and confidence intervals. Concrete fix: when survey_design is active, use resolved_survey.df_survey / survey_metadata.df_survey with the same nonpositive-df fallback logic used in LinearRegression. Locations: diff_diff/triple_diff.py:L579-L585, docs/methodology/REGISTRY.md:L1912-L1918.

Code Quality

  • Severity: P1. Impact: StackedDiD assumes every SurveyDesign has a weights column. Valid strata/PSU/FPC-only designs will hit stacked_df[None] both when composing weights and when recomputing metadata, so this new public parameter errors on a supported design shape. Concrete fix: if survey_design.weights is None, use all-ones stacked weights/raw weights instead of indexing a missing column, and add a no-weight survey smoke test. Locations: diff_diff/stacked_did.py:L369-L375, diff_diff/stacked_did.py:L433-L435.

Performance

  • No material findings in the reviewed diff.

Maintainability

  • No material findings in the reviewed diff.

Tech Debt

  • No existing TODO.md entry covers the P1 items above; the PR’s documented Phase 5 deferrals are the only survey limitations I would treat as mitigated. See TODO.md:L44-L56.

Security

  • No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

  • Static review only: I did not rerun the suite here because the shell Python in this environment cannot import the project dependencies (numpy was unavailable).

  • Severity: P3. Impact: tests/test_survey_phase3.py only covers weight-bearing survey designs for StackedDiD/EfficientDiD and only smoke-tests ContinuousDiD; it never exercises ContinuousDiD(..., aggregate="eventstudy", survey_design=...), SurveyDesign without weights, or rejection of time-varying within-unit survey fields. That gap is why the regressions above can slip through. Concrete fix: add targeted regression tests for those combinations. Locations: tests/test_survey_phase3.py:L248-L329, tests/test_survey_phase3.py:L503-L553, tests/test_survey_phase3.py:L561-L625, tests/test_survey_phase3.py:L84-L108.

Path to Approval

  1. Replace EfficientDiD’s panel-level EIF expansion with a unit-level ResolvedSurveyDesign TSL path, and add weights-only + stratified-no-PSU regression tests.
  2. Make ContinuousDiD’s survey event-study aggregation/inference fully survey-aware, or reject/document that combination instead of silently using sample-weighted event-study logic.
  3. Add explicit within-unit validation for survey weights/strata/psu/fpc in the unit-level survey estimators (at least ContinuousDiD and EfficientDiD).
  4. Use survey degrees of freedom in TripleDifference survey inference.
  5. Make StackedDiD handle SurveyDesign objects without a weight column and add a dedicated no-weight survey test.

igerber added a commit that referenced this pull request Mar 21, 2026
- EfficientDiD: collapse survey to unit-level before TSL (was expanding
  EIF to panel rows, making SEs too small for weights-only path)
- ContinuousDiD: survey-weighted event-study aggregation and TSL SEs
- ContinuousDiD/EfficientDiD: validate within-unit constancy for survey
  columns before collapsing to unit level
- TripleDifference: use survey df (n_PSU - n_strata) for p-values/CIs
  instead of sample df (n_obs - 8)
- StackedDiD: handle SurveyDesign without weights column (use ones)
- Pass survey df to safe_inference() in all survey-aware estimators

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9f48bc3bb96c8d6e4a00641f019fc88771086dda


Overall Assessment
⚠️ Needs changes

Executive Summary

  • Re-review status: the prior blockers on unit-constant survey validation, EfficientDiD’s panel-row TSL collapse, TripleDifference’s switch to survey df, and StackedDiD’s no-weight survey crash appear addressed.
  • The documented Phase 5 deferrals and the Bacon diagnostic note in the Methodology Registry are properly recorded and are not blockers.
  • ContinuousDiD’s new survey analytical path is not mathematically consistent with compute_survey_vcov(): it feeds already 1/n-scaled IF objects into the TSL sandwich and keeps panel-row survey df after collapsing to unit level.
  • EfficientDiD’s collapsed survey-row index is not aligned to all_units, so unsorted input can attach the wrong weights/strata/PSUs/FPCs to unit-level{EIF}s.
  • [Newly identified] SunAbraham still aggregates survey-weighted cohort effects with raw observation counts, so overall/event-study effects do not reflect survey-weighted cohort composition.
  • [Newly identified] StackedDiD computes survey TSL SEs but still uses normal-reference inference instead of df = n_PSU - n_strata.

Methodology
Re-review status: the earlier P1s called out in the last review are largely resolved, but the survey paths below are still not methodologically correct.

  • Severity: P1. Impact: ContinuousDiD’s survey SE path uses IF arrays that are already 1/n_t/1/n_c scaled for the library’s unweighted sqrt(sum(IF_i^2)) convention, then passes them into compute_survey_vcov(), which applies another bread term. The code’s own comment says those IFs already contain the scaling, so the survey SEs/CIs/p-values for overall ATT/ACRT, ATT(d), ACRT(d), and event-study bins are not for the documented estimator; on top of that, the survey df still comes from the original panel-level resolve rather than the collapsed unit-level design actually used for TSL. Concrete fix: rebuild the survey linearization variables on the unscaled score/residual scale expected by compute_survey_vcov(), construct the unit-level ResolvedSurveyDesign once, and use that unit-level object’s df_survey for all survey inference. Refs: diff_diff/continuous_did.py:L513, diff_diff/continuous_did.py:L1097, diff_diff/continuous_did.py:L1129, diff_diff/continuous_did.py:L1183, diff_diff/survey.py:L531.

  • Severity: P1. Impact: EfficientDiD’s unit-level survey collapse is still inconsistent. all_units is sorted, but _unit_first_panel_row is built in first-appearance order, and those positions are later used to subset weights/strata/PSUs/FPCs for the unit-level TSL helper. If the input rows are not already sorted by unit, the survey design arrays are permuted relative to the EIF vector, yielding incorrect survey SEs. The same path also stores self._survey_df from the original panel-level resolve, so weights-only/stratified-no-PSU designs use panel-row df rather than unit-level df. Concrete fix: build the first-row mapping keyed by all_units, create one unit-level ResolvedSurveyDesign aligned to the EIF order, and take df_survey from that collapsed design. Refs: diff_diff/efficient_did.py:L209, diff_diff/efficient_did.py:L303, diff_diff/efficient_did.py:L669.

  • Severity: P1. Impact: [Newly identified] SunAbraham’s new survey_design path stops at the saturated regression. _compute_iw_effects() and _compute_overall_att() still aggregate with raw observation counts, so survey-weighted cohort effects are collapsed with unweighted cohort composition. That changes the estimand whenever weights vary across cohorts or event times. Concrete fix: thread survey weights into _compute_iw_effects() and _compute_overall_att() and replace .size() / len(...) with treated weighted mass (sum(w)) for w_{g,e} and post-period aggregation weights. Refs: diff_diff/sun_abraham.py:L892, diff_diff/sun_abraham.py:L970.

  • Severity: P1. Impact: [Newly identified] StackedDiD now computes survey TSL SEs, but it still calls safe_inference() without a survey df for event-study coefficients and the overall ATT. The survey registry says survey t-inference should use df = n_PSU - n_strata; current p-values/CIs will therefore be too liberal in small-PSU designs. Concrete fix: when survey_design is active, pass survey_metadata.df_survey into both safe_inference() calls, with the same nonpositive-df fallback used in LinearRegression. Refs: diff_diff/stacked_did.py:L462, diff_diff/stacked_did.py:L492, docs/methodology/REGISTRY.md:L1912.

Code Quality
No additional material findings beyond the methodology blockers above.

Performance
No material findings in the reviewed diff.

Maintainability
No material findings in the reviewed diff.

Tech Debt
No relevant TODO.md entry mitigates the P1 items above, and these would not become mergeable via TODO tracking anyway because they are correctness/inference bugs.

Security
No security or secret-handling issues found in the reviewed diff.

Documentation/Tests
Static review only: I could not run the test suite here because pytest is not installed in this environment, and the shell Python also lacks numpy.

Path to Approval

  1. Rework ContinuousDiD’s survey analytical path so the objects passed to compute_survey_vcov() are unscaled linearization variables, then source survey df from the collapsed unit-level design used for TSL.
  2. Rebuild EfficientDiD’s unit_first_panel_row mapping in all_units order, construct a single aligned unit-level ResolvedSurveyDesign, and use that same object for both SE computation and survey df.
  3. Thread survey weights through SunAbraham’s _compute_iw_effects() and _compute_overall_att() so dynamic and overall effects aggregate by weighted treated mass rather than raw counts.
  4. Pass survey_metadata.df_survey into StackedDiD’s event-study and overall safe_inference() calls when survey inference is active, with the project-standard fallback for nonpositive df.
  5. Add regression tests for: a shuffled-unit EfficientDiD input, a SunAbraham dataset where survey weights materially change cohort composition, a collapsed-panel survey-df check for ContinuousDiD/EfficientDiD, and a small-PSU StackedDiD case where survey-df inference differs from normal-reference inference.

igerber added a commit that referenced this pull request Mar 21, 2026
- ContinuousDiD: rescale IFs by n_units before compute_survey_vcov to
  avoid double-counting 1/n bread; use unit-level df_survey
- EfficientDiD: align unit_first_panel_row to sorted all_units order;
  build unit-level ResolvedSurveyDesign once in fit(); use unit-level df
- SunAbraham: thread survey weights into _compute_iw_effects and
  _compute_overall_att for survey-weighted cohort aggregation
- StackedDiD: pass survey df to safe_inference for event-study and
  overall ATT p-values/CIs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bb3b64a3e884975c5e19af686ee39764e6276304


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review status: the earlier blockers on EfficientDiD unit-order alignment, StackedDiD survey-df inference, and SunAbraham survey-weighted aggregation appear addressed in diff_diff/efficient_did.py:L307, diff_diff/stacked_did.py:L462, and diff_diff/sun_abraham.py:L907.
  • The documented Phase 5 deferrals and documented methodology notes are not blockers here, including the survey deferred-work table and Bacon’s diagnostic-only survey note in docs/survey-roadmap.md:L28, docs/methodology/REGISTRY.md:L1235, and docs/methodology/REGISTRY.md:L1572.
  • ContinuousDiD’s new survey analytical path still linearizes the old count-based formulas rather than the new weighted estimator, so its reported survey SEs, CIs, and p-values are not the documented TSL for the weighted B-spline estimator.
  • TripleDifference’s survey RA path still re-expands the three pairwise influence functions with unweighted n/n_j factors instead of weighted subset mass, so the combined survey TSL is not for the weighted DDD estimand when subgroup average weights differ.
  • The added tests are mostly smoke/guard coverage; they do not pin either remaining weighted-linearization bug, and I could not execute the suite here because pytest, numpy, and pandas are unavailable.

Methodology

Code Quality
No material findings beyond the methodology blockers above.

Performance
No material findings in the reviewed diff.

Maintainability
No material findings in the reviewed diff.

Tech Debt
The remaining P1 issues are not tracked under TODO.md:L44, and they would not be mitigated by TODO tracking even if they were because undocumented methodology mismatches and incorrect survey SE paths remain approval blockers.

Security
No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

  • Severity: P3. Impact: tests/test_survey_phase3.py exercises smoke cases, guardrails, and finite outputs for the new survey paths, but it does not contain an oracle-style regression that would fail on the two remaining weighted-linearization bugs. That means this PR can still pass its new coverage while shipping incorrect survey inference in ContinuousDiD and TripleDifference. Concrete fix: add one ContinuousDiD fixture with unequal unit weights and a direct weighted-linearization oracle for analytical survey SEs, plus one TripleDifference fixture where mean weights differ across {3,4}, {2,4}, and {1,4} so count-scaling and weighted-mass scaling produce different inference. Refs: tests/test_survey_phase3.py:L396, tests/test_survey_phase3.py:L503, tests/test_survey_phase3.py:L707.
  • Static review only: I could not run the test suite here because pytest, numpy, and pandas are not installed in this environment.

Path to Approval

  1. Rework ContinuousDiD’s survey analytical path so the objects passed to compute_survey_vcov() are score-scale linearization variables for the weighted control mean and weighted WLS estimator, using weighted treated/control masses and weighted treated derivative averages throughout overall, dose-response, and event-study inference.
  2. Rework TripleDifference’s survey RA combination step so each padded pairwise IF is placed on the full-sample weighted scale before TSL, replacing the current n/n_j count factors with weighted-mass scaling.
  3. Add targeted regression tests for both fixes: one weighted-ContinuousDiD analytical oracle and one weighted-TripleDifference case where the current count-based subset scaling measurably disagrees with weighted-mass scaling.

igerber added a commit that referenced this pull request Mar 21, 2026
…nce from PR #226 review (round 3)

- ContinuousDiD: store survey-weighted treated/control masses and
  weighted dpsi_bar in bootstrap_info; use weighted masses for p_1, p_0,
  n_total in IF construction so TSL linearizes the weighted estimator
- TripleDifference: use survey-weighted subgroup mass (sum(w_sub))
  instead of raw counts (n_sub) for pairwise IF combination weights
  w3, w2, w1 when survey design is active

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 11660240c427f80dc4af5c7ad172d4018361fdc1


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review status: the earlier blockers on EfficientDiD unit-order alignment, StackedDiD survey-d.f. inference, and TripleDifference’s survey subset scaling appear addressed in diff_diff/efficient_did.py:L307-L343, diff_diff/stacked_did.py:L462-L505, and diff_diff/triple_diff.py:L1027-L1053.
  • ContinuousDiD still has an unmitigated P1 in the new survey analytical path: the weighted B-spline score is normalized by weighted treated/control mass downstream, but the stored WLS bread is still normalized by the raw treated count. That leaves the survey linearization for ACRT_glob, ATT(d), and ACRT(d) mis-scaled whenever average survey weights differ across treated/control subsets.
  • The documented Phase 5 deferrals and documented methodology notes in the registry/roadmap are not blockers here, including the survey support notes for StackedDiD, TripleDifference, Bacon, EfficientDiD, and ContinuousDiD.
  • ContinuousDiD and EfficientDiD now expose survey_metadata, but for these panel estimators it is still computed from panel-level arrays even though inference is run on unit-collapsed survey designs, so the reported effective_n, n_psu, and df_survey can be misleading.
  • tests/test_survey_phase3.py adds useful smoke and guard coverage, but it still does not pin the remaining ContinuousDiD linearization bug or the panel-metadata mismatch. Static review only: pytest, numpy, and pandas are unavailable in this environment.

Methodology

  • Severity: P1. Impact: ContinuousDiD’s survey analytical path still mixes two different normalizations for the weighted spline regression. The stored bread is built as (Psi' W Psi / n_treated_raw)^-1, but the downstream beta perturbations divide by weighted treated/control mass after n_t/n_c are replaced with w_treated/w_control. Because the bread and score pieces are no longer on the same scale, the survey IF for the weighted WLS step is mis-scaled whenever subgroup-average weights are not 1. This affects survey SEs, CIs, and p-values for ACRT_glob, ATT(d), and ACRT(d); ATT_glob/event-study are largely fixed by the new weighted-mass updates. Concrete fix: use a single normalization convention end-to-end for the weighted spline score. Either keep both bread and per-unit score terms on the raw-count normalization, or switch both to weighted-mass normalization; do not mix them. Refs: diff_diff/continuous_did.py:L949-L955, diff_diff/continuous_did.py:L1097-L1103, diff_diff/continuous_did.py:L1130-L1147, docs/methodology/REGISTRY.md:L492-L503.

Code Quality

Performance

  • No material findings in the reviewed diff.

Maintainability

  • No material findings beyond the metadata inconsistency above.

Tech Debt

  • The P1 methodology issue above is not tracked under TODO.md:L34-L45, so it is not mitigated by the project’s deferred-work policy.

Security

  • No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

  • Severity: P3. Impact: the new survey Phase 3 suite mostly covers smoke paths, invariance, summaries, and guardrails. It does not include an unequal-weight analytical oracle that would fail on the remaining ContinuousDiD beta-linearization mismatch, and it does not assert that panel-estimator survey_metadata is unit-level rather than panel-level. Concrete fix: add one unequal-weight ContinuousDiD fixture that checks analytical ACRT_glob/ATT(d)/ACRT(d) SEs against directly assembled weighted linearization scores, and add one panel-metadata assertion for unit-level effective_n/df_survey. Refs: tests/test_survey_phase3.py:L503-L553, tests/test_survey_phase3.py:L707-L783.
  • Static review only: I could not run the tests here because pytest, numpy, and pandas are not installed in this environment.

Path to Approval

  1. In ContinuousDiD, make the weighted spline linearization internally consistent by using the same normalization in both the stored bread and the downstream per-unit beta perturbations. The current raw-count bread plus weighted-mass score denominators need to be unified.
  2. Add a targeted unequal-weight regression test for ContinuousDiD analytical survey inference that would fail under the current mixed normalization, covering at least ACRT_glob, ATT(d), and ACRT(d).

igerber added a commit that referenced this pull request Mar 21, 2026
… from PR #226 review (round 4)

- ContinuousDiD: normalize WLS bread by weighted treated mass (not raw
  count) for consistency with downstream IF score denominators; fixes
  ACRT_glob/ATT(d)/ACRT(d) survey SEs when subgroup-average weights differ
- ContinuousDiD/EfficientDiD: recompute survey_metadata from unit-level
  ResolvedSurveyDesign so reported effective_n/n_psu/df_survey match
  the inference actually run (not the panel-level overcount)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 67a87e3366df27f11c9ab93c1f1975919cb98dd6


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The main blockers from the prior review appear resolved: ContinuousDiD now uses a consistent weighted-mass normalization in its survey analytical path, EfficientDiD now aligns unit order and reports unit-level survey metadata, StackedDiD now applies survey d.f. in inference, and TripleDifference now rescales the combined IF using survey-weighted subgroup mass. Refs: diff_diff/continuous_did.py:L959, diff_diff/continuous_did.py:L1111, diff_diff/efficient_did.py:L307, diff_diff/efficient_did.py:L679, diff_diff/stacked_did.py:L462, diff_diff/triple_diff.py:L1027
  • Remaining blocker 1: EfficientDiD’s survey path rebuilds unit_level_weights from the raw column and renormalizes it unconditionally, while the TSL path uses resolved_survey.weights. For SurveyDesign(weight_type="fweight"), estimation and inference now use different weight semantics, which conflicts with the project’s documented survey-weight handling.
  • Remaining blocker 2: the new exact-weight Bacon path indexes a NumPy weight array with DataFrame labels (w[df_22.index]), so non-RangeIndex inputs can error or silently misalign weights. This regression affects weights="exact" even without survey weights.
  • The Phase 5 deferrals and the Bacon diagnostic note are properly documented in the registry/roadmap and are not blockers.
  • Static review only: I could not run pytest here because the sandbox Python environment is missing pandas/numpy.

Methodology

  • Severity: P1. Impact: EfficientDiD’s new survey implementation is internally inconsistent for supported SurveyDesign(weight_type="fweight") inputs. The estimation path uses a hand-built, always-renormalized unit_level_weights, but the TSL path uses resolved_survey.weights[row_idx], which preserves raw fweight semantics. That means Omega*, generated outcomes, cohort fractions, and EIF construction can be based on different weights than the variance calculation, producing incorrect efficient weights, ATTs, and SEs for frequency-weighted survey designs. This is not covered by any documented registry deviation. Concrete fix: derive the unit-level estimation weights directly from resolved_survey.weights[row_idx], or at minimum branch on survey_weight_type and skip renormalization for fweight; then add a regression test with unequal integer fweights. Refs: diff_diff/efficient_did.py:L323, diff_diff/efficient_did.py:L374, diff_diff/survey.py:L117, docs/methodology/REGISTRY.md:L1846

Code Quality

  • Severity: P1. Impact: the new weighted exact-Bacon path uses w[df_22.index] inside _recompute_exact_weights. Because w is a position-based NumPy array and df_22.index is label-based, filtered or user-supplied non-RangeIndex DataFrames can raise out-of-bounds errors or pull the wrong weights. This is a production edge-case regression in a changed code path. Concrete fix: either reset the working index once in fit() or align weights with a Series keyed on df.index before subsetting. Add a regression test using a non-default index and weights="exact". Refs: diff_diff/bacon.py:L744

Performance

  • No material findings in the reviewed diff.

Maintainability

  • No material findings beyond the blockers above.

Tech Debt

Security

  • No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

  • Severity: P3. Impact: the new suite adds useful smoke, scale-invariance, metadata, and guard coverage, but it does not pin either blocker above: there is no EfficientDiD test exercising SurveyDesign(weight_type="fweight"), and the Bacon tests do not cover weights="exact" on a non-default DataFrame index. Those gaps make both regressions easy to ship unnoticed. Concrete fix: add one EfficientDiD regression with unequal integer frequency weights and one Bacon exact-weight regression using a filtered/non-reset index. Refs: tests/test_survey_phase3.py:L337, tests/test_survey_phase3.py:L644
  • Static review only: I could not execute the test suite in this environment because the sandbox Python runtime does not have pandas/numpy installed.

Path to Approval

  1. In EfficientDiD, make the unit-level survey weight used for Omega*/generated outcomes/EIF construction come from the resolved survey object, not a separately renormalized raw column path. Add a regression test covering SurveyDesign(weight_type="fweight") with unequal integer weights.
  2. In BaconDecomposition, remove label-based indexing from _recompute_exact_weights so exact weights work on non-default DataFrame indexes. Add a regression test using weights="exact" on a filtered DataFrame whose index has not been reset.

igerber added a commit that referenced this pull request Mar 21, 2026
…ety from PR #226 review (round 5)

- EfficientDiD: use resolved survey weights directly for unit-level
  estimation (Omega*, EIF, cohort fractions) instead of separately
  renormalized raw column, ensuring fweight/aweight consistency with TSL
- BaconDecomposition: store survey weights as DataFrame column for safe
  label-based subsetting in _recompute_exact_weights, preventing
  out-of-bounds errors on non-default DataFrame indexes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d26d498718eb53504e68677465d1d9562215c116


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The two prior blockers called out in the last review appear fixed in code: EfficientDiD now derives unit-level survey weights from the resolved survey object, and Bacon exact weights now subset via a DataFrame-aligned weight column instead of NumPy label indexing. Locations: diff_diff/efficient_did.py:L323, diff_diff/bacon.py:L694
  • P1: ContinuousDiD’s new survey analytical SE path is mis-scaled for supported SurveyDesign(weight_type="fweight") inputs. The IF construction is normalized by weighted mass, but the TSL conversion hard-codes n_units, so overall, dose-response, and event-study SEs are wrong for frequency-weighted designs.
  • P1: BaconDecomposition’s new treated-vs-never weighted mean path lacks the empty-cell guard already present in the timing-comparison path. On some unbalanced/filtered panels, the PR now turns a warning-only path into a hard failure.
  • The Phase 5 survey deferrals and Bacon’s diagnostic exact-sum caveat are documented in the registry/roadmap, so they are non-blocking. Locations: docs/methodology/REGISTRY.md:L497, docs/methodology/REGISTRY.md:L673, docs/methodology/REGISTRY.md:L1234, docs/methodology/REGISTRY.md:L1571, docs/survey-roadmap.md:L35
  • Static review only: I could not run pytest here because the local environment is missing numpy, pandas, and pytest.

Methodology

No unmitigated registry/source-material mismatch stood out in EfficientDiD, SunAbraham, StackedDiD, or TripleDifference; the remaining methodology blocker is in ContinuousDiD.

Code Quality

  • P1 Impact: BaconDecomposition still advertises unbalanced panels as warning-only, but _compute_treated_vs_never() now always uses np.average(..., weights=...) without checking whether the treated/never pre/post slices are empty. That means a filtered or more severely unbalanced panel can now raise instead of skipping the invalid comparison, which is a production edge-case regression introduced by the weighted rewrite. Concrete fix: add the same four-cell emptiness guard already used in _compute_timing_comparison() and return None before averaging if any cell is empty. Locations: diff_diff/bacon.py:L492, diff_diff/bacon.py:L820, diff_diff/bacon.py:L934, tests/test_bacon.py:L580

Performance

No material findings in the reviewed diff.

Maintainability

No material findings beyond the blockers above.

Tech Debt

Security

No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

  • P3 Impact: The new survey suite still exercises only happy-path pweight estimator flows for ContinuousDiD/EfficientDiD and only happy-path Bacon survey cases, so neither blocker above is pinned by regression coverage. Concrete fix: add one ContinuousDiD analytical survey test with unequal integer fweights, and one Bacon regression that filters an unbalanced panel until a treated-vs-never pre/post cell is empty and asserts the comparison is skipped rather than raising. Locations: tests/test_survey_phase3.py:L337, tests/test_survey_phase3.py:L561, tests/test_survey_phase3.py:L763, tests/test_bacon.py:L580
  • Static review only: I could not execute the test suite in this environment because numpy, pandas, and pytest are not installed.

Path to Approval

  1. In ContinuousDiD, replace the hard-coded n_units TSL rescaling with the unit-level total survey mass actually used by the weighted IF path, and apply the same fix to the event-study survey SE branch. Add an estimator-level regression test with unequal integer fweights.
  2. In BaconDecomposition, add a treated-vs-never empty-cell guard before the new weighted averages so unbalanced/filtered panels skip invalid comparisons instead of raising. Add a regression test that creates exactly that unbalanced-cell case.

igerber added a commit that referenced this pull request Mar 21, 2026
…ll guard from PR #226 review (round 6)

- ContinuousDiD: rescale IFs by unit-level total survey mass
  (unit_resolved.weights.sum()) instead of hard-coded n_units, so TSL
  SEs are correct for fweight designs where mass != n_units
- BaconDecomposition: add empty-cell guard in _compute_treated_vs_never
  before np.average() to prevent crashes on unbalanced/filtered panels

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3798bd4cd44fcaefe5327810d8676ec362a62d36


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

  • P1 Impact: SunAbraham’s survey implementation uses survey-weighted within-transformation and TSL vcov, but it does not carry the required survey degrees of freedom into the aggregate inference layer. The registry explicitly says survey inference should use df = n_PSU - n_strata; this code instead calls safe_inference without df for both overall ATT and interaction-weighted event-study aggregates, so overall_p_value, overall_conf_int, and each event_study_effects[e] are too optimistic whenever survey df is materially smaller than the normal limit. Concrete fix: thread survey_metadata.df_survey (or the regression object’s survey df) into both aggregate safe_inference calls, and preserve that df when storing overall/event-study inference. Locations: diff_diff/sun_abraham.py:L651-L663, diff_diff/sun_abraham.py:L947-L956, docs/methodology/REGISTRY.md:L745-L745, docs/methodology/REGISTRY.md:L1908-L1921
  • P2 [Newly identified] Impact: the new survey path in ContinuousDiD now computes df-aware analytical inference at fit time, but the unchanged DoseResponseCurve.to_dataframe() helper recomputes analytic t-stats/p-values with safe_inference(..., df=None). Direct calls to result.dose_response_att.to_dataframe() or result.dose_response_acrt.to_dataframe() therefore disagree with the estimator’s own survey-aware inference. Concrete fix: store analytic t-stats/p-values (or at least survey_df) on DoseResponseCurve during fit(), and stop recomputing them with a normal reference distribution in to_dataframe(). Locations: diff_diff/continuous_did.py:L512-L542, diff_diff/continuous_did_results.py:L49-L75

Code Quality

No material findings beyond the inference issues above.

Performance

No material findings in the reviewed diff.

Maintainability

No material findings in the reviewed diff.

Tech Debt

Security

No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

  • P3 Impact: the new survey suite exercises smoke/invariance/metadata paths, but it does not pin the two inference regressions above. There is no SunAbraham test asserting that stored overall/event-study p-values use df_survey, and no ContinuousDiD test covering direct DoseResponseCurve.to_dataframe() on a survey fit. Concrete fix: add one low-PSU SunAbraham regression that compares stored p-values/CIs against safe_inference(..., df=result.survey_metadata.df_survey), and one ContinuousDiD regression that calls dose_response_att.to_dataframe() on a survey result and checks its t/p output against the same df-aware calculation. Locations: tests/test_survey_phase3.py:L119-L230, tests/test_survey_phase3.py:L503-L545
  • Static review only: I could not execute the survey tests in this environment because numpy and pytest are unavailable.

Path to Approval

  1. In SunAbraham, pass survey df into aggregate inference so both overall_att and event-study outputs use safe_inference(..., df=survey_metadata.df_survey) when survey inference is active, and add a regression test with small survey df that would fail under the current normal-approximation behavior.

igerber added a commit that referenced this pull request Mar 21, 2026
…(round 7)

Thread survey_metadata.df_survey to safe_inference() for both overall
ATT and event-study effects, so p-values/CIs use t-distribution with
n_PSU - n_strata df when survey design is active.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 918b876a6f95c1d778545c7f197c8b516bec3011


Overall Assessment

Looks good

Executive Summary

Methodology

  • P2 Impact: direct result.dose_response_att.to_dataframe() / result.dose_response_acrt.to_dataframe() still recompute p_value with safe_inference(..., df=None), so survey-fit curves report normal-reference p-values even though fit-time analytical inference is built with df_survey. The stored CI arrays remain the df-aware ones; the inconsistency is in the helper’s recomputed p_value column. Concrete fix: store survey_df or precomputed analytical p_value on DoseResponseCurve and use it in to_dataframe() instead of recomputing with the normal reference. Locations: diff_diff/continuous_did_results.py:L49-L75, diff_diff/continuous_did.py:L524-L543, diff_diff/continuous_did.py:L658-L678

Code Quality

  • No material findings in the reviewed diff.

Performance

  • No material findings in the reviewed diff.

Maintainability

  • No material findings in the reviewed diff.

Tech Debt

Security

  • No security or secret-handling issues found in the reviewed diff.

Documentation/Tests

  • P3 Impact: the new survey suite exercises smoke/invariance/guardrails, but it still does not pin the remaining DoseResponseCurve.to_dataframe() survey p-value mismatch, and it also lacks a low-survey-df SunAbraham regression that would catch a future reversion of the now-fixed aggregate inference path. Concrete fix: add one survey-fit DoseResponseCurve.to_dataframe() assertion against safe_inference(..., df=result.survey_metadata.df_survey), and one small-PSU SunAbraham regression that checks stored overall/event-study inference against survey-df calculations. Locations: tests/test_survey_phase3.py:L116-L240, tests/test_survey_phase3.py:L503-L553, tests/test_survey_phase3.py:L726-L783

igerber added a commit that referenced this pull request Mar 21, 2026
…-values from PR #226 review (round 8)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7bf9c8fc216682f6c046535530dc2b3c2505fe6f


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

  • P1 Impact: In the Bacon exact-weight survey path, sample_share and the within-2x2 treated share are survey-weighted, but unit_share falls back to raw unit counts: unit_share = (n_k + n_l) / n_total_units. That is an undocumented mismatch against the registry/roadmap promise of “weighted group shares,” and it will silently distort the exact comparison weights whenever survey weights are not proportional to cohort sizes. Concrete fix: make the subsample-share term weight-consistent in _recompute_exact_weights() as well, or explicitly document a deliberate raw-count deviation in the methodology registry. See diff_diff/bacon.py:L747-L776, docs/methodology/REGISTRY.md:L1566-L1572, and docs/survey-roadmap.md:L19-L24.
  • P2 Impact: ContinuousDiD analytical survey fits use df_survey when constructing fit-time inference, but DoseResponseCurve.to_dataframe() still recomputes p_value with safe_inference(..., df=None). The helper therefore returns internally inconsistent inference columns for the same survey fit. Concrete fix: store survey_df or precomputed analytical t_stat/p_value on DoseResponseCurve and reuse them in to_dataframe(). See diff_diff/continuous_did.py:L512-L543, diff_diff/continuous_did.py:L658-L678, and diff_diff/continuous_did_results.py:L49-L75.

Code Quality

  • No material findings in the reviewed diff.

Performance

  • No material findings in the reviewed diff.

Maintainability

  • No material findings in the reviewed diff.

Tech Debt

  • P3 Impact: The deferred survey combinations are properly tracked and guarded: SunAbraham bootstrap+survey, TripleDifference IPW/DR+survey, ContinuousDiD bootstrap+survey, EfficientDiD bootstrap+survey, plus Bacon’s documented diagnostic caveat. Under project policy these are mitigated and non-blocking. Concrete fix: none required in this PR. See docs/survey-roadmap.md:L28-L40 and docs/methodology/REGISTRY.md:L1571-L1572.

Security

  • No security or secret-handling findings in the reviewed diff.

Documentation/Tests

  • P3 Impact: The new survey suite does not cover BaconDecomposition(weights="exact", survey_design=...), so the exact-weight mismatch above can slip through unchanged. Concrete fix: add a survey-weighted exact-Bacon regression test with unequal cohort weights and assert the comparison weights against a direct weighted-share calculation. See tests/test_survey_phase3.py:L337-L388.
  • P3 Impact: test_continuous_did_dose_response_survey_pvalue is named like a df-awareness regression test, but it only checks that helper p-values are finite; it would still pass with the current wrong normal-reference helper output. Concrete fix: assert one or more DoseResponseCurve.to_dataframe()["p_value"] values against safe_inference(effect, se, df=result.survey_metadata.df_survey). See tests/test_survey_phase3.py:L839-L887.

Path to Approval

  1. Make the Bacon exact survey path fully weight-consistent by replacing the raw-count unit_share factor in _recompute_exact_weights() with a survey-weighted subsample-share term, or document a deliberate deviation in the registry if raw counts are intentional. See diff_diff/bacon.py:L747-L776.
  2. Add a regression test that exercises BaconDecomposition(weights="exact", survey_design=...) on data where weighted cohort mass differs materially from raw unit counts. See tests/test_survey_phase3.py:L337-L388.

igerber added a commit that referenced this pull request Mar 22, 2026
- EfficientDiD: collapse survey to unit-level before TSL (was expanding
  EIF to panel rows, making SEs too small for weights-only path)
- ContinuousDiD: survey-weighted event-study aggregation and TSL SEs
- ContinuousDiD/EfficientDiD: validate within-unit constancy for survey
  columns before collapsing to unit level
- TripleDifference: use survey df (n_PSU - n_strata) for p-values/CIs
  instead of sample df (n_obs - 8)
- StackedDiD: handle SurveyDesign without weights column (use ones)
- Pass survey df to safe_inference() in all survey-aware estimators

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Mar 22, 2026
- ContinuousDiD: rescale IFs by n_units before compute_survey_vcov to
  avoid double-counting 1/n bread; use unit-level df_survey
- EfficientDiD: align unit_first_panel_row to sorted all_units order;
  build unit-level ResolvedSurveyDesign once in fit(); use unit-level df
- SunAbraham: thread survey weights into _compute_iw_effects and
  _compute_overall_att for survey-weighted cohort aggregation
- StackedDiD: pass survey df to safe_inference for event-study and
  overall ATT p-values/CIs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Mar 22, 2026
…nce from PR #226 review (round 3)

- ContinuousDiD: store survey-weighted treated/control masses and
  weighted dpsi_bar in bootstrap_info; use weighted masses for p_1, p_0,
  n_total in IF construction so TSL linearizes the weighted estimator
- TripleDifference: use survey-weighted subgroup mass (sum(w_sub))
  instead of raw counts (n_sub) for pairwise IF combination weights
  w3, w2, w1 when survey design is active

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Mar 22, 2026
… from PR #226 review (round 4)

- ContinuousDiD: normalize WLS bread by weighted treated mass (not raw
  count) for consistency with downstream IF score denominators; fixes
  ACRT_glob/ATT(d)/ACRT(d) survey SEs when subgroup-average weights differ
- ContinuousDiD/EfficientDiD: recompute survey_metadata from unit-level
  ResolvedSurveyDesign so reported effective_n/n_psu/df_survey match
  the inference actually run (not the panel-level overcount)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Mar 22, 2026
…ety from PR #226 review (round 5)

- EfficientDiD: use resolved survey weights directly for unit-level
  estimation (Omega*, EIF, cohort fractions) instead of separately
  renormalized raw column, ensuring fweight/aweight consistency with TSL
- BaconDecomposition: store survey weights as DataFrame column for safe
  label-based subsetting in _recompute_exact_weights, preventing
  out-of-bounds errors on non-default DataFrame indexes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3c30f8e765984cad41da7bf3dd13260569547fb8


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review P1 coverage gaps are addressed: there is now a TripleDifference survey+covariates test at tests/test_survey_phase3.py:535, and EfficientDiD now has survey aggregation tests for event_study, group, and all at tests/test_survey_phase3.py:717, tests/test_survey_phase3.py:737, and tests/test_survey_phase3.py:756.
  • I did not find a new silent numerical correctness bug in the reviewed survey implementations.
  • One unmitigated P1 remains: EfficientDiD now rejects survey_design + covariates in code, but the Methodology Registry still presents survey support without documenting that deferral. The limitation is only documented in the survey roadmap, not the registry.
  • The other survey deferrals in this PR are guarded and documented closely enough to treat as informational.
  • Static review only: I could not execute the tests here because pytest is unavailable and the local Python environment is missing numerical packages.

Methodology

  • P1 Impact: EfficientDiD.fit(..., covariates=[...], survey_design=...) now raises NotImplementedError in diff_diff/efficient_did.py:278, but the registry entry at docs/methodology/REGISTRY.md:675 only documents the bootstrap deferral and otherwise reads as if survey support is implemented for the estimator generally. The same limitation is documented in the roadmap at docs/survey-roadmap.md:39, so the code and roadmap agree, but the registry is now out of sync with actual supported behavior. Concrete fix: add an explicit **Note:** in the EfficientDiD registry entry stating that covariates / DR nuisance estimation with survey_design is deferred or unsupported, or fully thread survey weights through the DR nuisance path and remove the guard.
  • P3 Impact: The other changed survey limitations are documented and guarded rather than silently returning wrong output, including TripleDifference IPW/DR + survey at docs/methodology/REGISTRY.md:1240, ContinuousDiD bootstrap + survey at docs/methodology/REGISTRY.md:497, and Bacon’s diagnostic weighted decomposition note at docs/methodology/REGISTRY.md:1577. Concrete fix: none.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • P3 Impact: The previous re-review blockers on interaction coverage are now addressed by the new tests in tests/test_survey_phase3.py:535 and tests/test_survey_phase3.py:717. Concrete fix: none.
  • P3 Impact: I could not run the added tests locally because the environment does not have pytest or the numerical Python stack installed. Concrete fix: verify execution in CI or a provisioned dev environment.

Path to Approval

  1. Update the EfficientDiD registry entry at docs/methodology/REGISTRY.md:675 to explicitly note that covariates + survey_design is deferred / unsupported, matching the code guard at diff_diff/efficient_did.py:278 and the roadmap entry at docs/survey-roadmap.md:39.

#226 review (round 13)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 08ab80f5958f1666fa53fc6ebb2014903b4e6d3d


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review blocker is resolved: EfficientDiD now documents both bootstrap + survey and covariates + survey deferrals in the Methodology Registry, matching the new guards in code.
  • I did not find a new NaN/partial-inference bug in the changed survey inference paths; the modified estimators route scalar inference through safe_inference().
  • One unmitigated P1 remains in the new Bacon survey support: the weights="exact" path silently collapses each unit to its first observed survey weight but never checks that survey weights are constant within unit, so exact comparison weights can become row-order-dependent.
  • The other survey deferrals in this PR are guarded and documented rather than silently returning output.
  • Static review only: I could not run the test suite locally because this environment is missing numpy, pandas, scipy, and pytest.

Methodology

  • Severity: P1. Impact: BaconDecomposition(weights="exact", survey_design=...) accepts panel survey weights that vary over time within a unit, but the exact-weight recomputation later collapses each unit to a single weight via .groupby(unit).first(). That makes the reported “exact” Bacon weights depend on row order for otherwise valid inputs, and the registry does not document any unit-level collapse rule beyond “weighted cell means / weighted group shares.” See diff_diff/bacon.py:L474-L479, diff_diff/bacon.py:L776-L790, and docs/methodology/REGISTRY.md:L1578-L1579. Concrete fix: when self.weights == "exact" and survey_design.weights is set, either reject within-unit-varying survey weights (same pattern as _validate_unit_constant_survey() in the unit-collapsing estimators) or define and document a deterministic unit-level weighting rule and use it consistently instead of .first().
  • Severity: P3. Impact: The previous EfficientDiD documentation mismatch is fixed; the registry now explicitly documents both bootstrap + survey and covariates + survey deferrals, matching the NotImplementedError guards. See docs/methodology/REGISTRY.md:L675-L681 and diff_diff/efficient_did.py:L270-L286. Concrete fix: none.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

  • Severity: P3. Impact: I could not execute the new survey Phase 3 tests locally because the Python scientific stack is unavailable in this environment. Concrete fix: confirm the added coverage in CI or rerun the relevant survey tests in a provisioned dev environment before merge.

Path to Approval

  1. Add a guard for BaconDecomposition(weights="exact", survey_design=...) that rejects within-unit-varying survey weights, or replace the .groupby(...).first() collapse with a documented deterministic unit-level weighting rule.
  2. Add a regression test covering a panel with time-varying survey weights under weights="exact" and assert either a clear ValueError or stable exact comparison weights under two different row orders.

igerber and others added 4 commits March 22, 2026 07:18
…n test from PR #226 review (round 14)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m PR #226 local review

Pre-existing behavior (not introduced by survey PR): event-study
aggregation uses all (g,t) cells without anticipation filtering.
Affects both survey and non-survey paths equally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ocal review

Maintainability P2: panel-to-unit survey collapse, post-filter
re-resolution, and metadata recomputation patterns differ across
ContinuousDiD, EfficientDiD, and StackedDiD. Candidate for shared
helpers in a follow-up tech debt PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l review

P3: near-identical survey metadata rendering in 7 results classes
will grow worse as Phases 4-5 add more estimators. Extract a shared
helper in a follow-up tech debt PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4604badc913f29ef71fba509c70dc2f0d086a25c


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

Code Quality

  • Severity: P1. Impact: The new survey_design support is not threaded through two exported helper entry points, so bacon_decompose(..., survey_design=...) and triple_difference(..., survey_design=...) will fail despite the underlying estimators supporting survey designs. That leaves the public API internally inconsistent and the feature partially inaccessible (diff_diff/bacon.py:L1039, diff_diff/triple_diff.py:L1693, diff_diff/init.py:L18, diff_diff/init.py:L129). Concrete fix: add survey_design=None to both convenience-function signatures/docstrings and forward it to .fit(...).

Performance

No findings.

Maintainability

  • Severity: P3. Impact: The follow-up refactors around shared survey helper extraction and duplicated survey summary formatting are properly tracked in TODO.md, so they are not blockers (TODO.md:L58, TODO.md:L59). Concrete fix: none; tracked.

Tech Debt

  • Severity: P3. Impact: The pre-existing ContinuousDiD event-study anticipation limitation is now explicitly tracked, so I am not treating it as a blocker in this re-review (TODO.md:L57). Concrete fix: none; tracked.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: The new Phase 3 tests add wrapper coverage for stacked_did(), but not for bacon_decompose() or triple_difference(), which likely let the missed public-API propagation ship (tests/test_survey_phase3.py:L301, diff_diff/bacon.py:L1039, diff_diff/triple_diff.py:L1693). Concrete fix: add convenience-wrapper tests for bacon_decompose(..., survey_design=...) and triple_difference(..., survey_design=...), including the reg-success and IPW/DR-guard cases for TripleDifference.
  • Severity: P3. Impact: I could not execute the added tests locally because this environment lacks pytest, numpy, pandas, and scipy. Concrete fix: confirm tests/test_survey_phase3.py and the affected estimator subsets in CI or a provisioned dev environment.

Path to Approval

  1. Add survey_design passthrough support to bacon_decompose() and update its docstring/examples.
  2. Add survey_design passthrough support to triple_difference() and update its docstring to describe reg-only survey support plus the IPW/DR guard.
  3. Add regression tests covering both public convenience wrappers with survey inputs.

…) convenience functions from PR #226 review (round 15)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cb2d8859475cd0790e3a200f7103bef68201079f


Overall Assessment

⛔ Blocker

Executive Summary

Methodology

Code Quality

No findings. The prior public-wrapper passthrough issue appears fixed in diff_diff/bacon.py:L1123 and diff_diff/triple_diff.py:L1780.

Performance

No findings.

Maintainability

  • Severity: P3. Impact: the inconsistent panel-survey helper patterns and duplicated survey summary formatting are now explicitly tracked in TODO.md, so they are non-blocking deferred refactors. Concrete fix: none in this PR; tracked in TODO.md:L58 and TODO.md:L59.

Tech Debt

  • Severity: P3. Impact: the pre-existing ContinuousDiD event-study anticipation limitation is now tracked and should not block this PR. Concrete fix: none in this PR; tracked in TODO.md:L57.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new SunAbraham scale-invariance coverage only multiplies weights upward by 10, which does not exercise the below-1-weight case that triggers the blocker above. That gap is why this slipped through. Concrete fix: add a regression test that rescales the same survey weights downward (for example weight * 0.1) and asserts overall_att and overall_se are unchanged. See tests/test_survey_phase3.py:L818.
  • Severity: P3. Impact: bacon_decompose() and triple_difference() now expose survey_design, but their wrapper docstrings still do not document that parameter, and triple_difference() still does not describe the reg-only survey limitation. Concrete fix: update those wrapper docstrings/examples. See diff_diff/bacon.py:L1048 and diff_diff/triple_diff.py:L1707.
  • Severity: P3. Impact: I could not execute the new tests in this environment because the required Python packages are missing. Concrete fix: verify tests/test_survey_phase3.py in CI or a provisioned dev environment.

Path to Approval

  1. In SunAbraham._compute_overall_att(), remove the max(n_at_e, 1) floor from the survey-weighted branch and use the actual positive survey-weighted mass for post-period aggregation.
  2. Add a regression test that rescales the same survey weights below 1 and verifies SunAbraham’s overall_att and overall_se are invariant to that positive rescaling.

…226 review (round 16)

Remove max(n_at_e, 1) floor from survey-weighted branch — valid
survey masses can be < 1 and flooring breaks scale invariance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: efc09f966dc5e9c7b116b28c8b38eca216a193a6


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity P3. Impact: the new TODO entries correctly track the remaining shared-helper and duplicated-summary refactors, so those issues are non-blocking in this PR under the project’s deferred-work policy at TODO.md:L58-L59. Concrete fix: none in this PR.

Tech Debt

  • Severity P3. Impact: the pre-existing ContinuousDiD event-study anticipation limitation is now explicitly tracked in TODO.md:L57-L59, so it should not block this merge. Concrete fix: none in this PR.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: static review only; I could not execute the new survey tests locally because this container does not have numpy, pandas, pytest, or scipy installed. Concrete fix: verify the new survey slice in CI or a provisioned development environment before merge.

Path to Approval

  1. In diff_diff/stacked_did.py:L369-L443, add an explicit guard rejecting SurveyDesign(..., weight_type="fweight") for StackedDiD survey runs, because _Q_weight is ratio-valued and cannot preserve frequency-weight semantics after composition.
  2. Add a regression test in tests/test_survey_phase3.py that uses integer survey weights with weight_type="fweight" and asserts StackedDiD.fit(..., survey_design=...) raises a clear ValueError or NotImplementedError instead of fitting silently.
  3. Update the StackedDiD survey note in docs/methodology/REGISTRY.md:L1023-L1024 to document the supported weight-type restriction once the code enforces it.

…226 review (round 17)

Q-weight composition produces non-integer composed weights, breaking
frequency-weight semantics. Raise ValueError for weight_type='fweight'.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a8e2a20273bd3000b2288f23c34fe6d1f521dfbb


Overall Assessment
⚠️ Needs changes

Static review only; I did not execute the test suite in this environment.

Executive Summary

Methodology

  • Severity P1. Impact: TripleDifference now documents survey support as “weighted OLS + TSL on combined influence functions” in docs/methodology/REGISTRY.md:L1237-L1242. The final survey SE is then computed from the combined IF in diff_diff/triple_diff.py:L1042-L1053. But the weighted RA nuisance linearization in diff_diff/triple_diff.py:L1333-L1360 only weights the bread (XpX_pre/post) and leaves the score pieces wols_eX_pre/post unweighted. That makes the survey IF inconsistent with the weighted OLS fit, so the reported SE / p-value / CI can be wrong whenever survey weights vary. Concrete fix: weight wols_eX_pre/post by the same survey weights used in XpX_pre/post, with consistent normalization, before passing the combined IF to compute_survey_vcov().

Code Quality
No additional findings.

Performance
No findings.

Maintainability

  • Severity P3. Impact: the remaining panel-survey helper duplication and repeated survey-summary formatting are now explicitly tracked, so they are non-blocking in this PR under the project’s deferred-work policy in TODO.md:L58-L59. Concrete fix: none in this PR.

Tech Debt

  • Severity P3. Impact: ContinuousDiD event-study aggregation still does not filter by anticipation, but the PR adds a TODO.md entry that properly tracks it as pre-existing debt, so it should not block merge in TODO.md:L57. Concrete fix: none in this PR.

Security
No findings.

Documentation/Tests

  • Severity P3. Impact: the prior StackedDiD fweight blocker is fixed in code and test in diff_diff/stacked_did.py:L246-L257 and tests/test_survey_phase3.py:L331-L344, but the public docs still describe StackedDiD survey support as full / multiplicative without noting that fweight is rejected in docs/methodology/REGISTRY.md:L1023-L1024 and docs/survey-roadmap.md:L19-L24. Concrete fix: add a short note that StackedDiD survey support excludes weight_type="fweight" because Q-weight composition breaks frequency-weight semantics.
  • Severity P3. Impact: the new TripleDifference survey tests in tests/test_survey_phase3.py:L468-L602 only smoke-test finite output, metadata, and guard behavior; they do not benchmark the weighted RA SE against a known-good target, which is why the P1 above is currently unexercised. Concrete fix: add a regression test comparing the survey-reg se to a hand-worked or frequency-expanded equivalent on a small non-uniformly weighted 2×2×2 example.

Path to Approval

  1. Fix the weighted RA nuisance linearization in diff_diff/triple_diff.py:L1333-L1360 so the pre/post OLS score terms include survey weights consistently with the weighted OLS fit.
  2. Add a regression test in tests/test_survey_phase3.py:L468-L602 that asserts the TripleDifference survey-reg SE matches a known target on a small weighted example.

…hts in score terms from PR #226 review (round 18)

The pre/post OLS score pieces (wols_eX_pre/post) and design matrices
(wols_x_pre/post) now include survey weights when present, consistent
with the weighted OLS bread. Previously only the bread was weighted,
making the survey IF inconsistent with the weighted OLS fit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 5bdbccefae0e33dd80cea53441b69be2457b4a96


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in the changed PR paths. Static review only; I could not run the test suite in this sandbox because pytest and the Python scientific dependencies are unavailable here.

Executive Summary

Methodology

No unmitigated P0/P1 findings. The previously reviewed TripleDifference survey-RA mismatch is addressed in diff_diff/triple_diff.py:L1027-L1053 and diff_diff/triple_diff.py:L1331-L1364.

  • Severity P2. Impact: BaconDecomposition.fit() now applies _validate_unit_constant_survey() to every survey run in diff_diff/bacon.py:L474-L485, so the default approximate survey path also rejects time-varying survey columns even though the changed approximate implementation itself only uses observation-level weighted means / weighted within-transformation / weighted group shares in diff_diff/bacon.py:L521-L591. That is stricter than the documented Bacon survey support in docs/methodology/REGISTRY.md:L1573-L1579, and it is also narrower than the new exact-only regression test name in tests/test_survey_phase3.py:L1076-L1096. Concrete fix: move the unit-constant survey validation behind if self.weights == "exact" or add an explicit Bacon registry/roadmap note that survey support requires within-unit-constant design columns even for weights="approximate".

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the non-blocking items already tracked in TODO.md.

Tech Debt

  • Severity P3. Impact: the PR properly tracks the pre-existing ContinuousDiD event-study anticipation limitation and the survey-helper / summary-formatting cleanup as deferred work, so they are non-blocking under the project policy in TODO.md:L57-L59. Concrete fix: none in this PR.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: StackedDiD now correctly rejects weight_type="fweight" and has a regression test for it in diff_diff/stacked_did.py:L246-L257 and tests/test_survey_phase3.py:L331-L344, but the methodology registry and survey roadmap still describe the survey support as full / multiplicative without noting that exclusion in docs/methodology/REGISTRY.md:L1023-L1024 and docs/survey-roadmap.md:L21-L22. Concrete fix: add a short note that StackedDiD survey support excludes fweight because Q-weight composition breaks frequency-weight semantics.
  • Severity P3. Impact: the new TripleDifference survey tests cover smoke behavior, guards, metadata, and scale invariance, but they still do not pin the survey-regression se to a known-good weighted target in tests/test_survey_phase3.py:L467-L602. Concrete fix: add a small non-uniform 2×2×2 regression test whose survey-reg se is checked against a hand-worked or frequency-expanded target.

…usions from PR #226 review (round 19)

- BaconDecomposition: move _validate_unit_constant_survey behind
  weights="exact" — approximate path uses obs-level means and doesn't
  need within-unit-constant survey columns
- REGISTRY.md: document StackedDiD fweight rejection and Bacon
  exact-only unit-constancy requirement
- survey-roadmap.md: update StackedDiD support to "Full (pweight/aweight)"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ae7f687b9290aef0536a968da2475848baf47c3c


Overall Assessment

⚠️ Needs changes

Static re-review only. I could not run the test suite in this sandbox because numpy is unavailable.

Executive Summary

Methodology

Affected method: StackedDiD survey-weighted WLS/TSL extension.

Resolved from the prior review: the Bacon approximate survey path no longer rejects time-varying survey columns, and that exact-only constraint is now documented in diff_diff/bacon.py:L481-L488 and docs/methodology/REGISTRY.md:L1578-L1579.

  • Severity P1 [Newly identified]. Impact: StackedDiD now points users to weight_type="aweight" as a supported survey option in diff_diff/stacked_did.py:L246-L256. But in the survey path it multiplies Wing et al.’s corrective Q weights by the survey weights, copies the resolved survey design without changing weight_type, and sends that product into compute_survey_vcov() in diff_diff/stacked_did.py:L382-L448. The survey core’s aweight branch omits weights from the meat in diff_diff/survey.py:L553-L558, matching the registry’s inverse-variance/precision interpretation of aweight in docs/methodology/REGISTRY.md:L1891-L1894. Wing et al.’s Q_sa is a corrective sample weight for the stacked regression, whereas aweight is a precision-weight concept; treating Q × aweight as a pure aweight is therefore an undocumented variance rule, not a documented source-backed extension, and it can silently distort StackedDiD SEs, p-values, and CIs for a supported branch. Concrete fix: either reject weight_type="aweight" in StackedDiD.fit() until a valid composed-weight variance derivation exists, or implement and document a dedicated Q × aweight variance path instead of reusing the generic aweight meat. (nber.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

No new findings beyond the already tracked survey-helper cleanup in TODO.md:L58-L59.

Tech Debt

  • Severity P3. Impact: the pre-existing ContinuousDiD anticipation limitation and the cross-estimator survey-helper cleanup are now explicitly tracked in TODO.md:L57-L59, so they are mitigated under the project’s deferred-work policy. Concrete fix: none required for approval in this PR.

Security

No findings.

Documentation/Tests

Path to Approval

  1. In diff_diff/stacked_did.py:L246-L256 and diff_diff/stacked_did.py:L382-L448, stop accepting SurveyDesign(weight_type="aweight") unless you add a separately derived composed-weight variance formula. The minimal safe fix is to reject aweight, like fweight.
  2. Update docs/methodology/REGISTRY.md:L1023-L1024, docs/survey-roadmap.md:L21, and the StackedDiD help/error text so the documented support matrix matches the implementation.
  3. Add a StackedDiD survey test for aweight: either a rejection test or a validated numeric SE target for the chosen composed-weight method.

…om PR #226 review (round 20)

Q-weight composition changes inverse-variance weight semantics just
as it breaks frequency-weight semantics. Update guard, docs, and tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e33ed3b1fe14b7867b9766522d745869650a72f1


Overall Assessment

✅ Looks good

Static re-review only. I could not execute the test suite in this sandbox because numpy and pytest are unavailable.

Executive Summary

  • The prior StackedDiD blocker appears resolved: the PR now rejects weight_type="aweight" and weight_type="fweight" in code, mirrors that restriction in the methodology docs/roadmap, and adds explicit rejection tests in diff_diff/stacked_did.py:L246, docs/methodology/REGISTRY.md:L1023, docs/survey-roadmap.md:L21, tests/test_survey_phase3.py:L332, and tests/test_survey_phase3.py:L346.
  • I did not find any unmitigated P0/P1 issues in the changed survey-support paths.
  • The affected estimator surface matches the intended source-method families: corrective-weight stacked DiD, continuous-treatment DiD, interaction-weighted Sun-Abraham event studies, EIF-based EfficientDiD, and tailored TripleDifference decomposition. (nber.org)
  • The remaining issues are non-blocking: one SunAbraham registry wording drift, plus pre-existing ContinuousDiD/helper cleanup items that are now explicitly tracked in TODO.md.
  • This assessment is based on static review of the diff and surrounding code only.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3 (tracked in TODO.md). Impact: the new survey support still leaves panel-to-unit survey collapsing and survey-metadata formatting split across multiple estimators/results classes, but the PR explicitly tracks both follow-ups in TODO.md:L58 and TODO.md:L59. Concrete fix: none required for approval; handle in the tracked helper-extraction cleanup.

Tech Debt

  • Severity P3 (tracked in TODO.md). Impact: ContinuousDiD event-study aggregation still does not filter by anticipation, but the PR now records that as pre-existing deferred work in TODO.md:L57. Concrete fix: none required for approval in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the SunAbraham registry text still says aggregated event-study and overall ATT p-values use the normal distribution, but the new survey path now routes both through survey-df safe_inference() in diff_diff/sun_abraham.py:L637, diff_diff/sun_abraham.py:L645, diff_diff/sun_abraham.py:L671, and diff_diff/sun_abraham.py:L966. That makes the survey-mode behavior sensible, but the registry is now ambiguous/stale at docs/methodology/REGISTRY.md:L733. Concrete fix: add a short note that non-survey aggregate inference remains normal-approximation, while survey-mode aggregate inference uses df = n_PSU - n_strata.
  • No additional test findings from the diff itself. The new survey tests materially improve coverage for the prior StackedDiD issue and for the new NotImplementedError guards.

@igerber igerber merged commit 15985c3 into main Mar 22, 2026
14 checks passed
@igerber igerber deleted the survey-enhancements branch March 22, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant