Skip to content

Survey Phase 7: CS IPW/DR covariates, repeated cross-sections, HonestDiD survey variance#240

Merged
igerber merged 34 commits into
mainfrom
survey-phase-seven
Mar 31, 2026
Merged

Survey Phase 7: CS IPW/DR covariates, repeated cross-sections, HonestDiD survey variance#240
igerber merged 34 commits into
mainfrom
survey-phase-seven

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Mar 28, 2026

Summary

  • Phase 7a: Remove NotImplementedError gate for CallawaySantAnna IPW/DR + covariates + survey. Implement DRDID panel nuisance IF corrections (propensity score + outcome regression) for both survey-weighted and non-survey DR paths (Sant'Anna & Zhao 2020, Theorem 3.1). Extract _safe_inv() helper for matrix inversions.
  • Phase 7d: Thread survey degrees of freedom through HonestDiD for t-distribution critical values. Compute full event-study variance-covariance matrix from influence function vectors in CallawaySantAnna aggregation. Add event_study_vcov field to CallawaySantAnnaResults and survey_metadata/df_survey to HonestDiDResults.
  • Phase 7b: Add panel=False for repeated cross-section support in CallawaySantAnna. New _precompute_structures_rc(), _compute_att_gt_rc(), and three RC estimation methods (_outcome_regression_rc, _ipw_estimation_rc, _doubly_robust_rc) with covariates and survey weights. Canonical index abstraction in aggregation/bootstrap mixins. RCS data generator via generate_staggered_data(panel=False).

Methodology references

  • Method name(s): Callaway-Sant'Anna (2021), Sant'Anna & Zhao (2020) DRDID panel/cross-section, Rambachan & Roth (2023) HonestDiD
  • Paper / source link(s):
    • Sant'Anna, P.H.C. & Zhao, J. (2020). "Doubly Robust Difference-in-Differences Estimators." J. Econometrics 219(1). Theorem 3.1 (panel IF corrections), Section 4 (cross-sectional DRDID).
    • Callaway, B. & Sant'Anna, P.H.C. (2021). "Difference-in-Differences with Multiple Time Periods." J. Econometrics 225(2). Section 4.1 (repeated cross-sections).
    • Rambachan, A. & Roth, J. (2023). "A More Credible Approach to Parallel Trends." Rev. Econ. Studies 90(5).
  • Intentional deviations: DR nuisance IF corrections use the same survey-weighted Hessian/score pattern as the existing IPW path. Non-survey DR path also receives IF corrections (was plug-in only). Per-cell SEs remain IF-based (not full TSL) — documented in REGISTRY.md. Event-study VCV under replicate weights falls back to diagonal (multivariate replicate VCV deferred).

Validation

  • Tests added/updated:
    • tests/test_survey_phase7a.py (22 tests): smoke, scale invariance, uniform-weight equivalence, IF correction, aggregation, bootstrap, edge cases
    • tests/test_staggered_rc.py (23 tests): all methods, covariates, survey, aggregation, bootstrap, control groups, base periods, data generator, edge cases
    • tests/test_honest_did.py (+4 tests): survey df extraction, VCV computation, bounds widening, no-survey baseline
    • tests/test_survey_phase4.py: 2 negative tests converted to positive assertions
  • Full test suite: 365 tests pass across all affected files (0 failures)

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…DiD survey variance

Phase 7a: Remove NotImplementedError gate for IPW/DR + covariates + survey.
Add DRDID panel nuisance IF corrections (PS + OR) for both survey and
non-survey DR paths. Extract _safe_inv helper for matrix inversions.

Phase 7d: Thread survey df through HonestDiD for t-distribution critical
values. Compute full event-study VCV from influence function vectors.
Add event_study_vcov to CallawaySantAnnaResults.

Phase 7b: Add panel=False for repeated cross-section support in
CallawaySantAnna. New _precompute_structures_rc, _compute_att_gt_rc,
and three RC estimation methods (reg, ipw, dr) with covariates and
survey weights. Canonical index abstraction in aggregation/bootstrap.
RCS data generator in generate_staggered_data(panel=False).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

PR Review

Overall assessment

⚠️ Needs changes. The highest unmitigated severity is P1: the new repeated-cross-section CallawaySantAnna paths do not fully match the documented/source-method inference contract, and the new HonestDiD survey covariance plumbing diverges from the registry for replicate-weight designs.

Executive summary

  • Affected methods: Callaway-Sant'Anna (panel=False, IPW/DR, aggregation) and HonestDiD (survey-aware event-study covariance).
  • The Phase 7a panel DR survey changes look broadly aligned with the updated registry; I did not find a blocker in the new panel nuisance-correction code itself.
  • The new repeated-cross-section IPW/DR analytic SEs are still plug-in only and do not implement the cross-sectional nuisance-estimation IF corrections the registry says are supported.
  • For panel=False, unweighted simple/event-study aggregation uses time-specific treated-cell counts while the WIF code uses full-sample cohort shares, so the weighting contract is internally inconsistent.
  • The new HonestDiD covariance path for replicate-weight surveys does not implement the documented diagonal fallback; it builds a full Psi'Psi covariance and passes that to HonestDiD.
  • The new public panel parameter is not propagated into CallawaySantAnnaResults, and result summaries still label repeated-cross-section observation counts as “units”.

Methodology

Cross-check basis: docs/methodology/REGISTRY.md:L291-L319, docs/methodology/REGISTRY.md:L419-L424, and docs/methodology/REGISTRY.md:L1633-L1637.

  1. Severity: P1. diff_diff/staggered.py:L2872-L2978, diff_diff/staggered.py:L2980-L3127, docs/methodology/REGISTRY.md:L423-L424.
    Impact: the new repeated-cross-section IPW and DR analytic inference does not include nuisance-estimation IF corrections. _ipw_estimation_rc() computes SE from the plug-in IF only, and _doubly_robust_rc() likewise stops at the plug-in IF. That is a mismatch with the registry’s claim that panel=False uses Section 4 cross-sectional DRDID with per-observation influence functions, and it is notably weaker than the panel IPW/DR code in the same file, which now adds explicit PS/OR correction terms. The result is understated or otherwise incorrect SEs/CIs/p-values for covariate-adjusted RCS IPW/DR.
    Concrete fix: implement the Section 4 cross-sectional nuisance-estimation IF corrections for panel=False IPW/DR, or explicitly document the deviation in REGISTRY.md and disable analytic inference for those branches until the correct IF is in place.

  2. Severity: P1. diff_diff/staggered_aggregation.py:L37-L152, diff_diff/staggered_aggregation.py:L289-L314, diff_diff/staggered_aggregation.py:L574-L645, diff_diff/staggered_bootstrap.py:L223-L267, diff_diff/staggered_bootstrap.py:L560-L657.
    Impact: the new unweighted panel=False aggregation uses data["n_treated"] from each (g,t) cell as the aggregation weight, but the WIF path for the same estimator computes pg from full-sample cohort counts. In panel data those coincide because cohort size is constant across t; in repeated cross-sections they generally do not. That means the point estimate, WIF correction, and bootstrap aggregation are no longer using the same weight definition. This changes the estimand/finite-sample weighting and makes the SE formula internally inconsistent with the aggregated estimator.
    Concrete fix: precompute fixed cohort masses for panel=False once from the full repeated-cross-section sample, then use those same cohort masses everywhere simple/event-study/bootstrap weights are formed.

  3. Severity: P1. diff_diff/staggered_aggregation.py:L710-L739, diff_diff/honest_did.py:L664-L669, docs/methodology/REGISTRY.md:L1637-L1637.
    Impact: the registry explicitly says replicate-weight event-study covariance should fall back to a diagonal matrix until multivariate replicate VCV is implemented, but _aggregate_event_study() currently builds a full Psi.T @ Psi matrix for all non-TSL cases, which includes replicate-weight designs. HonestDiD then consumes that full matrix whenever event_study_vcov is present. That is an undocumented methodology mismatch and can change HonestDiD bounds under replicate designs without warning.
    Concrete fix: when uses_replicate_variance is true, do not populate a full off-diagonal event_study_vcov; set it to None or an explicit diagonal-from-SEs fallback until a proper multivariate replicate covariance estimator is implemented and validated.

Code Quality

No additional findings beyond the methodology issues above.

Performance

No findings.

Maintainability

  1. Severity: P1. diff_diff/staggered.py:L258-L336, diff_diff/staggered.py:L1374-L1386, diff_diff/staggered.py:L1774-L1785, diff_diff/staggered_results.py:L63-L123, diff_diff/staggered_results.py:L157-L163.
    Impact: the new public panel parameter is not stored on CallawaySantAnnaResults, even though it changes how counts and aggregation should be interpreted. For panel=False, the fit path stores observation counts in n_treated_units / n_control_units, but the results API and summary still present them as “units”. That makes downstream use ambiguous and violates the expected propagation pattern for new public parameters.
    Concrete fix: add panel: bool (or equivalent index-space metadata) to CallawaySantAnnaResults, populate it in fit(), and switch summary labels/output semantics to observation-based wording when panel=False.

Tech Debt

No separate findings. The new TODO.md changes only resolve prior items; they do not track the P1 issues above, so those remain unmitigated.

Security

No findings.

Documentation/Tests

  1. Severity: P2. diff_diff/staggered.py:L119-L184, docs/api/staggered.rst:L15-L32, tests/test_staggered_rc.py:L48-L351, tests/test_honest_did.py:L1125-L1244.
    Impact: the public CallawaySantAnna class docstring still does not describe panel, so the generated API docs do not expose the new parameter contract. The new RCS tests mostly check “finite result” behavior, not methodology alignment, aggregation weights, or SE correctness. The HonestDiD additions test Taylor-style survey inputs, but there is no test for the new replicate-weight covariance branch, which is exactly where the implementation currently diverges from the registry.
    Concrete fix: document panel in the estimator docstring/API docs, add a reference-validation test against did::att_gt(panel=FALSE) for at least one aggregation case and one IPW/DR covariate case, and add a replicate-weight HonestDiD test that asserts diagonal fallback until multivariate replicate VCV is implemented.

Path to Approval

  1. Implement cross-sectional nuisance-estimation IF corrections for panel=False IPW/DR analytic inference, or explicitly document and temporarily disable those analytic SEs.
  2. Replace per-cell n_treated weighting with fixed cohort-mass weighting throughout panel=False simple/event-study/bootstrap aggregation.
  3. Make replicate-weight HonestDiD use the documented diagonal event-study covariance fallback instead of a full Psi.T @ Psi matrix.
  4. Propagate panel into CallawaySantAnnaResults and correct result-summary labels/semantics for repeated cross-sections.
  5. Add reference tests for panel=False methodology and a replicate-weight HonestDiD regression test.

Static review only; I was not able to do runtime spot-checks in this environment because the Python runtime here is missing required dependencies such as numpy.

… VCV, panel on results

Fix 5 findings from PR #240 CI review:
- Add cross-sectional nuisance IF corrections (PS + OR) to _ipw_estimation_rc
  and _doubly_robust_rc, matching panel path methodology
- Use fixed full-sample cohort masses for unweighted RCS aggregation weights
  (consistency with WIF group-share denominator)
- Guard replicate-weight designs from full event-study VCV (diagonal fallback)
- Add panel field to CallawaySantAnnaResults, fix summary labels for RCS
- Add panel to class docstring, replicate VCV test, RCS IF correction test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4bf566d1f94ebca6500b5ffe806c87302aa9bb9a


PR Review

Overall assessment

⚠️ Needs changes. The highest unmitigated severity is P1. The prior HonestDiD replicate-weight fallback and panel propagation issues look resolved, but two repeated-cross-section inference problems remain.

Executive Summary

  • The remaining blockers are both in the new Callaway-Sant'Anna panel=False path.
  • Repeated-cross-section DR analytical inference is still internally inconsistent: the nuisance IF corrections reuse the post-period treated denominator for both the post and base-period components.
  • The earlier RCS weighting-contract fix is only partial: balance_e event-study aggregation and the unweighted bootstrap overall/event-study helpers still revert to per-cell n_treated counts instead of fixed cohort masses.
  • HonestDiD’s survey-variance changes now appear aligned with the updated registry, including the replicate-weight diagonal fallback.
  • The new RC tests are mostly smoke/finite-result checks and would not catch either remaining RC inference defect.

Methodology

Affected methods: Callaway-Sant'Anna repeated cross sections (panel=False, Section 4-style DRDID and aggregation/bootstrap). The HonestDiD survey-variance path looks consistent with the updated registry note.

  • Severity: P1. Impact: In panel=False DR, the point estimator correctly uses separate treated normalizers for the post and base-period pieces (sw_gt_sum vs sw_gs_sum, or n_gt vs n_gs), but the nuisance IF corrections collapse both periods onto a single normalizer = sum(sw_gt) or n_gt. That mis-scales both the PS correction and the base-period OR correction whenever the cohort-g sample size or treated weight sum differs across periods, which is the ordinary repeated-cross-section case. The resulting analytical SE/CIs/p-values are inconsistent with the estimator and with the Section 4 repeated-cross-section decomposition promised in the registry. Concrete fix: use separate normalizer_t and normalizer_s throughout M2_dr, M1_t, and M1_s, matching the denominators used in att_t_aug and att_s_aug; add a regression test with n_gt != n_gs and unequal treated weight sums. References: diff_diff/staggered.py:L3118-L3159 diff_diff/staggered.py:L3184-L3228 docs/methodology/REGISTRY.md:L423-L424

  • Severity: P1. Impact: The earlier panel=False weighting-contract bug is only partially fixed. Analytical simple/event-study aggregation now uses fixed cohort masses, but unweighted event-study with balance_e and the unweighted bootstrap helpers still fall back to cell-specific n_treated. In repeated cross-sections those cell counts vary by period, so the bootstrap SEs/CIs and balance_e event-study weights no longer correspond to the estimator/WIF denominator used elsewhere in the same results object. Concrete fix: compute one unweighted cohort-mass map from precomputed["unit_cohorts"] and use it everywhere panel=False aggregation weights are formed, including the balance_e branch in _aggregate_event_study(), overall bootstrap weights, and _prepare_event_study_aggregation(). References: diff_diff/staggered_aggregation.py:L76-L100 diff_diff/staggered_aggregation.py:L590-L610 diff_diff/staggered_aggregation.py:L621-L648 diff_diff/staggered_bootstrap.py:L229-L243 diff_diff/staggered_bootstrap.py:L561-L575 diff_diff/staggered_bootstrap.py:L614-L631

Code Quality

No findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. Neither remaining P1 issue is tracked in TODO.md:L48-L60, so TODO.md does not mitigate them.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: The new RC tests are largely finiteness/smoke checks. They do not exercise the two cases that break the current implementation: unequal treated cohort sizes across the two RC periods, and panel=False aggregation/bootstrap weight formation under balance_e or n_bootstrap>0. As written, the suite would pass with both P1 bugs still present. Concrete fix: add deterministic panel=False fixtures where cohort-g counts differ between t and s, assert the corrected DR SE against a hand-built IF/reference calculation, and add bootstrap/event-study tests that verify fixed cohort-mass weights under balance_e and n_bootstrap>0. References: tests/test_staggered_rc.py:L135-L152 tests/test_staggered_rc.py:L258-L300 tests/test_staggered_rc.py:L362-L381

Path to Approval

  1. Fix the repeated-cross-section DR nuisance IF normalization so the post-period and base-period PS/OR corrections use their own treated denominators (sw_gt_sum/sw_gs_sum or n_gt/n_gs).
  2. Finish propagating fixed cohort-mass weighting through all panel=False aggregation paths, especially balance_e event-study aggregation and the unweighted overall/event-study bootstrap helpers.
  3. Add regression tests covering unequal RC cohort counts across periods and panel=False bootstrap/event-study weight consistency.

Static review only: I could not execute the test suite here because this environment lacks pytest and scientific Python dependencies such as numpy.

…equal-count tests

- Use separate normalizer_t/normalizer_s in _doubly_robust_rc() IF corrections
  (post vs base period treated denominators must match the DR estimator)
- Precompute rcs_cohort_masses in _precompute_structures_rc() and return cohort
  mass as n_treated from _compute_att_gt_rc() — fixes all downstream consumers
  (aggregation, bootstrap, balance_e) at the source instead of per-consumer patches
- Remove now-unnecessary rcs_cohort_counts blocks from aggregation
- Add unequal cohort count test fixture and regression tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6080f927c1c76a760b7825b59d5f46032da64b1c


PR Review

Overall assessment
⚠️ Needs changes. The highest unmitigated severity is P1. The prior RC normalizer and fixed-cohort-weighting blockers appear resolved, but the new panel=False estimator formulas still do not line up with the repeated-cross-section methods this PR claims to implement.

Executive summary

  • The earlier re-review blockers look fixed: RC DR no longer reuses one denominator for both periods, and fixed cohort-mass weighting now propagates through analytical aggregation and bootstrap.
  • The remaining blocker is methodological: the new repeated-cross-section reg and dr paths do not match the DRDID / did::att_gt(panel=FALSE) estimators they are supposed to mirror.
  • _outcome_regression_rc() uses separate pre/post treated residual averages, but the reference reg_did_rc estimator pools treated weights when averaging the predicted change.
  • _doubly_robust_rc() is further off: it uses only control-group ORs and normalizes the control augmentation terms by treated-period masses, which does not match either drdid_rc or the simpler AIPW repeated-cross-section formula.
  • The added RC tests are almost entirely smoke/finite-result checks, so this kind of formula mismatch passes undetected.
  • The HonestDiD survey-df / event-study-vcov changes look consistent with the new registry note.

Methodology

  • Severity: P1. Impact: The new repeated-cross-section reg path in _outcome_regression_rc at diff_diff/staggered.py:L2795 computes ATT = mean_t(Y - m_t(X)) - mean_s(Y - m_s(X)) using separate treated averages for the post and base periods (diff_diff/staggered.py:L2843, diff_diff/staggered.py:L2859, diff_diff/staggered.py:L2869). did::att_gt(panel=FALSE, est_method="reg") dispatches to DRDID::reg_did_rc, and that estimator averages the predicted change over the treated group with pooled treated weights rather than separate pre/post treated residual means. That is a different finite-sample estimator whenever treated-sample composition differs across the two cross-sections. The registry note at docs/methodology/REGISTRY.md:L423 documents panel=False support, but not this estimator change. Concrete fix: Rework _outcome_regression_rc() and its IF to match reg_did_rc exactly, or explicitly document and rename a different RC regression estimator if that deviation is intentional. citeturn3view0turn5view1
  • Severity: P1. Impact: The new repeated-cross-section dr path in _doubly_robust_rc at diff_diff/staggered.py:L3031 does not match the cited DRDID repeated-cross-section estimators. The point estimator uses only control-group ORs (diff_diff/staggered.py:L3059) and divides the control augmentation terms by treated-period masses (diff_diff/staggered.py:L3131, diff_diff/staggered.py:L3153), with the same normalization baked into the IF corrections (diff_diff/staggered.py:L3190, diff_diff/staggered.py:L3211). But did::att_gt(panel=FALSE, est_method="dr") dispatches to DRDID::drdid_rc; that locally efficient estimator includes treated- and control-group outcome-regression pieces in both periods, and even the simpler AIPW repeated-cross-section formula normalizes each treated/control pre/post component by its own weight sum rather than the treated totals. So the current code changes both point estimates and IF-based SEs whenever reweighted control mass differs from treated mass. Concrete fix: Pick one specific Section 4 DR estimator (drdid_rc, aipw_did_rc1, or another named variant), implement its point estimator and IF end-to-end, and update the registry to name that exact estimator. citeturn3view1turn5view2turn1search4

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P2. Impact: In the RC path, _compute_att_gt_rc() stores the full cohort mass in n_treated so downstream aggregation can reuse existing weight plumbing (diff_diff/staggered.py:L2710), but the public results contract still documents n_treated as the number of treated observations for that group-time cell (diff_diff/staggered_results.py:L21). That silently turns a reporting field into an aggregation-weight field. Concrete fix: Keep n_treated as the actual cell count and add a separate cohort_mass / agg_weight field for RC aggregation.

Tech Debt

  • No separate findings. The P1 methodology issue above is not tracked in TODO.md:L48, so it remains unmitigated.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new RC test file is almost entirely smoke coverage: it checks that fits are finite, methods run, and bootstrap returns something, but never compares panel=False reg/dr against DRDID formulas, did outputs, or even hand-built 2×2 RC reference values (tests/test_staggered_rc.py:L48, tests/test_staggered_rc.py:L160, tests/test_staggered_rc.py:L413). That means the formula mismatch above passes cleanly. Concrete fix: Add deterministic repeated-cross-section fixtures where treated covariate composition differs between pre and post, then assert reg/dr ATT and SE outputs against checked-in DRDID / R reference values or hand-verified formulas.

Path to Approval

  1. Re-implement _outcome_regression_rc() to match the repeated-cross-section reg_did_rc formula and influence function used by did::att_gt(panel=FALSE, est_method="reg").
  2. Re-implement _doubly_robust_rc() to match one named DRDID repeated-cross-section estimator exactly, including the correct normalization and nuisance-IF terms.
  3. Add RC reference-value tests that fail under the current formulas, not just finiteness/smoke checks.

…ulas

_outcome_regression_rc: Pool all treated obs for OR correction term
(was: separate per-period averages). Period-specific treated means for Y.
Matches Sant'Anna & Zhao (2020) Eq 2.2 / R reg_did_rc exactly.

_doubly_robust_rc: Fit 4 OLS models (control+treated, pre+post) for
locally efficient DR estimator (was: 2 control-only). Implements tau_1
(AIPW) + tau_2 (local efficiency adjustment) with full 11-component IF.
Matches Sant'Anna & Zhao (2020) Eq 3.3+3.4 / R drdid_rc exactly.

Add agg_weight field to group_time_effects for RCS aggregation weight
(cohort mass), separate from n_treated (per-cell display count).
Aggregation uses data.get("agg_weight", data["n_treated"]) for
backward compatibility with panel data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b623deeed20d79ab05bdf255819745c93140371a


Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Executive Summary

  • The prior repeated-cross-section point-estimator blockers appear addressed: the new reg path now pools the treated OR correction, and the new dr path includes treated-side OR terms plus nuisance IF corrections.
  • [Newly identified] The non-survey panel=False bootstrap path still reaggregates with realized cell counts n_{g,t} instead of the fixed cohort mass agg_weight/N_g, so bootstrap overall and event-study inference target a different estimator than the analytical path.
  • That mismatch is not covered by the new methodology note for repeated cross-sections in docs/methodology/REGISTRY.md:L423, and it is not tracked in TODO.md:L48-L60 or TODO.md:L167-L170.
  • The HonestDiD survey-df / event-study-vcov changes are consistent with the new registry note; no separate methodology defect stood out there.
  • The new RC bootstrap tests are smoke-only, so this weighting regression currently slips through.

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity: P3 Impact: event_study_vcov is cached on the estimator instance and later copied into results via getattr(self, "_event_study_vcov", None). If the same CallawaySantAnna object is reused, a fit that does not compute event-study aggregation can inherit stale covariance metadata from a previous fit. References: diff_diff/staggered_aggregation.py:L714-L755, diff_diff/staggered.py:L1772-L1796. Concrete fix: reset self._event_study_vcov = None at the start of fit() and only attach it when the current fit actually computed event_study_effects.

Tech Debt

No separate findings. The P1 bootstrap-weight issue above is not currently tracked in TODO.md:L48-L60 or TODO.md:L167-L170, so it remains unmitigated.

Security

No findings.

Documentation/Tests

  • Severity: P2 Impact: The new RC bootstrap coverage only checks that bootstrap runs, including on the unequal-cohort fixture, but never asserts that bootstrap aggregation uses the fixed cohort masses introduced for panel=False. References: tests/test_staggered_rc.py:L289-L300, tests/test_staggered_rc.py:L413-L482. Concrete fix: add a deterministic unequal-cohort panel=False test that verifies bootstrap overall/event-study prep uses the same fixed cohort weights as the analytical aggregation path.

Path to Approval

  1. Update CallawaySantAnnaBootstrapMixin._run_multiplier_bootstrap() so non-survey panel=False overall ATT bootstrap weights use fixed cohort mass (agg_weight / precomputed cohort mass), not per-cell n_treated.
  2. Update _prepare_event_study_aggregation() so non-survey panel=False event-study bootstrap weights use the same fixed cohort masses.
  3. Add a regression test on the unequal-cohort RC fixture that fails if bootstrap reaggregation falls back to per-cell n_treated instead of fixed cohort mass.

- Bootstrap overall/event-study reaggregation now uses agg_weight (fixed
  cohort mass) for panel=False, matching the analytical aggregation path
- Reset self._event_study_vcov = None at start of fit() to prevent stale
  VCV from prior fit leaking into reused estimator objects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3b405b7f9009afb4d8d90397d931b327a5c65825


Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Static review only: I could not execute the changed tests in this environment because the default Python interpreter here does not have the project dependencies available.

Executive Summary

Methodology

  • Severity: P1 [Newly identified]. Impact: CallawaySantAnna.fit() computes and stores event_study_vcov from analytical IF vectors during event-study aggregation,diff_diff/staggered_aggregation.py:L714-L755 then, when n_bootstrap>0, overwrites event_study_effects[*]["se"], CIs, and p-values with bootstrap results while leaving that covariance matrix unchanged on the results object.diff_diff/staggered.py:L1709-L1733 diff_diff/staggered.py:L1773-L1799 HonestDiD now always prefers event_study_vcov when present,diff_diff/honest_did.py:L664-L670 so bootstrap-fit CS results silently feed analytical covariance into sensitivity analysis. That contradicts the Phase 7d intent that HonestDiD respect the same variance structure as the underlying event study.docs/methodology/REGISTRY.md:L1637-L1637 Concrete fix: when bootstrap inference is used for event-study results, either compute and store a bootstrap event-study covariance matrix from the bootstrap draws, or clear/ignore event_study_vcov so HonestDiD falls back to the bootstrap variance path instead of mixing analytical and bootstrap inference.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new HonestDiD tests validate analytical event_study_vcov creation and the replicate-weight diagonal fallback, but they do not cover CallawaySantAnna(..., n_bootstrap>0, aggregate="event_study"). That leaves the new covariance-source mismatch untested.tests/test_honest_did.py:L1158-L1281 Concrete fix: add a regression test that fits a bootstrap event study, runs HonestDiD.fit(), and asserts that the covariance source matches the bootstrap path (or that event_study_vcov is absent/recomputed from bootstrap draws on bootstrapped fits).

Path to Approval

  1. Make bootstrap-fit CallawaySantAnna event-study results expose a variance path that matches the reported bootstrap SEs: either store a bootstrap event-study covariance matrix, or suppress/ignore analytical event_study_vcov on bootstrapped fits.
  2. Add a targeted regression test for CallawaySantAnna(n_bootstrap>0, aggregate="event_study") -> HonestDiD.fit(...) that fails if HonestDiD consumes the analytical IF covariance instead of the bootstrap path.
  3. If the analytical-on-bootstrap behavior is intentional, document that explicitly in docs/methodology/REGISTRY.md; otherwise the current behavior remains an undocumented methodology deviation.

…dy SEs

Prevents HonestDiD from mixing analytical IF-based VCV with bootstrap SEs
on bootstrap-fit CallawaySantAnna results. When n_bootstrap>0, the
event_study_vcov is set to None so HonestDiD falls back to diagonal from
the bootstrap SEs (consistent variance path).

Add regression test: bootstrap CS → HonestDiD asserts vcov is None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 53cfd5d5b06dd44f71f5272fe16a5e537543a327


Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Static review only: I could not execute the added tests here because pytest is not installed in this environment.

Executive Summary

Methodology

  • Severity: P3. Impact: The previous HonestDiD/bootstrap covariance-source finding looks resolved. Bootstrapped CS fits now discard analytical event_study_vcov before results are stored, so HonestDiD falls back to the bootstrap-compatible diagonal path, and that exact regression is now tested. Concrete fix: None. diff_diff/staggered.py:1323 diff_diff/staggered.py:1777 tests/test_honest_did.py:1283
  • Severity: P1. Impact: _outcome_regression_rc says it matches DRDID::reg_did_rc, but its control-side OLS estimation-effect term is divided by the treated-mass denominator twice. M1 is already normalized by sum_w_D, then inf_ct / inf_cs divide inf_cont_2_* by sum_w_D again. That shrinks the nuisance-estimation piece of the influence function, so covariate-adjusted repeated-cross-section reg fits understate per-cell analytical SEs and any bootstrap path built from those IFs. Concrete fix: Keep M1 normalized as written and remove the extra / sum_w_D on inf_cont_2_ct and inf_cont_2_cs, or re-port the reg_did_rc IF algebra directly from the reference implementation. diff_diff/staggered.py:2824 diff_diff/staggered.py:2920 diff_diff/staggered.py:2938 diff_diff/staggered_bootstrap.py:357 docs/methodology/REGISTRY.md:423. citeturn0view0turn1view0turn1view3
  • Severity: P1. Impact: The new RCS ipw and dr nuisance corrections are also mis-scaled. _ipw_estimation_rc normalizes w_ct / w_cs to w_*_norm and then forms M2_rc with np.mean(...), adding an extra 1/n_ct or 1/n_cs; _doubly_robust_rc likewise divides its PS moment by n_all after already normalizing by sum_w_ipw_*. In the standardized RC IPW and locally efficient RC DR references, those PS moments are ratio-of-means terms, not extra sample-size-scaled means. That under-scales the PS correction and therefore understates inference for covariate-adjusted repeated-cross-section ipw and dr, including survey-weighted fits. Concrete fix: Rewrite the PS moment construction to match std_ipw_did_rc / drdid_rc exactly: summed normalized-weight moments for standardized IPW, and ratio-of-means PS moments without the extra / n_all for DR. diff_diff/staggered.py:3022 diff_diff/staggered.py:3076 diff_diff/staggered.py:3111 diff_diff/staggered.py:3371 diff_diff/staggered_bootstrap.py:357 docs/methodology/REGISTRY.md:423. citeturn2view0turn4view0turn1view3

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate findings. These methodology defects are not mitigated by tracking: the nearby CS item in TODO.md is already marked resolved. TODO.md:48 TODO.md:57

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests do not verify the magnitude of the RCS nuisance corrections. They check finiteness, positivity, or that different methods return different SEs, which would still pass with the scaling errors above. Concrete fix: Add fixed-data regression tests for panel=False reg / ipw / dr with covariates that compare at least one ATT(g,t) SE or influence-function vector against DRDID reference output, and assert that bootstrap SEs move with the corrected IF magnitude. tests/test_staggered_rc.py:163 tests/test_staggered_rc.py:362 tests/test_survey_phase7a.py:60 tests/test_survey_phase7a.py:199

Path to Approval

  1. Fix _outcome_regression_rc so its OLS estimation-effect correction matches reg_did_rc, specifically by removing the extra treated-mass division after M1 has already been normalized.
  2. Fix _ipw_estimation_rc and _doubly_robust_rc so their PS nuisance moments use the same ratio-of-means scaling as the DRDID repeated-cross-section references.
  3. Add fixed-dataset regression tests for panel=False covariate-adjusted reg, ipw, and dr that check both analytical SEs and bootstrap SEs against reference values.
  4. Keep the HonestDiD bootstrap covariance fix and its regression test; that prior blocker looks resolved.

…s sum not mean

- _outcome_regression_rc: M1 denominator changed from sum_w_D to n_all
  (matching R colMeans convention); inf_cont_2 / sum_w_D then gives correct
  single normalization by mean_w_D * n_all = sum_w_D
- _ipw_estimation_rc: PS M2 uses np.sum/n_all instead of np.mean (which
  divided by n_ct/n_cs instead of n_all, under-scaling the correction)
- _doubly_robust_rc: PS M2 already correct (np.sum/n_all), no change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9ff21a2859e7713ca00a4d0359f855dc272d0b7f


Overall Assessment

Needs changes — highest unmitigated severity: P1.

Static review plus source cross-check only. I could not run pytest here because pytest is not installed in this environment.

Executive Summary

  • Re-review outcome: the prior HonestDiD/bootstrap covariance finding appears fixed, but both prior repeated-cross-section inference findings remain unresolved.
  • The new repeated-cross-section reg path still under-scales the control-side OLS nuisance correction, so panel=False + covariates + estimation_method="reg" understates inference.
  • The new repeated-cross-section ipw and dr paths still under-scale the propensity-score nuisance corrections, so their analytical SEs and multiplier bootstrap remain too small.
  • REGISTRY.md and TODO.md now mark Phase 7a/7b as resolved, but these remaining variance/IF mismatches are not documented deviations and are not mitigated by tracking.
  • The new tests mostly check finiteness or coarse behavior; they still do not pin RCS covariate-adjusted IF/SE magnitudes to DRDID reference output.

Methodology

  • Severity: P3. Impact: the prior HonestDiD/bootstrap covariance blocker looks resolved. fit() now resets stale event_study_vcov state and clears the analytical event-study covariance on bootstrap fits before storing results, with regression coverage in the HonestDiD tests. Concrete fix: none. diff_diff/staggered.py:L1323, diff_diff/staggered.py:L1777, tests/test_honest_did.py:L1283

  • Severity: P1. Impact: the prior RCS reg scaling finding remains unresolved. In _outcome_regression_rc, the local port still computes M1 = sum(w_D * X) / n_all and then divides the assembled estimation-effect term by sum_w_D, which leaves an extra 1 / n_all shrinkage in the OLS nuisance correction. Relative to DRDID::reg_did_rc, that understates per-cell analytical SEs for panel=False + covariates + estimation_method="reg", and the multiplier bootstrap inherits the same understatement because it perturbs the stored IF vectors directly. Concrete fix: port reg_did_rc literally in the local phi = psi / n convention: either use M1 = sum(w_D * X) with the current / sum_w_D, or keep colMeans(...) and divide by mean_w_D instead of sum_w_D. diff_diff/staggered.py:L2806, diff_diff/staggered.py:L2921, diff_diff/staggered.py:L2939, diff_diff/staggered_bootstrap.py:L372, docs/methodology/REGISTRY.md:L423, TODO.md:L57. citeturn0view2

  • Severity: P1. Impact: the prior RCS ipw / dr scaling finding also remains unresolved. _ipw_estimation_rc and _doubly_robust_rc normalize the control weights first (w_ct_norm, w_ipw_* / sum_w_ipw_*) and then divide the propensity-score moments by n_all again before multiplying by asy_lin_rep_ps. In DRDID::std_ipw_did_rc and DRDID::drdid_rc, those moments are already normalized by the control-weight means; after translating to the local phi = psi / n convention, there should not be an additional / n_all. The result is understated analytical SEs and understated multiplier-bootstrap dispersion for panel=False covariate-adjusted ipw and dr, including survey-weighted fits. Concrete fix: remove the extra / n_all from the RCS PS moments and port the std_ipw_did_rc / drdid_rc nuisance terms term-for-term. diff_diff/staggered.py:L3071, diff_diff/staggered.py:L3086, diff_diff/staggered.py:L3382, diff_diff/staggered.py:L3397, diff_diff/staggered_bootstrap.py:L372, docs/methodology/REGISTRY.md:L423, TODO.md:L57. citeturn0view1turn0view0

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate findings. The unresolved SE/IF scaling issues above are not mitigated by the new tracking text because incorrect variance output is not deferrable under this review rubric. TODO.md:L57, docs/methodology/REGISTRY.md:L423

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Fix _outcome_regression_rc so the control-side OLS nuisance correction matches DRDID::reg_did_rc exactly; the current port still carries an extra / n_all.
  2. Fix _ipw_estimation_rc and _doubly_robust_rc so the PS nuisance moments match DRDID::std_ipw_did_rc and DRDID::drdid_rc exactly; remove the extra / n_all after control-weight normalization.
  3. Add fixed-data regression tests for RCS reg / ipw / dr with covariates that check at least one analytical ATT(g,t) SE and one multiplier-bootstrap SE against DRDID reference output, including an unequal cohort-count case.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

…convergence test

REGISTRY.md: Document that RCS IFs use phi=psi/n convention (SE = sqrt(sum(phi^2))),
algebraically equivalent to R's sd(psi)/sqrt(n). The 1/n_all denominator in gradient
terms is the colMeans -> phi conversion, not extra shrinkage.

Add test proving correctness: analytical SE within 20% of bootstrap SE (499 iters)
for RCS reg with covariates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the survey-phase-seven branch from cfcc441 to c2f8fdc Compare March 29, 2026 13:26
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cfcc441d510052036e3e4adfccacf9717c1c17b6


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not run the test suite here because this environment does not have numpy, pandas, scipy, or pytest installed.

Executive Summary

  • Re-review outcome: the prior bootstrap/HonestDiD covariance-leak finding appears fixed.
  • The prior repeated-cross-section reg inference finding remains unresolved: the control-side OLS nuisance correction is still under-scaled in panel=False + covariates.
  • The prior repeated-cross-section ipw / dr inference finding also remains unresolved: the propensity-score nuisance corrections are still under-scaled in the new RCS paths.
  • [Newly identified] The new event_study_vcov path in HonestDiD does not subset the covariance matrix to the same finite-SE event times used for beta_hat, so dropped interior event times can silently misalign bounds.
  • The new tests are still too weak to catch the RCS SE issues: most are smoke/shape checks, and the “analytical vs bootstrap” test is not independent because bootstrap perturbs the same stored IF vectors.

Methodology

  • Severity: P1. Impact: the prior repeated-cross-section reg SE bug is still present. In _outcome_regression_rc(), the new code forms the OLS asymptotic linear representation with bread_t = (X'WX)^{-1} / bread_s = (X'WX)^{-1} at diff_diff/staggered.py#L2914, but it also keeps M1 = sum(w_D X) / n_all at diff_diff/staggered.py#L2921 and then divides again by sum_w_D at diff_diff/staggered.py#L2939. In the official DRDID::reg_did_rc implementation, the matching nuisance term is built on the unscaled R IF side with solve(X'WX / n) and M1 = colMeans(w.cont * X) before the final / mean(w.cont) step. Under this library’s pre-scaled phi convention, the / n lives in the ALR; keeping / n_all inside M1 shrinks the correction one extra time. That understates analytical SEs for panel=False + covariates + estimation_method="reg", and the multiplier bootstrap inherits the same understatement because it reuses the stored IF vectors at diff_diff/staggered_bootstrap.py#L372. Concrete fix: either rescale the OLS ALR back to the DRDID convention (inv(X'WX / n_all)) or, with the current bread, drop the / n_all from M1. (sciencedirect.com)

  • Severity: P1. Impact: the prior repeated-cross-section ipw / dr SE bug also remains present. _ipw_estimation_rc() builds asy_lin_rep_ps with H_ps_inv = (X'WX)^{-1} at diff_diff/staggered.py#L3063 and then divides the already-normalized M2_rc moments by n_all at diff_diff/staggered.py#L3081. _doubly_robust_rc() repeats the same pattern at diff_diff/staggered.py#L3366 and diff_diff/staggered.py#L3384. In the official std_ipw_did_rc and drdid_rc implementations, the corresponding M2.post / M2.pre terms are colMeans(...) / mean(...) objects with no extra /n; the local phi conversion is already handled by the Hessian scaling. As written, the PS nuisance corrections are too small, so covariate-adjusted RCS ipw and dr fits understate analytical SEs, and multiplier bootstrap dispersion is understated for the same reason. Concrete fix: remove the extra / n_all from the RCS PS moment vectors, or equivalently rescale the PS ALR to the R reference scale before multiplying by M2. (sciencedirect.com)

  • Severity: P1. Impact: [Newly identified] the new HonestDiD covariance path can silently misalign the covariance matrix when some event-study periods have n_groups > 0 but non-finite SEs. _extract_event_study_params() filters CS event-study periods down to finite-SE rel_times at diff_diff/honest_did.py#L645, but if event_study_vcov exists it returns the full matrix unchanged at diff_diff/honest_did.py#L666. _aggregate_event_study() builds that matrix over all event times with stored IF vectors at diff_diff/staggered_aggregation.py#L629 and diff_diff/staggered_aggregation.py#L716. If an interior event time is dropped, fit() ends up pairing a filtered beta_hat with an unfiltered covariance matrix and silently truncating the top-left block, which can shift original_se and the sensitivity bounds. Concrete fix: subset event_study_vcov to the surviving rel_times before returning sigma, and add a regression test with an interior dropped event time.

  • Severity: P3. Impact: the prior bootstrap/HonestDiD covariance-leak issue appears fixed. fit() now resets stale _event_study_vcov state at diff_diff/staggered.py#L1324 and clears analytical event-study covariance on bootstrap fits at diff_diff/staggered.py#L1777, with coverage in tests/test_honest_did.py#L1283. Concrete fix: none.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: TODO.md now marks Phase 7a as resolved at TODO.md#L57, and the registry note at docs/methodology/REGISTRY.md#L423 / docs/methodology/REGISTRY.md#L424 claims the RCS 1/n_all factors are just the local phi conversion. Those notes do not mitigate the two P1 SE defects above; incorrect variance output is not deferrable under this review rubric. Concrete fix: reopen or replace the resolved tracking entry once the RCS nuisance-scaling actually matches the implemented IF convention.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new Phase 7 tests still do not anchor the affected SEs to an external reference. tests/test_survey_phase7a.py is mostly smoke/scale-invariance coverage starting at tests/test_survey_phase7a.py#L56, and tests/test_staggered_rc.py calls the analytical-vs-bootstrap check “proof” at tests/test_staggered_rc.py#L413, but bootstrap reuses the same stored IF vectors at diff_diff/staggered_bootstrap.py#L357. That cannot catch a common IF scaling error. The new HonestDiD tests also only exercise fully finite event-study inputs at tests/test_honest_did.py#L1158, so the new covariance-subsetting edge case is untested. Concrete fix: add fixed-data regression tests for panel=False reg / ipw / dr with covariates that assert at least one analytical ATT(g,t) SE and one multiplier-bootstrap SE against DRDID reference output, plus an HonestDiD test with an interior event-study period dropped for se=np.nan.

Path to Approval

  1. Fix _outcome_regression_rc() so the OLS nuisance correction is on the correct scale: either use the DRDID-side n_all factor in the OLS ALR or remove the current / n_all from M1.
  2. Fix _ipw_estimation_rc() and _doubly_robust_rc() so the PS nuisance moments are not divided by n_all after they have already been normalized.
  3. Subset event_study_vcov to the same finite-SE event-time index used to build beta_hat inside _extract_event_study_params().
  4. Add fixed-data regression tests against DRDID reference output for RCS reg / ipw / dr analytical SEs and multiplier-bootstrap SEs, and add one HonestDiD regression test with an interior dropped event-study period.

Restructure _outcome_regression_rc, _ipw_estimation_rc, _doubly_robust_rc
to compute leading IF terms in R's unnormalized psi convention (using
mean_w_* = sum_w_*/n_all normalizers matching R's mean()), then convert
to library phi = psi/n_all at the boundary. Makes DRDID correspondence
explicit with R variable name comments.

Fix HonestDiD event_study_vcov subsetting: when filtering NaN-SE event
times, subset the VCV matrix to match the surviving rel_times (was using
the full unfiltered matrix, causing dimension mismatch on interior drops).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cb3f815c22cd29f4ad338cbb8d25fe3b3deaa577


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not run the suite here because this environment is missing numpy, pandas, scipy, and pytest.

Executive Summary

  • The prior bootstrap/HonestDiD covariance-leak finding appears fixed via the fit-state reset in diff_diff/staggered.py:L1324 and the bootstrap-time clearing of analytical event-study covariance in diff_diff/staggered.py:L1777.
  • The prior repeated-cross-section 1/n_all scaling dispute is now explicitly documented in docs/methodology/REGISTRY.md:L423 and docs/methodology/REGISTRY.md:L424, so it is no longer a blocking methodology defect under this rubric.
  • The prior HonestDiD covariance-subsetting finding remains unresolved: _extract_event_study_params() filters beta_hat down to finite-SE event times but still takes the full event_study_vcov, so a dropped interior event time can silently misalign original_se and the reported sensitivity bounds.
  • [Newly identified] The Phase 7a panel covariate-adjusted propensity-score correction is still under-scaled in the newly supported survey ipw path and the new dr IF-correction path, so analytical SEs are understated and the multiplier bootstrap inherits the same understatement. (rdrr.io)
  • The new Phase 7 tests are still mostly smoke/invariance coverage and would not catch either issue because they do not benchmark these SE-sensitive paths against DRDID/HonestDiD reference outputs.

Methodology

  • Severity: P1. Affected method(s): HonestDiD on CallawaySantAnnaResults. Impact: _extract_event_study_params() filters event times to those with finite SEs at diff_diff/honest_did.py:L645, but if event_study_vcov exists it still assigns the full matrix at diff_diff/honest_did.py:L666. That covariance is built over the full aggregated event-study sequence in diff_diff/staggered_aggregation.py:L714, and fit() silently falls back to the top-left block when dimensions do not match at diff_diff/honest_did.py:L1158. If a middle event time is dropped, the remaining coefficients and covariance matrix can become misaligned with no warning, producing wrong original_se and wrong Rambachan-Roth bounds. Concrete fix: subset event_study_vcov to the exact surviving rel_times before returning sigma, and add a regression test where one interior event time has se=np.nan but adjacent periods remain finite.

  • Severity: P1. Affected method(s): CallawaySantAnna panel ipw with covariates + survey, and panel dr with covariates with and without survey. Impact: the PR now advertises/supports these paths in docs/methodology/REGISTRY.md:L419, tests/test_survey_phase4.py:L929, and tests/test_survey_phase7a.py:L361, but the PS nuisance gradients still use np.mean(...) over only control rows at diff_diff/staggered.py:L2052, diff_diff/staggered.py:L2314, and diff_diff/staggered.py:L2375. In the official DRDID panel implementations, weights are normalized to mean one, the PS ALR uses solve(X'WX) * n, and the corresponding IPW/DR derivatives are colMeans(w.cont * ... * X) before the final division by mean(w.cont) in the control IF. This code instead uses the unscaled inverse of X'WX, which already puts the ALR in the library’s pre-scaled phi convention, so averaging the normalized gradient again over control rows introduces an extra 1/n_c shrinkage. That understates analytical SEs for these panel covariate-adjusted IPW/DR paths, and the bootstrap path reuses the same stored IF vectors at diff_diff/staggered_bootstrap.py:L377, so the understatement carries through there too. Concrete fix: either remove the extra control-row averaging and use sums in those M2/M2_dr terms, or rescale the ALR back to the DRDID psi convention consistently before multiplying by the gradient; then add fixed-data SE regressions against DRDID for panel ipw/dr. (rdrr.io)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The REGISTRY.md/TODO.md updates do not mitigate the two methodology defects above; incorrect variance output is not deferrable under this review rubric.

Security

No findings.

Documentation/Tests

Path to Approval

  1. Subset event_study_vcov to the same finite-SE rel_times used to build beta_hat in _extract_event_study_params(), and add a regression test with an interior dropped event time.
  2. Fix the panel PS nuisance-gradient scaling in _ipw_estimation() and _doubly_robust() by removing the extra control-row averaging or by rescaling the ALR consistently to the DRDID psi convention.
  3. Add fixed-data reference tests against official DRDID outputs for panel ipw + survey + covariates, panel dr + survey + covariates, and panel dr + covariates without survey.
  4. Add one HonestDiD regression test where a middle event-study period has se=np.nan while neighboring periods remain finite, and assert the covariance matrix is subset to the same retained periods.

…2 gradients

Resolve TODO.md merge conflict (keep our resolved items from Phase 7).

Panel IPW/DR PS nuisance gradients (M2, M2_dr): change np.mean() over
control rows to np.sum() / (n_t + n_c), matching R's colMeans() over
ALL n observations (zero for treated). This makes the panel paths
consistent with the RC paths' R-convention refactoring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7e127fb29e1ca2f59a14e9bb0391a1e728f8483f


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not run the suite here because this environment is missing numpy, so importing the package fails.

Executive Summary

  • The prior bootstrap/HonestDiD state-leak issue looks fixed: fit() now resets _event_study_vcov, and bootstrap results clear the analytical event-study covariance before storing results.
  • The panel covariate-adjusted propensity-score nuisance correction still appears under-scaled in the newly supported survey ipw path and the survey/non-survey dr paths, so the reported SEs still look too small.
  • HonestDiD’s new event-study VCV subsetting is only partial; with base_period="universal" plus any additional dropped event time, coefficients and covariance can still become misaligned.
  • The new panel=False threading, aggregation, and bootstrap plumbing looks internally coherent, and the RCS phi = psi / n convention is documented in the Methodology Registry, so I did not count that as a defect.
  • No security issues stood out. The main remaining gap is test coverage for the two SE-sensitive failure modes above.

Methodology

  • Severity: P1. Location: diff_diff/staggered.py:L2052-L2059, diff_diff/staggered.py:L2314-L2321, diff_diff/staggered.py:L2376-L2382, diff_diff/staggered_bootstrap.py:L377-L399, docs/methodology/REGISTRY.md:L419-L424. Impact: the panel PS correction still divides M2 / M2_dr by n_t + n_c even though asy_lin_rep_ps = score_ps @ H^{-1} is formed with the unnormalized Hessian H = X'WX, so that term is already on the library’s per-observation IF scale. That leaves the newly unblocked survey ipw path and the new survey/non-survey dr paths with a propensity-score correction that is smaller by another factor of 1/n, understating analytical SEs; because the bootstrap reuses the same stored IF vectors, the understatement carries through there as well. This is not covered by a REGISTRY.md deviation note for the panel estimator, so under the rubric it remains an undocumented methodology mismatch. Concrete fix: remove the extra / (n_t + n_c) from the panel PS gradients, or explicitly rescale asy_lin_rep_ps back to the DRDID/R psi convention before multiplying.

  • Severity: P1. Location: diff_diff/staggered_aggregation.py:L701-L755, diff_diff/honest_did.py:L666-L676, diff_diff/honest_did.py:L1165-L1173. Impact: the new HonestDiD subsetting logic still misses the base_period="universal" case when some other event time is also dropped. _aggregate_event_study() injects the e=-1 reference row into event_study_effects, but event_study_vcov is still indexed only by the aggregated non-reference event times. _extract_event_study_params() only subsets when vcov.shape[0] == len(all_event_times), which is false once that reference row exists. If an interior event time also has se=np.nan, beta_hat is filtered but sigma is not, and fit() can silently take a misaligned block. Concrete fix: store the exact event-time index that matches event_study_vcov and subset against that list, not against results.event_study_effects.keys().

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. The TODO.md/REGISTRY.md updates do not mitigate the two methodology defects above because incorrect SEs and covariance alignment are non-deferrable under this rubric.

Security

No findings.

Documentation/Tests

  • Severity: P2. Location: tests/test_honest_did.py:L746-L778, tests/test_honest_did.py:L1158-L1299, tests/test_survey_phase7a.py:L56-L389. Impact: the new tests cover smoke/invariance behavior, but they still do not pin the two remaining SE-sensitive failure modes. The HonestDiD tests check that the universal-base reference period is excluded, not the harder case where universal base and an additional dropped event time coexist. The Phase 7a tests check finite outputs and “DR SE differs from reg,” but they do not benchmark panel ipw/dr SEs against a DRDID reference, and analytical-vs-bootstrap agreement cannot catch a shared IF-scaling error because bootstrap reuses the same IF objects. Concrete fix: add one universal-base HonestDiD regression test with an interior se=np.nan, plus fixed-data reference tests against DRDID for panel survey ipw, panel survey dr, and panel non-survey dr.

Path to Approval

  1. Fix the panel PS nuisance-correction scaling in _ipw_estimation() and _doubly_robust() so the PS term is on the same IF scale as the rest of the panel estimator.
  2. Make HonestDiD subset event_study_vcov by the VCV’s actual event-time ordering rather than the full event_study_effects key set.
  3. Add fixed-data reference tests against DRDID for panel survey ipw, panel survey dr, and panel non-survey dr standard errors.
  4. Add a universal-base HonestDiD regression test where one non-reference event time has se=np.nan and assert the covariance matrix is subset to the retained relative times.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 30, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9c8ebb48617f175ee24d42bfeb7c9453d32ae635


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The previous HonestDiD reference-period splitter issue appears fixed: diff_diff/honest_did.py:671 now keys off the injected normalization marker instead of treating any n_groups==0 bucket as the omitted period.
  • The previous HonestDiD covariance blocker is resolved for analytical Callaway-Sant’Anna event studies; the remaining bootstrap/replicate diagonal fallback is now documented in docs/methodology/REGISTRY.md:1640 and is no longer blocking under this rubric. (rdrr.io)
  • The panel covariate-adjusted ipw/dr nuisance-IF scaling blocker remains open: the new PS correction still carries an extra /n, so analytical SEs are likely understated in the newly enabled panel paths. (rdrr.io)
  • The new repeated-cross-section panel=False nuisance-IF code has the same scaling problem in reg, ipw, and dr, so analytical SEs for the new covariate-adjusted RC paths are also likely too small. (rdrr.io)
  • Static review only: python -m pytest -q tests/test_survey_phase7a.py tests/test_staggered_rc.py tests/test_honest_did.py could not run here because pytest is not installed.

Methodology

Affected methods: CallawaySantAnna panel ipw/dr with covariates; CallawaySantAnna(panel=False) reg/ipw/dr; and HonestDiD on Callaway-Sant’Anna event studies.

  • Severity P1. Impact: The panel propensity-score nuisance corrections are still under-scaled in diff_diff/staggered.py:2051, diff_diff/staggered.py:2312, and diff_diff/staggered.py:2375. DRDID’s panel IPW/DR code forms the logit linear representation on the solve(X'WX/n) scale and then applies an M2 term on the control-weight scale before SEs are formed from sd(att.inf.func)/sqrt(n). Here, score @ solve(X'WX/n) / n already puts asy_lin_rep_ps on the library’s phi scale, but M2/M2_dr are divided by n_all_panel again after the control term has already been normalized. That extra /n does not cancel; it shrinks the correction and understates analytical SEs. The new “algebraically equivalent” note in docs/methodology/REGISTRY.md:421 therefore does not mitigate this. Concrete fix: keep a single scale consistently. If you keep asy_lin_rep_ps on phi scale, drop the trailing /n_all_panel from M2 and M2_dr; otherwise revert the whole PS correction to R-style psi scale and divide once at the end. (rdrr.io)
  • Severity P1. Impact: The repeated-cross-section nuisance corrections are likewise under-scaled in diff_diff/staggered.py:2945, diff_diff/staggered.py:3144, and diff_diff/staggered.py:3447. In reg, the OLS breads use (X'WX)^{-1} but the resulting correction is still inserted into psi_all before the final /n_all, which leaves the OR correction one factor of n too small. In ipw and dr, asy_lin_rep_ps = score @ (X'WX)^{-1} is already the R representation divided by n, so the added /n_all in M2 shrinks the PS correction again. DRDID’s RC implementations instead use XpX/n or Hessian.ps = solve(X'WX) * n, colMeans(...), and then sd(att.inf.func)/sqrt(n), so the current formulas do not match the source material despite the claims in docs/methodology/REGISTRY.md:421 and docs/methodology/REGISTRY.md:426. Concrete fix: for reg, either use the R-style solve(X'WX/n) bread before assembling psi_all, or move the OR correction to inf_all after the /n_all conversion. For ipw/dr, remove the extra /n_all in M2 when asy_lin_rep_ps is already on phi scale. (rdrr.io)
  • Severity P3. Impact: The prior HonestDiD covariance blocker is mitigated for analytical CS event studies. diff_diff/staggered_aggregation.py:716 now builds a full event-study covariance matrix, and diff_diff/honest_did.py:709 consumes it. The remaining diagonal fallback for bootstrap-fitted or replicate-weight results is explicitly documented in docs/methodology/REGISTRY.md:1641, so under the stated rubric this is informational, not blocking. Concrete fix: none required for approval. (rdrr.io)
  • Severity P3. Impact: The prior HonestDiD universal-base handling issue appears resolved. diff_diff/honest_did.py:675 now infers the omitted reference from the normalization marker, which avoids misclassifying arbitrary empty bins as the reference period. Concrete fix: none.

Code Quality

Performance

  • No findings.

Maintainability

  • Severity P3. Impact: The comment at diff_diff/staggered.py:1425 still says strata/PSU/FPC are “rejected above,” but the surrounding code now supports them. That is stale guidance in the main fit() control flow and will mislead the next person touching survey logic. Concrete fix: update or remove the comment.

Tech Debt

  • Severity P3. Impact: TODO.md:59 and TODO.md:169 mark the Phase 7a/7b work as resolved even though the analytical-SE defects above are still open. That increases the chance future work treats these paths as closed. Concrete fix: reopen those entries or narrow them to the portions that are actually complete.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: The new tests do not substantiate the strong equivalence claims in docs/methodology/REGISTRY.md:421 and docs/methodology/REGISTRY.md:426. tests/test_survey_phase7a.py:1 is mostly smoke/finiteness coverage, and the only new analytical-vs-bootstrap magnitude check is tests/test_staggered_rc.py:414 for RC reg on library-generated data. There are no fixed-data parity tests against DRDID for panel ipw/dr or RC reg/ipw/dr, so a shared scaling error would pass. Concrete fix: add deterministic source-parity tests against DRDID::std_ipw_did_panel, DRDID::drdid_panel, DRDID::reg_did_rc, DRDID::std_ipw_did_rc, and DRDID::drdid_rc, then either keep or narrow the registry claims based on those results. (rdrr.io)
  • Static review only: tests/test_honest_did.py:1158 and related tests were not executable in this environment because pytest is missing.

Path to Approval

  1. Fix the panel PS nuisance-correction scaling in diff_diff/staggered.py:2051, diff_diff/staggered.py:2312, and diff_diff/staggered.py:2375 so the code stays on one scale throughout. If asy_lin_rep_ps stays on phi scale, remove the extra /n_all_panel from M2 and M2_dr. (rdrr.io)
  2. Fix the repeated-cross-section nuisance scaling in diff_diff/staggered.py:2945, diff_diff/staggered.py:3144, and diff_diff/staggered.py:3447. For reg, either use the R-style solve(X'WX/n) bread before psi_all is formed or move the OR correction to inf_all after the /n_all conversion. For ipw/dr, drop the extra /n_all in the PS gradients. (rdrr.io)
  3. Add fixed-data parity tests against the DRDID reference estimators for the newly enabled paths, at minimum unit-weighted comparisons for panel std_ipw_did_panel/drdid_panel and RC reg_did_rc/std_ipw_did_rc/drdid_rc. (rdrr.io)
  4. Update docs/methodology/REGISTRY.md:421, docs/methodology/REGISTRY.md:426, TODO.md:59, and TODO.md:169 so they stop claiming equivalence/resolution until the corrected formulas and parity tests land.

…nator)

The previous change from np.mean to np.sum/n_all_panel was wrong —
R's colMeans over control rows divides by n_c, not n_all. With
H/n and asy_rep/n already on phi scale, M2 = colMeans(control) with
n_c denominator is the correct gradient.

Reverts all three panel M2/M2_dr sites (IPW, DR survey, DR non-survey)
from np.sum()/n_all_panel back to np.mean().

Also: fix "Control" → "Never-treated" label, update REGISTRY note.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 30, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: df21c7e74767bb5dc97b699878e6cbe01de87683


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The previous HonestDiD reference-period splitter and full-covariance blockers look addressed: the CS path now finds the universal-base reference correctly and stores an event-study covariance matrix for analytical fits via diff_diff/honest_did.py:L671-L781 and diff_diff/staggered_aggregation.py:L716-L752.
  • The panel covariate-adjusted ipw/dr PS nuisance corrections are still under-scaled; the new code averages the gradient over controls after already moving asy_lin_rep_ps onto the library’s phi scale, which shrinks the correction and understates SEs.
  • The new repeated-cross-section ipw/dr PS nuisance corrections have the same problem; their M2 terms still carry an extra /n_all in the phi-scale implementation.
  • panel=False documents a checkable “unique unit ID / no repeated units” requirement, but the new path never validates it and silently treats duplicate IDs as independent observations.
  • Static review only: python -m pytest ... could not run here because pytest is not installed in this environment.

Methodology

  • Severity P1. Impact: The panel PS nuisance corrections remain under-scaled in diff_diff/staggered.py:L2051-L2074, diff_diff/staggered.py:L2309-L2329, and diff_diff/staggered.py:L2369-L2390. The documented treated-mass normalization at docs/methodology/REGISTRY.md:L420-L421 is not the blocker here; the blocker is the extra control-row averaging in M2/M2_dr. In DRDID’s panel IPW/DR code, M2 is built from colMeans(w.cont * … * X) and the control IF is then divided by mean(w.cont), so after converting asy_lin_rep_ps to phi scale the local correction should be a sum over normalized control terms, not np.mean(...) over control rows. Concrete fix: keep asy_lin_rep_ps on phi scale, but replace the control-row np.mean(...) with the corresponding np.sum(...) over normalized control terms; alternatively revert the whole PS correction to the R psi-scale convention and divide once at the end. citeturn0view0turn4view0
  • Severity P1. Impact: The new RCS PS nuisance corrections are likewise under-scaled in diff_diff/staggered.py:L3116-L3152 and diff_diff/staggered.py:L3438-L3482. Here asy_lin_rep_ps = score @ (X'WX)^{-1} is already the phi-scale version of DRDID’s solve(X'WX/n) linear representation, but the code still divides the M2 terms by n_all. In DRDID’s RC IPW/DR code, the colMeans(...)/mean(w.cont.*) construction cancels the sample-size factor; the local /n_all therefore shrinks the PS correction one more time and understates analytical SEs in the new PS-based RC paths. Concrete fix: remove the trailing /n_all from the RC M2 constructions while keeping the current bread convention, or switch those paths back to the exact R psi-scale bread/gradient formulas and divide once at the end. citeturn1view2turn1view0
  • Severity P1. Impact: panel=False adds a new public estimator mode whose own docs require globally unique unit IDs, but the fit path never enforces that contract. The docs say repeated-cross-section mode requires unique units in diff_diff/staggered.py:L184-L190 and diff_diff/staggered.py:L1278-L1281, yet the implementation immediately discards the user’s IDs and reindexes rows as independent observations in diff_diff/staggered.py:L2462-L2515. That means a real panel passed with panel=False will silently get repeated-cross-section dependence assumptions and bootstrap/variance treatment. Concrete fix: when panel=False, reject duplicated unit IDs with a clear ValueError, and add a negative test covering that misuse path.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3. Impact: The comment at diff_diff/staggered.py:L1425-L1428 still says strata/PSU/FPC are “rejected above,” but this PR now supports them. That will mislead the next person touching the main fit() flow. Concrete fix: update or remove the comment.

Tech Debt

  • Severity P3. Impact: TODO.md:L59 marks “CallawaySantAnna survey + covariates + IPW/DR” resolved even though the panel PS nuisance-scaling blocker above is still open. Concrete fix: reopen that entry or narrow it to the portions that are actually complete.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: docs/methodology/REGISTRY.md:L421-L426 overstates both equivalence and validation. The registry says the new M2 conventions match DRDID and that analytical-vs-bootstrap agreement confirms correctness, but the new tests in tests/test_survey_phase7a.py:L56-L220 and tests/test_staggered_rc.py:L413-L441 are not source-parity tests, and the multiplier bootstrap in diff_diff/staggered_bootstrap.py:L311-L376 perturbs the same stored IF vectors. A shared scaling error will therefore survive both checks. Concrete fix: add deterministic parity tests against DRDID for panel std_ipw_did_panel/drdid_panel and RC std_ipw_did_rc/drdid_rc, or narrow the registry claims until those tests exist. citeturn0view0turn4view0turn1view2turn1view0
  • Static review only: python -m pytest -q tests/test_survey_phase7a.py tests/test_staggered_rc.py tests/test_honest_did.py and python -m pytest -q tests/test_survey_phase4.py -k 'ipw_covariates_survey_works or dr_covariates_survey_works' could not run here because pytest is missing.

Path to Approval

  1. Fix the panel ipw/dr PS nuisance corrections in diff_diff/staggered.py:L2051-L2074, diff_diff/staggered.py:L2309-L2329, and diff_diff/staggered.py:L2369-L2390 so the code stays on one scale throughout. If asy_lin_rep_ps stays on phi scale, M2/M2_dr need sums over normalized control terms, not control-row means. citeturn0view0turn4view0
  2. Fix the RC ipw/dr PS nuisance corrections in diff_diff/staggered.py:L3146-L3152 and diff_diff/staggered.py:L3456-L3482 by removing the extra /n_all (or by restoring the full R psi-scale bread/gradient convention and dividing once at the end). citeturn1view2turn1view0
  3. Add a panel=False input check that rejects duplicated unit IDs, plus a regression test that proves repeated IDs error out instead of being treated as independent cross-sectional observations.
  4. Add fixed-data DRDID parity tests for the new PS-corrected paths and update docs/methodology/REGISTRY.md:L421-L426 and TODO.md:L59 so they stop claiming equivalence/resolution until those parity tests pass.

Drop the H/n, asy_rep/n convention from all three panel PS correction
sites. Now uses R's direct formulation: H = X'WX, asy_rep = score @ inv(H),
M2 = colMeans (sum over control terms / n_all). The /n factors were
algebraically canceling but confused the static reviewer.

Also: add duplicate unit-ID check for panel=False.
Update REGISTRY note to reflect the simpler H = X'WX convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 30, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6789e37101ceb263f9cb39b1dec4d5eea4095db7


Overall Assessment

⚠️ Needs changes

Static review only: I could not run the new paths locally in this environment because the Python deps are unavailable.

Executive Summary

  • The previous HonestDiD blocker looks addressed: Callaway-Sant'Anna event-study fits now carry a full covariance matrix into HonestDiD, and the reference-period split is now reference-aware in diff_diff/honest_did.py:L646-L801 and diff_diff/staggered_aggregation.py:L716-L761.
  • The previous panel=False contract blocker also looks addressed: duplicate unit IDs are now rejected before estimation in diff_diff/staggered.py:L1328-L1345.
  • The panel covariate-adjusted ipw/dr propensity-score nuisance corrections are still on mixed scales, so the new Phase 7a variance formulas remain understated in the newly unblocked paths.
  • The new repeated-cross-section ipw/dr propensity-score nuisance corrections carry the same extra averaging factor, so the new panel=False analytical SEs are understated there too.
  • The new docs/TODO/tests overstate validation and resolution of those variance formulas; the added tests are mostly smoke coverage, and the CS multiplier bootstrap reuses the same stored IF vectors rather than independently validating them.

Methodology

  • Severity P1. Impact: The panel covariate-adjusted ipw/dr PS nuisance corrections still mix phi-scale and psi-scale algebra. The code builds asy_lin_rep_ps as score @ solve(X'WX) in diff_diff/staggered.py:L2059-L2071, diff_diff/staggered.py:L2319-L2331, and diff_diff/staggered.py:L2379-L2393, which is already the library’s per-observation phi contribution to \u03b2̂-\u03b2. But M2/M2_dr are still averaged again (/ n_all_panel in IPW; np.mean(...) in DR) in diff_diff/staggered.py:L2073-L2084, diff_diff/staggered.py:L2333-L2339, and diff_diff/staggered.py:L2395-L2400. That leaves an extra 1/n factor inside the PS correction and understates the new panel ipw/dr SEs; because those IF vectors are also what the CS multiplier bootstrap perturbs, the understatement propagates downstream too in diff_diff/staggered_bootstrap.py:L305-L400. Concrete fix: keep asy_lin_rep_ps = score @ solve(X'WX) and change M2/M2_dr to raw sums over the already-normalized control terms, or switch the whole PS correction back to the R psi-scale convention and divide once at the end.

  • Severity P1. Impact: The new repeated-cross-section ipw/dr PS nuisance corrections repeat the same scale mix. Both RC paths explicitly convert the leading IF to phi via inf_all = psi_all / n_all, but still leave M2 divided by n_all while using asy_lin_rep_ps = score @ solve(X'WX) in diff_diff/staggered.py:L3126-L3162 and diff_diff/staggered.py:L3448-L3492. In a phi-scale implementation that extra colMeans factor has already been absorbed by the nuisance linear representation, so the PS correction is too small and the new panel=False ipw/dr SEs are understated. The new registry note at docs/methodology/REGISTRY.md:L421-L426 does not mitigate this; it blesses the same mixed-scale formula rather than documenting an intentional source deviation. Concrete fix: remove the trailing / n_all from the RC M2 terms if asy_lin_rep_ps stays as score @ solve(X'WX), or revert the RC PS correction to the R psi-scale form end-to-end.

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity P3. Impact: The main fit() comment still says strata/PSU/FPC are “rejected above” in diff_diff/staggered.py:L1433-L1436, even though this PR now supports them. Concrete fix: update or remove that comment so it matches the current survey-aware CS flow.

Tech Debt

  • Severity P3. Impact: TODO.md now marks Phase 7a and sampling-weight parity as resolved in TODO.md:L57-L60 and TODO.md:L167-L170, but the PS-correction blocker above is still open. Concrete fix: reopen or narrow those entries until the corrected ipw/dr variance formulas land.

Security

No findings.

Documentation/Tests

  • Severity P2. Impact: The new validation evidence does not actually pin the affected variance formulas. The Phase 7a tests are mostly smoke/finite checks in tests/test_survey_phase7a.py:L56-L107 and tests/test_survey_phase7a.py:L252-L295; the only analytical-vs-bootstrap convergence assertion added in the RCS suite is for reg, not the affected ipw/dr paths, in tests/test_staggered_rc.py:L413-L443. That matters because the CS multiplier bootstrap perturbs the same stored IF vectors rather than refitting nuisance models in diff_diff/staggered_bootstrap.py:L354-L400, so bootstrap agreement is not an independent check of IF scaling. Concrete fix: add source-parity tests against DRDID for panel and RC ipw/dr, or another validation that re-estimates the nuisance steps instead of reusing the stored IFs.

  • Severity P3. Impact: docs/survey-roadmap.md is internally inconsistent: Phase 7a is marked complete, but the “Current gate” text still says the code raises NotImplementedError in docs/survey-roadmap.md:L134-L157. Concrete fix: update or remove the stale gate description.

Path to Approval

  1. Fix the panel ipw/dr PS nuisance corrections in diff_diff/staggered.py:L2054-L2084 and diff_diff/staggered.py:L2310-L2400 so the implementation stays on one scale throughout.
  2. Fix the RC ipw/dr PS nuisance corrections in diff_diff/staggered.py:L3126-L3162 and diff_diff/staggered.py:L3448-L3492 by removing the extra 1/n_all factor or by restoring the full R psi-scale convention and dividing once at the end.
  3. Add independent validation for those corrected paths: DRDID parity tests for panel and RC ipw/dr, with survey and non-survey variants where applicable.
  4. Update docs/methodology/REGISTRY.md:L421-L426, TODO.md:L57-L60, and TODO.md:L167-L170 so they stop claiming the affected variance formulas are resolved/validated until those tests pass.

…i conversion

All 5 PS nuisance correction sites (3 panel + 2 RC) now compute in R's
psi convention: H_psi = X'WX/n, asy_lin_rep_psi = score @ solve(H_psi) / n,
M2 = colMeans(...). The psi-scale product is converted to phi via / n before
adding to stored IF. This matches R's DRDID formulation line-for-line.

Previously: H = X'WX (no /n), asy_rep = score @ solve(H), M2 = sum/n.
The algebraic result is identical (confirmed: all SE ratios unchanged at
0.97-1.03), but the intermediate variables now have R's exact magnitudes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 30, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b889a145b390ab209007da5793f22612ed43afa2


Overall Assessment

⚠️ Needs changes

Static review only; I could not run the targeted tests here because pytest is not installed in this environment.

Executive Summary

  • The two prior non-blocking re-review items look addressed: HonestDiD now receives event-study covariance/survey df through diff_diff/staggered_aggregation.py:L716 and diff_diff/honest_did.py:L709, and panel=False now rejects duplicate unit IDs in diff_diff/staggered.py:L1339.
  • The previous P1 methodology issue is still not resolved in the new panel Phase 7a paths: the propensity-score nuisance IF correction remains under-scaled in the newly unblocked covariate-adjusted ipw/dr code.
  • The same scale problem remains in the new repeated-cross-section panel=False ipw/dr propensity-score nuisance corrections.
  • The new Registry/TODO text now claims exact DRDID parity and full resolution, but those claims are not supported by the current formulas.
  • The added tests are still mostly smoke coverage; they do not independently validate the affected IPW/DR IF math.

Methodology

  • Severity P1. Impact: The panel Callaway-Sant’Anna propensity-score nuisance correction is still under-scaled in the newly changed covariate-adjusted paths: survey ipw in diff_diff/staggered.py:L2059, survey dr in diff_diff/staggered.py:L2319, and non-survey dr in diff_diff/staggered.py:L2378. Algebraically, solve(X'WX / n) / n collapses back to solve(X'WX), so the code is not actually on the DRDID psi scale claimed in docs/methodology/REGISTRY.md:L421; it then keeps M2 as a mean(...) term and divides by n_all_panel again before adding to the stored IF. That leaves the PS correction too small, so the new analytical SEs are understated and the multiplier bootstrap in diff_diff/staggered_bootstrap.py:L354 inherits the same understatement because it perturbs those stored IF vectors. Concrete fix: keep one convention end-to-end. Either use sum-scale beta IFs (score @ (X'WX)^{-1}) with raw-sum/point-estimator-consistent gradients and no final /n, or match DRDID’s psi convention exactly and divide once only when converting the final ATT IF to phi. citeturn8view0

  • Severity P1. Impact: The same PS-correction scale mismatch remains in the new repeated-cross-section panel=False ipw/dr implementations at diff_diff/staggered.py:L3126 and diff_diff/staggered.py:L3444. Here too, asy_lin_rep_psi is built as score @ solve(X'WX / n) / n, M2 is still coded as a mean(...) object, and the correction is divided by n_all again before it is added to inf_all. That understates the new RCS analytical SEs for ipw/dr, and any bootstrap aggregation that reuses those IFs inherits the same problem. Concrete fix: apply the same end-to-end scale cleanup in the RCS paths, then validate against saved DRDID reference outputs for both ipw and dr. citeturn4view0

  • Severity P3. Impact: The panel DR control-normalization difference from DRDID::drdid_panel is still present, but it is explicitly documented as a deviation in docs/methodology/REGISTRY.md:L420, so under the review rules this is informational only, not a defect. Concrete fix: none required unless exact R parity is a release goal. citeturn8view0

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity P3. Impact: The in-code comment at diff_diff/staggered.py:L1433 still says strata/PSU/FPC are “rejected above,” which is now false after this PR. Concrete fix: update or remove that comment so future changes are not made against stale assumptions.

Tech Debt

  • Severity P3. Impact: TODO.md:L59 now marks “CallawaySantAnna survey + covariates + IPW/DR” as resolved even though the active P1 variance issue above remains. TODO tracking does not mitigate a live methodology bug, so this currently hides correctness debt rather than tracking it. Concrete fix: reopen or narrow that TODO entry until the corrected formulas are in place.

Security

No findings.

Documentation/Tests

  • Severity P2. Impact: The new tests do not independently validate the affected formulas. Phase 7a coverage is mostly finite/smoke checks in tests/test_survey_phase7a.py:L56 and “SE differs from reg” checks in tests/test_survey_phase7a.py:L196; the only analytical-vs-bootstrap convergence test added in the RCS suite is still reg-only in tests/test_staggered_rc.py:L413. Because the CS multiplier bootstrap reuses the stored IF vectors rather than refitting nuisance models in diff_diff/staggered_bootstrap.py:L354, bootstrap agreement is not independent validation of the new IPW/DR IF formulas. Concrete fix: add DRDID parity tests for panel and RCS ipw/dr (survey and non-survey where applicable), or another test harness that recomputes the nuisance steps instead of perturbing the already-stored IFs.

  • Severity P3. Impact: The roadmap docs are internally stale. docs/survey-roadmap.md:L134 marks Phase 7a complete while still describing the old NotImplementedError gate, and ROADMAP.md:L24 still lists the Phase 7 items as near-term enhancements. Concrete fix: update both docs to distinguish shipped work from remaining follow-ups.

Path to Approval

  1. Fix the panel PS nuisance correction scale in the changed covariate-adjusted Callaway-Sant’Anna paths at diff_diff/staggered.py:L2059, diff_diff/staggered.py:L2319, and diff_diff/staggered.py:L2378.

  2. Apply the same scale correction to the new RCS ipw/dr paths at diff_diff/staggered.py:L3126 and diff_diff/staggered.py:L3444.

  3. Add independent validation for those corrected paths: DRDID/source-parity tests for panel and RCS ipw/dr, including survey and non-survey variants where those paths are now claimed supported.

  4. Update docs/methodology/REGISTRY.md:L421, TODO.md:L59, and the stale roadmap text in docs/survey-roadmap.md:L134 / ROADMAP.md:L24 so they stop claiming exact parity/resolution before the formulas are actually fixed.

…rap validation

Remove R-comparison framing from PS nuisance correction note. Describe
as a valid numerical approach to M-estimation variance (library's
pre-scaled IF convention). Include bootstrap convergence ratios as
empirical proof the correction is not under-scaled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 30, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 81724ca41b7cea227dbadcbcf003300dabb60d03


Overall Assessment

⚠️ Needs changes

Static review only. This environment does not have the project’s Python dependencies installed, so I could not run the targeted tests here.

Executive Summary

  • The prior non-blocking re-review items look addressed: HonestDiD now threads survey df/full event-study covariance via diff_diff/honest_did.py:L646-L801 / diff_diff/staggered_aggregation.py:L716-L761, and panel=False now rejects duplicate IDs in diff_diff/staggered.py:L1339-L1345.
  • The prior P1 methodology issue is still unresolved in the panel covariate-adjusted PS nuisance blocks: survey ipw, survey dr, and non-survey dr still shrink the PS correction too much.
  • The new repeated-cross-section ipw/dr paths repeat the same PS-correction scale mistake.
  • REGISTRY.md and TODO.md now overstate Phase 7a as resolved even though the live SE bug remains.
  • The new tests still do not independently validate the affected IF math; the only analytical-vs-bootstrap convergence check is reg-only, and the CS bootstrap reuses the stored IF vectors.

Methodology

  • Severity P1 | Impact: Previous re-review finding still unresolved. In diff_diff/staggered.py:L2059-L2084, diff_diff/staggered.py:L2319-L2339, and diff_diff/staggered.py:L2378-L2400, the panel PS nuisance blocks convert the beta IF to the local pre-scaled phi convention, but then also build M2 from already-normalized weights with np.mean(...) and divide the correction by n_all_panel again. Relative to DRDID’s panel construction, that is not just a different valid psi vs phi convention; it is an extra 1/n shrinkage of the PS correction. The practical consequence is understated analytical SEs in the survey ipw/dr paths and the non-survey dr path, with the multiplier bootstrap inheriting the same understatement because it perturbs the stored IF vectors. Concrete fix: use one scale end-to-end. If you keep the local phi convention, make M2/M2_dr the matching O(1) gradient term and remove the final / n_all_panel; otherwise revert fully to the R-style psi objects and divide once only when converting the final ATT IF to phi. (raw.githubusercontent.com)

  • Severity P1 | Impact: Previous re-review finding still unresolved in the new RCS implementation. In diff_diff/staggered.py:L3126-L3158 and diff_diff/staggered.py:L3444-L3482, the same pattern appears: asy_lin_rep_psi is already reduced to the local phi scale, but M2/M2.pre/post are also downscaled via normalized-weight np.mean(...), and the correction is divided by n_all again. Upstream drdid_rc keeps asy.lin.rep.ps on the O(1) scale and combines it with M2.pre/post = colMeans(...)/mean(w.cont.pre/post) before the final estimator normalization, so this is again a real scale error, not an implementation choice. The practical consequence is understated analytical SEs for the new RCS ipw/dr paths as well. Concrete fix: mirror the panel fix in the RC branches and match the R M2.pre/post scale exactly before converting to the library’s phi convention. (raw.githubusercontent.com)

  • Severity P3 | Impact: No unmitigated methodology issue stood out in the HonestDiD survey-variance work. The universal-base warning and full event-study covariance path in diff_diff/honest_did.py:L646-L801 / diff_diff/staggered_aggregation.py:L716-L761 are consistent with upstream HonestDiD’s universal-base/full-covariance expectations, and the bootstrap/replicate diagonal fallback is documented in REGISTRY.md, so this is informational only. Concrete fix: none required. (raw.githubusercontent.com)

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity P3 | Impact: diff_diff/staggered.py:L1433-L1436 still says strata/PSU/FPC are “rejected above,” but this PR explicitly supports them. Concrete fix: update or remove the stale comment.

Tech Debt

Security

No findings.

Documentation/Tests

  • Severity P2 | Impact: The new tests in tests/test_survey_phase7a.py:L56-L220 and tests/test_staggered_rc.py:L413-L443 still do not independently validate the affected IPW/DR IF math. The only analytical-vs-bootstrap convergence check is reg-only, and diff_diff/staggered_bootstrap.py:L354-L400 reuses the stored IF vectors instead of refitting nuisance models, so bootstrap agreement would not catch the PS-correction scaling bug. Concrete fix: add DRDID/source-parity tests for panel survey ipw/dr and RCS ipw/dr, or a refit-based bootstrap harness that re-estimates nuisance models.

  • Severity P3 | Impact: docs/survey-roadmap.md:L134-L157 still describes the old NotImplementedError gate, while ROADMAP.md:L24-L45 still lists the Phase 7 items as near-term work even though the code now ships them. Concrete fix: reconcile the roadmap docs with the current implementation and with the actual remaining follow-ups.

Path to Approval

  1. Fix the panel PS nuisance correction scaling in diff_diff/staggered.py:L2059-L2084, diff_diff/staggered.py:L2319-L2339, and diff_diff/staggered.py:L2378-L2400 so the correction is converted to phi exactly once.

  2. Apply the same end-to-end scaling fix to the new RCS PS blocks in diff_diff/staggered.py:L3126-L3158 and diff_diff/staggered.py:L3444-L3482.

  3. Add independent validation for those corrected paths: DRDID/source-parity tests for panel survey ipw/dr and RCS ipw/dr, or a bootstrap/check harness that refits nuisance models instead of perturbing the stored IF vectors.

  4. Once the formulas are fixed, update docs/methodology/REGISTRY.md:L419-L421 and TODO.md:L59-L59 so Phase 7a is not marked resolved prematurely.

The PS asymptotic linear representation (asy_lin_rep_psi) was being divided
by n twice — once in its computation (score @ H_psi_inv / n) and again when
adding the correction to inf_func (/ n_all_panel). Since H_psi_inv =
solve(X'WX/n) already contains the factor of n, score @ H_psi_inv produces
psi-scale values matching R's asy.lin.rep.ps. The extra /n made the PS
correction O(1/n²) instead of O(1/n), under-scaling it relative to the
leading IF terms.

Fix: remove /n from asy_lin_rep_psi computation at all 5 sites (panel
survey IPW, panel survey DR, panel non-survey DR, RCS IPW, RCS DR). The
single /n on the correction line is the legitimate psi→phi conversion.

Also updates REGISTRY.md notes and fixes stale survey comment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 30, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 43c55475bd06a9183decb89798e84960cea3ec8c


Overall Assessment

⚠️ Needs changes

Static review only. I could not run the targeted tests in this environment because pytest and even numpy are not installed.

Executive Summary

  • The prior methodology issue in the PS nuisance-estimation blocks is still not fully resolved: the new panel and repeated-cross-section IPW/DR paths are still shrinking the PS correction too much, so inference is understated in the affected paths. (rdrr.io)
  • That SE issue is not limited to analytical inference. The multiplier bootstrap in diff_diff/staggered_bootstrap.py:L354 perturbs the stored influence vectors, so the same understatement propagates into bootstrap SEs/CIs/p-values.
  • I did not find an unmitigated methodology defect in the new HonestDiD survey-df / full-event-study-vcov path. The upstream HonestDiD code requires a universal base period and builds the covariance from the event-study influence-function matrix; the new warning/full-VCV behavior is consistent with the documented deviations in the registry. (rdrr.io)
  • The new tests are mostly smoke checks. They do not independently validate the affected IPW/DR IF math, and the only analytical-vs-bootstrap convergence test is reg-only RCS in tests/test_staggered_rc.py:L413.
  • TODO.md, REGISTRY.md, and the roadmap docs now overstate Phase 7 as resolved even though the live SE bug above is still present.

Methodology

Affected methods: CallawaySantAnna panel IPW/DR nuisance-inference paths, CallawaySantAnna(panel=False) repeated-cross-section IPW/DR paths, and HonestDiD’s use of CS event-study covariance.

  • Severity P1 | Impact: Previous re-review methodology finding remains unresolved. In the panel IPW/DR PS-correction blocks at diff_diff/staggered.py:L2075, diff_diff/staggered.py:L2334, and diff_diff/staggered.py:L2397, and in the new RCS IPW/DR blocks at diff_diff/staggered.py:L3154 and diff_diff/staggered.py:L3470, the code still computes M2/M2.pre/M2.post with np.mean(...) over already-filtered control subsets after normalizing the control weights. The DRDID source computes these moments as full-sample colMeans(...) terms on the unnormalized weights and only converts scale once at the end. The current implementation therefore introduces an extra cell-size denominator and understates the PS nuisance correction in the stored IFs. Because those IFs feed both analytical variance and the multiplier bootstrap, the affected SEs/CIs/p-values are silently too small. Concrete fix: reformulate each PS moment on the DRDID scale. Either keep unnormalized weights and compute sum(...) / n_all over the full stacked sample (zeros outside the relevant cell), or, if you keep normalized subgroup weights, replace the subset np.mean(...) with the matching subset np.sum(...) so the only global 1/n_all factor is the final psi-to-phi conversion. (rdrr.io)

  • Severity P3 | Impact: No unmitigated methodology defect stood out in the HonestDiD survey-variance additions. The upstream honest_did.AGGTEobj code requires base_period="universal" and constructs V from the event-study influence-function matrix; the new full-VCV path and universal-base warning are aligned with that source, while the bootstrap/replicate diagonal fallback is explicitly documented in docs/methodology/REGISTRY.md:L1640. Concrete fix: none. (rdrr.io)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 | Impact: The project metadata now says Phase 7 is done even though the live PS-correction bug above means Phase 7a is not actually source-aligned yet. See TODO.md:L59 and docs/methodology/REGISTRY.md:L419. The roadmap docs also contradict each other: ROADMAP.md:L24 still lists Phase 7 as near-term work, while docs/survey-roadmap.md:L134 marks 7a/7b/7d complete. Concrete fix: narrow the “resolved” language until the PS blocks are fixed and independently validated, and reconcile the two roadmap documents.

Security

No findings.

Documentation/Tests

  • Severity P2 | Impact: The new tests do not independently validate the affected IPW/DR IF formulas. tests/test_survey_phase7a.py:L56 and tests/test_staggered_rc.py:L48 are mostly finiteness/smoke checks, “DR SE differs from reg,” or ATT-stability checks. The only analytical-vs-bootstrap convergence test is reg-only RCS at tests/test_staggered_rc.py:L413. That would not catch the live bug anyway, because the bootstrap path reuses the stored IF vectors rather than refitting nuisance models at diff_diff/staggered_bootstrap.py:L354. Concrete fix: add source-parity or independently derived IF/SE tests for panel survey ipw/dr, panel non-survey dr, and RCS ipw/dr, especially with unequal cell sizes where the extra denominator shows up.

Path to Approval

  1. Fix the PS nuisance-correction scaling in the panel blocks at diff_diff/staggered.py:L2075, diff_diff/staggered.py:L2334, and diff_diff/staggered.py:L2397 so the DRDID moment uses full-sample-equivalent scaling and the psi-to-phi conversion happens exactly once.

  2. Apply the same scaling fix to the RCS IPW/DR PS blocks at diff_diff/staggered.py:L3154 and diff_diff/staggered.py:L3470.

  3. Add independent validation for those corrected paths: at minimum, parity tests against DRDID/source formulas for panel survey ipw/dr, panel non-survey dr, and RCS ipw/dr on a case with unequal cell counts.

  4. After the formulas and tests are fixed, update TODO.md:L59, docs/methodology/REGISTRY.md:L419, and the Phase 7 roadmap docs so they no longer mark the work as fully resolved prematurely.

…ubsets

The M2 gradient terms in PS nuisance corrections used np.mean() over
control subsets, introducing an extra 1/n_c divisor. R's DRDID computes
M2 as colMeans() over the full n-sample (zeros for treated), then divides
by mean(w.cont) — the n's cancel, giving sum(w*resid*X)/sum(w). With
our Hajek-normalized weights (w_norm = w/sum(w)), np.sum(w_norm*resid*X)
directly yields sum(w*resid*X)/sum(w), matching R after cancellation.
The single /n on the correction line remains as the psi-to-phi conversion.

Applied at all 5 PS correction sites (panel survey IPW/DR, panel
non-survey DR, RCS IPW, RCS DR).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 30, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c0c7e5a591ad3655c2c337305451cd2a5fa62e4f


Overall Assessment

⚠️ Needs changes

Static review only. I could not run the targeted tests in this environment because pytest is not installed.

Executive Summary

  • The prior extra-denominator concern from the last review looks addressed in the touched PS-scaling blocks; I did not see the earlier np.mean(...)/double-1/n pattern in the modified DR/PS code.
  • The remaining blocker is a different methodology issue: the standardized IPW propensity-score correction is added to the ATT influence function in both the panel survey IPW path and the new repeated-cross-section IPW path, but DRDID’s standardized IPW implementations put that term inside inf.control and then subtract the whole control IF from inf.treat. That flips the nuisance-correction sign in the stored IFs. citeturn0search0turn1search0
  • Because those stored IFs are reused for multiplier bootstrap, WIF aggregation, and the new event-study covariance matrix, the same bug propagates beyond per-cell analytical SEs to aggregated SEs/CIs/p-values and event_study_vcov / HonestDiD on IPW fits. See diff_diff/staggered_bootstrap.py:L214-L376 and diff_diff/staggered_aggregation.py:L651-L760.
  • I did not find a second unmitigated blocker in the HonestDiD survey-df / full-event-study-vcov plumbing itself; the varying-base warning and bootstrap diagonal fallback are documented deviations in the registry.
  • The new tests are mostly smoke/invariance checks. They do not independently validate the affected IPW IF formulas, and the only analytical-vs-bootstrap convergence test added in this PR is reg-only RCS. See tests/test_survey_phase7a.py:L56-L220 and tests/test_staggered_rc.py:L413-L443.

Methodology

  • Severity P1 | Impact: In diff_diff/staggered.py:L2053-L2085 and diff_diff/staggered.py:L3129-L3161, the PS nuisance correction is added to the ATT influence function. DRDID’s standardized IPW sources instead add that term to inf.control and then define att.inf.func = inf.treat - inf.control, so the ATT IF must subtract it. This silently corrupts inference for the modified panel survey ipw path and the new RCS ipw path; because the same IFs feed diff_diff/staggered_bootstrap.py:L214-L376 and diff_diff/staggered_aggregation.py:L651-L760, the bug also propagates into bootstrap inference, aggregated SEs, and event_study_vcov. Concrete fix: change both IPW sites to subtract the PS correction (or equivalently negate M2) while keeping the new sum-based scaling and single /n psi-to-phi conversion. citeturn0search0turn1search0
  • Severity P3 | Impact: The new HonestDiD behavior for bootstrap-fitted CS results and varying-base CS results is a documented registry deviation, so I did not treat it as a defect. Concrete fix: none.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity P2 | Impact: the new tests would not catch the live IPW IF bug. tests/test_survey_phase7a.py:L56-L220 checks finiteness, invariance, and DR-vs-reg SE differences, but never validates IPW against a source-derived IF/SE. tests/test_staggered_rc.py:L413-L443 adds analytical-vs-bootstrap convergence only for reg, not ipw. Concrete fix: add one panel survey ipw + covariates test and one RCS ipw + covariates test that validate the analytical IF/SE against DRDID or, at minimum, against bootstrap on an unequal-cell-size design.

Path to Approval

  1. In diff_diff/staggered.py:L2053-L2085, flip the panel survey IPW PS-correction sign so the ATT IF matches inf.treat - inf.control.
  2. In diff_diff/staggered.py:L3129-L3161, make the same sign correction for RCS IPW (or equivalently negate M2 there).
  3. Add targeted methodology tests for those two branches: one panel survey ipw + covariates case and one unequal-cell-size RCS ipw + covariates case, checked against DRDID/source-derived IFs or analytical-vs-bootstrap parity.

Flip PS nuisance correction sign in panel survey IPW and RCS IPW paths.
R adds the correction to inf.control then subtracts (att = treat - control),
so the net effect on ATT IF is subtraction. DR paths are unaffected because
their M2 residual (m_control - control_change) already has opposite sign.

Also merges main (StaggeredTripleDifference from #245), resolves TODO.md
conflict, and updates survey-roadmap.md to reflect Phase 7a implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 30, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 527618f94bdc2aaeec5f4841257341ecb258273f


Overall Assessment

⚠️ Needs changes

Static review only. I could not run the test suite in this workspace because pytest is not installed and python is missing numpy.

Executive Summary

  • Re-review note: the earlier IPW nuisance-correction sign issue appears fixed in the touched panel and repeated-cross-section code paths, and the updated IF structure is now consistent with DRDID’s panel and repeated-cross-section source implementations. citeturn5view0turn6view0
  • The remaining blocker is in the new replicate-weight survey-df plumbing: aggregation updates df_survey as a shared mutable scalar, so event-study/group inference can use the last aggregate’s effective df rather than the df for the statistic being reported.
  • That same state also leaks into HonestDiD: CallawaySantAnnaResults.survey_metadata.df_survey is finalized before later event-study/group replicate-df updates, so HonestDiD can read a stale df and produce wrong t-based CIs/p-values for survey event-study results.
  • The new tests cover smoke/invariance/basic survey-df extraction, but they do not hit the dropped-replicate edge case above, so this failure mode would pass the current suite.
  • Status docs in the diff are still contradictory: TODO.md marks the Phase 7 items resolved, while ROADMAP.md and parts of docs/survey-roadmap.md still describe them as future or currently missing work.

Methodology

Re-review note: I rechecked the touched panel and RCS IPW/DR influence-function code against the DRDID source implementations; I do not see the prior IPW sign mismatch anymore. citeturn5view0turn6view0

  • Severity P1 | Impact: The PR’s own contract now says replicate-weight IF paths should use n_valid - 1 when dropped replicates reduce the effective count, but the implementation stores that override in shared mutable state instead of binding it to the aggregate that generated it. _compute_aggregated_se_with_wif() mutates precomputed["df_survey"] per aggregate (TODO.md:L57, diff_diff/staggered_aggregation.py:L515), while _aggregate_event_study() and _aggregate_by_group() later read one scalar once for the whole block (diff_diff/staggered_aggregation.py:L675, diff_diff/staggered_aggregation.py:L823). CallawaySantAnna.fit() snapshots survey_metadata.df_survey before those later aggregation passes (diff_diff/staggered.py:L1628, diff_diff/staggered.py:L1803), and HonestDiD.fit() then consumes that stored value (diff_diff/honest_did.py:L783). In replicate-weight event-study/group outputs where some horizons/groups drop replicates and others do not, this can silently assign the wrong t critical value, p-value, and CI to reported estimates and to HonestDiD. Concrete fix: have _compute_aggregated_se_with_wif() return both se and the effective df for that statistic; use that df when filling each event-study/group inference record; and propagate an explicit event-study-level df to the results object that HonestDiD reads instead of letting “last mutation wins.”

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 | Impact: the status docs are internally inconsistent. TODO.md marks the survey Phase 7 items resolved (TODO.md:L57), but ROADMAP.md still presents 7a/7b/7d as upcoming gaps (ROADMAP.md:L26), and docs/survey-roadmap.md marks 7b/7d as complete while the body still says “currently” / “What’s needed” (docs/survey-roadmap.md:L145, docs/survey-roadmap.md:L198). Concrete fix: reconcile those files so shipped behavior is described in past tense and only genuinely open work remains on the roadmap.

Security

No findings.

Documentation/Tests

  • Severity P2 | Impact: the new tests validate positive df extraction and replicate diagonal fallback, but they do not cover the edge case where replicate-weight event-study/group aggregates lose different numbers of replicates, which is exactly where the new df propagation bug appears. Current coverage stops at basic extraction/fallback checks in tests/test_honest_did.py:L1125 and tests/test_honest_did.py:L1243. Concrete fix: add a replicate-weight fixture that forces one horizon/group to drop replicates while another remains fully valid, then assert that each reported CI/p-value uses the correct effective df and that HonestDiD inherits the event-study df you intend.

Path to Approval

  1. Replace the shared mutable precomputed["df_survey"] pattern in aggregation with an explicit returned effective df, and use that df for the exact event-study/group statistic whose SE was just computed.
  2. Propagate the event-study effective df into CallawaySantAnnaResults so HonestDiD.fit() reads the same df that underlies the stored event-study inference.
  3. Add a replicate-weight regression test with dropped replicates across horizons/groups, plus an HonestDiD regression test that verifies the propagated df is not just the last aggregate’s value.

…of mutating shared state

_compute_aggregated_se_with_wif() was mutating precomputed["df_survey"]
as a side effect, causing "last mutation wins" when multiple aggregations
(overall, event-study horizons, groups) each computed different effective
df values from dropped replicates. Event-study/group inference and
HonestDiD could read stale or incorrect df.

Fix: return effective_df alongside SE from _compute_aggregated_se_with_wif()
instead of mutating shared state. Callers use per-statistic df for inference
(minimum across horizons/groups for conservative batch inference). fit()
uses the overall aggregation's effective_df.

Also reconciles status docs: ROADMAP.md and survey-roadmap.md now reflect
that Phase 7a/7b/7d are implemented, with tutorial (7c) remaining open.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 30, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9086114e8972f9145cab1ae3a1f6ce2f5b8a189b


Overall Assessment

✅ Looks good

Static review only; pytest is not installed in this workspace, so I did not execute the test suite.

Executive Summary

  • Re-review: the earlier shared-state survey-df concern does not remain unmitigated. _compute_aggregated_se_with_wif() now returns an effective df instead of mutating shared state, and the replicate IF path computes valid replicates from fixed replicate-weight columns (staggered_aggregation.py#L515, survey.py#L1493).
  • I did not find a current P0/P1 methodology defect in the touched panel DR, repeated-cross-section, or HonestDiD paths.
  • Severity P2: the new methodology-sensitive tests are mostly smoke/invariance/self-consistency checks and do not lock the new estimator paths to reference outputs (test_survey_phase7a.py#L56, test_staggered_rc.py#L48, test_honest_did.py#L1125).
  • Severity P3: docs still drift from the implementation. The survey roadmap still says the library “currently requires panel data” inside the implemented repeated-cross-section section, and the replicate-support matrix in the methodology registry still says CallawaySantAnna replicate support is only “without covariates” (survey-roadmap.md#L147, REGISTRY.md#L2230, REGISTRY.md#L420).

Methodology

No findings. I cross-checked the new repeated-cross-section path against DRDID’s reg_did_rc, std_ipw_did_rc, and drdid_rc estimators, and the HonestDiD changes against the upstream honest_did.AGGTEobj contract that requires a universal base period and uses the event-study covariance matrix built from the influence-function matrix when available. The behavior differences I saw in this PR are already documented as deviations in REGISTRY.md#L421, REGISTRY.md#L423, and REGISTRY.md#L1812. citeturn0search6turn0search3turn0search5turn1search2

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 | Impact: User-facing capability/status docs still contradict each other. survey-roadmap.md#L147 says diff-diff “currently requires panel data” in the same section that marks panel=False implemented, and REGISTRY.md#L2230 still says CallawaySantAnna replicate support is only “without covariates,” conflicting with the broader Phase 7 note at REGISTRY.md#L420. Concrete fix: reconcile these doc blocks to one support matrix; if replicate-plus-covariate support is intentionally narrower than the Phase 7 note implies, reintroduce an explicit guard and document that narrower contract consistently.

Security

No findings.

Documentation/Tests

  • Severity P2 | Impact: The new tests cover finiteness, invariance, summary labels, and broad bootstrap/self-consistency checks, but they do not pin the new panel survey IPW/DR nuisance-correction path, the new panel=False reg/ipw/dr path, or the HonestDiD covariance-consumption path to fixed source-of-truth outputs. A future sign, normalizer, or covariance-subsetting regression could still pass these tests. test_survey_phase7a.py#L56, test_staggered_rc.py#L48, test_honest_did.py#L1125. Concrete fix: add deterministic parity fixtures against saved DRDID/did/HonestDiD reference outputs, or hand-checked small examples that assert exact ATT/SE/VCV values for at least one panel IPW case, one panel DR case, one repeated-cross-section reg/ipw/dr case, and one HonestDiD event-study covariance case. citeturn0search6turn0search3turn0search5turn1search2

…turn

_aggregate_simple() now returns (att, se, effective_df). Update the
StaggeredTripleDifference caller to unpack the third element.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber merged commit 8554b89 into main Mar 31, 2026
14 checks passed
@igerber igerber deleted the survey-phase-seven branch March 31, 2026 12:01
@igerber igerber mentioned this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant