Add Staggered Triple Difference estimator by igerber · Pull Request #245 · igerber/diff-diff

igerber · 2026-03-29T20:34:33Z

Summary

Implement StaggeredTripleDifference estimator for staggered adoption DDD designs (Ortiz-Villavicencio & Sant'Anna 2025)
Group-time ATT(g,t) via three-DiD decomposition with DR, IPW, and regression adjustment methods
GMM-optimal combination across comparison cohorts with closed-form inverse-variance weights
Event study, group, and overall aggregation via CallawaySantAnna mixin reuse
Multiplier bootstrap with simultaneous confidence bands
control_group parameter supporting both "nevertreated" and "notyettreated" modes

Methodology references (required if estimator / math changes)

Method name(s): StaggeredTripleDifference (Staggered DDD)
Paper / source link(s): Ortiz-Villavicencio & Sant'Anna (2025). "Better Understanding Triple Differences Estimators." arXiv:2505.09942
Any intentional deviations from the source (and why):
- Aggregation uses CallawaySantAnna mixin (cohort-size weights) rather than R's agg_ddd() group-probability weights — documented in REGISTRY.md
- Propensity score trimming uses soft clipping vs R's hard exclusion — documented in REGISTRY.md
- Survey design support deferred (raises NotImplementedError)
- Cluster-robust analytical SEs deferred (bootstrap provides unit-level clustering)

Validation

Tests added/updated:
- tests/test_staggered_triple_diff.py — 40 unit tests (init, fit, recovery, methods, event study, GMM, bootstrap, edge cases)
- tests/test_methodology_staggered_triple_diff.py — 40 R cross-validation tests against pre-computed triplediff golden values
R cross-validation results: Group-time ATT and SE match R within 0.001% across 10 scenarios (3 seeds, 3 methods, both control group modes). Aggregation within 25% (documented weighting difference).
Paper review: docs/methodology/papers/ortiz-villavicencio-santanna-2025-review.md (not committed — gitignored)
R benchmark script: benchmarks/R/benchmark_staggered_triplediff.R

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…'Anna 2025) Implements the staggered DDD estimator with group-time ATT(g,t), GMM-optimal combination across comparison cohorts, event study aggregation, and multiplier bootstrap. Core pairwise DiD computation matches R triplediff package exactly (Riesz/Hajek normalization, PS + OR influence function corrections). Validated against R: group-time ATT and SE match within 0.001% across 10 scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-29T20:46:35Z

Overall Assessment

⚠️ Needs changes

This PR adds StaggeredTripleDifference, the staggered DDD estimator from Ortiz-Villavicencio & Sant’Anna (2025). The three-DiD / comparison-cohort / GMM structure is broadly aligned with the source material, but I found four unmitigated P1 issues: one undocumented methodology deviation, two inference/input-contract bugs, and one missing parameter-propagation issue. (arxiv.org)

Executive Summary

P1: The default propensity-score clipping changes the estimator when overlap is poor, but the new REGISTRY entry does not record that as a Note / Deviation from R.
P1: The implementation does not enforce the balanced-panel / stable-unit contract that the method and companion API expect.
P1: Bootstrap group-aggregation results are computed but never propagated back into returned group_effects.
P1: control_group is not propagated into the results object, so returned results lose essential estimand metadata.
P2: The new R cross-validation tests depend on CSV fixtures that are not committed in the PR, so they will skip in CI instead of validating parity.

Methodology

Severity: P1. Impact: diff_diff/staggered_triple_diff.py#L98 and diff_diff/staggered_triple_diff.py#L823 introduce substantive pscore_trim=0.01 clipping in the ATT weights, but the companion triplediff source only warns on poor overlap and then uses the estimated scores directly aside from an upper numerical cap near 1. The REGISTRY mention at docs/methodology/REGISTRY.md#L1269 is a plain bullet, not one of the allowed Note / Deviation from R labels, so under this review policy the deviation is still undocumented. Concrete fix: either remove substantive clipping from the default estimator path, or add a properly labeled REGISTRY deviation note and tests that lock in the intended divergence. (rdrr.io)
Severity: P1. Impact: diff_diff/staggered_triple_diff.py#L506 only validates eligibility and missingness, while diff_diff/staggered_triple_diff.py#L557 and diff_diff/staggered_triple_diff.py#L563 silently overwrite duplicate (unit,time) rows and collapse unit-level metadata with groupby(...).first(). That means unbalanced panels, time-varying first_treat, and time-varying covariates can be silently coerced into a different estimand instead of being rejected. Concrete fix: before precomputation, add explicit checks for unique (unit,time), equal period counts per unit, time-invariant first_treat, and time-invariant covariates when covariates is supplied. The companion API treats unbalanced panels as an explicit opt-in, and the paper still lists unbalanced / repeated-cross-section support as future work. (rdrr.io)
Severity: P1. Impact: diff_diff/staggered_bootstrap.py#L470 computes bootstrap group_effect_ses / CIs / p-values, but diff_diff/staggered_triple_diff.py#L423 only copies bootstrap inference back into overall ATT and event-study outputs. With n_bootstrap > 0 and aggregate="group" or "all", returned group_effects still show analytical SEs/p-values/CIs, which is the wrong inference path. Concrete fix: add a group-effect bootstrap update block matching the existing pattern in diff_diff/staggered.py#L1643.
Severity: P3. Impact: cluster is accepted in diff_diff/staggered_triple_diff.py#L99 but not wired into analytical SEs; this is explicitly documented in docs/methodology/REGISTRY.md#L1356, so it is non-blocking under the review rules. Concrete fix: optional warning when cluster is non-None.

Code Quality

No material findings beyond the issues above.

Performance

No material findings in the reviewed diff.

Maintainability

Severity: P1. Impact: control_group materially changes the estimand, but diff_diff/staggered_triple_diff.py#L473 does not store it in diff_diff/staggered_triple_diff_results.py#L60, and the reporting paths at diff_diff/staggered_triple_diff_results.py#L114 and diff_diff/staggered_triple_diff_results.py#L306 cannot tell the caller whether results came from "nevertreated" or "notyettreated". The analogous staggered DiD results type already preserves this metadata in diff_diff/staggered_results.py#L100. Concrete fix: add control_group to StaggeredTripleDiffResults, pass it from fit(), and surface it in summary() / to_dict().

Tech Debt

No existing TODO.md entry mitigates the P1 issues above, so no downgrade applies.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: tests/test_methodology_staggered_triple_diff.py#L44 loads per-scenario CSV fixtures, but the PR only adds benchmarks/data/synthetic/staggered_ddd_r_results.json plus the generator script benchmarks/R/benchmark_staggered_triplediff.R#L49. The CSVs are absent, so the new methodology tests will skip in CI instead of exercising the claimed R parity. Concrete fix: commit the CSV fixtures or generate them deterministically during test setup, and add a regression test for n_bootstrap > 0 with aggregate="group".

Path to Approval

Resolve the pscore_trim deviation by either matching the source / companion implementation in the default path or documenting the divergence with a properly labeled REGISTRY note.
Add hard input checks for balanced panels, unique (unit,time) rows, invariant first_treat, and invariant covariates.
Propagate bootstrap_results.group_effect_* into returned group_effects when bootstrap inference is requested.
Add control_group to StaggeredTripleDiffResults and surface it in summaries / serialized output.
Make the R cross-validation tests runnable in CI by supplying the missing CSV fixtures or generating them in test setup, and add a bootstrap+group-aggregation regression test.

- Add balanced-panel validation: unique (unit,time), equal periods per unit, time-invariant first_treat and covariates - Propagate bootstrap group-effect SEs/CIs/p-values into returned results - Add control_group to StaggeredTripleDiffResults and surface in summary/to_dict - Add Deviation from R label for pscore_trim in REGISTRY.md - Warn when cluster parameter is set (analytical cluster SEs not yet wired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-29T20:57:48Z

/ai-review

github-actions · 2026-03-29T21:08:44Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f2e25fb48eecd4fea0b3d693553fdad1c712f7d1

Overall Assessment

⛔ Blocker

The re-review shows that several prior issues are fixed, but one unmitigated P0 remains in the new estimator path, along with multiple P1s. The blocking problem is incorrect notyettreated comparison-group construction when anticipation > 0, which can silently contaminate controls and bias ATT(g,t).

Executive Summary

The earlier issues around control_group result metadata, group-effect bootstrap propagation, and the labeled pscore_trim registry note appear addressed.
P0: anticipation is not propagated into the notyettreated comparison-cohort filter, so cohorts inside the anticipation window can still enter the control pool.
P1: bootstrap group-time inference is computed but never written back into results.group_time_effects, so callers still see analytical SEs/CIs/p-values after requesting bootstrap inference.
P1: the balanced-panel validation is still incomplete; equal per-unit period counts can pass even when units are missing different periods.
P1: the aggregation-weight deviation from agg_ddd() is described, but not under a **Note:** / **Deviation from R:** label, so under this review policy it is still undocumented.
P2: the new R cross-validation suite still skips in CI because it requires CSV fixtures that are not committed in this PR.

Methodology

Severity: P0. Impact: diff_diff/staggered_triple_diff.py:L282-L295 builds valid_gc from max(g, t) only, while anticipation only affects the base period in diff_diff/staggered_triple_diff.py:L676-L684. The project’s methodology registry documents the not-yet-treated rule as first_treat > max(t, base_period) + anticipation in docs/methodology/REGISTRY.md:L352-L356 and docs/methodology/REGISTRY.md:L414-L417. As written, anticipation=1 still lets a cohort treated at t+1 serve as a control at time t, which changes control composition and can bias estimates. Concrete fix: derive the comparison-group threshold from max(t, base_period_val) + self.anticipation for control_group="notyettreated" (or reject that parameter combination explicitly), and add a regression test where a cohort inside the anticipation window must be excluded.
Severity: P1. Impact: diff_diff/staggered_triple_diff.py:L439-L500 replaces overall, event-study, and group-level inference from bootstrap_results, but never updates group_time_effects. That leaves results.group_time_effects and to_dataframe("group_time") in diff_diff/staggered_triple_diff_results.py:L228-L257 reporting analytical SEs/CIs/p-values even when n_bootstrap > 0. The existing staggered DiD implementation already updates group-time inference in diff_diff/staggered.py:L1600-L1615. Concrete fix: add the same group-time bootstrap update block here before the event-study/group updates, and lock it with a regression test that compares group_time_effects[(g, t)]["se"] against bootstrap_results.group_time_ses[(g, t)].
Severity: P1. Impact: the new balanced-panel guard in diff_diff/staggered_triple_diff.py:L586-L593 only checks that each unit has the same number of observed periods, not that each unit has the full global time set. A panel where all units have three observations but are missing different periods still passes validation, then gets silently dropped cell-by-cell by the finite-data check in diff_diff/staggered_triple_diff.py:L730-L735, which violates the “requires balanced panel” contract documented in docs/methodology/REGISTRY.md:L1262-L1278. Concrete fix: require each unit’s observed time set to equal the full df[time].unique() set, or equivalently require per-unit unique period counts to equal the global period count.
Severity: P1. Impact: the PR explicitly normalizes a different aggregation estimand in tests/test_methodology_staggered_triple_diff.py:L181-L190 and describes the difference in docs/methodology/REGISTRY.md:L1391-L1395, but the StaggeredTripleDifference registry entry does not record that as a labeled **Note:** / **Deviation from R:**. Under the review policy, that means the weighting change is still an undocumented methodology deviation. Concrete fix: add a labeled **Deviation from R:** (or **Note (deviation from R):**) paragraph in the StaggeredTripleDifference section that explicitly states that event-study / simple aggregation reuse Callaway-Sant’Anna cohort-size weights instead of agg_ddd()’s group-probability weights.

Code Quality

No additional findings beyond the methodology issues above.

Performance

No findings in the reviewed diff.

Maintainability

No additional findings beyond the missing bootstrap propagation branch already noted above.

Tech Debt

No findings. TODO.md does not already track the P0/P1 issues above, so no mitigation applies.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the methodology suite skips if per-scenario CSV fixtures are missing in tests/test_methodology_staggered_triple_diff.py:L48-L53, but this PR only commits benchmarks/data/synthetic/staggered_ddd_r_results.json plus the generator logic in benchmarks/R/benchmark_staggered_triplediff.R:L50-L55. In CI, the new R-parity tests therefore skip instead of validating the claimed cross-check. Concrete fix: commit the staggered_ddd_data_s*_dgp*.csv fixtures or generate them deterministically during test setup. Also add targeted regressions for anticipation>0 with control_group="notyettreated" and for bootstrap replacement of group_time_effects.

Path to Approval

Fix notyettreated comparison-group eligibility so anticipation affects control composition, or explicitly reject that parameter combination until it is supported.
Propagate bootstrap group_time_ses / CIs / p-values back into results.group_time_effects.
Tighten the balanced-panel validation to require full common time support, not just equal per-unit counts.
Add a labeled REGISTRY deviation note for the aggregation-weight change relative to agg_ddd().
Make the new R methodology tests runnable in CI by committing or generating the required CSV fixtures.

Verification note: I could not run the test suite in this environment because pytest and numpy are unavailable.

- P0: Propagate anticipation into notyettreated comparison-group threshold (use max(t, base_period) + anticipation, not just max(g, t)) - P1: Propagate bootstrap group-time SEs/CIs/p-values into group_time_effects - P1: Tighten balanced-panel check to require full common time support - P1: Add Deviation from R label for aggregation weight difference - P3: CI fix: cast first_treat to float before assigning inf in test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-30T10:52:08Z

/ai-review

github-actions · 2026-03-30T11:02:24Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8995f0ffba9e6ce865af81ba79ba2553b1d3d615

Overall Assessment

⚠️ Needs changes

The prior re-review P0/P1 items around anticipation propagation, bootstrap overwrite of group_time_effects, balanced-panel validation, and the labeled REGISTRY deviation note look addressed in the current diff. One newly identified P1 remains in the new estimator path: analytical group_effects inference uses the wrong variance formula when n_bootstrap=0. Static review only; I could not run pytest here because this environment does not have numpy/pytest.

Executive Summary

The earlier blocker-level issues from the last AI review appear fixed in the current diff.
[Newly identified] P1: analytical per-cohort group_effects SEs/CIs/p-values are still off in the new estimator path.
P2: the new R methodology suite still skips in a clean checkout because it depends on CSV fixtures that are not committed by this PR.
P2: generate_staggered_ddd_data() crashes for valid configurations that leave a cohort empty.
P3: one non-blocking REGISTRY wording mismatch remains around the scalar overall_att summary.

Methodology

Severity: P1 [Newly identified]. Impact: diff_diff/staggered_triple_diff.py:425 routes aggregate="group" / "all" through diff_diff/staggered_aggregation.py:659, and diff_diff/staggered_aggregation.py:704 then computes analytical SEs via the WIF-adjusted path starting at diff_diff/staggered_aggregation.py:220. In the triplediff reference implementation, cohort-specific group aggregation explicitly uses fixed within-group weights and wif = NULL because there are no estimated weights for that estimand, so the extra WIF term changes the analytical SE/CI/p-value formula for group_effects whenever n_bootstrap=0. Concrete fix: add a no-WIF analytical path for per-cohort group effects, and lock it with an R parity test for aggregate="group" under both control-group modes. citeturn2view0
Severity: P3. Impact: docs/methodology/REGISTRY.md:1321 labels the scalar overall_att as paper Equation 4.14, but the implementation in diff_diff/staggered_aggregation.py:37 is the simple post-treatment ATT(g,t) aggregation. Equation 4.14 in the paper is the average of post-treatment event-study effects, so the REGISTRY wording is misleading even if the code path is intentional. Concrete fix: relabel overall_att as the simple aggregation or expose a separate Eq. 4.14 summary, and update docs/methodology/REGISTRY.md:1327 so it does not claim that all aggregation SEs use WIF. citeturn1view2turn2view0

Code Quality

Severity: P2. Impact: diff_diff/prep_dgp.py:1018, diff_diff/prep_dgp.py:1021, and diff_diff/prep_dgp.py:1023 force n_eligible >= 1 for every requested cohort, including empty ones. Valid parameter combinations such as never_enabled_frac=0 or too many cohort_periods for the treated sample therefore fall into rng.choice(..., size=1) on an empty index set and crash. Concrete fix: skip empty masks before sampling and validate n_eligible <= n_g; if the generator requires both eligible and ineligible units, raise a clear parameter error instead of failing inside rng.choice.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. TODO.md does not currently track the P1/P2 items above, so no mitigation applies.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: tests/test_methodology_staggered_triple_diff.py:48 and tests/test_methodology_staggered_triple_diff.py:50 require per-scenario CSV fixtures and skip when they are absent, but this PR only commits benchmarks/data/synthetic/staggered_ddd_r_results.json plus a generator script that would write those CSVs at benchmarks/R/benchmark_staggered_triplediff.R:48. In a clean checkout, the new R parity suite therefore skips instead of validating the claimed benchmark coverage. Concrete fix: commit the staggered_ddd_data_s*_dgp*.csv fixtures or generate them deterministically during test setup; after fixing the group-SE issue, add one parity assertion for aggregate="group" as well.

Path to Approval

Remove WIF from analytical per-cohort group_effects inference in the staggered DDD path. A focused fix would let group aggregation reuse the existing combined-IF machinery with use_wif=False whenever a single cohort is being aggregated.
Add an R parity regression test for aggregate="group" with n_bootstrap=0, asserting that Python group-effect SEs/CIs match triplediff for at least one nevertreated and one notyettreated scenario.

…docs - Generate CSV fixtures via R at test time (skip if R unavailable) - Guard empty cohorts in generate_staggered_ddd_data - Fix REGISTRY: relabel overall_att (not Eq 4.14), document conservative per-cohort group-effect SEs (WIF included vs R's wif=NULL) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-30T12:34:19Z

/ai-review

github-actions · 2026-03-30T12:45:31Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9ce0195b1b7966514521c281307f7a2467ca40ca

Overall Assessment

⚠️ Needs changes

The PR closes some of the earlier review surface: the new estimator now validates balanced panels and time-invariant eligibility, and the synthetic DGP no longer crashes on empty cohort masks. But there are still unmitigated P1 issues in the new staggered DDD inference paths. Static review only: I could not run pytest here because this environment does not have numpy or pytest.

Executive Summary

The prior empty-cohort generator issue appears fixed, and the new estimator adds the expected balanced-panel and time-invariance checks.
Severity P1 [Newly identified]: overall and event-study inference mix eligible-treated aggregation weights with WIF built from all-unit cohort shares, so SE/CI/p-values can be wrong when eligibility shares differ across cohorts.
Severity P1: analytical group_effects still use WIF even though within-cohort weights are fixed; the added REGISTRY prose does not use one of the accepted deviation labels, so the prior methodology issue is still unmitigated under this repo’s rules.
Severity P1: the public rank_deficient_action parameter is not honored by the outcome-regression path.
Severity P2: the new R methodology suite still skips in a clean checkout because the required CSV fixtures are not committed, and it still does not cover group aggregation or aggregation-SE parity.

Methodology

Severity P1 [Newly identified]. Impact: the new estimator stores n_treated from the eligible-treated mask in diff_diff/staggered_triple_diff.py:L306 and diff_diff/staggered_triple_diff.py:L381, and the simple/event-study aggregations use that field as the point-estimate weight in diff_diff/staggered_aggregation.py:L76 and diff_diff/staggered_aggregation.py:L535. But the inherited WIF path still builds pg from all units with first_treat == g in diff_diff/staggered_aggregation.py:L279, and that same combined IF is reused for bootstrap overall/event-study inference in diff_diff/staggered_bootstrap.py:L373. Once cohort eligibility shares differ, the reported overall/event-study SEs, CIs, and p-values no longer correspond to the reported weighting scheme. Concrete fix: introduce a single explicit aggregation-mass definition and use it everywhere. Either switch the point weights to all-cohort masses, or compute WIF on 1{S_i=g, Q_i=1}; then add a regression test with unequal cohort-specific eligibility shares that checks overall/event-study ATT and SE.
Severity P1. Impact: the prior aggregate="group" issue is still open. The new estimator still routes group aggregation through diff_diff/staggered_triple_diff.py:L425, and _aggregate_by_group() still applies WIF-adjusted SEs in diff_diff/staggered_aggregation.py:L673 and diff_diff/staggered_aggregation.py:L704, even though the within-cohort averaging weights are fixed. The new REGISTRY prose at docs/methodology/REGISTRY.md:L1329 describes that behavior, but not under one of the accepted **Note:** / **Deviation from R:** labels, so it does not mitigate the deviation under the project’s review rules. Concrete fix: either remove WIF from per-cohort group aggregation when all contributing (g,t) pairs share the same cohort, or relabel the REGISTRY entry as an explicit deviation and add a locked-in aggregate="group", n_bootstrap=0 parity/expectation test.

Code Quality

Severity P1. Impact: the public rank_deficient_action contract is declared in diff_diff/staggered_triple_diff.py:L78, but _compute_or() in diff_diff/staggered_triple_diff.py:L1042 never honors it. The DR/reg outcome-regression path always falls back to Cholesky, np.linalg.lstsq, and zero-filling, so rank_deficient_action="error" can still silently pseudo-invert a collinear design. Concrete fix: route _compute_or() through the same OLS helper used elsewhere in the library (solve_ols / _linear_regression) with rank_deficient_action=self.rank_deficient_action, and add collinear-covariate tests for warn, error, and silent.

Performance

No findings.

Maintainability

No findings beyond the issues already called out in Methodology and Code Quality.

Tech Debt

No findings. TODO.md does not track the P1/P2 items above, so no mitigation applies.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the methodology suite is still not self-contained. The helper only generates CSV fixtures when local R + triplediff are available in tests/test_methodology_staggered_triple_diff.py:L56, and each scenario skips when the CSV is absent in tests/test_methodology_staggered_triple_diff.py:L94. This PR commits benchmarks/data/synthetic/staggered_ddd_r_results.json, but not the staggered_ddd_data_s*_dgp*.csv inputs those tests require. The same file also only checks GT ATT/SE and loose ATT aggregation in tests/test_methodology_staggered_triple_diff.py:L148 and tests/test_methodology_staggered_triple_diff.py:L241; it does not cover aggregate="group" or aggregation SE parity. Concrete fix: commit deterministic CSV fixtures, or generate them during test setup without requiring local R, and add explicit assertions for group aggregation plus at least one overall/event-study SE case.

Path to Approval

Make aggregation weights and WIF use the same population definition. Add a regression test with cohort-specific eligibility rates that checks overall/event-study ATT and SE.
Resolve the analytical group_effects deviation: either remove WIF for per-cohort averages, or promote the REGISTRY prose to an accepted deviation label and lock it in with a aggregate="group", n_bootstrap=0 test.
Make _compute_or() honor rank_deficient_action, and add DR/reg tests with collinear covariates for warn, error, and silent.
Make the R parity suite self-contained by committing deterministic CSV fixtures, or generating them without local R, and extend it to group aggregation and aggregation-SE coverage.

- Use eligible-treated-only cohort assignments for aggregation WIF, matching the paper's Eq 4.13 (G_i defined only for Q=1 units). Ineligible units get cohort=0 so they don't contribute to pg. - Honor rank_deficient_action in _compute_or via solve_ols fallback - Add proper Deviation from R / Note labels in REGISTRY for aggregation weight choice and per-cohort WIF Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-30T12:54:16Z

/ai-review

github-actions · 2026-03-30T13:01:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 40c02fa2428da5a12f48898f832ca31a89eb735e

Overall Assessment

⚠️ Needs changes

Static review only; I could not run pytest here because this environment is missing numpy. On re-review, the earlier rank_deficient_action code defect appears fixed, and the prior group-aggregation documentation issue is now mitigated by accepted REGISTRY labels. One unmitigated P1 remains in the aggregation-inference path.

Executive Summary

StaggeredTripleDifference still mixes eligible-treated point weights with all-unit cohort shares inside the non-survey WIF path, so overall/event-study SEs, CIs, and p-values can be wrong when cohort eligibility shares differ.
The prior aggregate="group" documentation problem is now mitigated: the deviation is recorded under accepted **Deviation from R:** / **Note:** labels in REGISTRY.md.
The prior rank_deficient_action implementation gap appears addressed in code, but there is still no direct regression test for the new collinearity path.
The new R-parity suite still skips in a clean checkout without local R fixtures and still does not lock aggregation SE parity or aggregate="group" parity.

Methodology

Severity P1. Impact: the estimator now prepares an eligible-treated cohort map before aggregation in diff_diff/staggered_triple_diff.py:L408-L450, matching the documented deviation in docs/methodology/REGISTRY.md:L1332-L1336. But the non-survey WIF helper still computes group_sizes[g] from raw df["first_treat"] cohort counts in diff_diff/staggered_aggregation.py:L279-L294, while the indicator side of the same WIF uses precomputed["unit_cohorts"] in diff_diff/staggered_aggregation.py:L334-L341. That mixes P(S=g) with 1{S=g,Q=1}, so the combined IF is inconsistent with the reported weighting scheme. The same combined IF drives analytical aggregation SEs in diff_diff/staggered_aggregation.py:L413-L500 and overall/event-study bootstrap inference in diff_diff/staggered_bootstrap.py:L373-L410. Concrete fix: in the non-survey branch, derive group_sizes from precomputed["unit_cohorts"] whenever precomputed is supplied, or zero out ineligible units before computing pg; then add a regression test with unequal cohort-specific eligibility shares that locks analytical and bootstrap overall/event-study SEs.
Severity P3 informational. Impact: the earlier aggregate="group" documentation issue is now mitigated. The behavior is explicitly documented under accepted labels in docs/methodology/REGISTRY.md:L1332-L1340, so under this repo’s review rules it is no longer blocking. Concrete fix: none required.

Code Quality

No findings. The earlier rank_deficient_action propagation issue appears addressed: _compute_or() now falls back through solve_ols(..., rank_deficient_action=self.rank_deficient_action) in diff_diff/staggered_triple_diff.py:L1053-L1103.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. TODO.md does not currently track the remaining P1/P2 items, so no mitigation applies. TODO.md

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new R-parity suite is still not self-contained in a clean checkout. _load_r_data() skips when the per-scenario CSV fixtures are absent in tests/test_methodology_staggered_triple_diff.py:L94-L103, and _ensure_csv_fixtures() can only regenerate them when local R + triplediff are available in tests/test_methodology_staggered_triple_diff.py:L56-L77. The diff adds benchmarks/data/synthetic/staggered_ddd_r_results.json, but not the staggered_ddd_data_s*_dgp*.csv fixtures that the tests actually load. Concrete fix: commit deterministic CSV fixtures, or replace them with a committed Python-readable fixture format so the parity suite runs without local R.
Severity P2. Impact: the methodology suite still does not lock the remaining high-risk aggregation paths. It checks GT ATT/SE and loose ATT aggregation only in tests/test_methodology_staggered_triple_diff.py:L229-L277, with no assertions for aggregation SE parity, no aggregate="group" parity case, and no direct regression test for the new rank_deficient_action path in tests/test_staggered_triple_diff.py:L191-L197 and tests/test_staggered_triple_diff.py:L343-L405. Concrete fix: add locked tests for overall/event-study SEs, at least one aggregate="group" R-parity case, and a duplicated-covariate fixture that exercises rank_deficient_action="warn", "error", and "silent".

Path to Approval

Make the non-survey WIF use the same population definition as the point weights. The simplest fix is to compute group_sizes from precomputed["unit_cohorts"] in diff_diff/staggered_aggregation.py:L279-L294 when precomputed is provided.
Add a regression test with unequal cohort-specific eligibility shares that fails on the current code and asserts the corrected analytical and bootstrap overall/event-study SE behavior.

The non-survey WIF path reads group sizes from df["first_treat"] directly (staggered_aggregation.py:292), bypassing precomputed["unit_cohorts"]. Zero out df["first_treat"] for ineligible units alongside precomputed_agg so both code paths use the eligible-treated-only population definition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-30T13:04:08Z

/ai-review

github-actions · 2026-03-30T13:20:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4b7e44041d46e2f3ccc18c818f9461768466bbd6

Overall Assessment

⚠️ Needs changes

Static review only; I could not run pytest here because this environment is missing numpy.

Executive Summary

The prior non-survey WIF aggregation issue appears resolved: the estimator now zeroes ineligible units before sending data into aggregation and bootstrap, which matches the documented deviation in REGISTRY.md.
One unmitigated methodology blocker remains: the new DDD registry says admissible comparison cohorts satisfy g_c > max(g, t), but the implementation uses g_c > max(t, base_period) + anticipation. Those rules diverge on pre-treatment cells for later cohorts.
base_period="universal" is not handled the same way as the existing staggered estimator: the new code estimates the normalization period instead of omitting it and letting the event-study aggregator inject the synthetic reference row.
The R-parity suite is still not self-contained; it skips on a clean checkout unless local R fixtures can be regenerated.
The parity harness still does not validate the covariate-adjusted DR/IPW/RA paths or aggregation SE / aggregate="group" parity, because the benchmark script hard-codes xformla = ~1.

Methodology

Severity P1. Impact: diff_diff/staggered_triple_diff.py:L291-L297 admits comparison cohorts using gc > max(t, base_period) + anticipation, while docs/methodology/REGISTRY.md:L1297-L1297 and docs/methodology/REGISTRY.md:L1399-L1399 say the method requires g_c > max(g, t). That is an undocumented methodology mismatch on control-group composition. The branch where these rules differ is exactly the kind of control-logic issue that can change placebo/event-study estimates for later cohorts under base_period="varying". Concrete fix: decide which rule is source-correct, then make code and REGISTRY.md agree and add a 3+ cohort regression/parity test where t < g and one candidate comparison cohort lies in (t, g).
Severity P1. Impact: diff_diff/staggered_triple_diff.py:L274-L280 loops over all periods in universal-base mode, so it estimates t == g - 1 - anticipation cells instead of omitting them. That bypasses the established reference-period contract in docs/methodology/REGISTRY.md:L379-L393 and the existing staggered implementation at diff_diff/staggered.py:L1467-L1470; the mixin only injects the synthetic normalization row when it is missing (diff_diff/staggered_aggregation.py:L644-L655). As written, the universal reference period becomes an ordinary event-study cell instead of the expected effect=0, se=NaN constraint. Concrete fix: mirror the existing staggered estimator by skipping t == g - 1 - anticipation before estimation, then add an event-study test like tests/test_staggered.py:L3107-L3134 for anticipation=0 and anticipation=1.
No findings on the documented R deviations themselves. The aggregation-weight, propensity-clipping, and deferred cluster/survey notes added in docs/methodology/REGISTRY.md:L1332-L1381 use accepted Deviation/Note labels, so they are not defects under this repo’s review policy.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: TODO.md:L47-L78 does not track the non-blocking fixture/parity gaps introduced by this PR, so they are not mitigated under the repo’s deferred-work policy. Concrete fix: if you plan to defer the fixture/parity cleanup, add explicit TODO.md entries under “Tech Debt from Code Reviews.”

Security

No findings.

Documentation/Tests

Severity P2. Impact: the methodology suite still is not self-contained. tests/test_methodology_staggered_triple_diff.py:L56-L103 requires staggered_ddd_data_s*.csv fixtures that are not committed, so a clean checkout without local R + triplediff will skip the parity tests even though benchmarks/data/synthetic/staggered_ddd_r_results.json is present. Concrete fix: commit deterministic CSV fixtures, or replace them with a committed Python-readable fixture format that the tests load directly.
Severity P2. Impact: the new parity story overstates method coverage. The R generator hard-codes xformla = ~1 in benchmarks/R/benchmark_staggered_triplediff.R:L57-L65, so the DR/IPW/RA scenarios collapse to the same no-covariate path, and tests/test_methodology_staggered_triple_diff.py:L229-L277 only asserts aggregation ATT with loose 25% tolerance, not aggregation SEs or aggregate="group" parity. Concrete fix: add at least one covariate-adjusted R fixture plus locked checks for event-study/overall/group SE parity.

Path to Approval

Resolve the admissible comparison-cohort rule by making diff_diff/staggered_triple_diff.py and docs/methodology/REGISTRY.md agree, and add a 3+ cohort test that exercises a pre-treatment (g, t) cell where the current code and current docs diverge.
In universal-base mode, exclude t == g - 1 - anticipation from GT estimation and add an event-study test asserting the injected reference row has effect=0.0, se=np.nan, and conf_int=(np.nan, np.nan).

- Skip reference period (t == g-1-anticipation) in universal mode, matching CallawaySantAnna behavior. Event-study mixin injects synthetic reference row with effect=0. - Update REGISTRY comparison-group rule to match code: uses max(t, base_period) + anticipation for notyettreated threshold. - Add TODO.md entries for deferred P2 items (CSV fixtures, covariate parity, group-effect WIF). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-30T13:24:43Z

/ai-review

github-actions · 2026-03-30T13:33:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1a309c44cb93be61da3896ed0da47fd93cfa5269

Overall Assessment

⚠️ Needs changes

Static review only. I could not run the targeted test modules here because pytest is not installed in this environment.

Executive Summary

The prior universal-base-period blocker appears resolved: diff_diff/staggered_triple_diff.py:L274 now skips the normalization period in universal mode, and the shared event-study aggregator still injects the synthetic reference row at diff_diff/staggered_aggregation.py:L642.
One methodology issue remains unmitigated in the new StaggeredTripleDifference: the admissible notyettreated comparison-cohort rule is still the only P1.
The previous non-blocking parity/fixture gaps are now explicitly tracked in TODO.md:L62, TODO.md:L63, and TODO.md:L64, so they are mitigated under this repo’s deferred-work policy.
I did not find additional unmitigated P0/P1 issues in code quality, performance, maintainability, or security.

Methodology

Severity P1. Impact: Affected method is StaggeredTripleDifference ATT(g,t) with control_group="notyettreated". The implementation at diff_diff/staggered_triple_diff.py:L293 and the registry text at docs/methodology/REGISTRY.md:L1297 now agree on g_c > max(t, base_period) + anticipation, but the cited paper’s estimation section states admissible comparison cohorts satisfy g_c > max(g, t). The current triplediff source does use max(t, pret) gating, so the PR is effectively following companion-package behavior rather than the paper, but that choice is still undocumented as a deviation from the cited source material. This changes the control set for pre-treatment cells whenever a later cohort lies between t and g, so placebo/event-study estimates can shift depending on which rule is used. (arxiv.org) Concrete fix: either change the estimator and registry to the paper’s g_c > max(g, t) rule, or keep the current rule but add an explicit **Deviation from paper:** note in the registry explaining that the implementation intentionally follows the companion package instead.
Severity P3. Impact: The aggregation-weight differences, propensity-score clipping, and conservative per-cohort group-SE behavior are explicitly labeled in docs/methodology/REGISTRY.md:L1335, docs/methodology/REGISTRY.md:L1371, and docs/methodology/REGISTRY.md:L1380. Under repo policy, these documented deviations are informational only. Concrete fix: none required before merge.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: The prior review’s fixture/parity follow-ups are now properly tracked in TODO.md:L62, TODO.md:L63, and TODO.md:L64, which mitigates them under the project’s deferred-work rules. Concrete fix: none required before merge.

Security

No findings.

Documentation/Tests

No unmitigated findings. The remaining non-blocking parity gaps are tracked in TODO.md, and this review stayed static because pytest is unavailable in the current environment.

Path to Approval

Resolve the admissible notyettreated comparison-cohort rule called out above by either matching the cited paper or explicitly documenting the current companion-package rule as a deviation in docs/methodology/REGISTRY.md:L1297. (arxiv.org)
Add a regression test with at least three treated cohorts and base_period="varying" that exercises a pre-treatment cell t < g where one candidate comparison cohort lies in (t, g), then assert the intended comparison-set behavior in tests/test_staggered_triple_diff.py:L391 or a new targeted methodology test.

The implementation follows R's triplediff (max(t, base_period) + anticipation) rather than the paper's max(g, t). Add Deviation from paper label in REGISTRY. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-30T13:55:32Z

/ai-review

github-actions · 2026-03-30T14:05:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2a44d56330660404462e1779e30a1f853f49da3b

Overall Assessment

⚠️ Needs changes

Static review only. I could not execute the new test modules here because this environment does not have pytest, numpy, pandas, or scipy installed.

Executive Summary

The prior notyettreated methodology blocker is resolved: the registry now explicitly documents the paper/package comparison-set divergence, so that item is P3 informational rather than blocking.
P1: the estimator still silently drops unidentified cells/cohorts when the required base period is outside the panel or when one of the four required (S,Q) subgroups is empty; that can change the aggregated ATT with no warning.
P1: rank-deficient propensity-score fits are cached incorrectly. solve_logit() returns NaN placeholders for dropped columns, and the cache reuse path multiplies by those NaNs directly, which can turn later cached PS predictions into NaN and silently remove valid (g,t) cells.
The new R-parity/test-fixture limitations are tracked in TODO.md, so they are non-blocking P3 items under this repo’s deferred-work policy.
No unmitigated findings in performance, maintainability, or security.

Methodology

Affected methods: StaggeredTripleDifference ATT(g,t), the GMM combination across comparison cohorts, and the downstream overall/event-study/group aggregations.

Severity P1. Impact: fit() silently continues when the computed base period is absent, and _compute_ddd_gt_gc() silently returns None when any required DDD subgroup is empty. If that removes all usable comparisons for a (g,t) cell, the cell disappears from group_time_effects; if it happens for an entire cohort, the overall ATT is then aggregated over a subset of treated cohorts with no explicit warning. This is especially problematic when g - 1 - anticipation falls before the observed panel or when a cohort has only eligible or only ineligible units. Concrete fix: emit an explicit warning (or error) when a cohort lacks the required pre-treatment base, and emit a warning when a (g,g_c,t) comparison is unidentified because any of the four required cells is empty; add regression tests for both cases. staggered_triple_diff.py:288, staggered_triple_diff.py:351, staggered_triple_diff.py:772, staggered_triple_diff.py:434, REGISTRY.md:1267. The companion triplediff reference explicitly warns and drops groups with no pre-treatment periods rather than skipping them silently. citeturn1view0
Severity P3. Impact: the previously reported paper/package comparison-group mismatch is now explicitly documented as a deviation in the registry, along with the aggregation-weight and PS-trimming choices, so those are informational only in this re-review. Concrete fix: none required before merge. REGISTRY.md:1301, REGISTRY.md:1342, REGISTRY.md:1378. citeturn3view3turn1view0

Code Quality

Severity P1. Impact: _compute_pscore() caches the expanded coefficient vector returned by solve_logit(), but solve_logit() fills dropped rank-deficient columns with NaN. On cache reuse, np.dot(covX, beta_logistic) propagates those NaNs into the propensity scores, so later (g,t) cells that reuse the same PS model can become non-finite and be skipped from GMM/aggregation even though rank_deficient_action="warn" is supposed to leave a usable reduced-rank fit. Concrete fix: do not cache the NaN-expanded coefficient vector. Cache fitted probabilities, or cache a cleaned representation such as zero-filled dropped columns or (kept_cols, beta_reduced) and reconstruct predictions from that; add a regression test with collinear covariates across multiple post-treatment periods. staggered_triple_diff.py:1025, staggered_triple_diff.py:1030, staggered_triple_diff.py:1041, linalg.py:1351

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the PR now tracks the non-blocking follow-ups around missing committed CSV fixtures, missing covariate-adjusted R parity, and conservative per-cohort group SE parity. Under this repo’s policy, those are mitigated. Concrete fix: none required before merge. TODO.md:62, TODO.md:63, TODO.md:64

Security

No findings.

Documentation/Tests

No unmitigated findings beyond the tracked TODO.md items. Static review only here because the runtime dependencies needed to execute the new estimator/tests are unavailable in this environment.

Path to Approval

Add explicit unidentified-cell/cohort handling: warn once when g - 1 - anticipation is outside the observed panel, warn when a (g,g_c,t) comparison is missing any required (S,Q) subgroup, and add regression tests for both cases.
Fix the propensity-score cache so rank-deficient solve_logit() fits are not reused via NaN-expanded coefficient vectors, and add a regression test with duplicated/collinear covariates spanning multiple post-treatment periods.

- Warn when base period is outside observed panel (previously silent skip) - Warn when DDD subgroups are empty with details of which cells are missing - Zero-fill NaN logistic coefficients before caching to prevent NaN propagation on cache reuse with rank-deficient covariates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-30T14:15:00Z

/ai-review

github-actions · 2026-03-30T14:27:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9bc0341dd5d15b6b149143c5e9c63d11ef074f36

Overall Assessment

✅ Looks good

Static review only. This environment does not have numpy, pandas, scipy, or pytest, so I could not execute the new estimator or test modules.

Executive Summary

This re-review resolves the two prior blocking issues I checked first: missing base-period / empty-subgroup paths now warn instead of silently disappearing, and the cached rank-deficient propensity-score path no longer reuses NaN coefficients. See diff_diff/staggered_triple_diff.py#L288, diff_diff/staggered_triple_diff.py#L789, and diff_diff/staggered_triple_diff.py#L1063.
I did not find any new unmitigated P0/P1 correctness issues in the changed estimator, inference, or parameter-propagation paths.
The remaining methodology differences versus the paper / companion R package are explicitly documented in the new registry entry, so they are P3 informational rather than blocking.
The remaining R-parity fixture and broader parity limitations are tracked in TODO.md, so they are non-blocking under this repo’s deferred-work policy.
One non-blocking P2 remains: the new tests still do not directly lock down the two regression branches that fixed the last review’s blockers.

Methodology

Severity P3. Impact: The main source-material deviations are now properly documented in the registry: the paper motivates cohort-specific staggered DDD comparisons and GMM / inverse-variance combination, while the companion triplediff implementation uses the max(t, pre_t) style admissible-control filter that this PR follows; the registry also explicitly documents the aggregation-weight and propensity-score-trimming differences. Under the project policy, that makes these informational only, not defects. Concrete fix: none before merge. See docs/methodology/REGISTRY.md#L1301, docs/methodology/REGISTRY.md#L1342, and docs/methodology/REGISTRY.md#L1378. (psantanna.com)
No other unmitigated methodology findings.

Code Quality

No unmitigated findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: The remaining non-blocking limitations are explicitly tracked: missing committed CSV fixtures for the R-parity suite, no covariate-adjusted R benchmark coverage yet, and the documented per-cohort group-SE WIF difference. Under repo policy, these do not block merge. Concrete fix: none before merge; follow the tracked TODO items. See TODO.md#L62, TODO.md#L63, and TODO.md#L64.

Security

No findings.

Documentation/Tests

Severity P2. Impact: The new suites cover happy-path estimation, aggregation, bootstrap, and R parity, but they still do not directly exercise the two branches that fixed the previous review’s blockers: the missing-base / empty-subgroup warning paths in diff_diff/staggered_triple_diff.py#L288 and diff_diff/staggered_triple_diff.py#L789, and the rank-deficient cached-PS reuse path in diff_diff/staggered_triple_diff.py#L1050. The added tests in tests/test_staggered_triple_diff.py#L343 and tests/test_methodology_staggered_triple_diff.py#L145 do not pin those regressions specifically, so a future refactor could reintroduce them quietly. Concrete fix: add one warning-focused test with a cohort whose base period falls outside the observed panel or whose required (S,Q) subgroup is empty, and add one collinear-covariate test that reuses the same PS cache across multiple (g,t) cells and asserts finite estimates / SEs.

- Test base period outside panel: warns and produces finite results - Test empty DDD subgroup: warns with cell details, still estimates - Test collinear covariates with PS cache reuse: all (g,t) finite Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-30T14:43:24Z

/ai-review

github-actions · 2026-03-30T14:56:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9a87f4ecf1e17aff08286d8c22d7a4f7e14e4817

Overall Assessment

⚠️ Needs changes

Static review only; I could not execute the new tests in this workspace because the available Python environment is missing numpy.

Executive Summary

The prior blocking items from the last AI review look addressed: out-of-panel base periods now warn/skip, empty (S,Q) subgroups warn/skip, and the cached rank-deficient propensity path no longer reuses NaN coefficients.
Severity P1 [Newly identified]: the balanced-panel validation only checks per-unit period counts, not that every unit has the same set of periods, so invalid panels can pass and then lose group-time comparisons downstream.
Severity P1 [Newly identified]: non-finite outcomes are not rejected up front; Inf can pass validation and cause (g,g_c,t) comparisons to be silently dropped.
I did not find an additional undocumented source-material deviation beyond the deviations already disclosed in the new methodology registry entry.
Severity P2: the R-parity tests compare group-time outputs only by position and never assert the expected (g,t) labels or vector lengths, so they can miss dropped or reordered cells.

Methodology

Severity P1 [Newly identified]. Impact: this estimator is documented and implemented as requiring a balanced panel, but the validation only checks nunique() counts per unit. A unit with periods {1,2,4} and another with {1,2,3} both pass, even though the panel is not balanced. Once that happens, _compute_ddd_gt_gc() can silently discard affected comparisons when delta_y_all is non-finite, which is an unmitigated missing assumption check for a panel estimator. Location: diff_diff/staggered_triple_diff.py:641, diff_diff/staggered_triple_diff.py:803. Concrete fix: validate exact period-set equality per unit against the global period set, not just counts, and add a regression test with same-count/different-period units.
Severity P1 [Newly identified]. Impact: _validate_inputs() rejects NaN but not Inf, so a non-finite outcome can pass the front door and later cause _compute_ddd_gt_gc() to return None with no warning when delta_y_all is non-finite. That silently changes which comparison cohorts or GT cells feed the GMM step instead of failing fast on invalid input. Location: diff_diff/staggered_triple_diff.py:624, diff_diff/staggered_triple_diff.py:803. Concrete fix: reject non-finite outcomes up front with np.isfinite and raise explicitly if a cell is being skipped because delta_y_all is non-finite; I would make the same check for supplied covariates while touching this block.
Severity P3. Impact: I did not find another undocumented mismatch in the core staggered DDD method beyond the deviations already recorded for admissible comparison cohorts, aggregation weighting, and trimming. Location: docs/methodology/REGISTRY.md:1301, docs/methodology/REGISTRY.md:1342, docs/methodology/REGISTRY.md:1378. Concrete fix: none before merge. citeturn0search12turn1view0

Code Quality

No unmitigated findings. The previous re-review blockers appear fixed in diff_diff/staggered_triple_diff.py:288, diff_diff/staggered_triple_diff.py:789, and diff_diff/staggered_triple_diff.py:1063, with direct regression coverage in tests/test_staggered_triple_diff.py:413, tests/test_staggered_triple_diff.py:433, and tests/test_staggered_triple_diff.py:453.

Performance

No findings.

Maintainability

No unmitigated findings.

Tech Debt

Severity P3. Impact: the remaining non-blockers for missing CSV fixtures, missing covariate-adjusted parity coverage, and the documented group-SE/WIF difference are properly tracked in TODO.md:62, TODO.md:63, and TODO.md:64, so they are non-blocking under this repo’s deferred-work policy. Concrete fix: none before merge.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the methodology parity tests compare Python GT outputs to the stored R vectors only by positional index and never assert either the expected vector length or the expected (g,t) identities. If the estimator drops or reorders a GT cell, these tests can still pass on the shared prefix and miss the regression. Location: tests/test_methodology_staggered_triple_diff.py:154, tests/test_methodology_staggered_triple_diff.py:168, tests/test_methodology_staggered_triple_diff.py:199, tests/test_methodology_staggered_triple_diff.py:213. Concrete fix: assert len(py_gt) == len(r["gt_att"]) == len(r["gt_groups"]) == len(r["gt_periods"]) and compare each (g,t) to zip(r["gt_groups"], r["gt_periods"]) before checking ATT/SE closeness.

Path to Approval

Strengthen StaggeredTripleDifference._validate_inputs() so it checks exact per-unit period sets against the global period set, and add a regression test where units have equal counts but different observed periods.
Reject non-finite outcomes before estimation, and add a regression test showing that an np.inf outcome raises a deterministic ValueError instead of silently removing a (g,g_c,t) comparison from estimation.

- Validate exact period-set equality (not just counts) for balanced panel - Reject non-finite outcomes (Inf) and covariates up front - R parity tests now assert GT vector lengths and (g,t) label identity before comparing ATT/SE values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-30T15:44:49Z

/ai-review

github-actions · 2026-03-30T15:56:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8ce7c5d893cb49d7374b49c518f2e72eed82fcd6

Overall Assessment

⚠️ Needs changes

Static review only. I could not run the new tests in this workspace because pytest is unavailable and the Python environment is missing numpy, pandas, and scipy.

Executive Summary

The prior P1 findings from the last review look addressed: balanced-panel validation now checks exact period sets, and non-finite outcomes/covariates are rejected up front.
The GT R-parity tests now assert vector length and (g,t) identity, which closes the earlier positional-comparison gap for group-time outputs.
Severity P1 [Newly identified]: analytical ATT(g,t) inference can be wrong when some timing-admissible comparison cohorts are later skipped as unidentified; default n_bootstrap=0 exposes the bad SE/p-value/CI path.
I did not find another undocumented methodology deviation beyond the deviations already documented in the Methodology Registry.
Remaining fixture/covariate-parity/aggregation-SE gaps are properly tracked in TODO.md and are non-blocking under this repo’s policy.

Methodology

Severity P1 [Newly identified]. Impact: StaggeredTripleDifference computes size_gt before it knows which comparison cohorts actually survive the inner g_c loop, then reuses that pre-filter size in the per-cohort IF rescaling and final analytical SE path. If a (g,t) cell has multiple timing-valid comparison cohorts and one or more are later skipped for empty (S,Q) cells or another None/non-finite return, the reported analytical SE no longer corresponds to the estimator that was actually combined. Because safe_inference() is then fed that SE, the cell’s se, p_value, and conf_int are all wrong on the default n_bootstrap=0 path. Location: diff_diff/staggered_triple_diff.py:L327, diff_diff/staggered_triple_diff.py:L343, diff_diff/staggered_triple_diff.py:L349, diff_diff/staggered_triple_diff.py:L361, diff_diff/staggered_triple_diff.py:L796. Concrete fix: collect successful comparison cohorts first, rebuild size_gt from the retained gc_labels only, and use that retained union size consistently in both the IF rescaling and the analytical GT SE formula.

No other unmitigated methodology mismatch stood out beyond the documented notes on admissible comparison cohorts, aggregation weighting/WIF, and propensity-score trimming in docs/methodology/REGISTRY.md:L1301, docs/methodology/REGISTRY.md:L1342, and docs/methodology/REGISTRY.md:L1370.

Code Quality

No unmitigated findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the remaining non-blockers are already tracked: missing committed CSV fixtures, missing covariate-adjusted R parity / aggregation-SE parity coverage, and the documented per-cohort group-effect WIF difference. These do not block approval under the repo’s deferred-work policy. Location: TODO.md:L62, TODO.md:L63, TODO.md:L64. Concrete fix: none before merge.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new regression coverage checks that empty subgroup cases warn and keep the estimator finite, but it does not exercise the branch that exposes the SE bug above: a (g,t) cell with at least three admissible comparison cohorts where one is skipped and at least two remain in the GMM combination. As written, test_empty_subgroup_warns() only covers a two-cohort setup, so it cannot catch the mis-normalized multi-comparison analytical SE path. Location: tests/test_staggered_triple_diff.py:L282, tests/test_staggered_triple_diff.py:L433. Concrete fix: add a regression with 3+ admissible comparison cohorts, force one later cohort to fail the (S,Q) identification check, and assert the affected analytical (g,t) se/p_value/conf_int are based only on the retained comparison set.

Path to Approval

Recompute size_gt from the actually retained comparison cohorts after the inner g_c loop filters out skipped/non-finite comparisons, and use that retained normalization consistently in both IF rescaling and analytical GT SE computation.
Add a regression test with at least three admissible comparison cohorts where one is skipped for an empty (S,Q) cell and at least two remain, then assert the affected analytical se/p_value/conf_int match the retained-comparison calculation.

Move size_gt computation to after the inner gc loop so it reflects only the comparison cohorts that actually survived identification. Prevents incorrect IF rescaling and analytical SE when some gc's are skipped due to empty subgroups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-30T18:15:21Z

/ai-review

github-actions · 2026-03-30T18:23:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4b5157c0ed8e0baca54fbccf50003c751767aa04

Overall Assessment

✅ Looks good

Static review only: this workspace still lacks pytest, numpy, pandas, and scipy, so I could not execute the new tests.

Executive Summary

The prior P1 from the last review appears resolved: the analytical ATT(g,t) path now rebuilds size_gt from the retained comparison cohorts before influence-function rescaling and SE calculation in diff_diff/staggered_triple_diff.py:L351-L376.
I did not find a new unmitigated P0/P1 in the changed estimator.
Methodology-wise, the implementation preserves the paper’s core staggered-DDD structure by avoiding pooled not-yet-treated controls and combining valid comparison cohorts with GMM weights; the remaining departures I saw are documented in the Methodology Registry and are informational only. (ar5iv.org)
The main remaining gap is test coverage: the exact multi-comparison skipped-cohort SE bug from the prior review still lacks a direct regression test.
The broader R-parity limitations are properly tracked in TODO.md and do not block approval.

Methodology

Affected method(s): StaggeredTripleDifference group-time ATT(g,t), GMM combination, and aggregation.

Severity P3. Impact: the substantive deviations I found are documented rather than silent: the notyettreated admissible-cohort rule follows the companion triplediff implementation instead of the paper’s stricter g_c > max(g,t) statement, and the reported overall_att uses a CS-style simple aggregation rather than the paper’s Eq. 4.14 scalar summary. Under this repo’s policy, those are informational only because they are called out in the Registry. Concrete fix: none before merge. Locations: diff_diff/staggered_triple_diff.py:L303-L313, diff_diff/staggered_triple_diff.py:L445-L465, docs/methodology/REGISTRY.md:L1301-L1307, docs/methodology/REGISTRY.md:L1331-L1391. (ar5iv.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the remaining non-blockers are explicitly tracked: missing committed CSV fixtures for the R cross-validation path, no covariate-adjusted R parity / aggregation-SE parity coverage yet, and the documented conservative per-cohort group-effect WIF behavior. Concrete fix: none before merge; already tracked in TODO.md:L62-L64.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the prior SE-normalization bug appears fixed in code, but the regression coverage still does not hit the exact failure shape from the last review. test_empty_subgroup_warns() uses only a two-comparison-cohort setup, so it does not directly verify the analytical path where 3+ admissible comparison cohorts exist, one is skipped, and at least two remain for the GMM combination. Concrete fix: add a focused regression with 3+ admissible comparison cohorts, force one later cohort to fail the (S,Q) identification check, and assert the affected analytical se/p_value/conf_int are computed from only the retained comparison set. Location: tests/test_staggered_triple_diff.py:L433-L451.

Flip PS nuisance correction sign in panel survey IPW and RCS IPW paths. R adds the correction to inf.control then subtracts (att = treat - control), so the net effect on ATT IF is subtraction. DR paths are unaffected because their M2 residual (m_control - control_change) already has opposite sign. Also merges main (StaggeredTripleDifference from #245), resolves TODO.md conflict, and updates survey-roadmap.md to reflect Phase 7a implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document comparison-cohort rule deviation from paper

2a44d56

The implementation follows R's triplediff (max(t, base_period) + anticipation) rather than the paper's max(g, t). Add Deviation from paper label in REGISTRY. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber merged commit 07e37fa into main Mar 30, 2026
14 checks passed

igerber deleted the staggered-ddd branch March 30, 2026 20:04

igerber mentioned this pull request Mar 31, 2026

Bump version to 2.8.0 #246

Merged

Conversation

igerber commented Mar 29, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Mar 29, 2026

Uh oh!

igerber commented Mar 29, 2026

Uh oh!

github-actions Bot commented Mar 29, 2026

Uh oh!

igerber commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

igerber commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

igerber commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

igerber commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

igerber commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

igerber commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

igerber commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

igerber commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

igerber commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant