Add AI practitioner guardrails (Baker et al. 2025) by igerber · Pull Request #239 · igerber/diff-diff

igerber · 2026-03-28T18:36:42Z

Summary

Add docs/llms-practitioner.txt: 8-step Baker et al. (2025) workflow mapped to concrete diff-diff API calls, structured as agent-executable instructions
Restructure docs/llms.txt with practitioner workflow header — the first thing AI agents see on discovery
Add diff_diff/practitioner.py: practitioner_next_steps(results) function providing context-aware guidance for all 13 result types via handler registry dispatch
Add tests/test_practitioner.py: 26 tests covering all result types, completed_steps filtering, verbose output, NaN handling, and unknown type fallback
Add docs/practitioner-guide-evaluation.md: empirical before/after comparison across 30 independent agent runs

Methodology references (required if estimator / math changes)

Method name(s): N/A — practitioner.py is a guidance module (no estimator math). References Baker, Larcker & Wang (2025) "Difference-in-Differences Designs: A Practitioner's Guide" (arXiv:2503.13323) for the 8-step workflow.
Paper / source link(s): https://arxiv.org/abs/2503.13323
Any intentional deviations from the source (and why): None — the guide maps Baker et al.'s framework to diff-diff's existing API without modifying any estimator behavior.

Validation

Tests added/updated: tests/test_practitioner.py (26 tests, all passing)
Empirical validation: 30 independent AI agent runs (10 before, 10 after v1, 10 after v2) scored against Baker et al. rubric. Results: 9.4/16 → 16.0/16 (+70%) with variance collapsing from SD=0.84 to SD=0.0. Full results in docs/practitioner-guide-evaluation.md.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

github-actions · 2026-03-28T18:45:35Z

Overall Assessment

⚠️ Needs changes

The PR’s main feature is a new “Baker et al. 2025” practitioner workflow, but two P1 issues are unmitigated: it rewrites the cited eight-step framework without documenting the deviation, and it recommends compute_honest_did() for result types that the library does not support. The cited paper’s Step 8 is “Keep learning,” not “Robustness & Reporting,” and the paper explicitly lists few-cluster issues as outside its main coverage. (scribd.com)

Executive Summary

P1: The new AI-facing workflow is presented as a direct Baker et al. mapping, but it renumbers/redefines the paper’s steps and injects a hard few-cluster rule with no deviation note in the Methodology Registry. (scribd.com)
P1: practitioner_next_steps() and the new guide both recommend compute_honest_did(results) for SunAbraham, ImputationDiD, TwoStageDiD, StackedDiD, and EfficientDiD results, even though honest_did.py:543 only supports MultiPeriodDiDResults and CallawaySantAnnaResults.
P2: Several snippets emitted by diff_diff/practitioner.py:209 reference nonexistent contracts (result_no_cov.att, results.pre_treatment_rmse, results.sub_experiments, results.hausman_pretest).
P2: tests/test_practitioner.py:118 and neighboring fixtures use __new__ mocks with invented fields, so the suite does not validate the real result-object APIs it claims to cover.
P2/P3: The new primary guide has a broken equivalence_test_trends() example and visible mojibake in the estimator decision tree.

Methodology

Severity: P1. Impact: docs/llms-practitioner.txt:1, docs/llms-practitioner.txt:148, docs/llms-practitioner.txt:312, docs/llms.txt:13, and the baker_step=8 guidance emitted from diff_diff/practitioner.py:192 present this as Baker et al.’s eight-step workflow, but the cited paper’s Step 8 is “Keep learning,” and few-cluster issues are explicitly outside the paper’s covered topics. There is no corresponding REGISTRY.md note/deviation, so this is an undocumented methodology rewrite rather than a documented project convention. Concrete fix: realign the numbered workflow to the paper, or relabel the extra “robustness/reporting” and cluster-count rules as diff-diff extensions and document them in docs/methodology/REGISTRY.md. (scribd.com)
Severity: P1. Impact: docs/llms-practitioner.txt:244 says HonestDiD works with MultiPeriodDiD, CallawaySantAnna, SunAbraham, ImputationDiD, TwoStageDiD, StackedDiD, and EfficientDiD results, and the same unsupported recommendation is emitted in _handle_sa, _handle_imputation, _handle_two_stage, _handle_stacked, and _handle_efficient at diff_diff/practitioner.py:292, diff_diff/practitioner.py:303, diff_diff/practitioner.py:314, diff_diff/practitioner.py:325, and diff_diff/practitioner.py:407. But diff_diff/honest_did.py:543 and diff_diff/honest_did.py:650 only support MultiPeriodDiDResults and CallawaySantAnnaResults. That makes the advertised Step 6 fail on five of the “supported” result types. Concrete fix: either implement extraction/support for those result classes, or gate the guidance to supported types only; for CS results, also mention the existing aggregate='event_study' requirement from diff_diff/honest_did.py:617.

Code Quality

Severity: P2. Impact: The generated snippets in diff_diff/practitioner.py:209, diff_diff/practitioner.py:325, diff_diff/practitioner.py:346, and diff_diff/practitioner.py:407 do not match current public contracts. result_no_cov.att is wrong for result objects that expose overall_att/overall_se such as diff_diff/sun_abraham.py:31, diff_diff/imputation_results.py:81, diff_diff/two_stage_results.py:80, diff_diff/continuous_did_results.py:88, and diff_diff/efficient_did_results.py:54. SyntheticDiDResults exposes pre_treatment_fit, not pre_treatment_rmse (diff_diff/results.py:663); StackedDiDResults exposes stacked_data and n_sub_experiments, not sub_experiments (diff_diff/stacked_did_results.py:44); and the Efficient DiD pretest is an estimator method starting at diff_diff/efficient_did.py:1396, not a results attribute. Concrete fix: make snippets handler-specific and validate every referenced symbol against the target result dataclass or estimator API.

Performance

No findings in the changed code.

Maintainability

Severity: P2. Impact: tests/test_practitioner.py:118, tests/test_practitioner.py:143, and nearby fixtures build __new__ mocks with fields that real result classes do not expose. For example, the EfficientDiD fixture invents .att, .se, and .hausman_pretest, even though the actual contract is overall_att/overall_se plus a separate HausmanPretestResult type (diff_diff/efficient_did_results.py:21, diff_diff/efficient_did_results.py:54). The same mismatch exists for SunAbrahamResults, ImputationDiDResults, TwoStageDiDResults, ContinuousDiDResults, and StackedDiDResults. This locks in the wrong API shape and lets broken guidance pass tests. Concrete fix: replace these mocks with real estimator outputs where feasible, or with dataclass-faithful fixtures that use the actual public fields and required preconditions.

Tech Debt

Severity: P3. Impact: None of the practitioner-guide issues above are tracked in TODO.md or documented as a methodology deviation in docs/methodology/REGISTRY.md, so merging would create untracked debt. Concrete fix: only defer the doc-only cleanups if they are explicitly added to TODO.md; the P1 items still need direct fixes before merge.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new primary guide uses equivalence_test_trends(..., threshold=0.5) at docs/llms-practitioner.txt:91, but the actual API parameter is equivalence_margin at diff_diff/utils.py:919. Because docs/llms.txt:36 tells agents to “start here,” this broken example is immediately user-facing. Concrete fix: rename the argument to equivalence_margin= and add the new file to doc-snippet coverage.
Severity: P3. Impact: The estimator decision tree in docs/llms-practitioner.txt:115 and docs/llms-practitioner.txt:127 contains mojibake (��, └��─), which makes the new AI-facing guide look corrupted. Concrete fix: re-save the file as UTF-8 and add a lightweight doc lint or snapshot check for non-printing/corrupted characters.

Path to Approval

Realign the Baker workflow in docs/llms-practitioner.txt, docs/llms.txt, and diff_diff/practitioner.py: restore the paper’s step structure, or explicitly mark diff-diff-specific additions as deviations in docs/methodology/REGISTRY.md.
In diff_diff/practitioner.py, only emit HonestDiD guidance for result types actually supported by diff_diff/honest_did.py, and update the docs to match; for Callaway-Sant’Anna, tell users to refit with aggregate='event_study' or aggregate='all'.
Fix the generated snippets in diff_diff/practitioner.py: use the correct result fields (overall_att, pre_treatment_fit, stacked_data/n_sub_experiments) and the actual EfficientDiD pretest API.
Replace the contract-inaccurate mocks in tests/test_practitioner.py with real or faithful fixtures, add assertions for the EfficientDiD and Synthetic/Stacked handler paths, and add doc-snippet coverage for the new practitioner guide.

Validation note: this was a static review only. I could not run pytest here because pytest is not installed in the environment, and importing diff_diff also fails here because numpy is unavailable.

igerber · 2026-03-28T18:56:27Z

/ai-review

github-actions · 2026-03-28T19:05:42Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f34254254dfba5f0295655f4813c83cab2fe75d9

Overall Assessment

⚠️ Needs changes

The previous HonestDiD-support blocker looks resolved, but one P1 remains: the new Baker workflow is still numbered inconsistently across the exported helper, the long/short AI guides, the registry note, and the new tests. Validation note: this was a static review only; I could not run pytest or import diff_diff here because pytest and numpy are unavailable in this environment.

Executive Summary

No estimator math, weighting, variance/SE, or default-behavior changes were introduced; the review surface here is the new Baker practitioner-guidance layer.
Previous P1 on unsupported HonestDiD guidance appears resolved: the new docs and handlers now limit HonestDiD to MultiPeriodDiD and CallawaySantAnna, with the required CS event-study aggregation caveat at docs/llms-practitioner.txt:245, diff_diff/practitioner.py:266, and diff_diff/honest_did.py:616.
P1: the Baker step mapping is still inconsistent across diff_diff/practitioner.py:139, docs/llms-practitioner.txt:55, docs/llms.txt:18, docs/methodology/REGISTRY.md:2004, and tests/test_practitioner.py:283.
P2: the Bacon handler checks for negative weights instead of the actual forbidden-comparison signal, so its warning path will not fire when later-vs-earlier contamination is present at diff_diff/practitioner.py:524.
P2: the EfficientDiD handler still shows a nonexistent run_pretest=True API at diff_diff/practitioner.py:435.

Methodology

Affected method(s): Baker et al. (2025) practitioner workflow only; no estimator implementation changed.

Severity: P1. Impact: The PR adds a registry note saying the diff-diff adaptation makes parallel-trends testing a separate Step 3 and folds “sources of uncertainty” into Step 5 at docs/methodology/REGISTRY.md:2004, and the short AI guide follows that numbering at docs/llms.txt:18. But the exported helper still emits parallel-trends guidance as baker_step=2 at diff_diff/practitioner.py:139, the long-form guide keeps testing inside Step 2 and still has a separate Step 4 at docs/llms-practitioner.txt:55 and docs/llms-practitioner.txt:149, the evaluation rubric still scores inference as Step 4 at docs/practitioner-guide-evaluation.md:24, and the new tests hard-code baker_step == 2 as the parallel-trends step at tests/test_practitioner.py:283. Baker et al. describe Step 2 as articulating identifying assumptions, Step 3 as determining the estimation method, Step 4 as discussing sources of uncertainty, and Step 8 as “Keep learning.” citeturn0view1turn0view0 Concrete fix: choose one canonical mapping and align diff_diff/practitioner.py:137, docs/llms-practitioner.txt:55, docs/llms.txt:18, docs/practitioner-guide-evaluation.md:24, and tests/test_practitioner.py:277 to that same workflow. If the intended adaptation is the registry/short-guide version, change the helper’s parallel-trends step to the documented Step 3 and rebaseline the tests.

Code Quality

Severity: P2. Impact: The Bacon handler says it should warn when “later vs earlier” comparisons contaminate TWFE, but the implementation looks for negative comparison weights at diff_diff/practitioner.py:524. In this library, forbidden comparisons are represented by comparison_type="later_vs_earlier" and summarized via total_weight_later_vs_earlier, not by negative weights, at diff_diff/bacon.py:34, diff_diff/bacon.py:84, and diff_diff/bacon.py:230. As written, the new warning path will miss the very contamination pattern it describes. Concrete fix: key the warning off results.total_weight_later_vs_earlier > 0 or comparison_type == "later_vs_earlier" instead of negative weights.
Severity: P2. Impact: The EfficientDiD guidance correctly notes that the Hausman pretest is not a results attribute, but its example then suggests edid.fit(..., run_pretest=True) at diff_diff/practitioner.py:435. EfficientDiD.fit() has no run_pretest parameter at diff_diff/efficient_did.py:294; the supported API is EfficientDiD.hausman_pretest(...) at diff_diff/efficient_did.py:1396. Concrete fix: replace the snippet with the actual hausman_pretest(...) call and keep the comment that it is separate from the results object.

Performance

No findings.

Maintainability

No separate findings beyond the cross-file workflow drift noted above.

Tech Debt

No separate findings.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new tests ratify the current numbering bug instead of catching it at tests/test_practitioner.py:283, and they do not assert either the Bacon warning behavior or the EfficientDiD pretest guidance paths. Separately, the snippet smoke test only scans .rst files at tests/test_doc_snippets.py:25, so the new AI-facing .txt guides are outside automated snippet validation. Concrete fix: after deciding the canonical Baker-step mapping, update the practitioner tests to assert the correct step numbers, add handler-specific assertions for Bacon warnings and hausman_pretest, and either extend snippet coverage to the .txt AI guides or move executable examples into validated .rst docs.

Path to Approval

Choose a single canonical Baker-step mapping and align practitioner_next_steps(), both LLM guides, the registry note/evaluation rubric, and tests/test_practitioner.py to that same numbering. If the intended adaptation is the one documented in the registry, change the helper’s parallel-trends step from baker_step=2 to the documented Step 3 and update the surrounding headings/tests to match.

igerber · 2026-03-28T19:11:42Z

/ai-review

github-actions · 2026-03-28T19:21:39Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 50bc3fcd913bc85219642316979f4c0d5070eb82

Overall Assessment

⚠️ Needs changes

This re-review is down to one remaining blocker. No estimator math, weighting, variance/SE, or default-behavior logic changed in this diff, and the prior Bacon-warning / EfficientDiD API issues appear resolved. The remaining P1 is that the new Baker guardrail artifacts are still not internally consistent: the main guide and runtime helper now use the adapted Step 3/4/5 mapping, but the complete example and evaluation rubric still teach/score the old sequence.

Executive Summary

Review surface is the new practitioner-guidance layer; no estimator implementation changed.
Re-review status: the prior Bacon-warning and EfficientDiD Hausman-snippet issues appear resolved.
P1: the long-form guide’s complete example still labels Step 3 as estimator selection and Step 4 as inference, and it never performs the explicit Step 3 parallel-trends test.
P2: _check_nan_att() misses MultiPeriodDiDResults.avg_att, so NaN multi-period results do not trigger the new warning path.
P2: the new .txt AI guides are outside the doc-snippet smoke-test harness.

Methodology

Affected method(s): Baker et al. (2025) practitioner workflow only; no estimator implementation changed.

Severity: P1. Impact: docs/methodology/REGISTRY.md:2004 and diff_diff/practitioner.py:137 establish the adapted canonical numbering as Step 3 = test parallel trends, Step 4 = choose estimator, Step 5 = estimate. But the shipped “Complete Example” still labels Step 3 as estimator selection and Step 4 as inference at docs/llms-practitioner.txt:392 and docs/llms-practitioner.txt:399, the Step 6 Bacon note still cross-references “Step 3” at docs/llms-practitioner.txt:258, and the new evaluation rubric still scores S3/S4/S5 using the old mapping at docs/practitioner-guide-evaluation.md:24. Because the example never executes the explicit Step 3 pre-trends check, agents following the example can still skip the diagnostic that the registry separated into its own required step. Baker et al.’s paper keeps Step 3 as estimation method, Step 4 as sources of uncertainty, Step 5 as estimate, and Step 8 as “keep learning.” (ar5iv.org) Concrete fix: align the complete example, Step 6 Bacon cross-reference, and evaluation rubric to the canonical mapping, and add an explicit Step 3 check_parallel_trends() / equivalence_test_trends() block before Bacon decomposition and estimation.

Code Quality

Severity: P2. Impact: _check_nan_att() only inspects .att and .overall_att at diff_diff/practitioner.py:592, but MultiPeriodDiDResults stores its aggregate effect as .avg_att at diff_diff/results.py:302. A NaN multi-period/event-study average therefore skips the module’s new warning path for one of the advertised supported result types. Concrete fix: check avg_att as well, or centralize a small ordered list of supported effect field names such as ("att", "overall_att", "avg_att").

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. I did not find a TODO.md entry that mitigates the issues above.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the snippet smoke-test harness still only enumerates .rst files at tests/test_doc_snippets.py:25, so none of the new AI-facing docs/llms.txt, docs/llms-full.txt, or docs/llms-practitioner.txt code blocks are exercised in CI. The added practitioner tests also cover NaN warnings only for DiDResults at tests/test_practitioner.py:329, not the MultiPeriodDiDResults path above. Concrete fix: extend the snippet harness to the new .txt guides or move their runnable examples into validated .rst pages, and add a MultiPeriodDiDResults(avg_att=np.nan, ...) regression test.

Validation note: this was a static review only. I could not execute pytest or import diff_diff in this environment because pytest and numpy are unavailable.

Path to Approval

Update the remaining step-numbering drift in docs/llms-practitioner.txt:258, docs/llms-practitioner.txt:392, docs/llms-practitioner.txt:399, and docs/practitioner-guide-evaluation.md:24 so they match the canonical Step 3/4/5 mapping in docs/methodology/REGISTRY.md:2004.
Make the “Complete Example” actually perform Step 3 by adding a pre-trends block (check_parallel_trends() and/or equivalence_test_trends()) before the Bacon decomposition and cs.fit(...) call.

igerber · 2026-03-28T19:29:12Z

/ai-review

github-actions · 2026-03-28T19:38:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6ca617ea1f269a0a19df15907b85737eccda1851

Overall Assessment

⚠️ Needs changes

Executive Summary

No estimator math, weighting, variance/SE, identification assumptions, or defaults changed in this diff. The affected methodology surface is the new Baker-practitioner guidance layer.
Severity P1: the flagship load_mpdta() Step 3 example is still broken. docs/llms-practitioner.txt:394 passes first_treat into check_parallel_trends(), but the helper splits groups on == 1 and == 0 diff_diff/utils.py:651, diff_diff/utils.py:652; on mpdta, first_treat is a cohort-year field and the binary ever-treated column is treat diff_diff/datasets.py:654, diff_diff/datasets.py:655. That leaves the treated subset empty, and compute_trend() calls time_values.min() before its n < 2 guard diff_diff/utils.py:660, diff_diff/utils.py:664.
Previous re-review items are mostly resolved: the avg_att NaN-warning gap is fixed in diff_diff/practitioner.py:592 and covered in tests/test_practitioner.py:344; the EfficientDiD Hausman snippet now uses hausman_pretest in diff_diff/practitioner.py:438 and is checked in tests/test_practitioner.py:387; the .txt doc-smoke gap is now tracked in TODO.md:82.
Severity P2: the new evaluation doc still mixes old and new Baker step labels, so its score tables/narrative do not use the same canonical numbering as docs/methodology/REGISTRY.md:2004.
Static review only: I could not run pytest here because numpy and pytest are unavailable in this environment.

Methodology

Severity: P1. Impact: the new required Step 3 guardrail is still not executable in the shipped complete example. docs/llms-practitioner.txt:394 uses treatment_group='first_treat', but check_parallel_trends() expects a binary indicator split into == 1 and == 0 groups diff_diff/utils.py:632, diff_diff/utils.py:651. On load_mpdta(), first_treat takes values 0/2004/2006/2007 and the dataset years are only 2003-2007 diff_diff/datasets.py:651, diff_diff/datasets.py:714. So the treated subset is empty, and even a naive switch to treat would still leave only one universal pre-period before the 2004 cohort, while the helper needs at least two points to form a slope diff_diff/utils.py:660, diff_diff/utils.py:663. Concrete fix: rewrite the example to use a valid binary comparison with an explicit multi-period pre-window on a restricted later-treated subsample, or change the example dataset/helper so Step 3 can actually run before estimation.
Severity: P2. Impact: [Previous finding partially addressed] the core guide now matches the canonical remap in docs/methodology/REGISTRY.md:2004, but the evaluation writeup still labels S3/S4/S5 as “Estimator choice / Inference / Estimation” in docs/practitioner-guide-evaluation.md:54, docs/practitioner-guide-evaluation.md:55, docs/practitioner-guide-evaluation.md:56, and repeats that framing in the narrative at docs/practitioner-guide-evaluation.md:80 and docs/practitioner-guide-evaluation.md:104. That leaves the scorecard semantically inconsistent with the shipped workflow. Concrete fix: relabel the comparison/progression tables and surrounding narrative to the canonical S3 = Test parallel trends, S4 = Choose estimator, S5 = Estimate.

Code Quality

No unmitigated findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3 (tracked in TODO.md). Impact: the prior .txt doc-snippet smoke-test gap is now explicitly tracked at TODO.md:82, so under the review policy it is mitigated and not blocking. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

No unmitigated findings beyond the tracked .txt smoke-test item above.
Validation note: this was a static review only; I could not execute the new tests in this environment.

Path to Approval

Replace the load_mpdta() Step 3 block with a pre-trends diagnostic that is actually valid for the chosen data/helper combination. Concretely, either restrict to later-treated vs never-treated units and pass a binary treatment indicator plus explicit multi-period pre-periods, or switch the complete example to a dataset/workflow where check_parallel_trends() can run before estimation.

igerber · 2026-03-28T20:09:54Z

/ai-review

github-actions · 2026-03-28T20:23:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 528c2410bb8bfa0eee404235715a33bd2c58f93f

Overall Assessment

⚠️ Needs changes

Executive Summary

No estimator math, weighting, variance/SE, identification assumptions, or defaults changed in executable estimators; the review surface is the new Baker workflow docs plus practitioner_next_steps().
P1: the previous Step 3 issue is not fully resolved. The load_mpdta() example no longer crashes, but it still instructs a contaminated pre-trends test for staggered adoption, and the same generic pattern is emitted by practitioner_next_steps() for staggered result types.
The Baker-step renumbering, the cluster-count directive, and the mandatory with/without-covariates convention are explicitly documented in the Methodology Registry, so those deviations are non-blocking under the review policy.
P2: the flagship “Complete Example” still omits the guide’s own REQUIRED with/without-covariates comparison.
P2: the evaluation writeup changes model mix across the before/after arms, so its “documentation alone” causal claim is not isolated.
Static review only: I could not run imports/tests here because numpy and pytest are unavailable in this environment.

Methodology

P1 [Previous finding not fully resolved] Impact: the new Step 3 guidance is still methodologically wrong for staggered load_mpdta() analyses. The example now uses treat instead of first_treat, which fixes the old empty-treated-path bug, but check_parallel_trends() explicitly assumes the supplied pre_periods are genuinely pre-treatment and splits a binary 0/1 group. In load_mpdta(), treatment cohorts include 2004, so the example’s pre_periods=[2003, 2004, 2005] include post-treatment observations for the 2004 cohort. That contaminates the treated trend instead of testing a valid pre-trend. The same generic binary-group snippet is reused for staggered result handlers in practitioner_next_steps(), so the runtime guardrail is not actually context-aware for staggered designs. Concrete fix: replace the staggered Step 3 guidance in both the docs and practitioner.py with a valid staggered-design diagnostic, e.g. pre-treatment event-study / ATT(g,t) checks or a later-treated-vs-never-treated restricted sample whose selected years are truly untreated for the “treated” group. Refs: docs/llms-practitioner.txt:392, diff_diff/practitioner.py:137, diff_diff/practitioner.py:266, diff_diff/utils.py:619, diff_diff/datasets.py:651, diff_diff/datasets.py:714, docs/tutorials/02_staggered_did.ipynb:464.
P3 Impact: the PR’s Baker-step renumbering and the Step 5 / Step 8 conventions are documented in the Methodology Registry, so they are mitigated and non-defects under the stated policy. Concrete fix: none required. Refs: docs/methodology/REGISTRY.md:2000.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

P3 Impact: the .txt doc-snippet smoke-test gap is now explicitly tracked in TODO.md, so this is mitigated and not blocking. Concrete fix: none required for approval. Refs: TODO.md:82.

Security

No findings.

Documentation/Tests

P2 Impact: the guide says Step 8 “Report with and without covariates” is REQUIRED, but the flagship “Complete Example” stops after comparing CS/SA/BJS and never performs the no-covariate re-fit. That leaves the end-to-end example inconsistent with the workflow it tells agents to follow. Concrete fix: extend the complete example with a covariates=None fit and print the with/without comparison, or relax the Step 8 wording if it is not meant to be mandatory. Refs: docs/llms-practitioner.txt:311, docs/llms-practitioner.txt:427.
P2 Impact: the empirical evaluation is framed as a before/after documentation effect, but the two arms use different model mixes (1 Opus + 9 Sonnet before vs 10 Sonnet after). That is a confound, so the claimed gain cannot be attributed to documentation changes alone from the evidence shown here. Concrete fix: rerun with identical model allocation across arms, or report model-stratified results and tone down the causal phrasing. Refs: docs/practitioner-guide-evaluation.md:13, docs/practitioner-guide-evaluation.md:142.
P3 [Previous finding largely resolved] Impact: the major step-label mismatch in the evaluation doc is mostly fixed, but two residual sentences still call the cluster-count change “Step 4” even though the canonical numbering in the Registry and the rest of the doc use Step 5. Concrete fix: relabel those two references from Step 4 to Step 5. Refs: docs/practitioner-guide-evaluation.md:111, docs/practitioner-guide-evaluation.md:153, docs/methodology/REGISTRY.md:2004.

Path to Approval

Replace the staggered Step 3 guidance in both docs/llms-practitioner.txt and diff_diff/practitioner.py with a valid pre-trends diagnostic for staggered adoption that does not treat already-treated 2004-cohort observations as pre-treatment.

igerber · 2026-03-28T20:29:18Z

/ai-review

github-actions · 2026-03-28T20:36:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 158f72526543f4376fc4f0c85b539b3842b98e18

Overall Assessment

⚠️ Needs changes

Executive Summary

No executable estimator math, weighting, variance/SE formulas, identification checks, or defaults changed; the review surface is the new Baker workflow docs plus practitioner_next_steps().
P1 [Previous finding not fully resolved]: the staggered Step 3 fix is only partial. The detailed guide now correctly rejects generic check_parallel_trends() for staggered adoption, but the new discovery header in docs/llms.txt:L20 and the evaluation rubric in docs/practitioner-guide-evaluation.md:L30 still instruct/reward that invalid diagnostic for the staggered load_mpdta() task.
The Baker-step renumbering, the cluster-count directive, and the mandatory with/without-covariates convention are explicitly documented in docs/methodology/REGISTRY.md:L2004, so those deviations are non-blocking.
The prior full-example covariate-comparison gap, the step-label mismatch, and the .txt smoke-test tracking issue are resolved or mitigated.
Static review only: I could not run imports/tests here because numpy and pytest are unavailable in this environment.

Methodology

The cited Baker paper is framed as an organizing framework for DiD designs and estimators rather than a single estimator recipe, so consistency across the short guide, detailed guide, and evaluation rubric is the key methodology check for this PR. (ideas.repec.org)

Severity: P1 [Previous finding not fully resolved]. Impact: the detailed practitioner guide now correctly says generic check_parallel_trends() is invalid for staggered adoption because it assumes a single binary treatment with universal pre-periods, as reflected in docs/llms-practitioner.txt:L105 and the underlying helper in diff_diff/utils.py:L611. But the new top-level workflow header still lists check_parallel_trends() / equivalence_test_trends() as Step 3 with no staggered carve-out in docs/llms.txt:L20, and the evaluation doc still gives full Step-3 credit for those calls and says the “after” agents consistently ran them in docs/practitioner-guide-evaluation.md:L30 and docs/practitioner-guide-evaluation.md:L96. That means the PR’s discovery entrypoint and supporting validation still reward exactly the staggered diagnostic the detailed guide now rejects. Concrete fix: split Step 3 in docs/llms.txt into simple-2x2 vs staggered branches, and update the evaluation rubric/claims to score the staggered-safe diagnostic instead of generic check_parallel_trends() / equivalence_test_trends(). If the reported 16/16 runs were scored under the old rubric, rerun or explicitly soften the conclusion.
Severity: P3. Impact: the step renumbering, Step-5 cluster-count rule, and Step-8 mandatory with/without-covariates comparison are explicitly documented deviations in docs/methodology/REGISTRY.md:L2004, docs/methodology/REGISTRY.md:L2010, and docs/methodology/REGISTRY.md:L2013, so they are mitigated non-defects under the review policy. Concrete fix: none.

Code Quality

Severity: P2. Impact: practitioner_next_steps() is advertised as context-aware, but the shared staggered Step-3 snippet in [_parallel_trends_step()at diff_diff/practitioner.py:L137](/home/runner/work/diff-diff/diff-diff/diff_diff/practitioner.py#L137) tells users to “Fit withaggregate='event_study'or'all'” for every staggered handler. That is invalid for SunAbraham.fit()because it has noaggregateargument in [diff_diff/sun_abraham.py:L452](/home/runner/work/diff-diff/diff-diff/diff_diff/sun_abraham.py#L452), and partially invalid forStackedDiD.fit()becauseaggregate='all'` is rejected in diff_diff/stacked_did.py:L169 and diff_diff/stacked_did.py:L214. This is not an estimator-methodology bug, but it does make the new runtime guidance less than truly estimator-specific. Concrete fix: parameterize staggered Step-3 guidance by estimator family and add tests asserting that each handler’s emitted snippet only mentions supported fit parameters.

Performance

No findings.

Maintainability

No findings beyond the Code Quality issue above.

Tech Debt

Severity: P3. Impact: the .txt doc-snippet smoke-test gap is now explicitly tracked in TODO.md:L82, so it is mitigated and non-blocking under the stated policy. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

Severity: P2 [Previous finding not resolved]. Impact: the evaluation writeup still changes model mix across arms in docs/practitioner-guide-evaluation.md:L19 while concluding “Documentation alone” drove the improvement in docs/practitioner-guide-evaluation.md:L156. The note that the confound is “minor” does not remove the identification problem: the allocation is still unmatched across arms, so the causal attribution remains unisolated. Concrete fix: rerun with identical model allocation across conditions, or report model-stratified results and soften the causal language.
Verification note: static review only; numpy and pytest are unavailable here, so I could not run imports or tests/test_practitioner.py.

Path to Approval

Update docs/llms.txt so Step 3 explicitly distinguishes simple 2x2 designs from staggered adoption, and do not present check_parallel_trends() / equivalence_test_trends() as the staggered default.
Update docs/practitioner-guide-evaluation.md so the Step-3 rubric and “after agents” claims use the corrected staggered-safe diagnostic; if needed, rerun or reframe the 16/16 result under that corrected rubric.

Ensure AI agents follow the 8-step empiricist workflow from Baker et al. (2025) "Difference-in-Differences Designs: A Practitioner's Guide" when using diff-diff, rather than just calling fit() without diagnostics. New files: - docs/llms-practitioner.txt: Full 8-step workflow mapped to diff-diff API - diff_diff/practitioner.py: practitioner_next_steps() runtime guidance - tests/test_practitioner.py: 26 tests covering all 13 result types - docs/practitioner-guide-evaluation.md: Before/after empirical comparison Empirical validation (30 agent runs): analysis quality improved from 9.4/16 to 16.0/16 (+70%) with zero variance, driven by agents now running HonestDiD sensitivity analysis, formal parallel trends testing, 3-estimator robustness comparisons, and with/without covariate reporting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P1 fixes: - Relabel workflow as "based on Baker et al. 2025" (not a 1:1 mapping) - Add REGISTRY.md deviation notes for reorganized steps and diff-diff conventions - Gate HonestDiD guidance to MultiPeriodDiD and CallawaySantAnna only (the only types supported by compute_honest_did) - Add aggregate='event_study' requirement for CS + HonestDiD P2 fixes: - Fix attribute names: pre_treatment_rmse -> pre_treatment_fit, sub_experiments -> n_sub_experiments/stacked_data - Fix EfficientDiD Hausman pretest (estimator method, not results attr) - Fix equivalence_test_trends param: threshold -> equivalence_margin - Fix _covariates_step snippet to note .att vs .overall_att difference - Fix test mocks to use correct attributes (overall_att/overall_se for staggered result types) P3 fix: - Replace Unicode box-drawing chars with ASCII in decision tree Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P1: Align all files to canonical 8-step numbering from llms.txt: 1-Define, 2-Assumptions, 3-Test PT, 4-Choose estimator, 5-Estimate (with cluster check), 6-Sensitivity, 7-Heterogeneity, 8-Robustness. Moved PT testing from Step 2 into separate Step 3 in llms-practitioner.txt, folded uncertainty into Step 5, updated baker_step values in practitioner.py (PT: 2->3, estimator: 3->4). P2: Bacon handler now checks total_weight_later_vs_earlier > 0.01 instead of negative weights (matches actual BaconDecompositionResults API). EfficientDiD snippet uses actual hausman_pretest() classmethod instead of nonexistent run_pretest=True parameter. Tests: Updated step number assertions, added Bacon warning tests (2) and EfficientDiD handler tests (2). Suite now 30 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P1: Align complete example and Bacon cross-reference to canonical numbering (Step 3=Test PT, Step 4=Choose estimator, Step 5=Estimate). Complete example now executes check_parallel_trends() and prints cluster count before estimation. Evaluation rubric S3-S5 labels updated to match. P2: _check_nan_att() now checks .avg_att (MultiPeriodDiDResults) in addition to .att and .overall_att. Added regression test. P2: Added TODO.md entry for extending snippet smoke tests to .txt AI guides (deferred, low priority). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P1: Complete example Step 3 now uses treatment_group='treat' (binary ever-treated indicator) instead of 'first_treat' (cohort year), with explicit pre_periods=[2003, 2004, 2005]. The previous version would produce an empty treated subset since check_parallel_trends() splits on == 1 / == 0. P2: Evaluation doc comparison tables and narrative now use canonical step labels (S3=Test PT, S4=Choose estimator, S5=Estimate) matching REGISTRY.md and the practitioner guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…mple P1: For staggered designs, replace generic check_parallel_trends() with CS event-study pre-period inspection. The generic helper assumes a single binary treatment with universal pre-periods, which is invalid when cohorts adopt at different times (early-cohort post-treatment observations contaminate the "pre-trend"). The guide and practitioner_next_steps() now emit staggered-specific guidance using event_study_effects pre-periods. P2: Complete example now includes REQUIRED with/without covariates comparison (was missing despite Step 8 marking it mandatory). P2: Evaluation doc model-mix caveat added, residual "Step 4" references changed to "Step 5" to match canonical numbering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P1: llms.txt Step 3 now distinguishes simple 2x2 (check_parallel_trends) from staggered (CS event-study pre-periods). Staggered Step 3 snippet in practitioner.py no longer references estimator-specific aggregate parameter that SA/Stacked don't support. P2: Evaluation rubric S3 updated to score staggered-safe diagnostic. Softened causal language re: model-mix confound. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T20:44:00Z

/ai-review

github-actions · 2026-03-28T20:53:03Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bd8e565013c7f4064a7b01f4d88ee5c1d096ce09

Overall Assessment

⚠️ Needs changes

Executive Summary

The previous staggered Step 3 blocker is materially resolved: the discovery guide, practitioner guide, and scoring rubric now distinguish simple 2x2 pre-trends tests from staggered event-study pre-period inspection in docs/llms.txt:L20, docs/llms-practitioner.txt:L105, and docs/practitioner-guide-evaluation.md:L30.
P1: practitioner_next_steps() still emits the simple 2x2 check_parallel_trends() workflow for ContinuousDiDResults and TripleDifferenceResults, even though those estimators have different identifying assumptions in docs/methodology/REGISTRY.md:L444 and diff_diff/triple_diff.py:L383.
P1: the new Step 6 placebo guidance points staggered and advanced users to run_all_placebo_tests(), but that helper is implemented by refitting the basic DifferenceInDifferences estimator with binary treatment and time inputs rather than the staggered estimators the new handlers/docs target.
The Baker-step renumbering, the 50-cluster rule, and the mandatory with/without-covariates convention are explicitly documented in the registry, so those deviations are mitigated non-defects.
Static review only: I could not execute imports or tests/test_practitioner.py here because this environment is missing numpy and pytest.

Methodology

Severity: P1. Impact: practitioner_next_steps() routes ContinuousDiDResults and TripleDifferenceResults through the generic _parallel_trends_step() in diff_diff/practitioner.py:L157, diff_diff/practitioner.py:L469, and diff_diff/practitioner.py:L502, which emits check_parallel_trends(data, ..., treatment_group='treated') from the simple treated-vs-control helper in diff_diff/utils.py:L618. That does not match the dose-specific PT/SPT assumptions documented for ContinuousDiD in docs/methodology/REGISTRY.md:L444 or the DDD differential-trend assumption documented in diff_diff/triple_diff.py:L383. Users can therefore “complete” Baker Step 3 with a diagnostic that is not testing the estimator’s actual identifying assumption. Concrete fix: add estimator-specific Step 3 handlers for ContinuousDiDResults and TripleDifferenceResults, or explicitly state that no built-in formal test exists and point users to the closest supported diagnostic/caveat instead of emitting check_parallel_trends().
Severity: P1. Impact: the generic _placebo_step() in diff_diff/practitioner.py:L191 is attached to staggered/advanced handlers such as diff_diff/practitioner.py:L328 and diff_diff/practitioner.py:L443, and the docs advertise the same call as a generic Step 6 tool in docs/llms.txt:L23 and docs/llms-practitioner.txt:L260. But run_all_placebo_tests() takes treatment and time rather than first_treat in diff_diff/diagnostics.py:L766, and its core routines refit DifferenceInDifferences in diff_diff/diagnostics.py:L415 and diff_diff/diagnostics.py:L566, whose contract is a binary treatment-group indicator and a binary post indicator in diff_diff/estimators.py:L168 and diff_diff/estimators.py:L223. For staggered adoption this is not the falsification exercise the helper/docs claim to be recommending. Concrete fix: restrict this placebo guidance to simple 2x2 DiD, or add a staggered-aware placebo API keyed off cohort timing and update the helper/docs to reference that instead.
Severity: P3. Impact: the step renumbering, cluster-count inference rule, and mandatory with/without-covariate comparison are explicitly documented as diff-diff conventions in docs/methodology/REGISTRY.md:L2120, docs/methodology/REGISTRY.md:L2126, and docs/methodology/REGISTRY.md:L2129, so they are mitigated non-defects under the stated policy. Concrete fix: none.

Code Quality

Severity: P2. Impact: _check_nan_att() only warns when the ATT is a Python float NaN in diff_diff/practitioner.py:L611 and diff_diff/practitioner.py:L619. Surrounding estimator code often produces NumPy scalars from array aggregation, for example overall ATT in diff_diff/staggered_aggregation.py:L138, so real NaN results can skip the warning path. The new tests only cover float("nan") inputs in tests/test_practitioner.py:L329. Concrete fix: coerce numeric scalars robustly before testing (float(att) in a guarded block, then np.isfinite/np.isnan) and add coverage for np.float64(np.nan) and np.inf.

Performance

No findings.

Maintainability

No findings beyond the methodology and code-quality issues above.

Tech Debt

Severity: P3. Impact: the .txt doc-snippet smoke-test gap is explicitly tracked in TODO.md:L83, so it is mitigated and non-blocking under the review policy. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: tests/test_practitioner.py validates dispatch, schema, and a few label/code fragments in tests/test_practitioner.py:L211, tests/test_practitioner.py:L260, and tests/test_practitioner.py:L381, but it never asserts that each handler emits methodology-valid guidance for its estimator family. That is why the invalid Step 3 and Step 6 guidance passes the new suite. Concrete fix: add content tests that fail if ContinuousDiDResults or TripleDifferenceResults emit check_parallel_trends(), and if staggered handlers emit run_all_placebo_tests() without a staggered-aware wrapper.
Severity: P3. Impact: the formal evaluation rubric is now corrected, but the qualitative “After agents” summary still says the fixed staggered load_mpdta() task used check_parallel_trends or CS event-study pre-periods in docs/practitioner-guide-evaluation.md:L96. That is no longer a blocker, but it is still imprecise for a single staggered task. Concrete fix: tighten the sentence to the staggered-safe diagnostic actually intended for this task.
Verification note: static review only; numpy and pytest are unavailable in this environment, so I could not run tests/test_practitioner.py.

Path to Approval

Replace the generic Step 3 guidance for ContinuousDiDResults and TripleDifferenceResults with estimator-specific diagnostics or explicit “no built-in formal test” caveats that match their actual identifying assumptions.
Remove or re-scope run_all_placebo_tests() from the staggered/advanced handlers and the generic practitioner docs; if placebo guidance should remain for staggered designs, add a staggered-aware API that accepts cohort timing (first_treat) and targets the relevant estimator family.
Add regression tests that fail on both issues above, so the helper cannot reintroduce 2x2 pre-trends or 2x2 placebo guidance into non-2x2 estimator handlers.

P1: ContinuousDiD and TripleDifference handlers now emit honest "no built-in formal test" guidance instead of the invalid generic check_parallel_trends() which assumes a single binary treatment. P1: Staggered handlers now use _placebo_step(staggered=True) which recommends specification-based falsification (control group/anticipation/ subsample comparisons) instead of run_all_placebo_tests() (which refits a basic 2x2 DiD and is invalid for staggered designs). P2: _check_nan_att() now coerces to float() before testing, handling numpy scalars (np.float64(nan)) in addition to Python float("nan"). Docs updated: llms.txt Step 6 and llms-practitioner.txt placebo section now note that run_all_placebo_tests is for simple 2x2 only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T21:02:07Z

/ai-review

github-actions · 2026-03-28T21:11:29Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 51bcf369f763f634eb948cadc52af8db688bb22b

Overall Assessment

⚠️ Needs changes

Executive Summary

The previous re-review blockers are materially resolved: the practitioner helper and the main AI-facing docs now scope check_parallel_trends() to simple 2x2 designs and scope run_all_placebo_tests() away from staggered designs in diff_diff/practitioner.py:141, diff_diff/practitioner.py:195, docs/llms-practitioner.txt:105, and docs/llms-practitioner.txt:260.
P1: TripleDifferenceResults guidance still misstates the identifying assumption by telling users DDD requires separate parallel-trends checks “along two dimensions,” which conflicts with the library’s own DDD documentation and the cited Baker appendix.
P1: SyntheticDiDResults and TROPResults still inherit a staggered-adoption falsification template that tells users to vary control_group, anticipation, and cohorts, even though those are not the relevant APIs for SDID or TROP.
P2: tests/test_practitioner.py still validates dispatch/schema only; it would not fail on either of the methodology regressions above.
The renumbered Baker steps, the 50-cluster inference rule, and the mandatory with/without-covariates convention are explicitly documented in REGISTRY.md, so those deviations are mitigated non-defects.
Static review only: this environment is missing numpy and pytest, so I could not execute imports or run tests/test_practitioner.py.

Methodology

Severity: P1. Impact: _handle_triple() still tells users that DDD “requires parallel trends along two dimensions” and to inspect treatment and eligibility trends separately in diff_diff/practitioner.py:540 and diff_diff/practitioner.py:557. That conflicts with the library’s own DDD contract in diff_diff/triple_diff.py:383, which says DDD is weaker than requiring separate parallel trends for two DiDs. Baker et al.’s appendix says the same thing: DDD allows group-specific and partition-specific PT violations and generally cannot be expressed as “two DiDs.” Concrete fix: rewrite the TripleDifference handler to state the documented DDD identifying assumption directly, and replace the separate-dimension PT / placebo language with either a DDD-specific diagnostic or an explicit “no built-in formal test” caveat. (ar5iv.org)
Severity: P1. Impact: SyntheticDiDResults and TROPResults both reuse _placebo_step(staggered=True) via diff_diff/practitioner.py:419 and diff_diff/practitioner.py:456, so practitioner_next_steps() tells those users to vary control_group, set anticipation=1, and drop cohorts one-by-one from the template defined in diff_diff/practitioner.py:195. Those are staggered-adoption knobs; they are not the methodological contract of SDID or TROP as documented in docs/methodology/REGISTRY.md:1044, diff_diff/synthetic_did.py:34, diff_diff/synthetic_did.py:183, diff_diff/trop.py:47, and diff_diff/trop.py:391. Concrete fix: give SDID and TROP estimator-specific Step 6 handlers, or at minimum remove control_group/anticipation/cohort language and replace it with checks their actual APIs support.
Severity: P3. Impact: none. The step renumbering, explicit cluster-count rule, and mandatory with/without-covariates reporting are all documented deviations in docs/methodology/REGISTRY.md:2120, so they are mitigated and should not block approval. Concrete fix: none. (ar5iv.org)

Code Quality

No findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No additional findings.

Tech Debt

Severity: P3. Impact: .txt AI-guide smoke tests are still outside CI, but the PR correctly tracks that limitation in TODO.md:81, so it is mitigated and non-blocking. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests only assert that each handler returns some output; they never assert the methodology-critical content. The SDID, TROP, and DDD tests in tests/test_practitioner.py:251, tests/test_practitioner.py:256, and tests/test_practitioner.py:268 would all still pass with the current P1 guidance bugs. Concrete fix: add negative content assertions that DDD guidance does not describe separate PT requirements, and that SDID/TROP guidance does not mention control_group, anticipation, or staggered-cohort leave-one-out logic.
Severity: P3. Impact: the qualitative evaluation summary still says the “after” agents on the staggered load_mpdta() task ran run_all_placebo_tests() and calls CS pre-period inspection a “formal” diagnostic in docs/practitioner-guide-evaluation.md:96 and docs/practitioner-guide-evaluation.md:98. That no longer matches the corrected practitioner guide and overstates what the staggered-safe workflow is. Concrete fix: tighten the summary to the actual staggered-safe Step 3 and Step 6 actions being recommended/scored.

Path to Approval

Rewrite the TripleDifferenceResults guidance in diff_diff/practitioner.py so Step 3 and Step 7 match the documented DDD assumption in diff_diff/triple_diff.py, rather than describing DDD as two separate PT checks.
Split out SyntheticDiDResults and TROPResults Step 6 handlers in diff_diff/practitioner.py so they no longer emit control_group, anticipation, or cohort-drop guidance copied from staggered estimators.
Add regression tests in tests/test_practitioner.py that fail if DDD guidance says “parallel trends along two dimensions,” or if SDID/TROP guidance mentions control_group, anticipation, or run_all_placebo_tests().

Verification note: static review only; numpy and pytest are unavailable in this environment, so I could not run the new test file or import the package.

Eliminate reuse of staggered templates for non-staggered estimators: - SyntheticDiD/TROP: replace staggered _placebo_step with estimator- specific in-time/in-space placebo guidance (no control_group, anticipation, or cohort-dropping language) - TripleDifference: rewrite to state the actual DDD identifying assumption (weaker than separate PT for two DiDs), remove the "requires PT along two dimensions" mistatement - ContinuousDiD: already fixed (dose-specific guidance) Tests: add negative content assertions ensuring SDiD/TROP don't mention control_group/anticipation, CDiD doesn't emit check_parallel_trends, and DDD doesn't claim "two dimensions". Evaluation doc: tighten qualitative summary to match corrected staggered-safe diagnostics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T21:17:47Z

/ai-review

github-actions · 2026-03-28T21:24:08Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bea06a59c6d4e8439573083a9b10853f24d08325

Overall Assessment

⚠️ Needs changes

Executive Summary

The previous methodology blockers are resolved: TripleDifference guidance now matches the library’s weaker DDD identifying assumption, and SyntheticDiD/TROP no longer inherit the staggered control_group/anticipation template in diff_diff/practitioner.py:L402-L489 and diff_diff/practitioner.py:L566-L597, with matching negative assertions in tests/test_practitioner.py:L251-L287.
Severity: P1. Impact: [Newly identified] the shared staggered Step 6 template still assumes the CS/SA control-group API and is reused for ImputationDiD, TwoStageDiD, StackedDiD, and EfficientDiD, where that guidance is invalid or uses the wrong parameter names. See diff_diff/practitioner.py:L191-L208, diff_diff/practitioner.py:L359-L399, diff_diff/practitioner.py:L493-L516, and docs/llms-practitioner.txt:L261-L264.
Severity: P2. Impact: the new tests still would not catch that estimator-contract bug, because those handlers only assert “some guidance exists” in tests/test_practitioner.py:L239-L270.
Severity: P3. Impact: the evaluation write-up still says the staggered load_mpdta() “after” runs used placebo tests, which no longer matches the corrected guide, in docs/practitioner-guide-evaluation.md:L59-L67.
Static review only: numpy is unavailable in this environment, so I could not import diff_diff or run tests/test_practitioner.py.

Methodology

Severity: P1. Impact: [Newly identified] the generic staggered falsification text says to vary control_group='never_treated' vs 'not_yet_treated' and reuse that advice across handlers, but that only fits CS/SA-style APIs. ImputationDiD and TwoStageDiD do not expose any control_group parameter at all in diff_diff/imputation.py:L126-L168 and diff_diff/two_stage.py:L132-L164; StackedDiD uses clean_control instead in diff_diff/stacked_did.py:L121-L157 and the registry’s clean-control contract in docs/methodology/REGISTRY.md:L996-L1015; EfficientDiD only allows never_treated or last_cohort, explicitly distinguished from not_yet_treated, in diff_diff/efficient_did.py:L197-L237 and docs/methodology/REGISTRY.md:L685-L687. As written, practitioner_next_steps() and the AI guide can send users toward runtime errors or invalid robustness exercises. Concrete fix: replace the single _placebo_step(staggered=True) template with estimator-specific Step 6 guidance. At minimum: remove control_group advice from Imputation/TwoStage; use clean_control for Stacked; use never_treated vs last_cohort for EfficientDiD; keep never_treated vs not_yet_treated only where the estimator actually supports it.
Severity: P3. Impact: none. The step renumbering, explicit cluster-count rule, and mandatory with/without-covariates reporting are all documented deviations in docs/methodology/REGISTRY.md:L2120-L2130, so they are mitigated non-defects. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: .txt AI-guide snippets remain outside CI smoke validation, but this PR correctly tracks that limitation in TODO.md:L83-L83, so it is mitigated and non-blocking. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the regression tests still do not pin the new estimator-contract bug down. ImputationDiDResults, TwoStageDiDResults, StackedDiDResults, and EfficientDiDResults only check that guidance exists in tests/test_practitioner.py:L239-L270, so the invalid shared control_group template would pass unchanged. Concrete fix: add handler-specific assertions that ImputationDiD/TwoStageDiD guidance does not mention control_group, StackedDiD uses clean_control rather than control_group, and EfficientDiD does not mention not_yet_treated and instead points to last_cohort where relevant.
Severity: P3. Impact: the evaluation document still describes the corrected staggered workflow as “HonestDiD + placebo,” which overstates what the guide now recommends for load_mpdta() in docs/practitioner-guide-evaluation.md:L59-L67. Concrete fix: update those lines to describe HonestDiD plus specification-based falsification / control-definition comparisons, not placebo tests.

Path to Approval

Split the shared staggered Step 6 guidance in diff_diff/practitioner.py:L191-L208 into estimator-specific handlers so each estimator only recommends supported falsification knobs.
Update the generic staggered wording in docs/llms-practitioner.txt:L261-L264 to stop presenting control_group comparisons as valid for all staggered estimators.
Add regression tests covering ImputationDiD, TwoStageDiD, StackedDiD, and EfficientDiD guidance content, not just dispatch presence, in tests/test_practitioner.py:L239-L270.

Replace all remaining _placebo_step(staggered=True) calls with estimator-specific Step 6 guidance: - SA: control_group comparison (never_treated vs not_yet_treated) - ImputationDiD: no control_group param — cohort subsetting + cross- estimator comparison - TwoStageDiD: same as Imputation (no control_group) - StackedDiD: clean_control variation (not control_group) - EfficientDiD: never_treated vs last_cohort (not not_yet_treated) Tests: negative content assertions on code snippets for Imputation, TwoStage (no control_group in code), Stacked (clean_control), Efficient (no not_yet_treated in code). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T21:29:45Z

/ai-review

github-actions · 2026-03-28T21:37:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5d6b11c09753980e00678812817eee26462784e7

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior handler-level estimator-contract blocker is mostly resolved: the new runtime guidance now distinguishes ImputationDiD, TwoStageDiD, StackedDiD, EfficientDiD, SyntheticDiD, and TROP instead of reusing one staggered template.
One P1 remains: the old generic staggered Step 6 advice still survives in the shared _placebo_step(staggered=True) helper and in the new AI guide text, so the fix is only partial.
That remaining generic wording is still incompatible with several estimator APIs and can steer agents toward unsupported knobs or invalid falsification exercises.
The Baker-guide renumbering, cluster-count rule, and mandatory with/without-covariates reporting are documented in the Methodology Registry, so those deviations are non-defects.
The prior evaluation-writeup wording issue appears resolved.
Static review only: I could not run tests here because pytest is unavailable, and package import fails in this environment because numpy is missing.

Methodology

Severity: P1. Impact: The previous estimator-contract issue is only partially fixed. The per-estimator handlers are now corrected, but the old shared staggered helper still hard-codes CS/SA-style advice to compare control_group='never_treated' vs 'not_yet_treated' and vary anticipation, and the new AI guide still tells agents to do this for staggered designs generally. That conflicts with the actual APIs for ImputationDiD (no control_group), TwoStageDiD (no control_group), StackedDiD (clean_control instead), and EfficientDiD (last_cohort, not not_yet_treated). Because this PR explicitly promotes the practitioner guide as an AI reference, the remaining generic pattern can still push agents toward unsupported or methodologically invalid robustness exercises. Concrete fix: delete or rewrite the unused shared staggered helper to be estimator-agnostic, and rewrite the Step 6 AI-guide prose to say “compare supported specification choices for the estimator you used,” with estimator-family-specific examples. diff_diff/practitioner.py:L191-L208 docs/llms-practitioner.txt:L260-L264 CLAUDE.md:L141-L141 diff_diff/imputation.py:L114-L117 diff_diff/two_stage.py:L120-L122 diff_diff/stacked_did.py:L121-L140 diff_diff/efficient_did.py:L161-L167 docs/methodology/REGISTRY.md:L628-L628 docs/methodology/REGISTRY.md:L687-L687 docs/methodology/REGISTRY.md:L1011-L1015
Severity: P3. Impact: None. The guide’s Baker-step renumbering, explicit cluster-count rule, and mandatory with/without-covariates reporting are documented deviations in the Methodology Registry, so they are not defects. Concrete fix: none. docs/methodology/REGISTRY.md:L2120-L2130

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the methodology blocker above.

Tech Debt

Severity: P3. Impact: .txt AI-guide snippets are still outside doc-snippet CI smoke validation, but this PR correctly tracks that limitation in TODO.md, so it is mitigated and non-blocking. Concrete fix: none required for approval. TODO.md:L83-L83

Security

No findings.

Documentation/Tests

No unmitigated findings beyond the blocker above. The prior handler-specific regression gap is substantially addressed by the new assertions covering ImputationDiD, TwoStageDiD, StackedDiD, EfficientDiD, SyntheticDiD, TROP, ContinuousDiD, and DDD. tests/test_practitioner.py:L239-L282
Static review only: python -m pytest -q tests/test_practitioner.py could not run because pytest is not installed, and importing diff_diff in this environment fails because numpy is missing.

Path to Approval

Remove or rewrite the unused shared staggered helper in diff_diff/practitioner.py:L191-L208 so no remaining helper text hard-codes control_group/anticipation as generic staggered advice.
Rewrite Step 6 in docs/llms-practitioner.txt:L260-L264 to describe estimator-supported falsification choices, with explicit family-specific examples: CS/SA (control_group, anticipation), Stacked (clean_control, window width), EfficientDiD (never_treated vs last_cohort), Imputation/TwoStage (cohort-drop, covariate, or cross-estimator comparisons).
If the shared helper is kept, add a regression test that guards its wording directly so this generic staggered control_group pattern cannot be reintroduced outside the estimator-specific handlers.

igerber · 2026-03-28T21:48:07Z

/ai-review

github-actions · 2026-03-28T21:55:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f8a3b857930fd62c098b09b438410b821b68e3d7

Overall Assessment

⚠️ Needs changes

The highest unmitigated finding is still a P1: the top-level AI-facing summary in docs/llms.txt:L24 continues to present aggregate='group' / aggregate='event_study' as if they were generic heterogeneity APIs, which remains incorrect for SunAbraham and can still misdirect the exact practitioner workflow this PR is trying to enforce.

Executive Summary

The previous Step 7 blocker is only partially resolved: the full guide and runtime helper now split out Sun-Abraham correctly in docs/llms-practitioner.txt:L319 and diff_diff/practitioner.py:L347, but the top-level summary in docs/llms.txt:L24 still does not.
That remaining docs/llms.txt summary can still steer agents toward unsupported SA calls, because SunAbraham.fit() has no aggregate parameter and SA heterogeneity lives on event_study_effects, cohort_effects, and to_dataframe(level=...) in diff_diff/sun_abraham.py:L191 and diff_diff/sun_abraham.py:L436.
The Baker-step renumbering, separate PT-testing step, cluster-count rule, and required with/without-covariates reporting are explicitly documented deviations in docs/methodology/REGISTRY.md:L2120 and are consistent with the paper’s original Step 3/4/8 structure, so those are non-defects.
The new practitioner guide cites arXiv:2503.13323 with the wrong author list in docs/llms-practitioner.txt:L509; the paper is by Andrew Baker, Brantly Callaway, Scott Cunningham, Andrew Goodman-Bacon, and Pedro H. C. Sant’Anna. (ideas.repec.org)
The .txt doc-smoke coverage gap is properly tracked in TODO.md:L83, so that item is mitigated and non-blocking.

Methodology

Severity: P1. Impact: docs/llms.txt:L24 still says Step 7 heterogeneity is aggregate='group' for cohort effects and aggregate='event_study' for dynamic effects. That remains wrong for SunAbraham, whose public API has no aggregate argument in diff_diff/sun_abraham.py:L436 and exposes heterogeneity through event_study_effects, cohort_effects, and to_dataframe(level='event_study'/'cohort') in diff_diff/sun_abraham.py:L191. Because docs/llms.txt is the first AI-facing summary, this can still send agents to unsupported SA calls before they ever reach the corrected full guide or runtime helper. Concrete fix: rewrite Step 7 in docs/llms.txt:L24 to split SA out explicitly from aggregate-based estimators, or reduce it to “see docs/llms-practitioner.txt for estimator-specific heterogeneity guidance.”
Severity: P3. Impact: None. The renumbering of Baker’s workflow, the separate PT-testing step, the Step-4-to-Step-5 uncertainty merge, and the Step-8 robustness/reporting additions are all documented deviations in docs/methodology/REGISTRY.md:L2120, and the paper’s own sequence is indeed Step 3 “justify estimation method,” Step 4 “sources of uncertainty,” and Step 8 “keep learning.” Concrete fix: none.
Severity: P2. Impact: The bibliography entry in docs/llms-practitioner.txt:L509 attributes arXiv:2503.13323 to Baker/Larcker/Wang, which is not the cited paper. That weakens source traceability for a PR whose main claim is adherence to source material. Concrete fix: replace the reference entry with the paper’s actual author list. (ideas.repec.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: None for approval. The new .txt AI guides are outside current doc-snippet smoke coverage, but this limitation is explicitly tracked in TODO.md:L83, so it is mitigated. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The targeted SA regression test requested by the prior review is still effectively missing. tests/test_practitioner.py:L234 only checks dispatch/name for SunAbrahamResults, so it would not fail if aggregate='group' or group_effects reappeared in SA guidance. Concrete fix: extend the SA test to assert that guidance uses to_dataframe(level='event_study') / to_dataframe(level='cohort') from diff_diff/practitioner.py:L347 and does not mention aggregate='group' or group_effects.
Static review only: pytest is unavailable in this environment (pytest: command not found), and importing diff_diff fails here because numpy is unavailable.

Path to Approval

Fix the Step 7 summary in docs/llms.txt:L24 so it no longer presents aggregate='group' / aggregate='event_study' as a generic heterogeneity API. Either split out Sun-Abraham explicitly with its real surface (event_study_effects, cohort_effects, to_dataframe(level=...)) or make the summary point readers directly to docs/llms-practitioner.txt:L300 for estimator-specific heterogeneity guidance.

Fixing item 1 would move the assessment to ✅. The remaining P2/P3 items are cleanup, not blockers.

The condensed Step 7 was still presenting aggregate='group' as generic for all staggered estimators. Now distinguishes CS (aggregate parameter) from SA (automatic event_study_effects/cohort_effects). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T21:58:11Z

/ai-review

github-actions · 2026-03-28T22:08:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0857affa272ed8c6050be4b28df1ca28dec7e224

Overall Assessment

⚠️ Needs changes

The previous Sun-Abraham Step 7 blocker is fixed, but the new practitioner_next_steps() helper still has an unmitigated P1 contract/methodology issue: it advertises the Baker 8-step workflow, accepts target_parameter and assumptions as valid public step names, and can print “All Baker et al. steps completed!”, yet none of its handlers can ever emit Step 1 or Step 2 guidance. Baker et al. explicitly make defining target parameters and stating identifying assumptions the first two steps of the workflow. (ar5iv.org)

Executive Summary

The previous P1 from the prior review is resolved: the top-level summary in docs/llms.txt:L18-L25 now separates Callaway-Sant’Anna aggregate-based heterogeneity from Sun-Abraham’s event_study_effects / to_dataframe(level='cohort') interface.
Severity: P1. The new runtime helper’s public API still claims an 8-step Baker workflow, but its emitted guidance only covers later diagnostic steps; Steps 1 and 2 are unreachable even though they are exposed as valid step names in diff_diff/practitioner.py:L16-L25, advertised in diff_diff/practitioner.py:L49-L110, and completion is overstated in diff_diff/practitioner.py:L796-L820. This is an undocumented mismatch with the source workflow. (ar5iv.org)
Severity: P2. The bibliography entry in docs/llms-practitioner.txt:L507-L510 still names the wrong authors for arXiv:2503.13323; the cited paper is by Andrew Baker, Brantly Callaway, Scott Cunningham, Andrew Goodman-Bacon, and Pedro H. C. Sant’Anna. (econpapers.repec.org)
The documented deviations called out in docs/methodology/REGISTRY.md:L2116-L2131 remain non-defects: renumbering, separate PT testing, the 50-cluster rule, and “Robustness & Reporting” for Step 8 are all explicitly tracked there.
Static review only: python -m pytest tests/test_practitioner.py -q could not run here because pytest is not installed, and importing diff_diff in this environment fails because numpy is missing.

Methodology

Severity: P1. Impact: diff_diff/practitioner.py:L16-L25 exposes all 8 Baker step names, diff_diff/practitioner.py:L49-L110 says the helper recommends which Baker steps remain, and docs/llms-practitioner.txt:L15-L16 / docs/llms.txt:L27-L30 tell users to call it for remaining steps after estimation. But the handlers in diff_diff/practitioner.py:L251-L754 never emit target_parameter or assumptions, so the helper can report the Baker workflow complete while silently skipping the paper’s first two steps. Concrete fix: either add generic Step 1 and Step 2 reminder items to the helper and let completed_steps filter them, or explicitly narrow the helper to post-estimation diagnostics by removing those two names from STEPS and changing the docs/output strings accordingly. (ar5iv.org)
Severity: P2. Impact: docs/llms-practitioner.txt:L507-L510 misattributes the cited paper, which weakens source traceability in a PR whose core claim is adherence to Baker et al. Concrete fix: replace the reference entry with the actual author list from arXiv:2503.13323. (econpapers.repec.org)
Severity: P3. Impact: none. The PR’s renumbering and workflow adaptations are documented in docs/methodology/REGISTRY.md:L2116-L2131, so those deviations are informational rather than defects. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: none for approval. The new .txt AI-guide smoke-test gap is explicitly tracked in TODO.md:L83, so it is mitigated. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: tests/test_practitioner.py:L234-L237 still only checks Sun-Abraham dispatch/name, so it would not catch a regression back to aggregate-based SA guidance. The file also has no assertion covering the Step 1 / Step 2 contract implied by diff_diff/practitioner.py:L16-L25. Concrete fix: add targeted assertions that SA guidance continues to use to_dataframe(level='event_study'/'cohort') and add a contract test that either Step 1/2 are emitted by default or, if the design is narrowed, that the public API/docs no longer advertise them.
Static review only: python -m pytest tests/test_practitioner.py -q failed here because pytest is unavailable, and import diff_diff failed because numpy is unavailable.

Path to Approval

Make practitioner_next_steps() internally consistent with its advertised Baker workflow: either emit Step 1 (target_parameter) and Step 2 (assumptions) guidance by default, or remove those step names from the public API and update every docstring/document/output string that currently promises “remaining steps” / “All Baker et al. steps completed.”

Structural fix: practitioner_next_steps() now always emits Step 1 (define target parameter) and Step 2 (state assumptions) as reminders, prepended to every handler's output. These are filterable via completed_steps=['target_parameter', 'assumptions']. This closes the gap where the helper could report the Baker workflow "complete" while silently skipping the first two steps. Fix bibliography: correct author list for arXiv:2503.13323 (Baker, Callaway, Cunningham, Goodman-Bacon, Sant'Anna). Tests: added assertions that Steps 1-2 are always emitted and filterable. SA negative assertion checks code snippets only (excluding generic Step 2 text). 33 tests now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T22:14:22Z

/ai-review

github-actions · 2026-03-28T22:29:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e5f10ff678bdbbfd7701276da5e135b163257bf0

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior Step 1 / Step 2 contract issue is fixed: practitioner_next_steps() now prepends those steps and the new tests cover both default emission and completed_steps filtering (diff_diff/practitioner.py:L97-L125, tests/test_practitioner.py:L193-L209).
Severity: P1 [Newly identified]. The new practitioner guide now frames PT-GT-NYT / control_group='not_yet_treated' as the general recommendation for staggered DiD, but the paper treats that choice as context-dependent and the existing estimator docstrings still recommend never_treated; that is an undocumented methodology mismatch in control-group guidance (docs/llms-practitioner.txt:L73-L79, docs/llms-practitioner.txt:L205-L219, diff_diff/staggered.py:L113-L116, diff_diff/sun_abraham.py:L313-L316). citeturn0view0
No estimator, weighting, or variance implementation changed in this diff; the blocking issue is in public methodology guidance, not numerical code.
The remaining workflow adaptations that differ from Baker et al. are explicitly documented in the registry, so they are mitigated P3 informational items, not blockers (docs/methodology/REGISTRY.md:L2116-L2130).
The new .txt guide smoke-test gap is tracked in TODO.md, so it is not a blocker (TODO.md:L83-L83).
I could not run python -m pytest tests/test_practitioner.py -q here because pytest is not installed. A no-write compile() syntax check passed for diff_diff/practitioner.py and tests/test_practitioner.py.

Methodology

Severity: P1 [Newly identified]. Impact: the guide says PT-GT-NYT is “recommended for most applications” and repeatedly uses control_group='not_yet_treated' in the lead CS examples, while the paper presents the control-group / PT-assumption choice as context-dependent and explicitly says the analyst should state and justify which assumption is being used; the existing CallawaySantAnna and SunAbraham docstrings still recommend never_treated instead (docs/llms-practitioner.txt:L73-L79, docs/llms-practitioner.txt:L120-L121, docs/llms-practitioner.txt:L347-L357, diff_diff/practitioner.py:L703-L719, diff_diff/staggered.py:L113-L116, diff_diff/sun_abraham.py:L313-L316). Concrete fix: remove the blanket recommendation and present never_treated vs not_yet_treated as a context-specific choice with explicit assumption trade-offs, or document a deliberate library-wide deviation in REGISTRY.md and update the estimator docstrings/examples to match. citeturn0view0
Severity: P3. Impact: none for approval. The guide’s renumbering of Baker’s workflow, the separate PT-testing step, the cluster-count convention in Step 5, and the “Robustness & Reporting” rewrite for Step 8 are all explicitly documented in the methodology registry, so they are mitigated deviations rather than defects (docs/methodology/REGISTRY.md:L2116-L2130). Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: none for approval. The new .txt AI guides are still outside the doc-snippet smoke tests, but that limitation is explicitly tracked in TODO.md for this PR (TODO.md:L83-L83). Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

No unmitigated findings.
Validation note: the new tests do cover the previously flagged Step 1 / Step 2 contract and result-type dispatch (tests/test_practitioner.py:L193-L209, tests/test_practitioner.py:L229-L322), but I could not execute them here because pytest is unavailable.

Path to Approval

Rewrite the practitioner guidance so PT-GT-Nev vs PT-GT-NYT / never_treated vs not_yet_treated is presented as an assumption choice that must be justified, not as a blanket recommendation.
If the project intentionally wants to standardize on not_yet_treated, add that as an explicit methodology note in docs/methodology/REGISTRY.md and update the CallawaySantAnna / SunAbraham docstrings so the AI guide, runtime snippets, and estimator docs all agree.

The guide was recommending PT-GT-NYT / not_yet_treated as the default, contradicting both Baker et al. (who present it as context-dependent) and the library's own docstrings (which default to never_treated). Changed: PT-GT-NYT description no longer says "recommended for most applications" — now describes the trade-off. All code examples changed from not_yet_treated to never_treated (matching library default), with comments noting the choice is context-dependent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T22:32:38Z

/ai-review

github-actions · 2026-03-28T22:39:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7d3f76488d95ac13026a609df64c92c7c06f2091

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior control-group blocker appears resolved. The new guide no longer gives a blanket not_yet_treated recommendation; it now uses control_group='never_treated' in examples and labels not_yet_treated as a context-dependent choice, which matches the existing estimator docstrings and runtime guidance (docs/llms-practitioner.txt:205, docs/llms-practitioner.txt:274, diff_diff/practitioner.py:113, diff_diff/staggered.py:113, diff_diff/sun_abraham.py:313).
Severity: P1 [Newly identified]. The new Step 4 guidance now routes “few treated units” users to SyntheticDiD/TROP, but the library’s own TROP docs define TROP by suspected interactive fixed effects / factor confounding, not by small treated-group size. That is an undocumented methodology mismatch in estimator-selection guidance (docs/llms-practitioner.txt:141, docs/llms.txt:21, docs/choosing_estimator.rst:37, docs/choosing_estimator.rst:391, docs/api/trop.rst:18, diff_diff/trop.py:45).
No estimator, weighting, or variance implementation changed in this diff; the blocking issue is in public methodology guidance, not numerical code.
The renumbered Baker workflow, separate PT-testing step, 50-cluster inference convention, and mandatory with/without-covariates comparison are explicitly documented in the methodology registry, so those are mitigated P3 informational items, not blockers (docs/methodology/REGISTRY.md:2116).
The .txt smoke-test gap remains tracked in TODO.md, so it is not blocking (TODO.md:83).

Methodology

Severity: P1 [Newly identified]. Impact: the new practitioner decision tree and the condensed docs/llms.txt header tell agents that “few treated units” points to SyntheticDiD/TROP, but the project’s own estimator-selection docs and TROP API docs reserve TROP for suspected factor confounding / interactive fixed effects. That can push agents toward the wrong estimator family in Step 4, which is exactly the methodology-selection step this PR is trying to standardize (docs/llms-practitioner.txt:141, docs/llms.txt:21, docs/choosing_estimator.rst:37, docs/choosing_estimator.rst:391, docs/api/trop.rst:18, diff_diff/trop.py:45, docs/llms-full.txt:1453). Concrete fix: split the decision rule so SyntheticDiD is recommended for “few treated units / synthetic control” cases, while TROP is recommended when interactive fixed effects / latent factor confounding is the motivating concern.
Severity: P3. Impact: none for approval. The Step 3/PT-test separation, Step 5 cluster-count convention, and Step 8 “Robustness & Reporting” rewrite are documented deviations in the registry, so they are mitigated (docs/methodology/REGISTRY.md:2120). Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: none for approval. The new .txt AI guides are still outside the doc-snippet smoke tests, but that limitation is explicitly tracked in TODO.md for this PR (TODO.md:83). Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

No unmitigated findings.
Validation note: the new tests do cover the restored Step 1/2 contract, per-result-type dispatch, NaN warning path, and handler-specific guidance checks (tests/test_practitioner.py:193, tests/test_practitioner.py:229, tests/test_practitioner.py:380, tests/test_practitioner.py:433). I could not run python -m pytest tests/test_practitioner.py -q here because pytest is not installed, and importing diff_diff also fails in this environment because numpy is unavailable.

Path to Approval

Rewrite the Step 4 estimator-selection guidance in docs/llms-practitioner.txt:141 and docs/llms.txt:21 so SyntheticDiD is tied to “few treated units / synthetic control” and TROP is tied to “interactive fixed effects / factor confounding,” consistent with docs/choosing_estimator.rst:37, docs/choosing_estimator.rst:391, and docs/api/trop.rst:18.
If the project intentionally wants a different TROP recommendation, add an explicit methodology note to docs/methodology/REGISTRY.md and update the existing TROP docs (docs/choosing_estimator.rst, docs/api/trop.rst, and any relevant docstrings) so all public guidance agrees.

SDiD is for few treated units (synthetic control hybrid). TROP is for suspected factor confounding / interactive fixed effects (nuclear norm regularization). These have different use cases — the previous grouping under "few treated units" contradicted TROP's own documentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T22:41:35Z

/ai-review

github-actions · 2026-03-28T22:49:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a21845ce91143a808e83e52ec4ed597550a9c0e0

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior Step 4 documentation blocker is resolved: the new guides now separate few treated units → SyntheticDiD from factor confounding → TROP in docs/llms-practitioner.txt:154 and docs/llms.txt:21.
No estimator, weighting, variance, or default-behavior code changed in this diff; the remaining blocker is in the new public guidance layer.
Severity: P1. The new runtime helper still tells SyntheticDiDResults users to compare against TROP because both address “few treated units” in diff_diff/practitioner.py:523 and diff_diff/practitioner.py:525. That conflicts with the project’s own estimator guide and TROP docstrings, which reserve TROP for factor confounding / interactive fixed effects.
The Baker-workflow renumbering, separate PT-testing step, and Step 8 robustness/reporting rewrite are explicitly documented in docs/methodology/REGISTRY.md:2120, so those are mitigated P3s rather than defects. The paper itself is an organizing framework whose original steps include estimation-method selection, uncertainty, heterogeneity, and “keep learning.” (aeaweb.org)
The .txt doc-smoke-test gap is properly tracked in TODO.md:83, so it is non-blocking.
I could not run python -m pytest tests/test_practitioner.py -q or import diff_diff here because this environment is missing pytest and numpy.

Methodology

Severity: P1. Impact: the new runtime guidance reintroduces the same estimator-selection mismatch the docs just fixed. diff_diff/practitioner.py:523 and diff_diff/practitioner.py:525 say SyntheticDiD and TROP address the same “few treated units” setting, but the library’s own decision rules say SyntheticDiD is for few treated units / synthetic-control settings docs/choosing_estimator.rst:37, docs/choosing_estimator.rst:204, while TROP is for factor confounding / interactive fixed effects docs/choosing_estimator.rst:391, docs/api/trop.rst:18, docs/api/trop.rst:272, and diff_diff/trop.py:4. Because both AI guides explicitly tell agents to call practitioner_next_steps(results) after estimation docs/llms-practitioner.txt:15, docs/llms.txt:27, this is a user-facing methodology mismatch, not dead text. Concrete fix: rewrite the SyntheticDiD robustness step so it does not characterize TROP as the few-treated-units comparator; mention TROP only when latent-factor / interactive-fixed-effects concerns are the reason to branch away from SDiD.
Severity: P3. Impact: none for approval. The Baker-step renumbering and the added cluster-count / robustness conventions are explicitly documented in docs/methodology/REGISTRY.md:2116, so they are mitigated deviations rather than undocumented methodology drift. That is consistent with Baker et al. framing the paper as a general organizing framework rather than a fixed API recipe. (aeaweb.org) Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: none for approval. The new AI .txt guides are outside tests/test_doc_snippets.py, but this PR adds that limitation to TODO.md:83, so it is properly tracked. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new tests do not assert the estimator-selection semantics that mattered in the prior review. tests/test_practitioner.py:281 and tests/test_practitioner.py:291 only check that SDiD/TROP snippets avoid staggered-only knobs, so the contradictory SDiD↔TROP message in diff_diff/practitioner.py:523 was free to slip through. Concrete fix: add targeted assertions that SDiD guidance does not present TROP as the few-treated-units analogue, and that TROP guidance is tied to factor confounding / interactive fixed effects.
Validation note: I could not execute the new test file in this environment because pytest is unavailable, and package import fails before test collection because numpy is not installed.

Path to Approval

Update the SyntheticDiD handler in diff_diff/practitioner.py:488 so its robustness guidance no longer says TROP addresses the same “few treated units” setting. Reserve TROP for factor-confounding / interactive-fixed-effects cases, consistent with docs/api/trop.rst:18 and docs/choosing_estimator.rst:391.
Add a regression test in tests/test_practitioner.py:281 that fails if SDiD guidance characterizes TROP as the few-treated-units comparator, and a corresponding TROP assertion that mentions factor confounding / interactive fixed effects.

SDiD robustness step was still suggesting TROP comparison because "both address few treated units" — but TROP is for factor confounding, not few treated units. Now suggests comparing with CS/SA instead, with a note that TROP is a different use case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T23:00:20Z

/ai-review

github-actions · 2026-03-28T23:09:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8ce444101ad4c17677bf99738adfa9f8d57099b8

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior SDiD↔TROP methodology blocker is resolved in the new guidance text and runtime helper.
No estimator math, weighting, variance, or default behavior changed in this diff; the remaining blocker is in the new practitioner-guidance layer.
Severity P1: the new guidance is not actually result-type safe. MultiPeriodDiD guidance emits a 2x2 placebo workflow, and the shared snippet uses time='period' with multi-period pre/post lists even though run_all_placebo_tests() routes that time column into DifferenceInDifferences, which requires a binary time indicator.
The companion docs example then silently drops failed placebo subtests by filtering to objects with p_value, so agents can miss that permutation / leave-one-out did not run.
The Baker-workflow renumbering and Step 5 / Step 8 adaptations are explicitly documented in the methodology registry, so I treated those as mitigated P3s rather than defects.
I could not run pytest here because pytest is not installed, and importing diff_diff fails because numpy is missing in this environment.

Methodology

Severity: P1
Impact: The new “context-aware” sensitivity guidance is not conditioned tightly enough on supported result types. MultiPeriodDiDResults gets _placebo_step() in practitioner.py:305, even though the new guide itself says placebo tests are for simple 2x2 designs only in llms-practitioner.txt:270. The shared placebo snippet uses time='period' with multi-period pre_periods / post_periods in practitioner.py:221 and llms-practitioner.txt:286. But run_all_placebo_tests() forwards that time into permutation and leave-one-out refits in diagnostics.py:844 and diagnostics.py:865, and DifferenceInDifferences validates time as binary in estimators.py:223. The same helper family also tells basic DiD users to “use HonestDiD” in practitioner.py:187, even though compute_honest_did() only supports MultiPeriodDiDResults or CallawaySantAnnaResults in honest_did.py:1455. This means the new guidance can direct users/agents into unsupported diagnostics and partial-failure paths.
Concrete fix: Make the sensitivity guidance strictly result-type safe. Remove the 2x2 placebo recommendation from MultiPeriodDiD guidance unless you first show a valid binary-post workflow, and gate HonestDiD messaging to result types actually supported by compute_honest_did().
Severity: P3
Impact: None for approval. The Baker-guide renumbering, separate PT-testing step, Step 5 inference convention, and Step 8 robustness/reporting rewrite are explicitly documented in REGISTRY.md:2116, so I treated those deviations as mitigated.
Concrete fix: None.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3
Impact: None for approval. The .txt AI guides are still outside the doc-snippet smoke tests, but that limitation is already tracked in TODO.md:83.
Concrete fix: None required for approval.

Security

No findings.

Documentation/Tests

Severity: P3
Impact: The new tests do not cover the blocking guidance mismatch. test_multi_period_results() in test_practitioner.py:234 only asserts that guidance exists, and the new assertions around SDiD/TROP in test_practitioner.py:281 and test_practitioner.py:291 still do not exercise whether emitted diagnostics are valid for the result type. That let the broken placebo path slip through.
Concrete fix: Add regression tests that fail if MultiPeriodDiD guidance emits the 2x2 placebo snippet, if simple DiD guidance points to unsupported HonestDiD, or if practitioner examples suppress returned placebo error entries.
Validation note
I could not execute tests/test_practitioner.py here because pytest is unavailable, and import diff_diff fails in this environment because numpy is not installed.

Path to Approval

Make practitioner_next_steps() result-type safe for sensitivity guidance: remove or replace the 2x2 placebo recommendation from MultiPeriodDiD output in practitioner.py:305, and stop pointing unsupported result types at HonestDiD from practitioner.py:187.
Rewrite the placebo snippet in practitioner.py:221 and llms-practitioner.txt:286 so it uses a valid binary post indicator or another actually supported workflow.
Update the docs example in llms-practitioner.txt:291 to report returned {"error": ...} entries instead of silently skipping them.
Add focused regression tests in test_practitioner.py:234 covering the invalid placebo path and unsupported HonestDiD recommendation.

run_all_placebo_tests() requires binary time indicator, which MultiPeriodDiD does not use (it has multi-period time). Removed the placebo step from _handle_multi_period; HonestDiD (which IS supported for MultiPeriodDiDResults) remains as the sensitivity tool. DiD handler already correctly uses _placebo_step() (valid for 2x2 with binary time) and does not suggest HonestDiD (not supported for DiDResults). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T23:12:51Z

/ai-review

github-actions · 2026-03-28T23:18:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c8658167161f9e40a290f64cc1a979811bef710d

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior MultiPeriodDiD -> placebo blocker is only partially resolved: diff_diff/practitioner.py:L305 now omits placebo guidance for MultiPeriodDiDResults.
Two P1 guidance mismatches remain in the new practitioner layer: the basic DiD path still tells users to use unsupported HonestDiD, and the shared placebo example still uses an unsupported multi-period time='period' workflow while hiding returned error entries.
No estimator math, weighting, variance/SE formulas, or default estimator behavior changed in this diff; the approval blockers are guidance/API-contract mismatches, not numerical methodology changes.
The Baker-workflow renumbering, cluster-count convention, and with/without-covariates requirement are documented in the methodology registry, so I treated those as mitigated P3s rather than defects.
The added tests do not cover the two remaining result-type-safety gaps.
I could not run pytest here because pytest is not installed.

Methodology

This PR is aligning a guidance layer with Baker et al.’s practitioner framework rather than changing estimator formulas. citeturn5search0turn5search1

Severity: P1
Impact: DiDResults still inherit the generic parallel-trends step via diff_diff/practitioner.py:L281, and that shared step still tells users to “use HonestDiD” at diff_diff/practitioner.py:L190. But compute_honest_did() only accepts MultiPeriodDiDResults or CallawaySantAnnaResults, and otherwise raises a type error at diff_diff/honest_did.py:L561 and diff_diff/honest_did.py:L650. The runtime helper can still steer basic 2x2 users into an unsupported sensitivity workflow.
Concrete fix: Make the non-staggered PT copy result-type safe: remove the HonestDiD sentence from the generic/basic-DiD step, or gate it so only MultiPeriodDiDResults / CallawaySantAnnaResults receive that recommendation.
Severity: P1
Impact: The shared placebo guidance still reproduces the earlier partial-failure path. The new file itself acknowledges that run_all_placebo_tests() requires a binary time indicator at diff_diff/practitioner.py:L309, but _placebo_step() still shows time='period' with multi-period pre_periods / post_periods at diff_diff/practitioner.py:L231. The practitioner guide repeats the same example and then filters output with hasattr(result, 'p_value') at docs/llms-practitioner.txt:L286 and docs/llms-practitioner.txt:L292. Meanwhile, the permutation and leave-one-out branches pass the supplied time straight into DifferenceInDifferences.fit() at diff_diff/diagnostics.py:L567, diff_diff/diagnostics.py:L590, diff_diff/diagnostics.py:L693, and diff_diff/diagnostics.py:L712, and DifferenceInDifferences validates time as binary at diff_diff/estimators.py:L223. Following the published snippet can therefore suppress failed subtests and leave users/agents thinking the full placebo suite ran.
Concrete fix: Rewrite the shared snippet in both the runtime helper and the practitioner guide to use a binary post indicator, or switch the example to individual placebo helpers that genuinely support multi-period period. Also print returned error dicts explicitly instead of filtering them out. The existing diagnostics docs already use binary post for permutation/LOO at docs/api/diagnostics.rst:L82 and docs/api/diagnostics.rst:L108.
Severity: P3
Impact: The Baker-workflow renumbering, explicit PT-testing step, cluster-count rule, and “with and without covariates” reporting requirement are explicitly documented in docs/methodology/REGISTRY.md:L2116, so I treated those deviations as mitigated rather than defects.
Concrete fix: None.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3
Impact: .txt AI-guide snippets remain outside the doc-snippet smoke tests, but that limitation is already tracked in TODO.md:L83.
Concrete fix: None required for approval.

Security

No findings.

Documentation/Tests

Severity: P3
Impact: The added TestResultTypeDispatch coverage in tests/test_practitioner.py:L229 checks several handler-specific strings, but there is still no regression that fails when the DiD path mentions unsupported HonestDiD or when the practitioner-guide placebo example suppresses returned error entries.
Concrete fix: Add focused regressions for both cases: one on practitioner_next_steps(DiDResults) and one on the published placebo example/output handling.
Validation note
I could not execute python -m pytest tests/test_practitioner.py -q because pytest is not installed in this environment.

Path to Approval

Remove or gate the HonestDiD recommendation on the basic-DiD path so practitioner_next_steps(DiDResults) never points to an unsupported API.
Replace the shared run_all_placebo_tests() example in diff_diff/practitioner.py:L221 and docs/llms-practitioner.txt:L282 with a binary post workflow or with supported multi-period placebo helpers.
Update the practitioner-guide example loop so it reports returned {"error": ...} entries instead of silently skipping them.
Add regression tests covering items 1-3.

P1: Non-staggered parallel_trends_step now qualifies HonestDiD as available "For MultiPeriodDiD or CS results" instead of blanket advice. DiDResults users won't be steered toward unsupported compute_honest_did. P1: Placebo snippet now uses binary time='post' with pre_periods=[0], post_periods=[1] matching what run_all_placebo_tests() actually needs (it refits DifferenceInDifferences which validates binary time). Also shows error-dict handling in the guide instead of silently filtering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T23:21:21Z

/ai-review

github-actions · 2026-03-28T23:27:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b662a41805325cf77cb1c7f3b6cf5e76aa484025

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review result: one of the two prior P1s is fixed, one remains. The prior placebo/example issue is resolved in diff_diff/practitioner.py:L222-L241 and docs/llms-practitioner.txt:L270-L296.
This diff adds a guidance/documentation layer and tests; it does not change estimator math, weighting, variance/SE formulas, or estimator defaults.
One P1 remains: the generic parallel-trends step in diff_diff/practitioner.py:L167-L201 is still reused by the basic DiD handler at diff_diff/practitioner.py:L283-L286, so practitioner_next_steps(DiDResults) still tells users to use HonestDiD even though diff_diff/honest_did.py:L543-L545 and diff_diff/honest_did.py:L650-L653 only support MultiPeriodDiDResults or CallawaySantAnnaResults.
The Baker-workflow renumbering, separate PT-testing step, cluster-count convention, and with/without-covariates reporting rule are explicitly documented in docs/methodology/REGISTRY.md:L2114-L2131, so I treated those as mitigated P3s rather than defects. Baker et al. is an eight-step practitioner framework, and the registry is explicit that diff-diff is adapting rather than mirroring that framework 1:1. citeturn0search0turn1search0
The added tests still do not pin the remaining DiDResults/HonestDiD mismatch: tests/test_practitioner.py:L229-L233.

Methodology

Severity: P1
Impact: practitioner_next_steps(DiDResults) is still not fully result-type-safe. The basic 2x2 path inherits copy saying “For MultiPeriodDiD or CS results, use HonestDiD,” but the runtime helper is supposed to be context-aware for the actual result object it receives. For basic DiD users, this still points at an unsupported sensitivity workflow and keeps the prior blocker alive: diff_diff/practitioner.py:L167-L201, diff_diff/practitioner.py:L283-L286, diff_diff/honest_did.py:L650-L653.
Concrete fix: split _parallel_trends_step() into a basic-DiD version and an event-study version, or add a supports_honest_did flag so the DiDResults handler never mentions HonestDiD at all.
Severity: P3
Impact: The workflow renumbering and diff-diff-specific reporting conventions are documented deviations, not silent methodology drift: docs/methodology/REGISTRY.md:L2114-L2131.
Concrete fix: None.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3
Impact: .txt AI-guide snippets remain outside doc-snippet smoke tests, but this PR explicitly tracks that limitation in TODO.md:L74-L83, so it is mitigated and not a blocker.
Concrete fix: None required for approval.

Security

No findings.

Documentation/Tests

Severity: P3
Impact: The new test suite checks many handler-specific API mismatches, but it still does not fail when practitioner_next_steps(DiDResults) mentions unsupported HonestDiD guidance. That leaves the remaining P1 unguarded: tests/test_practitioner.py:L229-L320.
Concrete fix: add a regression asserting that the DiDResults path contains no HonestDiD/compute_honest_did recommendation in its emitted steps.
Validation note: I could not execute pytest here because pytest is not installed, and I also could not import the package for a runtime spot-check because this environment lacks numpy.

Path to Approval

Make the basic DiDResults guidance fully result-type-safe by removing the HonestDiD recommendation from the non-staggered parallel-trends step, or by gating that text to result types actually accepted by compute_honest_did().
Add a regression test on practitioner_next_steps(did_results) that asserts the emitted steps do not mention HonestDiD or compute_honest_did.

Basic DiD handler now has its own PT step that doesn't mention HonestDiD (which DiDResults doesn't support). The shared _parallel_trends_step() is now only used by MultiPeriodDiD (where HonestDiD IS supported) and the generic fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-28T23:29:22Z

/ai-review

github-actions · 2026-03-28T23:36:20Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c74e7ace8b9d4e99b0c9559821f5045bbabb798d

Overall Assessment

✅ Looks good

This re-review resolves the prior P1. The PR adds a guidance/documentation layer and a new helper, but it does not change estimator formulas, weighting, variance/SE computation, identification checks, or defaults.

Executive Summary

The prior blocker is fixed: DiDResults now has a dedicated 2x2 handler that no longer mentions HonestDiD, while the new docs/runtime guidance scope HonestDiD to supported result types and keep placebo tests on the simple 2x2 path (diff_diff/practitioner.py:L283-L316, docs/llms-practitioner.txt:L250-L296, diff_diff/honest_did.py:L616-L652, diff_diff/diagnostics.py:L766-L845).
Methodology-wise, this is a workflow/adaptation layer, not an estimator-math change. The cited Baker paper is an organizing framework for DiD designs and estimators, and the PR’s renumbering/reporting changes are explicitly documented as diff-diff-specific adaptations in the registry, so I treat them as mitigated P3s rather than defects (docs/methodology/REGISTRY.md:L2116-L2131). (aeaweb.org)
The new TODO.md entry properly tracks the .txt doc-snippet smoke-test gap, so that limitation is mitigated and non-blocking (TODO.md:L74-L83).
Residual gap: the new test suite still does not add a direct regression asserting that practitioner_next_steps(DiDResults) emits no HonestDiD / compute_honest_did guidance; coverage for DiDResults currently only checks that some steps are returned (tests/test_practitioner.py:L229-L233).

Methodology

Severity: P3
Impact: No unmitigated methodology defect. Cross-checking the cited Baker guide and the registry, this PR is implementing a library-specific practitioner workflow rather than altering estimator mathematics, and the step renumbering / cluster-threshold / robustness-reporting changes are explicitly labeled as diff-diff adaptations. The previous DiDResults → HonestDiD mismatch is resolved by the dedicated 2x2 handler and by constraining HonestDiD guidance to supported result types only (docs/methodology/REGISTRY.md:L2116-L2131, diff_diff/practitioner.py:L283-L316, docs/llms-practitioner.txt:L250-L296, diff_diff/honest_did.py:L616-L652). (aeaweb.org)
Concrete fix: None.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3
Impact: The new .txt AI guides are still outside the existing doc-snippet smoke tests, but this PR explicitly tracks that limitation in TODO.md, so it is mitigated and not a blocker (TODO.md:L74-L83).
Concrete fix: None required for approval.

Security

No findings.

Documentation/Tests

Severity: P3
Impact: The suite covers many result-type-specific API mismatches, but it still does not pin the exact previously-blocking regression that practitioner_next_steps(DiDResults) must not mention HonestDiD or compute_honest_did; current DiDResults coverage only asserts that guidance exists (tests/test_practitioner.py:L229-L233). That leaves the old failure mode unguarded against future refactors, even though the implementation is currently correct (diff_diff/practitioner.py:L283-L316).
Concrete fix: Add a targeted regression that calls practitioner_next_steps(did_results, verbose=False) and asserts that neither the emitted labels nor code snippets contain HonestDiD or compute_honest_did.
Severity: P3
Impact: practitioner_next_steps is now a top-level public API export, but it is not added to the standard Sphinx API index, so the regular API docs surface adjacent diagnostics/sensitivity helpers but not this new entry point (diff_diff/init.py:L171-L171, diff_diff/init.py:L348-L349, docs/api/index.rst:L82-L125).
Concrete fix: Add diff_diff.practitioner_next_steps to the autosummary/API docs and include a short usage example.

Replace the original adherence-only rubric (which inflated before scores by giving credit for running the wrong diagnostic) with a correctness rubric that penalizes methodologically invalid choices. Before scores drop from 9.4 to 8.4 (generic check_parallel_trends on staggered data scored as 1, not 2). Final scores: 15.55/16. Net improvement increases from +65% to +85%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber and others added 6 commits March 28, 2026 16:40

igerber force-pushed the ai-practitioner-guide branch from 158f725 to f3c17ff Compare March 28, 2026 20:40

igerber merged commit 8f92bc9 into main Mar 29, 2026
14 checks passed

igerber deleted the ai-practitioner-guide branch March 29, 2026 01:12

igerber mentioned this pull request Mar 29, 2026

Bump version to 2.7.6 #241

Merged

Conversation

igerber commented Mar 28, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

igerber commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Mar 28, 2026