Skip to content

Add AI practitioner guardrails (Baker et al. 2025)#239

Merged
igerber merged 21 commits into
mainfrom
ai-practitioner-guide
Mar 29, 2026
Merged

Add AI practitioner guardrails (Baker et al. 2025)#239
igerber merged 21 commits into
mainfrom
ai-practitioner-guide

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Mar 28, 2026

Summary

  • Add docs/llms-practitioner.txt: 8-step Baker et al. (2025) workflow mapped to concrete diff-diff API calls, structured as agent-executable instructions
  • Restructure docs/llms.txt with practitioner workflow header — the first thing AI agents see on discovery
  • Add diff_diff/practitioner.py: practitioner_next_steps(results) function providing context-aware guidance for all 13 result types via handler registry dispatch
  • Add tests/test_practitioner.py: 26 tests covering all result types, completed_steps filtering, verbose output, NaN handling, and unknown type fallback
  • Add docs/practitioner-guide-evaluation.md: empirical before/after comparison across 30 independent agent runs

Methodology references (required if estimator / math changes)

  • Method name(s): N/A — practitioner.py is a guidance module (no estimator math). References Baker, Larcker & Wang (2025) "Difference-in-Differences Designs: A Practitioner's Guide" (arXiv:2503.13323) for the 8-step workflow.
  • Paper / source link(s): https://arxiv.org/abs/2503.13323
  • Any intentional deviations from the source (and why): None — the guide maps Baker et al.'s framework to diff-diff's existing API without modifying any estimator behavior.

Validation

  • Tests added/updated: tests/test_practitioner.py (26 tests, all passing)
  • Empirical validation: 30 independent AI agent runs (10 before, 10 after v1, 10 after v2) scored against Baker et al. rubric. Results: 9.4/16 → 16.0/16 (+70%) with variance collapsing from SD=0.84 to SD=0.0. Full results in docs/practitioner-guide-evaluation.md.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

The PR’s main feature is a new “Baker et al. 2025” practitioner workflow, but two P1 issues are unmitigated: it rewrites the cited eight-step framework without documenting the deviation, and it recommends compute_honest_did() for result types that the library does not support. The cited paper’s Step 8 is “Keep learning,” not “Robustness & Reporting,” and the paper explicitly lists few-cluster issues as outside its main coverage. (scribd.com)

Executive Summary

  • P1: The new AI-facing workflow is presented as a direct Baker et al. mapping, but it renumbers/redefines the paper’s steps and injects a hard few-cluster rule with no deviation note in the Methodology Registry. (scribd.com)
  • P1: practitioner_next_steps() and the new guide both recommend compute_honest_did(results) for SunAbraham, ImputationDiD, TwoStageDiD, StackedDiD, and EfficientDiD results, even though honest_did.py:543 only supports MultiPeriodDiDResults and CallawaySantAnnaResults.
  • P2: Several snippets emitted by diff_diff/practitioner.py:209 reference nonexistent contracts (result_no_cov.att, results.pre_treatment_rmse, results.sub_experiments, results.hausman_pretest).
  • P2: tests/test_practitioner.py:118 and neighboring fixtures use __new__ mocks with invented fields, so the suite does not validate the real result-object APIs it claims to cover.
  • P2/P3: The new primary guide has a broken equivalence_test_trends() example and visible mojibake in the estimator decision tree.

Methodology

Code Quality

Performance

No findings in the changed code.

Maintainability

  • Severity: P2. Impact: tests/test_practitioner.py:118, tests/test_practitioner.py:143, and nearby fixtures build __new__ mocks with fields that real result classes do not expose. For example, the EfficientDiD fixture invents .att, .se, and .hausman_pretest, even though the actual contract is overall_att/overall_se plus a separate HausmanPretestResult type (diff_diff/efficient_did_results.py:21, diff_diff/efficient_did_results.py:54). The same mismatch exists for SunAbrahamResults, ImputationDiDResults, TwoStageDiDResults, ContinuousDiDResults, and StackedDiDResults. This locks in the wrong API shape and lets broken guidance pass tests. Concrete fix: replace these mocks with real estimator outputs where feasible, or with dataclass-faithful fixtures that use the actual public fields and required preconditions.

Tech Debt

  • Severity: P3. Impact: None of the practitioner-guide issues above are tracked in TODO.md or documented as a methodology deviation in docs/methodology/REGISTRY.md, so merging would create untracked debt. Concrete fix: only defer the doc-only cleanups if they are explicitly added to TODO.md; the P1 items still need direct fixes before merge.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: The new primary guide uses equivalence_test_trends(..., threshold=0.5) at docs/llms-practitioner.txt:91, but the actual API parameter is equivalence_margin at diff_diff/utils.py:919. Because docs/llms.txt:36 tells agents to “start here,” this broken example is immediately user-facing. Concrete fix: rename the argument to equivalence_margin= and add the new file to doc-snippet coverage.
  • Severity: P3. Impact: The estimator decision tree in docs/llms-practitioner.txt:115 and docs/llms-practitioner.txt:127 contains mojibake (��, └���─), which makes the new AI-facing guide look corrupted. Concrete fix: re-save the file as UTF-8 and add a lightweight doc lint or snapshot check for non-printing/corrupted characters.

Path to Approval

  1. Realign the Baker workflow in docs/llms-practitioner.txt, docs/llms.txt, and diff_diff/practitioner.py: restore the paper’s step structure, or explicitly mark diff-diff-specific additions as deviations in docs/methodology/REGISTRY.md.
  2. In diff_diff/practitioner.py, only emit HonestDiD guidance for result types actually supported by diff_diff/honest_did.py, and update the docs to match; for Callaway-Sant’Anna, tell users to refit with aggregate='event_study' or aggregate='all'.
  3. Fix the generated snippets in diff_diff/practitioner.py: use the correct result fields (overall_att, pre_treatment_fit, stacked_data/n_sub_experiments) and the actual EfficientDiD pretest API.
  4. Replace the contract-inaccurate mocks in tests/test_practitioner.py with real or faithful fixtures, add assertions for the EfficientDiD and Synthetic/Stacked handler paths, and add doc-snippet coverage for the new practitioner guide.

Validation note: this was a static review only. I could not run pytest here because pytest is not installed in the environment, and importing diff_diff also fails here because numpy is unavailable.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f34254254dfba5f0295655f4813c83cab2fe75d9


Overall Assessment

⚠️ Needs changes

The previous HonestDiD-support blocker looks resolved, but one P1 remains: the new Baker workflow is still numbered inconsistently across the exported helper, the long/short AI guides, the registry note, and the new tests. Validation note: this was a static review only; I could not run pytest or import diff_diff here because pytest and numpy are unavailable in this environment.

Executive Summary

Methodology

Affected method(s): Baker et al. (2025) practitioner workflow only; no estimator implementation changed.

Code Quality

  • Severity: P2. Impact: The Bacon handler says it should warn when “later vs earlier” comparisons contaminate TWFE, but the implementation looks for negative comparison weights at diff_diff/practitioner.py:524. In this library, forbidden comparisons are represented by comparison_type="later_vs_earlier" and summarized via total_weight_later_vs_earlier, not by negative weights, at diff_diff/bacon.py:34, diff_diff/bacon.py:84, and diff_diff/bacon.py:230. As written, the new warning path will miss the very contamination pattern it describes. Concrete fix: key the warning off results.total_weight_later_vs_earlier > 0 or comparison_type == "later_vs_earlier" instead of negative weights.
  • Severity: P2. Impact: The EfficientDiD guidance correctly notes that the Hausman pretest is not a results attribute, but its example then suggests edid.fit(..., run_pretest=True) at diff_diff/practitioner.py:435. EfficientDiD.fit() has no run_pretest parameter at diff_diff/efficient_did.py:294; the supported API is EfficientDiD.hausman_pretest(...) at diff_diff/efficient_did.py:1396. Concrete fix: replace the snippet with the actual hausman_pretest(...) call and keep the comment that it is separate from the results object.

Performance

  • No findings.

Maintainability

  • No separate findings beyond the cross-file workflow drift noted above.

Tech Debt

  • No separate findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests ratify the current numbering bug instead of catching it at tests/test_practitioner.py:283, and they do not assert either the Bacon warning behavior or the EfficientDiD pretest guidance paths. Separately, the snippet smoke test only scans .rst files at tests/test_doc_snippets.py:25, so the new AI-facing .txt guides are outside automated snippet validation. Concrete fix: after deciding the canonical Baker-step mapping, update the practitioner tests to assert the correct step numbers, add handler-specific assertions for Bacon warnings and hausman_pretest, and either extend snippet coverage to the .txt AI guides or move executable examples into validated .rst docs.

Path to Approval

  1. Choose a single canonical Baker-step mapping and align practitioner_next_steps(), both LLM guides, the registry note/evaluation rubric, and tests/test_practitioner.py to that same numbering. If the intended adaptation is the one documented in the registry, change the helper’s parallel-trends step from baker_step=2 to the documented Step 3 and update the surrounding headings/tests to match.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 50bc3fcd913bc85219642316979f4c0d5070eb82


Overall Assessment

⚠️ Needs changes

This re-review is down to one remaining blocker. No estimator math, weighting, variance/SE, or default-behavior logic changed in this diff, and the prior Bacon-warning / EfficientDiD API issues appear resolved. The remaining P1 is that the new Baker guardrail artifacts are still not internally consistent: the main guide and runtime helper now use the adapted Step 3/4/5 mapping, but the complete example and evaluation rubric still teach/score the old sequence.

Executive Summary

  • Review surface is the new practitioner-guidance layer; no estimator implementation changed.
  • Re-review status: the prior Bacon-warning and EfficientDiD Hausman-snippet issues appear resolved.
  • P1: the long-form guide’s complete example still labels Step 3 as estimator selection and Step 4 as inference, and it never performs the explicit Step 3 parallel-trends test.
  • P2: _check_nan_att() misses MultiPeriodDiDResults.avg_att, so NaN multi-period results do not trigger the new warning path.
  • P2: the new .txt AI guides are outside the doc-snippet smoke-test harness.

Methodology

Affected method(s): Baker et al. (2025) practitioner workflow only; no estimator implementation changed.

  • Severity: P1. Impact: docs/methodology/REGISTRY.md:2004 and diff_diff/practitioner.py:137 establish the adapted canonical numbering as Step 3 = test parallel trends, Step 4 = choose estimator, Step 5 = estimate. But the shipped “Complete Example” still labels Step 3 as estimator selection and Step 4 as inference at docs/llms-practitioner.txt:392 and docs/llms-practitioner.txt:399, the Step 6 Bacon note still cross-references “Step 3” at docs/llms-practitioner.txt:258, and the new evaluation rubric still scores S3/S4/S5 using the old mapping at docs/practitioner-guide-evaluation.md:24. Because the example never executes the explicit Step 3 pre-trends check, agents following the example can still skip the diagnostic that the registry separated into its own required step. Baker et al.’s paper keeps Step 3 as estimation method, Step 4 as sources of uncertainty, Step 5 as estimate, and Step 8 as “keep learning.” (ar5iv.org) Concrete fix: align the complete example, Step 6 Bacon cross-reference, and evaluation rubric to the canonical mapping, and add an explicit Step 3 check_parallel_trends() / equivalence_test_trends() block before Bacon decomposition and estimation.

Code Quality

  • Severity: P2. Impact: _check_nan_att() only inspects .att and .overall_att at diff_diff/practitioner.py:592, but MultiPeriodDiDResults stores its aggregate effect as .avg_att at diff_diff/results.py:302. A NaN multi-period/event-study average therefore skips the module’s new warning path for one of the advertised supported result types. Concrete fix: check avg_att as well, or centralize a small ordered list of supported effect field names such as ("att", "overall_att", "avg_att").

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. I did not find a TODO.md entry that mitigates the issues above.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the snippet smoke-test harness still only enumerates .rst files at tests/test_doc_snippets.py:25, so none of the new AI-facing docs/llms.txt, docs/llms-full.txt, or docs/llms-practitioner.txt code blocks are exercised in CI. The added practitioner tests also cover NaN warnings only for DiDResults at tests/test_practitioner.py:329, not the MultiPeriodDiDResults path above. Concrete fix: extend the snippet harness to the new .txt guides or move their runnable examples into validated .rst pages, and add a MultiPeriodDiDResults(avg_att=np.nan, ...) regression test.

Validation note: this was a static review only. I could not execute pytest or import diff_diff in this environment because pytest and numpy are unavailable.

Path to Approval

  1. Update the remaining step-numbering drift in docs/llms-practitioner.txt:258, docs/llms-practitioner.txt:392, docs/llms-practitioner.txt:399, and docs/practitioner-guide-evaluation.md:24 so they match the canonical Step 3/4/5 mapping in docs/methodology/REGISTRY.md:2004.
  2. Make the “Complete Example” actually perform Step 3 by adding a pre-trends block (check_parallel_trends() and/or equivalence_test_trends()) before the Bacon decomposition and cs.fit(...) call.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6ca617ea1f269a0a19df15907b85737eccda1851


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

Code Quality

  • No unmitigated findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3 (tracked in TODO.md). Impact: the prior .txt doc-snippet smoke-test gap is now explicitly tracked at TODO.md:82, so under the review policy it is mitigated and not blocking. Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

  • No unmitigated findings beyond the tracked .txt smoke-test item above.
  • Validation note: this was a static review only; I could not execute the new tests in this environment.

Path to Approval

  1. Replace the load_mpdta() Step 3 block with a pre-trends diagnostic that is actually valid for the chosen data/helper combination. Concretely, either restrict to later-treated vs never-treated units and pass a binary treatment indicator plus explicit multi-period pre-periods, or switch the complete example to a dataset/workflow where check_parallel_trends() can run before estimation.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 528c2410bb8bfa0eee404235715a33bd2c58f93f


Overall Assessment

⚠️ Needs changes

Executive Summary

  • No estimator math, weighting, variance/SE, identification assumptions, or defaults changed in executable estimators; the review surface is the new Baker workflow docs plus practitioner_next_steps().
  • P1: the previous Step 3 issue is not fully resolved. The load_mpdta() example no longer crashes, but it still instructs a contaminated pre-trends test for staggered adoption, and the same generic pattern is emitted by practitioner_next_steps() for staggered result types.
  • The Baker-step renumbering, the cluster-count directive, and the mandatory with/without-covariates convention are explicitly documented in the Methodology Registry, so those deviations are non-blocking under the review policy.
  • P2: the flagship “Complete Example” still omits the guide’s own REQUIRED with/without-covariates comparison.
  • P2: the evaluation writeup changes model mix across the before/after arms, so its “documentation alone” causal claim is not isolated.
  • Static review only: I could not run imports/tests here because numpy and pytest are unavailable in this environment.

Methodology

  • P1 [Previous finding not fully resolved] Impact: the new Step 3 guidance is still methodologically wrong for staggered load_mpdta() analyses. The example now uses treat instead of first_treat, which fixes the old empty-treated-path bug, but check_parallel_trends() explicitly assumes the supplied pre_periods are genuinely pre-treatment and splits a binary 0/1 group. In load_mpdta(), treatment cohorts include 2004, so the example’s pre_periods=[2003, 2004, 2005] include post-treatment observations for the 2004 cohort. That contaminates the treated trend instead of testing a valid pre-trend. The same generic binary-group snippet is reused for staggered result handlers in practitioner_next_steps(), so the runtime guardrail is not actually context-aware for staggered designs. Concrete fix: replace the staggered Step 3 guidance in both the docs and practitioner.py with a valid staggered-design diagnostic, e.g. pre-treatment event-study / ATT(g,t) checks or a later-treated-vs-never-treated restricted sample whose selected years are truly untreated for the “treated” group. Refs: docs/llms-practitioner.txt:392, diff_diff/practitioner.py:137, diff_diff/practitioner.py:266, diff_diff/utils.py:619, diff_diff/datasets.py:651, diff_diff/datasets.py:714, docs/tutorials/02_staggered_did.ipynb:464.
  • P3 Impact: the PR’s Baker-step renumbering and the Step 5 / Step 8 conventions are documented in the Methodology Registry, so they are mitigated and non-defects under the stated policy. Concrete fix: none required. Refs: docs/methodology/REGISTRY.md:2000.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • P3 Impact: the .txt doc-snippet smoke-test gap is now explicitly tracked in TODO.md, so this is mitigated and not blocking. Concrete fix: none required for approval. Refs: TODO.md:82.

Security

No findings.

Documentation/Tests

  • P2 Impact: the guide says Step 8 “Report with and without covariates” is REQUIRED, but the flagship “Complete Example” stops after comparing CS/SA/BJS and never performs the no-covariate re-fit. That leaves the end-to-end example inconsistent with the workflow it tells agents to follow. Concrete fix: extend the complete example with a covariates=None fit and print the with/without comparison, or relax the Step 8 wording if it is not meant to be mandatory. Refs: docs/llms-practitioner.txt:311, docs/llms-practitioner.txt:427.
  • P2 Impact: the empirical evaluation is framed as a before/after documentation effect, but the two arms use different model mixes (1 Opus + 9 Sonnet before vs 10 Sonnet after). That is a confound, so the claimed gain cannot be attributed to documentation changes alone from the evidence shown here. Concrete fix: rerun with identical model allocation across arms, or report model-stratified results and tone down the causal phrasing. Refs: docs/practitioner-guide-evaluation.md:13, docs/practitioner-guide-evaluation.md:142.
  • P3 [Previous finding largely resolved] Impact: the major step-label mismatch in the evaluation doc is mostly fixed, but two residual sentences still call the cluster-count change “Step 4” even though the canonical numbering in the Registry and the rest of the doc use Step 5. Concrete fix: relabel those two references from Step 4 to Step 5. Refs: docs/practitioner-guide-evaluation.md:111, docs/practitioner-guide-evaluation.md:153, docs/methodology/REGISTRY.md:2004.

Path to Approval

  1. Replace the staggered Step 3 guidance in both docs/llms-practitioner.txt and diff_diff/practitioner.py with a valid pre-trends diagnostic for staggered adoption that does not treat already-treated 2004-cohort observations as pre-treatment.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 158f72526543f4376fc4f0c85b539b3842b98e18


Overall Assessment

⚠️ Needs changes

Executive Summary

  • No executable estimator math, weighting, variance/SE formulas, identification checks, or defaults changed; the review surface is the new Baker workflow docs plus practitioner_next_steps().
  • P1 [Previous finding not fully resolved]: the staggered Step 3 fix is only partial. The detailed guide now correctly rejects generic check_parallel_trends() for staggered adoption, but the new discovery header in docs/llms.txt:L20 and the evaluation rubric in docs/practitioner-guide-evaluation.md:L30 still instruct/reward that invalid diagnostic for the staggered load_mpdta() task.
  • The Baker-step renumbering, the cluster-count directive, and the mandatory with/without-covariates convention are explicitly documented in docs/methodology/REGISTRY.md:L2004, so those deviations are non-blocking.
  • The prior full-example covariate-comparison gap, the step-label mismatch, and the .txt smoke-test tracking issue are resolved or mitigated.
  • Static review only: I could not run imports/tests here because numpy and pytest are unavailable in this environment.

Methodology

The cited Baker paper is framed as an organizing framework for DiD designs and estimators rather than a single estimator recipe, so consistency across the short guide, detailed guide, and evaluation rubric is the key methodology check for this PR. (ideas.repec.org)

  • Severity: P1 [Previous finding not fully resolved]. Impact: the detailed practitioner guide now correctly says generic check_parallel_trends() is invalid for staggered adoption because it assumes a single binary treatment with universal pre-periods, as reflected in docs/llms-practitioner.txt:L105 and the underlying helper in diff_diff/utils.py:L611. But the new top-level workflow header still lists check_parallel_trends() / equivalence_test_trends() as Step 3 with no staggered carve-out in docs/llms.txt:L20, and the evaluation doc still gives full Step-3 credit for those calls and says the “after” agents consistently ran them in docs/practitioner-guide-evaluation.md:L30 and docs/practitioner-guide-evaluation.md:L96. That means the PR’s discovery entrypoint and supporting validation still reward exactly the staggered diagnostic the detailed guide now rejects. Concrete fix: split Step 3 in docs/llms.txt into simple-2x2 vs staggered branches, and update the evaluation rubric/claims to score the staggered-safe diagnostic instead of generic check_parallel_trends() / equivalence_test_trends(). If the reported 16/16 runs were scored under the old rubric, rerun or explicitly soften the conclusion.
  • Severity: P3. Impact: the step renumbering, Step-5 cluster-count rule, and Step-8 mandatory with/without-covariates comparison are explicitly documented deviations in docs/methodology/REGISTRY.md:L2004, docs/methodology/REGISTRY.md:L2010, and docs/methodology/REGISTRY.md:L2013, so they are mitigated non-defects under the review policy. Concrete fix: none.

Code Quality

  • Severity: P2. Impact: practitioner_next_steps() is advertised as context-aware, but the shared staggered Step-3 snippet in [_parallel_trends_step()at diff_diff/practitioner.py:L137](/home/runner/work/diff-diff/diff-diff/diff_diff/practitioner.py#L137) tells users to “Fit withaggregate='event_study'or'all'” for every staggered handler. That is invalid for SunAbraham.fit()because it has noaggregateargument in [diff_diff/sun_abraham.py:L452](/home/runner/work/diff-diff/diff-diff/diff_diff/sun_abraham.py#L452), and partially invalid forStackedDiD.fit()becauseaggregate='all'` is rejected in diff_diff/stacked_did.py:L169 and diff_diff/stacked_did.py:L214. This is not an estimator-methodology bug, but it does make the new runtime guidance less than truly estimator-specific. Concrete fix: parameterize staggered Step-3 guidance by estimator family and add tests asserting that each handler’s emitted snippet only mentions supported fit parameters.

Performance

No findings.

Maintainability

No findings beyond the Code Quality issue above.

Tech Debt

  • Severity: P3. Impact: the .txt doc-snippet smoke-test gap is now explicitly tracked in TODO.md:L82, so it is mitigated and non-blocking under the stated policy. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

  • Severity: P2 [Previous finding not resolved]. Impact: the evaluation writeup still changes model mix across arms in docs/practitioner-guide-evaluation.md:L19 while concluding “Documentation alone” drove the improvement in docs/practitioner-guide-evaluation.md:L156. The note that the confound is “minor” does not remove the identification problem: the allocation is still unmatched across arms, so the causal attribution remains unisolated. Concrete fix: rerun with identical model allocation across conditions, or report model-stratified results and soften the causal language.
  • Verification note: static review only; numpy and pytest are unavailable here, so I could not run imports or tests/test_practitioner.py.

Path to Approval

  1. Update docs/llms.txt so Step 3 explicitly distinguishes simple 2x2 designs from staggered adoption, and do not present check_parallel_trends() / equivalence_test_trends() as the staggered default.
  2. Update docs/practitioner-guide-evaluation.md so the Step-3 rubric and “after agents” claims use the corrected staggered-safe diagnostic; if needed, rerun or reframe the 16/16 result under that corrected rubric.

igerber and others added 6 commits March 28, 2026 16:40
Ensure AI agents follow the 8-step empiricist workflow from Baker et al.
(2025) "Difference-in-Differences Designs: A Practitioner's Guide" when
using diff-diff, rather than just calling fit() without diagnostics.

New files:
- docs/llms-practitioner.txt: Full 8-step workflow mapped to diff-diff API
- diff_diff/practitioner.py: practitioner_next_steps() runtime guidance
- tests/test_practitioner.py: 26 tests covering all 13 result types
- docs/practitioner-guide-evaluation.md: Before/after empirical comparison

Empirical validation (30 agent runs): analysis quality improved from
9.4/16 to 16.0/16 (+70%) with zero variance, driven by agents now
running HonestDiD sensitivity analysis, formal parallel trends testing,
3-estimator robustness comparisons, and with/without covariate reporting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1 fixes:
- Relabel workflow as "based on Baker et al. 2025" (not a 1:1 mapping)
- Add REGISTRY.md deviation notes for reorganized steps and diff-diff conventions
- Gate HonestDiD guidance to MultiPeriodDiD and CallawaySantAnna only
  (the only types supported by compute_honest_did)
- Add aggregate='event_study' requirement for CS + HonestDiD

P2 fixes:
- Fix attribute names: pre_treatment_rmse -> pre_treatment_fit,
  sub_experiments -> n_sub_experiments/stacked_data
- Fix EfficientDiD Hausman pretest (estimator method, not results attr)
- Fix equivalence_test_trends param: threshold -> equivalence_margin
- Fix _covariates_step snippet to note .att vs .overall_att difference
- Fix test mocks to use correct attributes (overall_att/overall_se for
  staggered result types)

P3 fix:
- Replace Unicode box-drawing chars with ASCII in decision tree

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1: Align all files to canonical 8-step numbering from llms.txt:
  1-Define, 2-Assumptions, 3-Test PT, 4-Choose estimator,
  5-Estimate (with cluster check), 6-Sensitivity, 7-Heterogeneity,
  8-Robustness. Moved PT testing from Step 2 into separate Step 3
  in llms-practitioner.txt, folded uncertainty into Step 5, updated
  baker_step values in practitioner.py (PT: 2->3, estimator: 3->4).

P2: Bacon handler now checks total_weight_later_vs_earlier > 0.01
  instead of negative weights (matches actual BaconDecompositionResults
  API). EfficientDiD snippet uses actual hausman_pretest() classmethod
  instead of nonexistent run_pretest=True parameter.

Tests: Updated step number assertions, added Bacon warning tests (2)
  and EfficientDiD handler tests (2). Suite now 30 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1: Align complete example and Bacon cross-reference to canonical
numbering (Step 3=Test PT, Step 4=Choose estimator, Step 5=Estimate).
Complete example now executes check_parallel_trends() and prints
cluster count before estimation. Evaluation rubric S3-S5 labels
updated to match.

P2: _check_nan_att() now checks .avg_att (MultiPeriodDiDResults) in
addition to .att and .overall_att. Added regression test.

P2: Added TODO.md entry for extending snippet smoke tests to .txt
AI guides (deferred, low priority).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1: Complete example Step 3 now uses treatment_group='treat' (binary
ever-treated indicator) instead of 'first_treat' (cohort year), with
explicit pre_periods=[2003, 2004, 2005]. The previous version would
produce an empty treated subset since check_parallel_trends() splits
on == 1 / == 0.

P2: Evaluation doc comparison tables and narrative now use canonical
step labels (S3=Test PT, S4=Choose estimator, S5=Estimate) matching
REGISTRY.md and the practitioner guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mple

P1: For staggered designs, replace generic check_parallel_trends() with
CS event-study pre-period inspection. The generic helper assumes a single
binary treatment with universal pre-periods, which is invalid when
cohorts adopt at different times (early-cohort post-treatment observations
contaminate the "pre-trend"). The guide and practitioner_next_steps()
now emit staggered-specific guidance using event_study_effects pre-periods.

P2: Complete example now includes REQUIRED with/without covariates
comparison (was missing despite Step 8 marking it mandatory).

P2: Evaluation doc model-mix caveat added, residual "Step 4" references
changed to "Step 5" to match canonical numbering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the ai-practitioner-guide branch from 158f725 to f3c17ff Compare March 28, 2026 20:40
P1: llms.txt Step 3 now distinguishes simple 2x2 (check_parallel_trends)
from staggered (CS event-study pre-periods). Staggered Step 3 snippet
in practitioner.py no longer references estimator-specific aggregate
parameter that SA/Stacked don't support.

P2: Evaluation rubric S3 updated to score staggered-safe diagnostic.
Softened causal language re: model-mix confound.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bd8e565013c7f4064a7b01f4d88ee5c1d096ce09


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The previous staggered Step 3 blocker is materially resolved: the discovery guide, practitioner guide, and scoring rubric now distinguish simple 2x2 pre-trends tests from staggered event-study pre-period inspection in docs/llms.txt:L20, docs/llms-practitioner.txt:L105, and docs/practitioner-guide-evaluation.md:L30.
  • P1: practitioner_next_steps() still emits the simple 2x2 check_parallel_trends() workflow for ContinuousDiDResults and TripleDifferenceResults, even though those estimators have different identifying assumptions in docs/methodology/REGISTRY.md:L444 and diff_diff/triple_diff.py:L383.
  • P1: the new Step 6 placebo guidance points staggered and advanced users to run_all_placebo_tests(), but that helper is implemented by refitting the basic DifferenceInDifferences estimator with binary treatment and time inputs rather than the staggered estimators the new handlers/docs target.
  • The Baker-step renumbering, the 50-cluster rule, and the mandatory with/without-covariates convention are explicitly documented in the registry, so those deviations are mitigated non-defects.
  • Static review only: I could not execute imports or tests/test_practitioner.py here because this environment is missing numpy and pytest.

Methodology

Code Quality

Performance

  • No findings.

Maintainability

  • No findings beyond the methodology and code-quality issues above.

Tech Debt

  • Severity: P3. Impact: the .txt doc-snippet smoke-test gap is explicitly tracked in TODO.md:L83, so it is mitigated and non-blocking under the review policy. Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: tests/test_practitioner.py validates dispatch, schema, and a few label/code fragments in tests/test_practitioner.py:L211, tests/test_practitioner.py:L260, and tests/test_practitioner.py:L381, but it never asserts that each handler emits methodology-valid guidance for its estimator family. That is why the invalid Step 3 and Step 6 guidance passes the new suite. Concrete fix: add content tests that fail if ContinuousDiDResults or TripleDifferenceResults emit check_parallel_trends(), and if staggered handlers emit run_all_placebo_tests() without a staggered-aware wrapper.
  • Severity: P3. Impact: the formal evaluation rubric is now corrected, but the qualitative “After agents” summary still says the fixed staggered load_mpdta() task used check_parallel_trends or CS event-study pre-periods in docs/practitioner-guide-evaluation.md:L96. That is no longer a blocker, but it is still imprecise for a single staggered task. Concrete fix: tighten the sentence to the staggered-safe diagnostic actually intended for this task.
  • Verification note: static review only; numpy and pytest are unavailable in this environment, so I could not run tests/test_practitioner.py.

Path to Approval

  1. Replace the generic Step 3 guidance for ContinuousDiDResults and TripleDifferenceResults with estimator-specific diagnostics or explicit “no built-in formal test” caveats that match their actual identifying assumptions.
  2. Remove or re-scope run_all_placebo_tests() from the staggered/advanced handlers and the generic practitioner docs; if placebo guidance should remain for staggered designs, add a staggered-aware API that accepts cohort timing (first_treat) and targets the relevant estimator family.
  3. Add regression tests that fail on both issues above, so the helper cannot reintroduce 2x2 pre-trends or 2x2 placebo guidance into non-2x2 estimator handlers.

P1: ContinuousDiD and TripleDifference handlers now emit honest
"no built-in formal test" guidance instead of the invalid generic
check_parallel_trends() which assumes a single binary treatment.

P1: Staggered handlers now use _placebo_step(staggered=True) which
recommends specification-based falsification (control group/anticipation/
subsample comparisons) instead of run_all_placebo_tests() (which refits
a basic 2x2 DiD and is invalid for staggered designs).

P2: _check_nan_att() now coerces to float() before testing, handling
numpy scalars (np.float64(nan)) in addition to Python float("nan").

Docs updated: llms.txt Step 6 and llms-practitioner.txt placebo section
now note that run_all_placebo_tests is for simple 2x2 only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 51bcf369f763f634eb948cadc52af8db688bb22b


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The previous re-review blockers are materially resolved: the practitioner helper and the main AI-facing docs now scope check_parallel_trends() to simple 2x2 designs and scope run_all_placebo_tests() away from staggered designs in diff_diff/practitioner.py:141, diff_diff/practitioner.py:195, docs/llms-practitioner.txt:105, and docs/llms-practitioner.txt:260.
  • P1: TripleDifferenceResults guidance still misstates the identifying assumption by telling users DDD requires separate parallel-trends checks “along two dimensions,” which conflicts with the library’s own DDD documentation and the cited Baker appendix.
  • P1: SyntheticDiDResults and TROPResults still inherit a staggered-adoption falsification template that tells users to vary control_group, anticipation, and cohorts, even though those are not the relevant APIs for SDID or TROP.
  • P2: tests/test_practitioner.py still validates dispatch/schema only; it would not fail on either of the methodology regressions above.
  • The renumbered Baker steps, the 50-cluster inference rule, and the mandatory with/without-covariates convention are explicitly documented in REGISTRY.md, so those deviations are mitigated non-defects.
  • Static review only: this environment is missing numpy and pytest, so I could not execute imports or run tests/test_practitioner.py.

Methodology

  • Severity: P1. Impact: _handle_triple() still tells users that DDD “requires parallel trends along two dimensions” and to inspect treatment and eligibility trends separately in diff_diff/practitioner.py:540 and diff_diff/practitioner.py:557. That conflicts with the library’s own DDD contract in diff_diff/triple_diff.py:383, which says DDD is weaker than requiring separate parallel trends for two DiDs. Baker et al.’s appendix says the same thing: DDD allows group-specific and partition-specific PT violations and generally cannot be expressed as “two DiDs.” Concrete fix: rewrite the TripleDifference handler to state the documented DDD identifying assumption directly, and replace the separate-dimension PT / placebo language with either a DDD-specific diagnostic or an explicit “no built-in formal test” caveat. (ar5iv.org)
  • Severity: P1. Impact: SyntheticDiDResults and TROPResults both reuse _placebo_step(staggered=True) via diff_diff/practitioner.py:419 and diff_diff/practitioner.py:456, so practitioner_next_steps() tells those users to vary control_group, set anticipation=1, and drop cohorts one-by-one from the template defined in diff_diff/practitioner.py:195. Those are staggered-adoption knobs; they are not the methodological contract of SDID or TROP as documented in docs/methodology/REGISTRY.md:1044, diff_diff/synthetic_did.py:34, diff_diff/synthetic_did.py:183, diff_diff/trop.py:47, and diff_diff/trop.py:391. Concrete fix: give SDID and TROP estimator-specific Step 6 handlers, or at minimum remove control_group/anticipation/cohort language and replace it with checks their actual APIs support.
  • Severity: P3. Impact: none. The step renumbering, explicit cluster-count rule, and mandatory with/without-covariates reporting are all documented deviations in docs/methodology/REGISTRY.md:2120, so they are mitigated and should not block approval. Concrete fix: none. (ar5iv.org)

Code Quality

  • No findings beyond the methodology issues above.

Performance

  • No findings.

Maintainability

  • No additional findings.

Tech Debt

  • Severity: P3. Impact: .txt AI-guide smoke tests are still outside CI, but the PR correctly tracks that limitation in TODO.md:81, so it is mitigated and non-blocking. Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests only assert that each handler returns some output; they never assert the methodology-critical content. The SDID, TROP, and DDD tests in tests/test_practitioner.py:251, tests/test_practitioner.py:256, and tests/test_practitioner.py:268 would all still pass with the current P1 guidance bugs. Concrete fix: add negative content assertions that DDD guidance does not describe separate PT requirements, and that SDID/TROP guidance does not mention control_group, anticipation, or staggered-cohort leave-one-out logic.
  • Severity: P3. Impact: the qualitative evaluation summary still says the “after” agents on the staggered load_mpdta() task ran run_all_placebo_tests() and calls CS pre-period inspection a “formal” diagnostic in docs/practitioner-guide-evaluation.md:96 and docs/practitioner-guide-evaluation.md:98. That no longer matches the corrected practitioner guide and overstates what the staggered-safe workflow is. Concrete fix: tighten the summary to the actual staggered-safe Step 3 and Step 6 actions being recommended/scored.

Path to Approval

  1. Rewrite the TripleDifferenceResults guidance in diff_diff/practitioner.py so Step 3 and Step 7 match the documented DDD assumption in diff_diff/triple_diff.py, rather than describing DDD as two separate PT checks.
  2. Split out SyntheticDiDResults and TROPResults Step 6 handlers in diff_diff/practitioner.py so they no longer emit control_group, anticipation, or cohort-drop guidance copied from staggered estimators.
  3. Add regression tests in tests/test_practitioner.py that fail if DDD guidance says “parallel trends along two dimensions,” or if SDID/TROP guidance mentions control_group, anticipation, or run_all_placebo_tests().

Verification note: static review only; numpy and pytest are unavailable in this environment, so I could not run the new test file or import the package.

Eliminate reuse of staggered templates for non-staggered estimators:

- SyntheticDiD/TROP: replace staggered _placebo_step with estimator-
  specific in-time/in-space placebo guidance (no control_group,
  anticipation, or cohort-dropping language)
- TripleDifference: rewrite to state the actual DDD identifying
  assumption (weaker than separate PT for two DiDs), remove the
  "requires PT along two dimensions" mistatement
- ContinuousDiD: already fixed (dose-specific guidance)

Tests: add negative content assertions ensuring SDiD/TROP don't
mention control_group/anticipation, CDiD doesn't emit
check_parallel_trends, and DDD doesn't claim "two dimensions".

Evaluation doc: tighten qualitative summary to match corrected
staggered-safe diagnostics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bea06a59c6d4e8439573083a9b10853f24d08325


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

  • Severity: P1. Impact: [Newly identified] the generic staggered falsification text says to vary control_group='never_treated' vs 'not_yet_treated' and reuse that advice across handlers, but that only fits CS/SA-style APIs. ImputationDiD and TwoStageDiD do not expose any control_group parameter at all in diff_diff/imputation.py:L126-L168 and diff_diff/two_stage.py:L132-L164; StackedDiD uses clean_control instead in diff_diff/stacked_did.py:L121-L157 and the registry’s clean-control contract in docs/methodology/REGISTRY.md:L996-L1015; EfficientDiD only allows never_treated or last_cohort, explicitly distinguished from not_yet_treated, in diff_diff/efficient_did.py:L197-L237 and docs/methodology/REGISTRY.md:L685-L687. As written, practitioner_next_steps() and the AI guide can send users toward runtime errors or invalid robustness exercises. Concrete fix: replace the single _placebo_step(staggered=True) template with estimator-specific Step 6 guidance. At minimum: remove control_group advice from Imputation/TwoStage; use clean_control for Stacked; use never_treated vs last_cohort for EfficientDiD; keep never_treated vs not_yet_treated only where the estimator actually supports it.
  • Severity: P3. Impact: none. The step renumbering, explicit cluster-count rule, and mandatory with/without-covariates reporting are all documented deviations in docs/methodology/REGISTRY.md:L2120-L2130, so they are mitigated non-defects. Concrete fix: none.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: .txt AI-guide snippets remain outside CI smoke validation, but this PR correctly tracks that limitation in TODO.md:L83-L83, so it is mitigated and non-blocking. Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the regression tests still do not pin the new estimator-contract bug down. ImputationDiDResults, TwoStageDiDResults, StackedDiDResults, and EfficientDiDResults only check that guidance exists in tests/test_practitioner.py:L239-L270, so the invalid shared control_group template would pass unchanged. Concrete fix: add handler-specific assertions that ImputationDiD/TwoStageDiD guidance does not mention control_group, StackedDiD uses clean_control rather than control_group, and EfficientDiD does not mention not_yet_treated and instead points to last_cohort where relevant.
  • Severity: P3. Impact: the evaluation document still describes the corrected staggered workflow as “HonestDiD + placebo,” which overstates what the guide now recommends for load_mpdta() in docs/practitioner-guide-evaluation.md:L59-L67. Concrete fix: update those lines to describe HonestDiD plus specification-based falsification / control-definition comparisons, not placebo tests.

Path to Approval

  1. Split the shared staggered Step 6 guidance in diff_diff/practitioner.py:L191-L208 into estimator-specific handlers so each estimator only recommends supported falsification knobs.
  2. Update the generic staggered wording in docs/llms-practitioner.txt:L261-L264 to stop presenting control_group comparisons as valid for all staggered estimators.
  3. Add regression tests covering ImputationDiD, TwoStageDiD, StackedDiD, and EfficientDiD guidance content, not just dispatch presence, in tests/test_practitioner.py:L239-L270.

Replace all remaining _placebo_step(staggered=True) calls with
estimator-specific Step 6 guidance:

- SA: control_group comparison (never_treated vs not_yet_treated)
- ImputationDiD: no control_group param — cohort subsetting + cross-
  estimator comparison
- TwoStageDiD: same as Imputation (no control_group)
- StackedDiD: clean_control variation (not control_group)
- EfficientDiD: never_treated vs last_cohort (not not_yet_treated)

Tests: negative content assertions on code snippets for Imputation,
TwoStage (no control_group in code), Stacked (clean_control),
Efficient (no not_yet_treated in code).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 5d6b11c09753980e00678812817eee26462784e7


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior handler-level estimator-contract blocker is mostly resolved: the new runtime guidance now distinguishes ImputationDiD, TwoStageDiD, StackedDiD, EfficientDiD, SyntheticDiD, and TROP instead of reusing one staggered template.
  • One P1 remains: the old generic staggered Step 6 advice still survives in the shared _placebo_step(staggered=True) helper and in the new AI guide text, so the fix is only partial.
  • That remaining generic wording is still incompatible with several estimator APIs and can steer agents toward unsupported knobs or invalid falsification exercises.
  • The Baker-guide renumbering, cluster-count rule, and mandatory with/without-covariates reporting are documented in the Methodology Registry, so those deviations are non-defects.
  • The prior evaluation-writeup wording issue appears resolved.
  • Static review only: I could not run tests here because pytest is unavailable, and package import fails in this environment because numpy is missing.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings beyond the methodology blocker above.

Tech Debt

  • Severity: P3. Impact: .txt AI-guide snippets are still outside doc-snippet CI smoke validation, but this PR correctly tracks that limitation in TODO.md, so it is mitigated and non-blocking. Concrete fix: none required for approval. TODO.md:L83-L83

Security

  • No findings.

Documentation/Tests

  • No unmitigated findings beyond the blocker above. The prior handler-specific regression gap is substantially addressed by the new assertions covering ImputationDiD, TwoStageDiD, StackedDiD, EfficientDiD, SyntheticDiD, TROP, ContinuousDiD, and DDD. tests/test_practitioner.py:L239-L282
  • Static review only: python -m pytest -q tests/test_practitioner.py could not run because pytest is not installed, and importing diff_diff in this environment fails because numpy is missing.

Path to Approval

  1. Remove or rewrite the unused shared staggered helper in diff_diff/practitioner.py:L191-L208 so no remaining helper text hard-codes control_group/anticipation as generic staggered advice.
  2. Rewrite Step 6 in docs/llms-practitioner.txt:L260-L264 to describe estimator-supported falsification choices, with explicit family-specific examples: CS/SA (control_group, anticipation), Stacked (clean_control, window width), EfficientDiD (never_treated vs last_cohort), Imputation/TwoStage (cohort-drop, covariate, or cross-estimator comparisons).
  3. If the shared helper is kept, add a regression test that guards its wording directly so this generic staggered control_group pattern cannot be reintroduced outside the estimator-specific handlers.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f8a3b857930fd62c098b09b438410b821b68e3d7


Overall Assessment

⚠️ Needs changes

The highest unmitigated finding is still a P1: the top-level AI-facing summary in docs/llms.txt:L24 continues to present aggregate='group' / aggregate='event_study' as if they were generic heterogeneity APIs, which remains incorrect for SunAbraham and can still misdirect the exact practitioner workflow this PR is trying to enforce.

Executive Summary

  • The previous Step 7 blocker is only partially resolved: the full guide and runtime helper now split out Sun-Abraham correctly in docs/llms-practitioner.txt:L319 and diff_diff/practitioner.py:L347, but the top-level summary in docs/llms.txt:L24 still does not.
  • That remaining docs/llms.txt summary can still steer agents toward unsupported SA calls, because SunAbraham.fit() has no aggregate parameter and SA heterogeneity lives on event_study_effects, cohort_effects, and to_dataframe(level=...) in diff_diff/sun_abraham.py:L191 and diff_diff/sun_abraham.py:L436.
  • The Baker-step renumbering, separate PT-testing step, cluster-count rule, and required with/without-covariates reporting are explicitly documented deviations in docs/methodology/REGISTRY.md:L2120 and are consistent with the paper’s original Step 3/4/8 structure, so those are non-defects.
  • The new practitioner guide cites arXiv:2503.13323 with the wrong author list in docs/llms-practitioner.txt:L509; the paper is by Andrew Baker, Brantly Callaway, Scott Cunningham, Andrew Goodman-Bacon, and Pedro H. C. Sant’Anna. (ideas.repec.org)
  • The .txt doc-smoke coverage gap is properly tracked in TODO.md:L83, so that item is mitigated and non-blocking.

Methodology

  • Severity: P1. Impact: docs/llms.txt:L24 still says Step 7 heterogeneity is aggregate='group' for cohort effects and aggregate='event_study' for dynamic effects. That remains wrong for SunAbraham, whose public API has no aggregate argument in diff_diff/sun_abraham.py:L436 and exposes heterogeneity through event_study_effects, cohort_effects, and to_dataframe(level='event_study'/'cohort') in diff_diff/sun_abraham.py:L191. Because docs/llms.txt is the first AI-facing summary, this can still send agents to unsupported SA calls before they ever reach the corrected full guide or runtime helper. Concrete fix: rewrite Step 7 in docs/llms.txt:L24 to split SA out explicitly from aggregate-based estimators, or reduce it to “see docs/llms-practitioner.txt for estimator-specific heterogeneity guidance.”
  • Severity: P3. Impact: None. The renumbering of Baker’s workflow, the separate PT-testing step, the Step-4-to-Step-5 uncertainty merge, and the Step-8 robustness/reporting additions are all documented deviations in docs/methodology/REGISTRY.md:L2120, and the paper’s own sequence is indeed Step 3 “justify estimation method,” Step 4 “sources of uncertainty,” and Step 8 “keep learning.” Concrete fix: none.
  • Severity: P2. Impact: The bibliography entry in docs/llms-practitioner.txt:L509 attributes arXiv:2503.13323 to Baker/Larcker/Wang, which is not the cited paper. That weakens source traceability for a PR whose main claim is adherence to source material. Concrete fix: replace the reference entry with the paper’s actual author list. (ideas.repec.org)

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: None for approval. The new .txt AI guides are outside current doc-snippet smoke coverage, but this limitation is explicitly tracked in TODO.md:L83, so it is mitigated. Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: The targeted SA regression test requested by the prior review is still effectively missing. tests/test_practitioner.py:L234 only checks dispatch/name for SunAbrahamResults, so it would not fail if aggregate='group' or group_effects reappeared in SA guidance. Concrete fix: extend the SA test to assert that guidance uses to_dataframe(level='event_study') / to_dataframe(level='cohort') from diff_diff/practitioner.py:L347 and does not mention aggregate='group' or group_effects.
  • Static review only: pytest is unavailable in this environment (pytest: command not found), and importing diff_diff fails here because numpy is unavailable.

Path to Approval

  1. Fix the Step 7 summary in docs/llms.txt:L24 so it no longer presents aggregate='group' / aggregate='event_study' as a generic heterogeneity API. Either split out Sun-Abraham explicitly with its real surface (event_study_effects, cohort_effects, to_dataframe(level=...)) or make the summary point readers directly to docs/llms-practitioner.txt:L300 for estimator-specific heterogeneity guidance.

Fixing item 1 would move the assessment to ✅. The remaining P2/P3 items are cleanup, not blockers.

The condensed Step 7 was still presenting aggregate='group' as generic
for all staggered estimators. Now distinguishes CS (aggregate parameter)
from SA (automatic event_study_effects/cohort_effects).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0857affa272ed8c6050be4b28df1ca28dec7e224


Overall Assessment

⚠️ Needs changes

The previous Sun-Abraham Step 7 blocker is fixed, but the new practitioner_next_steps() helper still has an unmitigated P1 contract/methodology issue: it advertises the Baker 8-step workflow, accepts target_parameter and assumptions as valid public step names, and can print “All Baker et al. steps completed!”, yet none of its handlers can ever emit Step 1 or Step 2 guidance. Baker et al. explicitly make defining target parameters and stating identifying assumptions the first two steps of the workflow. (ar5iv.org)

Executive Summary

  • The previous P1 from the prior review is resolved: the top-level summary in docs/llms.txt:L18-L25 now separates Callaway-Sant’Anna aggregate-based heterogeneity from Sun-Abraham’s event_study_effects / to_dataframe(level='cohort') interface.
  • Severity: P1. The new runtime helper’s public API still claims an 8-step Baker workflow, but its emitted guidance only covers later diagnostic steps; Steps 1 and 2 are unreachable even though they are exposed as valid step names in diff_diff/practitioner.py:L16-L25, advertised in diff_diff/practitioner.py:L49-L110, and completion is overstated in diff_diff/practitioner.py:L796-L820. This is an undocumented mismatch with the source workflow. (ar5iv.org)
  • Severity: P2. The bibliography entry in docs/llms-practitioner.txt:L507-L510 still names the wrong authors for arXiv:2503.13323; the cited paper is by Andrew Baker, Brantly Callaway, Scott Cunningham, Andrew Goodman-Bacon, and Pedro H. C. Sant’Anna. (econpapers.repec.org)
  • The documented deviations called out in docs/methodology/REGISTRY.md:L2116-L2131 remain non-defects: renumbering, separate PT testing, the 50-cluster rule, and “Robustness & Reporting” for Step 8 are all explicitly tracked there.
  • Static review only: python -m pytest tests/test_practitioner.py -q could not run here because pytest is not installed, and importing diff_diff in this environment fails because numpy is missing.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: none for approval. The new .txt AI-guide smoke-test gap is explicitly tracked in TODO.md:L83, so it is mitigated. Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: tests/test_practitioner.py:L234-L237 still only checks Sun-Abraham dispatch/name, so it would not catch a regression back to aggregate-based SA guidance. The file also has no assertion covering the Step 1 / Step 2 contract implied by diff_diff/practitioner.py:L16-L25. Concrete fix: add targeted assertions that SA guidance continues to use to_dataframe(level='event_study'/'cohort') and add a contract test that either Step 1/2 are emitted by default or, if the design is narrowed, that the public API/docs no longer advertise them.
  • Static review only: python -m pytest tests/test_practitioner.py -q failed here because pytest is unavailable, and import diff_diff failed because numpy is unavailable.

Path to Approval

  1. Make practitioner_next_steps() internally consistent with its advertised Baker workflow: either emit Step 1 (target_parameter) and Step 2 (assumptions) guidance by default, or remove those step names from the public API and update every docstring/document/output string that currently promises “remaining steps” / “All Baker et al. steps completed.”

Structural fix: practitioner_next_steps() now always emits Step 1
(define target parameter) and Step 2 (state assumptions) as reminders,
prepended to every handler's output. These are filterable via
completed_steps=['target_parameter', 'assumptions']. This closes the
gap where the helper could report the Baker workflow "complete" while
silently skipping the first two steps.

Fix bibliography: correct author list for arXiv:2503.13323 (Baker,
Callaway, Cunningham, Goodman-Bacon, Sant'Anna).

Tests: added assertions that Steps 1-2 are always emitted and
filterable. SA negative assertion checks code snippets only
(excluding generic Step 2 text). 33 tests now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e5f10ff678bdbbfd7701276da5e135b163257bf0


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

  • Severity: P1 [Newly identified]. Impact: the guide says PT-GT-NYT is “recommended for most applications” and repeatedly uses control_group='not_yet_treated' in the lead CS examples, while the paper presents the control-group / PT-assumption choice as context-dependent and explicitly says the analyst should state and justify which assumption is being used; the existing CallawaySantAnna and SunAbraham docstrings still recommend never_treated instead (docs/llms-practitioner.txt:L73-L79, docs/llms-practitioner.txt:L120-L121, docs/llms-practitioner.txt:L347-L357, diff_diff/practitioner.py:L703-L719, diff_diff/staggered.py:L113-L116, diff_diff/sun_abraham.py:L313-L316). Concrete fix: remove the blanket recommendation and present never_treated vs not_yet_treated as a context-specific choice with explicit assumption trade-offs, or document a deliberate library-wide deviation in REGISTRY.md and update the estimator docstrings/examples to match. citeturn0view0
  • Severity: P3. Impact: none for approval. The guide’s renumbering of Baker’s workflow, the separate PT-testing step, the cluster-count convention in Step 5, and the “Robustness & Reporting” rewrite for Step 8 are all explicitly documented in the methodology registry, so they are mitigated deviations rather than defects (docs/methodology/REGISTRY.md:L2116-L2130). Concrete fix: none.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: none for approval. The new .txt AI guides are still outside the doc-snippet smoke tests, but that limitation is explicitly tracked in TODO.md for this PR (TODO.md:L83-L83). Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Rewrite the practitioner guidance so PT-GT-Nev vs PT-GT-NYT / never_treated vs not_yet_treated is presented as an assumption choice that must be justified, not as a blanket recommendation.
  2. If the project intentionally wants to standardize on not_yet_treated, add that as an explicit methodology note in docs/methodology/REGISTRY.md and update the CallawaySantAnna / SunAbraham docstrings so the AI guide, runtime snippets, and estimator docs all agree.

The guide was recommending PT-GT-NYT / not_yet_treated as the default,
contradicting both Baker et al. (who present it as context-dependent)
and the library's own docstrings (which default to never_treated).

Changed: PT-GT-NYT description no longer says "recommended for most
applications" — now describes the trade-off. All code examples changed
from not_yet_treated to never_treated (matching library default), with
comments noting the choice is context-dependent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7d3f76488d95ac13026a609df64c92c7c06f2091


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

  • Severity: P1 [Newly identified]. Impact: the new practitioner decision tree and the condensed docs/llms.txt header tell agents that “few treated units” points to SyntheticDiD/TROP, but the project’s own estimator-selection docs and TROP API docs reserve TROP for suspected factor confounding / interactive fixed effects. That can push agents toward the wrong estimator family in Step 4, which is exactly the methodology-selection step this PR is trying to standardize (docs/llms-practitioner.txt:141, docs/llms.txt:21, docs/choosing_estimator.rst:37, docs/choosing_estimator.rst:391, docs/api/trop.rst:18, diff_diff/trop.py:45, docs/llms-full.txt:1453). Concrete fix: split the decision rule so SyntheticDiD is recommended for “few treated units / synthetic control” cases, while TROP is recommended when interactive fixed effects / latent factor confounding is the motivating concern.
  • Severity: P3. Impact: none for approval. The Step 3/PT-test separation, Step 5 cluster-count convention, and Step 8 “Robustness & Reporting” rewrite are documented deviations in the registry, so they are mitigated (docs/methodology/REGISTRY.md:2120). Concrete fix: none.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: none for approval. The new .txt AI guides are still outside the doc-snippet smoke tests, but that limitation is explicitly tracked in TODO.md for this PR (TODO.md:83). Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Rewrite the Step 4 estimator-selection guidance in docs/llms-practitioner.txt:141 and docs/llms.txt:21 so SyntheticDiD is tied to “few treated units / synthetic control” and TROP is tied to “interactive fixed effects / factor confounding,” consistent with docs/choosing_estimator.rst:37, docs/choosing_estimator.rst:391, and docs/api/trop.rst:18.
  2. If the project intentionally wants a different TROP recommendation, add an explicit methodology note to docs/methodology/REGISTRY.md and update the existing TROP docs (docs/choosing_estimator.rst, docs/api/trop.rst, and any relevant docstrings) so all public guidance agrees.

SDiD is for few treated units (synthetic control hybrid). TROP is for
suspected factor confounding / interactive fixed effects (nuclear norm
regularization). These have different use cases — the previous grouping
under "few treated units" contradicted TROP's own documentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a21845ce91143a808e83e52ec4ed597550a9c0e0


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior Step 4 documentation blocker is resolved: the new guides now separate few treated units → SyntheticDiD from factor confounding → TROP in docs/llms-practitioner.txt:154 and docs/llms.txt:21.
  • No estimator, weighting, variance, or default-behavior code changed in this diff; the remaining blocker is in the new public guidance layer.
  • Severity: P1. The new runtime helper still tells SyntheticDiDResults users to compare against TROP because both address “few treated units” in diff_diff/practitioner.py:523 and diff_diff/practitioner.py:525. That conflicts with the project’s own estimator guide and TROP docstrings, which reserve TROP for factor confounding / interactive fixed effects.
  • The Baker-workflow renumbering, separate PT-testing step, and Step 8 robustness/reporting rewrite are explicitly documented in docs/methodology/REGISTRY.md:2120, so those are mitigated P3s rather than defects. The paper itself is an organizing framework whose original steps include estimation-method selection, uncertainty, heterogeneity, and “keep learning.” (aeaweb.org)
  • The .txt doc-smoke-test gap is properly tracked in TODO.md:83, so it is non-blocking.
  • I could not run python -m pytest tests/test_practitioner.py -q or import diff_diff here because this environment is missing pytest and numpy.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: none for approval. The new AI .txt guides are outside tests/test_doc_snippets.py, but this PR adds that limitation to TODO.md:83, so it is properly tracked. Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new tests do not assert the estimator-selection semantics that mattered in the prior review. tests/test_practitioner.py:281 and tests/test_practitioner.py:291 only check that SDiD/TROP snippets avoid staggered-only knobs, so the contradictory SDiD↔TROP message in diff_diff/practitioner.py:523 was free to slip through. Concrete fix: add targeted assertions that SDiD guidance does not present TROP as the few-treated-units analogue, and that TROP guidance is tied to factor confounding / interactive fixed effects.
  • Validation note: I could not execute the new test file in this environment because pytest is unavailable, and package import fails before test collection because numpy is not installed.

Path to Approval

  1. Update the SyntheticDiD handler in diff_diff/practitioner.py:488 so its robustness guidance no longer says TROP addresses the same “few treated units” setting. Reserve TROP for factor-confounding / interactive-fixed-effects cases, consistent with docs/api/trop.rst:18 and docs/choosing_estimator.rst:391.
  2. Add a regression test in tests/test_practitioner.py:281 that fails if SDiD guidance characterizes TROP as the few-treated-units comparator, and a corresponding TROP assertion that mentions factor confounding / interactive fixed effects.

SDiD robustness step was still suggesting TROP comparison because
"both address few treated units" — but TROP is for factor confounding,
not few treated units. Now suggests comparing with CS/SA instead,
with a note that TROP is a different use case.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8ce444101ad4c17677bf99738adfa9f8d57099b8


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior SDiD↔TROP methodology blocker is resolved in the new guidance text and runtime helper.
  • No estimator math, weighting, variance, or default behavior changed in this diff; the remaining blocker is in the new practitioner-guidance layer.
  • Severity P1: the new guidance is not actually result-type safe. MultiPeriodDiD guidance emits a 2x2 placebo workflow, and the shared snippet uses time='period' with multi-period pre/post lists even though run_all_placebo_tests() routes that time column into DifferenceInDifferences, which requires a binary time indicator.
  • The companion docs example then silently drops failed placebo subtests by filtering to objects with p_value, so agents can miss that permutation / leave-one-out did not run.
  • The Baker-workflow renumbering and Step 5 / Step 8 adaptations are explicitly documented in the methodology registry, so I treated those as mitigated P3s rather than defects.
  • I could not run pytest here because pytest is not installed, and importing diff_diff fails because numpy is missing in this environment.

Methodology

  • Severity: P1
    Impact: The new “context-aware” sensitivity guidance is not conditioned tightly enough on supported result types. MultiPeriodDiDResults gets _placebo_step() in practitioner.py:305, even though the new guide itself says placebo tests are for simple 2x2 designs only in llms-practitioner.txt:270. The shared placebo snippet uses time='period' with multi-period pre_periods / post_periods in practitioner.py:221 and llms-practitioner.txt:286. But run_all_placebo_tests() forwards that time into permutation and leave-one-out refits in diagnostics.py:844 and diagnostics.py:865, and DifferenceInDifferences validates time as binary in estimators.py:223. The same helper family also tells basic DiD users to “use HonestDiD” in practitioner.py:187, even though compute_honest_did() only supports MultiPeriodDiDResults or CallawaySantAnnaResults in honest_did.py:1455. This means the new guidance can direct users/agents into unsupported diagnostics and partial-failure paths.
    Concrete fix: Make the sensitivity guidance strictly result-type safe. Remove the 2x2 placebo recommendation from MultiPeriodDiD guidance unless you first show a valid binary-post workflow, and gate HonestDiD messaging to result types actually supported by compute_honest_did().

  • Severity: P3
    Impact: None for approval. The Baker-guide renumbering, separate PT-testing step, Step 5 inference convention, and Step 8 robustness/reporting rewrite are explicitly documented in REGISTRY.md:2116, so I treated those deviations as mitigated.
    Concrete fix: None.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3
    Impact: None for approval. The .txt AI guides are still outside the doc-snippet smoke tests, but that limitation is already tracked in TODO.md:83.
    Concrete fix: None required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity: P3
    Impact: The new tests do not cover the blocking guidance mismatch. test_multi_period_results() in test_practitioner.py:234 only asserts that guidance exists, and the new assertions around SDiD/TROP in test_practitioner.py:281 and test_practitioner.py:291 still do not exercise whether emitted diagnostics are valid for the result type. That let the broken placebo path slip through.
    Concrete fix: Add regression tests that fail if MultiPeriodDiD guidance emits the 2x2 placebo snippet, if simple DiD guidance points to unsupported HonestDiD, or if practitioner examples suppress returned placebo error entries.

  • Validation note
    I could not execute tests/test_practitioner.py here because pytest is unavailable, and import diff_diff fails in this environment because numpy is not installed.

Path to Approval

  1. Make practitioner_next_steps() result-type safe for sensitivity guidance: remove or replace the 2x2 placebo recommendation from MultiPeriodDiD output in practitioner.py:305, and stop pointing unsupported result types at HonestDiD from practitioner.py:187.
  2. Rewrite the placebo snippet in practitioner.py:221 and llms-practitioner.txt:286 so it uses a valid binary post indicator or another actually supported workflow.
  3. Update the docs example in llms-practitioner.txt:291 to report returned {"error": ...} entries instead of silently skipping them.
  4. Add focused regression tests in test_practitioner.py:234 covering the invalid placebo path and unsupported HonestDiD recommendation.

run_all_placebo_tests() requires binary time indicator, which
MultiPeriodDiD does not use (it has multi-period time). Removed
the placebo step from _handle_multi_period; HonestDiD (which IS
supported for MultiPeriodDiDResults) remains as the sensitivity tool.

DiD handler already correctly uses _placebo_step() (valid for 2x2
with binary time) and does not suggest HonestDiD (not supported
for DiDResults).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c8658167161f9e40a290f64cc1a979811bef710d


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior MultiPeriodDiD -> placebo blocker is only partially resolved: diff_diff/practitioner.py:L305 now omits placebo guidance for MultiPeriodDiDResults.
  • Two P1 guidance mismatches remain in the new practitioner layer: the basic DiD path still tells users to use unsupported HonestDiD, and the shared placebo example still uses an unsupported multi-period time='period' workflow while hiding returned error entries.
  • No estimator math, weighting, variance/SE formulas, or default estimator behavior changed in this diff; the approval blockers are guidance/API-contract mismatches, not numerical methodology changes.
  • The Baker-workflow renumbering, cluster-count convention, and with/without-covariates requirement are documented in the methodology registry, so I treated those as mitigated P3s rather than defects.
  • The added tests do not cover the two remaining result-type-safety gaps.
  • I could not run pytest here because pytest is not installed.

Methodology

This PR is aligning a guidance layer with Baker et al.’s practitioner framework rather than changing estimator formulas. citeturn5search0turn5search1

  • Severity: P1
    Impact: DiDResults still inherit the generic parallel-trends step via diff_diff/practitioner.py:L281, and that shared step still tells users to “use HonestDiD” at diff_diff/practitioner.py:L190. But compute_honest_did() only accepts MultiPeriodDiDResults or CallawaySantAnnaResults, and otherwise raises a type error at diff_diff/honest_did.py:L561 and diff_diff/honest_did.py:L650. The runtime helper can still steer basic 2x2 users into an unsupported sensitivity workflow.
    Concrete fix: Make the non-staggered PT copy result-type safe: remove the HonestDiD sentence from the generic/basic-DiD step, or gate it so only MultiPeriodDiDResults / CallawaySantAnnaResults receive that recommendation.

  • Severity: P1
    Impact: The shared placebo guidance still reproduces the earlier partial-failure path. The new file itself acknowledges that run_all_placebo_tests() requires a binary time indicator at diff_diff/practitioner.py:L309, but _placebo_step() still shows time='period' with multi-period pre_periods / post_periods at diff_diff/practitioner.py:L231. The practitioner guide repeats the same example and then filters output with hasattr(result, 'p_value') at docs/llms-practitioner.txt:L286 and docs/llms-practitioner.txt:L292. Meanwhile, the permutation and leave-one-out branches pass the supplied time straight into DifferenceInDifferences.fit() at diff_diff/diagnostics.py:L567, diff_diff/diagnostics.py:L590, diff_diff/diagnostics.py:L693, and diff_diff/diagnostics.py:L712, and DifferenceInDifferences validates time as binary at diff_diff/estimators.py:L223. Following the published snippet can therefore suppress failed subtests and leave users/agents thinking the full placebo suite ran.
    Concrete fix: Rewrite the shared snippet in both the runtime helper and the practitioner guide to use a binary post indicator, or switch the example to individual placebo helpers that genuinely support multi-period period. Also print returned error dicts explicitly instead of filtering them out. The existing diagnostics docs already use binary post for permutation/LOO at docs/api/diagnostics.rst:L82 and docs/api/diagnostics.rst:L108.

  • Severity: P3
    Impact: The Baker-workflow renumbering, explicit PT-testing step, cluster-count rule, and “with and without covariates” reporting requirement are explicitly documented in docs/methodology/REGISTRY.md:L2116, so I treated those deviations as mitigated rather than defects.
    Concrete fix: None.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3
    Impact: .txt AI-guide snippets remain outside the doc-snippet smoke tests, but that limitation is already tracked in TODO.md:L83.
    Concrete fix: None required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity: P3
    Impact: The added TestResultTypeDispatch coverage in tests/test_practitioner.py:L229 checks several handler-specific strings, but there is still no regression that fails when the DiD path mentions unsupported HonestDiD or when the practitioner-guide placebo example suppresses returned error entries.
    Concrete fix: Add focused regressions for both cases: one on practitioner_next_steps(DiDResults) and one on the published placebo example/output handling.

  • Validation note
    I could not execute python -m pytest tests/test_practitioner.py -q because pytest is not installed in this environment.

Path to Approval

  1. Remove or gate the HonestDiD recommendation on the basic-DiD path so practitioner_next_steps(DiDResults) never points to an unsupported API.
  2. Replace the shared run_all_placebo_tests() example in diff_diff/practitioner.py:L221 and docs/llms-practitioner.txt:L282 with a binary post workflow or with supported multi-period placebo helpers.
  3. Update the practitioner-guide example loop so it reports returned {"error": ...} entries instead of silently skipping them.
  4. Add regression tests covering items 1-3.

P1: Non-staggered parallel_trends_step now qualifies HonestDiD as
available "For MultiPeriodDiD or CS results" instead of blanket advice.
DiDResults users won't be steered toward unsupported compute_honest_did.

P1: Placebo snippet now uses binary time='post' with pre_periods=[0],
post_periods=[1] matching what run_all_placebo_tests() actually needs
(it refits DifferenceInDifferences which validates binary time). Also
shows error-dict handling in the guide instead of silently filtering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b662a41805325cf77cb1c7f3b6cf5e76aa484025


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

  • Severity: P1
    Impact: practitioner_next_steps(DiDResults) is still not fully result-type-safe. The basic 2x2 path inherits copy saying “For MultiPeriodDiD or CS results, use HonestDiD,” but the runtime helper is supposed to be context-aware for the actual result object it receives. For basic DiD users, this still points at an unsupported sensitivity workflow and keeps the prior blocker alive: diff_diff/practitioner.py:L167-L201, diff_diff/practitioner.py:L283-L286, diff_diff/honest_did.py:L650-L653.
    Concrete fix: split _parallel_trends_step() into a basic-DiD version and an event-study version, or add a supports_honest_did flag so the DiDResults handler never mentions HonestDiD at all.

  • Severity: P3
    Impact: The workflow renumbering and diff-diff-specific reporting conventions are documented deviations, not silent methodology drift: docs/methodology/REGISTRY.md:L2114-L2131.
    Concrete fix: None.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3
    Impact: .txt AI-guide snippets remain outside doc-snippet smoke tests, but this PR explicitly tracks that limitation in TODO.md:L74-L83, so it is mitigated and not a blocker.
    Concrete fix: None required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity: P3
    Impact: The new test suite checks many handler-specific API mismatches, but it still does not fail when practitioner_next_steps(DiDResults) mentions unsupported HonestDiD guidance. That leaves the remaining P1 unguarded: tests/test_practitioner.py:L229-L320.
    Concrete fix: add a regression asserting that the DiDResults path contains no HonestDiD/compute_honest_did recommendation in its emitted steps.

  • Validation note: I could not execute pytest here because pytest is not installed, and I also could not import the package for a runtime spot-check because this environment lacks numpy.

Path to Approval

  1. Make the basic DiDResults guidance fully result-type-safe by removing the HonestDiD recommendation from the non-staggered parallel-trends step, or by gating that text to result types actually accepted by compute_honest_did().
  2. Add a regression test on practitioner_next_steps(did_results) that asserts the emitted steps do not mention HonestDiD or compute_honest_did.

Basic DiD handler now has its own PT step that doesn't mention
HonestDiD (which DiDResults doesn't support). The shared
_parallel_trends_step() is now only used by MultiPeriodDiD
(where HonestDiD IS supported) and the generic fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c74e7ace8b9d4e99b0c9559821f5045bbabb798d


Overall Assessment

✅ Looks good

This re-review resolves the prior P1. The PR adds a guidance/documentation layer and a new helper, but it does not change estimator formulas, weighting, variance/SE computation, identification checks, or defaults.

Executive Summary

  • The prior blocker is fixed: DiDResults now has a dedicated 2x2 handler that no longer mentions HonestDiD, while the new docs/runtime guidance scope HonestDiD to supported result types and keep placebo tests on the simple 2x2 path (diff_diff/practitioner.py:L283-L316, docs/llms-practitioner.txt:L250-L296, diff_diff/honest_did.py:L616-L652, diff_diff/diagnostics.py:L766-L845).
  • Methodology-wise, this is a workflow/adaptation layer, not an estimator-math change. The cited Baker paper is an organizing framework for DiD designs and estimators, and the PR’s renumbering/reporting changes are explicitly documented as diff-diff-specific adaptations in the registry, so I treat them as mitigated P3s rather than defects (docs/methodology/REGISTRY.md:L2116-L2131). (aeaweb.org)
  • The new TODO.md entry properly tracks the .txt doc-snippet smoke-test gap, so that limitation is mitigated and non-blocking (TODO.md:L74-L83).
  • Residual gap: the new test suite still does not add a direct regression asserting that practitioner_next_steps(DiDResults) emits no HonestDiD / compute_honest_did guidance; coverage for DiDResults currently only checks that some steps are returned (tests/test_practitioner.py:L229-L233).

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3
    Impact: The new .txt AI guides are still outside the existing doc-snippet smoke tests, but this PR explicitly tracks that limitation in TODO.md, so it is mitigated and not a blocker (TODO.md:L74-L83).
    Concrete fix: None required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity: P3
    Impact: The suite covers many result-type-specific API mismatches, but it still does not pin the exact previously-blocking regression that practitioner_next_steps(DiDResults) must not mention HonestDiD or compute_honest_did; current DiDResults coverage only asserts that guidance exists (tests/test_practitioner.py:L229-L233). That leaves the old failure mode unguarded against future refactors, even though the implementation is currently correct (diff_diff/practitioner.py:L283-L316).
    Concrete fix: Add a targeted regression that calls practitioner_next_steps(did_results, verbose=False) and asserts that neither the emitted labels nor code snippets contain HonestDiD or compute_honest_did.

  • Severity: P3
    Impact: practitioner_next_steps is now a top-level public API export, but it is not added to the standard Sphinx API index, so the regular API docs surface adjacent diagnostics/sensitivity helpers but not this new entry point (diff_diff/init.py:L171-L171, diff_diff/init.py:L348-L349, docs/api/index.rst:L82-L125).
    Concrete fix: Add diff_diff.practitioner_next_steps to the autosummary/API docs and include a short usage example.

Replace the original adherence-only rubric (which inflated before scores
by giving credit for running the wrong diagnostic) with a correctness
rubric that penalizes methodologically invalid choices.

Before scores drop from 9.4 to 8.4 (generic check_parallel_trends on
staggered data scored as 1, not 2). Final scores: 15.55/16.
Net improvement increases from +65% to +85%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber merged commit 8f92bc9 into main Mar 29, 2026
14 checks passed
@igerber igerber deleted the ai-practitioner-guide branch March 29, 2026 01:12
@igerber igerber mentioned this pull request Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant