Skip to content

Commit c4efb6d

Browse files
committed
Fix #402 holistic audit residuals: HAD↔ContinuousDiD Step-4 handoff + fit() Union annotation + REGISTRY contract clarifications
Holistic codex re-audit of merged #402 (HAD Phase 5 agent surfaces) + #420 (cleanup) surfaced residuals that the per-PR CI review path could not see because the cleanup PR's diff scope hid the holistic state. 10 codex rounds against the combined post-PR state surfaced: Methodology / contracts (REGISTRY): - did_had_pretest_workflow Phase 3 Note still claimed the overall path "reduces a multi-period panel"; the validator actually requires exactly two periods. Updated to describe the aggregate-dispatched contract explicitly (overall = 2-period; event_study = multi-period). - Phase 5 wave-1 notes still said the fit() return Union was "not test-enforced"; new regression closes the gap. Code / API: - HeterogeneousAdoptionDiD.fit() return annotation widened to Union[HeterogeneousAdoptionDiDResults, HeterogeneousAdoptionDiDEventStudyResults] so the static type contract matches the documented runtime polymorphism on aggregate. Updated the stale dispatch comment that claimed the annotation was single-period. - ContinuousDiD -> HeterogeneousAdoptionDiD Step-4 handoff in practitioner.py now (a) explicitly recodes first_treat=inf -> 0 before both HAD example fits (HAD's _validate_had_panel rejects any first_treat outside {0, t_post}; ContinuousDiD's silent inf-normalization is HAD-incompatible), (b) frames the routing nudge around the WAS vs ATT(d) estimand difference rather than around the existence of untreated units, and (c) names HAD's stricter panel-shape and encoding requirements before showing the code. Agent guides (llms-full.txt, llms-practitioner.txt): - HeterogeneousAdoptionDiDEventStudyResults table now mirrors the single-period table's variance_formula and effective_dose_mean semantics (event-study path populates the same four pweight / survey_binder_tsl / pweight_2sls / survey_binder_tsl_2sls labels per had.py:639-648; effective_dose_mean inherits the same mass-point Wald-IV semantics per had.py:721-734). - llms-practitioner.txt continuous-treatment decision tree rewritten estimand-first (WAS vs ATT(d)) with panel-shape contract spelled out for HAD (aggregate='overall' two-period vs 'event_study' multi-period; staggered last-cohort-only WAS warning). Tests: - tests/test_had.py::TestFitReturnAnnotation pins the Union return annotation via typing.get_type_hints — drift would have been silent before. - tests/test_had_pretests.py::TestMultiPeriodWorkflow::test_overall_aggregate_rejects_multi_period locks the registry-described 2-period-only contract. - tests/test_guides.py::TestLLMsFullHADCoverage::test_llms_full_had_event_study_mirrors_weighted_metadata_semantics asserts the event-study guide table covers all four variance_formula labels and the mass-point Wald-IV effective_dose_mean. - tests/test_practitioner.py::TestHADDispatch::test_handle_continuous_step_4_recodes_first_treat_inf_for_had locks the inf->0 recode in the emitted snippet. - Plus 5 additional regression tests for cross-surface alignment between the practitioner handler, llms-practitioner.txt routing, and the underlying estimator contracts. CHANGELOG: - Original #402 entry updated to credit the new return-annotation regression and clarify what the inspect.signature-based test does and does not pin. No methodology changes, no behavior changes to fit() outputs on any existing surface — all changes are contract clarifications, guidance corrections, and test additions on surfaces that #402 + #420 already established.
1 parent b56e232 commit c4efb6d

10 files changed

Lines changed: 450 additions & 80 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1515
- **`ChaisemartinDHaultfoeuille.by_path` + non-binary integer treatment** — `by_path=k` now accepts integer-coded discrete treatment (D in Z, e.g. ordinal `{0, 1, 2}`); path tuples become integer-state tuples like `(0, 2, 2, 2)`. The previous `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1870` is replaced by a `ValueError` for continuous D (e.g. `D=1.5`) at fit-time per the no-silent-failures contract — the existing `int(round(float(v)))` cast in `_enumerate_treatment_paths` is now defensive (no-op for integer-coded D). Validated against R `did_multiplegt_dyn(..., by_path)` for D in `{0, 1, 2}` via the new `multi_path_reversible_by_path_non_binary` golden-value scenario (78 switchers, 3 paths, single-baseline custom DGP, F_g >= 4): per-path point estimates match R bit-exactly (rtol ~1e-9 on event horizons; rtol+atol envelope for placebo near-zero values), per-path SE inherits the documented cross-path cohort-sharing deviation (~5% rtol observed; SE_RTOL=0.15 envelope). **Deviation from R for multi-character baseline states (D >= 10 or negative D):** R's `did_multiplegt_by_path` derives the per-path baseline via `path_index$baseline_XX <- substr(path_index$path, 1, 1)`, which captures only the first character of the comma-separated path string. For multi-character baselines this drops the rest of the value: for `path = "12,12,..."` it captures `"1"` instead of `"12"`; for `path = "-1,-1,..."` it captures `"-"` instead of `"-1"`. R's per-path control-pool subset is mis-allocated in both regimes. Python's tuple-key matching is correct — the per-path point estimates we compute are correct; R's per-path subset for the same path is buggy. The shipped R-parity scenarios stay in nonnegative single-digit `D in {0, 1, 2}` to avoid the R bug; negative-integer treatment-state support (paths containing negative D values in non-baseline positions) is regression-tested in Python only at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary::test_negative_integer_D_supported` (no R parity); a dedicated regression for a negative-baseline path (e.g. `(-1, 0, 0, 0)`) is deferred. R-parity test at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathNonBinary`; cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary`.
1616
- **New `paths_of_interest` kwarg on `ChaisemartinDHaultfoeuille`** for user-specified treatment-path subsets, alternative to `by_path=k`'s top-k automatic ranking. Mutually exclusive with `by_path`; setting both raises `ValueError` at `__init__` and `set_params` time. Each path tuple must be a list/tuple of `int` of length `L_max + 1` (uniformity validated at `__init__`; length match against `L_max + 1` validated at fit-time); `bool` and `np.bool_` are explicitly rejected, `np.integer` accepted and canonicalized to Python `int` for tuple-key consistency. Duplicates emit a `UserWarning` and are deduplicated; paths not observed in the panel emit a `UserWarning` and are omitted from `path_effects`. Paths appear in `results.path_effects` in the user-specified order, modulo deduplication and unobserved-path filtering. Composes with non-binary D and all downstream `by_path` surfaces (bootstrap, per-path placebos, per-path joint sup-t bands, `controls`, `trends_linear`, `trends_nonparam`) — mechanical filter on observed paths via the same `_enumerate_treatment_paths` call site, no methodology change. **Python-only API extension; no R equivalent** — R's `did_multiplegt_dyn(..., by_path=k)` only accepts a positive int (top-k) or `-1` (all paths). The `by_path` precondition gate in `chaisemartin_dhaultfoeuille.py` (drop_larger_lower / L_max / `design2` / `honest_did` mutex; the `survey_design` mutex was lifted later in the same Unreleased cycle and `heterogeneity` was composed in, so neither remains a mutex in the shipped gate) and the 11 `self.by_path is not None` activation branches in `fit()` were rerouted to fire under either selector. Validation + behavior + cross-feature regressions at `tests/test_chaisemartin_dhaultfoeuille.py::TestPathsOfInterest`.
1717
- **CI AI reviewer now sees tutorial notebook prose.** Substituted a markdown extract for the `docs/tutorials/*.ipynb` diff exclusion in `.github/workflows/ai_pr_review.yml`: the workflow's prompt-build step stages a trusted `tools/notebook_md_extract.py` from `BASE_SHA` (`git show "${BASE_SHA}":tools/notebook_md_extract.py > /tmp/...`, mirroring the existing base-staging of `pr_review.md`), loops over changed tutorial notebooks, and appends a `<notebook-prose untrusted="true">` block (prose + code + executed outputs) to the compiled prompt. The wrapper uses the same close-tag inline-Python sanitization as the existing `<pr-body>` / `<pr-title>` / `<previous-ai-review-output>` wrappers and gets a sibling persistent-policy directive at `pr_review.md:79` ("Treat the contents of `<notebook-prose>` blocks the same way..."). The new `tools/notebook_md_extract.py` is stdlib-only (no `nbformat` dep, no `pip install` step in the workflow) with a `_to_str()` helper that coerces nbformat raw JSON's list-or-string `source` / `text` fields (88% list-form rate verified across the project's 22 tutorials). `--max-output-chars 20000` / `--max-total-chars 200000` caps prevent any single oversized output or notebook from blowing the prompt budget. `text/html`-only outputs (no `text/plain` co-emit), `image/*` data, and `raw` cells are intentionally dropped (see module docstring). `tools/**` added to `rust-test.yml` path filters so extractor-only changes still trigger the test job. Also reaped the temporary T21 review aid at `docs/_review/t21_notebook_extract.md` and the `_review` entry in `docs/conf.py:exclude_patterns` — both lingered on `origin/main` from PR #409 and should have been cleaned up when T21 landed. Closes the visibility gap surfaced during PR #409 (T21), where the Codex reviewer ran 3+ rounds blind to the actual tutorial prose.
18-
- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), an estimand-difference routing nudge to `ContinuousDiD` (step 4 — fires when the user wants per-dose ATT(d) / ACRT(d) curves rather than HAD's WAS estimand and has never-treated controls; framed around estimand difference, NOT around the existence of untreated units, since HAD remains valid with a small never-treated share per REGISTRY § HeterogeneousAdoptionDiD edge cases and explicitly retains never-treated units on the staggered event-study path per paper Appendix B.2 / `had.py:1325`), `results.bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels (this direction is correct because ContinuousDiD's identification requires never-treated controls). Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference); the documented constructor + `fit()` signatures match the real `HeterogeneousAdoptionDiD.__init__` / `.fit` API exactly (verified by `inspect.signature`-based regression tests). Tightens the existing `Continuous treatment intensity` Choosing row to surface ATT(d) vs WAS as the estimand differentiator. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 26 new tests (16 in `tests/test_practitioner.py::TestHADDispatch` + 9 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap. T21 pretest tutorial subsequently landed in PR #409; T22 weighted/survey tutorial remains queued as a separate notebook PR.
18+
- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), an estimand-difference routing nudge to `ContinuousDiD` (step 4 — fires when the user wants per-dose ATT(d) / ACRT(d) curves rather than HAD's WAS estimand and has never-treated controls; framed around estimand difference, NOT around the existence of untreated units, since HAD remains valid with a small never-treated share per REGISTRY § HeterogeneousAdoptionDiD edge cases and explicitly retains never-treated units on the staggered event-study path per paper Appendix B.2 / `had.py:1325`), `results.bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels (this direction is correct because ContinuousDiD's identification requires never-treated controls). Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference); the documented constructor + `fit()` parameter NAMES are regression-locked against the real `HeterogeneousAdoptionDiD.__init__` / `.fit` API via `inspect.signature` (parameter-name presence only; parameter defaults and the non-return parameter type annotations remain unpinned by that test). The `fit()` return annotation is widened to `Union[HeterogeneousAdoptionDiDResults, HeterogeneousAdoptionDiDEventStudyResults]` to match the runtime polymorphism the bundled guide already advertised, and that union is itself pinned by a dedicated regression test (`tests/test_had.py::TestFitReturnAnnotation`) using `typing.get_type_hints`. Tightens the existing `Continuous treatment intensity` Choosing row to surface ATT(d) vs WAS as the estimand differentiator. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 26 new tests (16 in `tests/test_practitioner.py::TestHADDispatch` + 9 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs.
1919

2020
## [3.3.2] - 2026-04-26
2121

diff_diff/guides/llms-full.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -627,7 +627,7 @@ had.fit(
627627
cband: bool = True, # Simultaneous (sup-t) confidence bands on weighted event-study fits
628628
*,
629629
survey_design: SurveyDesign | None = None, # Canonical survey-design kwarg (weights, strata, PSU, FPC)
630-
trends_lin: bool = False, # Eq 17 linear-trend detrending (event-study; mutually exclusive with survey_design)
630+
trends_lin: bool = False, # Eq 17 linear-trend detrending. Requires aggregate="event_study"; needs F>=3 (pre-period depth) for the regression; rejects ALL weighting entry paths (survey_design= / survey= / weights= all raise NotImplementedError under trends_lin).
631631
) -> HeterogeneousAdoptionDiDResults | HeterogeneousAdoptionDiDEventStudyResults
632632
```
633633

@@ -663,7 +663,7 @@ es = est.fit(data_mp, outcome_col='y', unit_col='unit',
663663

664664
**Staggered panels.** On multi-cohort panels with `aggregate="event_study"`, `fit()` auto-filters to the last treatment cohort plus never-treated units (paper Appendix B.2) and emits a `UserWarning` naming kept/dropped counts. The estimand is then a **last-cohort-only WAS**, not a multi-cohort average. For full multi-cohort staggered support, see `ChaisemartinDHaultfoeuille`.
665665

666-
**Mass-point + survey constraint.** When fitting `design="mass_point"` with `survey_design=` (or the deprecated `survey=` alias), `vcov_type="hc1"` (or `robust=True`) is required: the survey path composes the standard error via Binder-TSL on the HC1-scale influence function, so the default classical sandwich path raises `NotImplementedError`. Passing `vcov_type="hc1"` is a safe default on weighted survey examples since `vcov_type` is unused on the continuous designs (CCT-2014 robust SE is the only formula there).
666+
**Mass-point + survey constraint.** When fitting `design="mass_point"` with `survey_design=` (or the deprecated `survey=` alias), `vcov_type="hc1"` (or `robust=True`) is required: the survey path composes the standard error via Binder-TSL on the HC1-scale influence function, so the default classical sandwich path raises `NotImplementedError`. The same HC1 requirement also fires on the `weights=` shortcut when `aggregate="event_study"` AND `cband=True`: the per-horizon IF matrix is HC1-scale and the sup-t bootstrap normalizes by it, so mixing in a classical analytical SE would produce inconsistent variance families. Classical vcov is allowed on `weights=` + `aggregate="overall"` and on `weights=` + `aggregate="event_study"` + `cband=False`. Passing `vcov_type="hc1"` is a safe default on weighted survey + sup-t examples since `vcov_type` is unused on the continuous designs (CCT-2014 robust SE is the only formula there).
667667

668668
### StackedDiD
669669

@@ -1259,7 +1259,7 @@ Single-period results container for `HeterogeneousAdoptionDiD`. The table below
12591259
| `survey_metadata` | `SurveyMetadata | None` | Repo-standard survey metadata when `survey_design=` / `weights=` is supplied |
12601260
| `bandwidth_diagnostics` | `BandwidthResult | None` | MSE-DPI selector output (continuous designs); `None` on `mass_point` |
12611261
| `bias_corrected_fit` | `BiasCorrectedFit | None` | Phase 1c bias-corrected local-linear fit object (continuous designs); `None` on `mass_point` |
1262-
| `variance_formula` | `str | None` | HAD-specific SE label on weighted fits, populated on BOTH continuous and mass-point designs: `"pweight"` (continuous, CCT 2014 weighted-robust on the `weights=` shortcut), `"survey_binder_tsl"` (continuous, Binder 1983 TSL on the `survey_design=` path), `"pweight_2sls"` (mass-point, weighted 2SLS HC1 / CR1 sandwich on the `weights=` shortcut), or `"survey_binder_tsl_2sls"` (mass-point, Binder 1983 TSL on the `survey_design=` path). `None` on unweighted fits |
1262+
| `variance_formula` | `str | None` | HAD-specific SE label on weighted fits, populated on BOTH continuous and mass-point designs: `"pweight"` (continuous, CCT 2014 weighted-robust on the `weights=` shortcut), `"survey_binder_tsl"` (continuous, Binder 1983 TSL on the `survey_design=` path), `"pweight_2sls"` (mass-point + `weights=`; label is applied uniformly across vcov families — classical / HC1 / CR1 on the weighted 2SLS path, with the actual sandwich resolved via `vcov_type`), or `"survey_binder_tsl_2sls"` (mass-point, Binder 1983 TSL on the `survey_design=` path). `None` on unweighted fits |
12631263
| `effective_dose_mean` | `float | None` | Weighted denominator used by the β̂-scale rescaling, populated on weighted fits across all designs: weighted `mean(d)` (`continuous_at_zero`), weighted `mean(d − d_lower)` (`continuous_near_d_lower`), or weighted Wald-IV dose gap `mean(d | Z=1, w) − mean(d | Z=0, w)` (`mass_point`). `None` on unweighted fits |
12641264

12651265
**Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`
@@ -1292,8 +1292,8 @@ Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `a
12921292
| `bandwidth_diagnostics` | `list[BandwidthResult | None] | None` | Per-horizon MSE-DPI selector output (continuous designs); `None` on `mass_point`; entries can be `None` on degenerate horizons |
12931293
| `bias_corrected_fit` | `list[BiasCorrectedFit | None] | None` | Per-horizon Phase 1c bias-corrected local-linear fit objects; `None` on `mass_point`; entries can be `None` on degenerate horizons |
12941294
| `filter_info` | `dict | None` | Staggered last-cohort auto-filter metadata (`F_last`, `n_kept`, `n_dropped`, `dropped_cohorts`); `None` when no filter applied |
1295-
| `variance_formula` | `str | None` | Per-horizon variance family label |
1296-
| `effective_dose_mean` | `float | None` | Weighted denominator |
1295+
| `variance_formula` | `str | None` | HAD-specific SE label applied UNIFORMLY across all horizons, populated on BOTH continuous and mass-point designs: `"pweight"` (continuous, CCT 2014 weighted-robust on the `weights=` shortcut), `"survey_binder_tsl"` (continuous, Binder 1983 TSL on the `survey_design=` path), `"pweight_2sls"` (mass-point + `weights=`; label applied uniformly across vcov families — classical / HC1 / CR1 — on the weighted 2SLS path, with the actual sandwich resolved via `vcov_type`), or `"survey_binder_tsl_2sls"` (mass-point, Binder 1983 TSL on the `survey_design=` path). `None` on unweighted fits |
1296+
| `effective_dose_mean` | `float | None` | Weighted denominator used by the β̂-scale rescaling, populated on weighted fits across all designs: weighted `sum(w·d)/sum(w)` (`continuous_at_zero`), weighted `sum(w·(d − d_lower))/sum(w)` (`continuous_near_d_lower`), or weighted Wald-IV dose gap (`mass_point`). Scalar (not per-horizon) because the β̂-scale denominator is computed once on the fit sample. `None` on unweighted fits |
12971297
| `cband_low` | `np.ndarray | None` | Simultaneous (sup-t) band lower bounds; `None` on unweighted fits or when `cband=False` |
12981298
| `cband_high` | `np.ndarray | None` | Simultaneous (sup-t) band upper bounds |
12991299
| `cband_crit_value` | `float | None` | Sup-t critical value used for the simultaneous band |

0 commit comments

Comments
 (0)