Address PR #402 R2 review (1 P1, 1 P2)

igerber · claude · igerber · commit ed2842a83881 · 2026-05-09T11:09:38.000-04:00
P1 pretest assumption labels: _handle_had step-3 + llms-full.txt HAD
Pretests section + qug_test/stute_test bullets misstated which paper
assumption each shipped test actually tests:
- qug_test was labeled "Assumption 5 support condition", but QUG tests
  H_0: d_lower = 0 (paper Theorem 4 / step 1 of the workflow).
  Assumption 5 is the Design 1 sign-identification condition and is
  NOT testable via pre-trends per REGISTRY.md:2270.
- stute_test was labeled "Assumption 7 mean-independence", but
  stute_test is the Assumption 8 linearity test (paper Section 4.2
  step 3 / Appendix D). Assumption 7 is pre-trends (step 2).
- did_had_pretest_workflow(aggregate="overall") was implied to cover
  step 2, but the workflow runs steps 1 + 3 only - step 2 is
  explicitly not covered on the overall path (had_pretests.py:4434-4441
  + the workflow's verdict flags the gap).

Rewrote both surfaces to match the actual contracts: QUG = paper
Theorem 4 support-infimum test (step 1, decides Design 1' vs Design 1);
Stute / Yatchew-HR = Assumption 8 linearity tests (step 3); Assumption
7 step 2 closure requires aggregate="event_study" (joint Stute
pre-trends). Assumption 7 / step 2 gap is explicitly flagged on the
overall path so agents do not assume coverage where there is none.

P2 result-class field tables incomplete: HeterogeneousAdoptionDiDResults
table was missing n_mass_point, n_above_d_lower, cluster_name,
bias_corrected_fit, variance_formula, effective_dose_mean.
HeterogeneousAdoptionDiDEventStudyResults table was missing vcov_type,
cluster_name, bandwidth_diagnostics, bias_corrected_fit, filter_info.
Added all missing fields with correct types and descriptions.

Tests added (3 new, 86 total):
- test_llms_full_had_results_class_field_lists_match_real_dataclass:
  uses dataclasses.fields() to enumerate every public field on both
  result classes and assert each appears in the documented table.
  Catches future drift where new fields land but the guide is not
  updated.
- test_llms_full_had_pretests_assumption_labels_correct: scans the
  qug_test and stute_test bullets in the HAD Pretests section and
  enforces positive labels (support-infimum / Theorem 4 / linearity)
  + forbids positive Assumption-5 / Assumption-7 misclaims (negative
  disclaimers like "QUG does NOT test Assumption 5" remain allowed).
- test_had_step_3_pretest_assumption_labels_correct: same checks on
  the practitioner.py _handle_had step-3 why-text; also requires
  positive acknowledgment of the Assumption 7 / step 2 gap on the
  overall workflow path.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt
@@ -1226,7 +1226,7 @@ Each event study effect dict contains: `effect`, `se`, `t_stat`, `p_value`, `con
 
 ### HeterogeneousAdoptionDiDResults
 
-Single-period results container for `HeterogeneousAdoptionDiD`.
+Single-period results container for `HeterogeneousAdoptionDiD`. The table below enumerates every public dataclass field; a regression test in `tests/test_guides.py` (`test_llms_full_had_results_class_field_lists_match_real_dataclass`) compares this list against the real `dataclasses.fields()` of the result class.
 
 | Attribute | Type | Description |
 |-----------|------|-------------|
@@ -1243,16 +1243,22 @@ Single-period results container for `HeterogeneousAdoptionDiD`.
 | `n_obs` | `int` | Units contributing to estimation |
 | `n_treated` | `int` | Units with `D > d_lower` |
 | `n_control` | `int` | Units at or below `d_lower` |
+| `n_mass_point` | `int | None` | Mass-point design only: units exactly at `d_lower`; `None` on continuous designs |
+| `n_above_d_lower` | `int | None` | Mass-point design only: units strictly above `d_lower`; `None` on continuous designs |
 | `inference_method` | `str` | `"analytical_nonparametric"` or `"analytical_2sls"` |
 | `vcov_type` | `str | None` | Mass-point only: `"classical"`, `"hc1"`, or `"cr1"` |
-| `bandwidth_diagnostics` | `BandwidthResult | None` | MSE-DPI selector output (continuous designs); `None` on `mass_point` |
+| `cluster_name` | `str | None` | Cluster column name when CR1 cluster-robust SE is requested; `None` otherwise |
 | `survey_metadata` | `SurveyMetadata | None` | Repo-standard survey metadata when `survey_design=` / `weights=` is supplied |
+| `bandwidth_diagnostics` | `BandwidthResult | None` | MSE-DPI selector output (continuous designs); `None` on `mass_point` |
+| `bias_corrected_fit` | `BiasCorrectedFit | None` | Phase 1c bias-corrected local-linear fit object (continuous designs); `None` on `mass_point` |
+| `variance_formula` | `str | None` | HAD-specific SE label on the weighted continuous path: `"pweight"` (CCT 2014 weighted-robust) or `"survey_binder_tsl"` (Binder 1983); `None` on unweighted / mass-point fits |
+| `effective_dose_mean` | `float | None` | Weighted denominator used by the β̂-scale rescaling on the weighted continuous path; `None` on unweighted fits |
 
 **Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`
 
 ### HeterogeneousAdoptionDiDEventStudyResults
 
-Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `aggregate="event_study"`. The anchor horizon `e = -1` is excluded by construction.
+Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `aggregate="event_study"`. The anchor horizon `e = -1` is excluded by construction. The table below enumerates every public dataclass field; a regression test (`test_llms_full_had_results_class_field_lists_match_real_dataclass`) compares this list against the real `dataclasses.fields()`.
 
 | Attribute | Type | Description |
 |-----------|------|-------------|
@@ -1263,11 +1269,6 @@ Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `a
 | `p_value` | `np.ndarray` | Per-horizon p-values |
 | `conf_int_low` | `np.ndarray` | Pointwise CI lower bounds |
 | `conf_int_high` | `np.ndarray` | Pointwise CI upper bounds |
-| `cband_low` | `np.ndarray | None` | Simultaneous (sup-t) band lower bounds; `None` on unweighted fits or when `cband=False` |
-| `cband_high` | `np.ndarray | None` | Simultaneous (sup-t) band upper bounds |
-| `cband_crit_value` | `float | None` | Sup-t critical value used for the simultaneous band |
-| `cband_method` | `str | None` | `"multiplier_bootstrap"` when populated |
-| `cband_n_bootstrap` | `int | None` | Bootstrap iterations used for the band |
 | `n_obs_per_horizon` | `np.ndarray` | Per-horizon contributing-unit counts |
 | `alpha` | `float` | CI level used at fit time |
 | `design` | `str` | Shared across horizons (paper Appendix B.2 invariant) |
@@ -1277,9 +1278,19 @@ Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `a
 | `F` | `object` | First-treatment period label |
 | `n_units` | `int` | Unique units contributing to the fit (post last-cohort filter) |
 | `inference_method` | `str` | `"analytical_nonparametric"` or `"analytical_2sls"` |
+| `vcov_type` | `str | None` | Mass-point only: `"classical"`, `"hc1"`, or `"cr1"`; `None` on continuous designs |
+| `cluster_name` | `str | None` | Cluster column name when CR1 is requested; `None` otherwise |
 | `survey_metadata` | `SurveyMetadata | None` | Populated on weighted fits |
+| `bandwidth_diagnostics` | `list[BandwidthResult | None] | None` | Per-horizon MSE-DPI selector output (continuous designs); `None` on `mass_point`; entries can be `None` on degenerate horizons |
+| `bias_corrected_fit` | `list[BiasCorrectedFit | None] | None` | Per-horizon Phase 1c bias-corrected local-linear fit objects; `None` on `mass_point`; entries can be `None` on degenerate horizons |
+| `filter_info` | `dict | None` | Staggered last-cohort auto-filter metadata (`F_last`, `n_kept`, `n_dropped`, `dropped_cohorts`); `None` when no filter applied |
 | `variance_formula` | `str | None` | Per-horizon variance family label |
 | `effective_dose_mean` | `float | None` | Weighted denominator |
+| `cband_low` | `np.ndarray | None` | Simultaneous (sup-t) band lower bounds; `None` on unweighted fits or when `cband=False` |
+| `cband_high` | `np.ndarray | None` | Simultaneous (sup-t) band upper bounds |
+| `cband_crit_value` | `float | None` | Sup-t critical value used for the simultaneous band |
+| `cband_method` | `str | None` | `"multiplier_bootstrap"` when populated |
+| `cband_n_bootstrap` | `int | None` | Bootstrap iterations used for the band |
 
 **Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`
 
@@ -1393,7 +1404,7 @@ results = did.fit(data, outcome='y', treatment='treated', time='post')
 
 ## HAD Pretests
 
-Diagnostic pretests for the `HeterogeneousAdoptionDiD` identifying assumptions (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). The composite workflow `did_had_pretest_workflow` is the recommended entry point — call it before reporting WAS as causal.
+Diagnostic pretests for the `HeterogeneousAdoptionDiD` identifying assumptions (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). The composite workflow `did_had_pretest_workflow` is the recommended entry point — call it before reporting WAS as causal. The workflow follows paper Section 4.2's three-step battery: **step 1** is the QUG support-infimum test (decides whether Design 1' or Design 1 applies); **step 2** is the Assumption 7 pre-trends test (joint Stute on the event-study path; explicitly NOT covered on the overall path because a single-pre-period panel cannot support the joint variant); **step 3** is the Assumption 8 linearity test (`stute_test` or `yatchew_hr_test`). On the default `aggregate="overall"` path the workflow runs steps 1 + 3 only and the returned `verdict` flags the Assumption 7 gap; pass `aggregate="event_study"` on a multi-period panel to close that gap.
 
 ```python
 from diff_diff import (
@@ -1402,24 +1413,29 @@ from diff_diff import (
     stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test,
 )
 
-# Composite workflow — bundles QUG + Stute + Yatchew per the paper's three-step battery
+# Composite workflow:
+#   aggregate="overall"      -> steps 1 + 3 (QUG + Assumption 8 linearity)
+#                               step 2 (Assumption 7 pre-trends) NOT covered;
+#                               verdict explicitly flags this gap.
+#   aggregate="event_study"  -> steps 1 + 2 + 3 (QUG + joint Stute pre-trends +
+#                               joint homogeneity-linearity Stute) on multi-period panels.
 report = did_had_pretest_workflow(
     data, outcome_col='y', unit_col='unit', time_col='t',
     dose_col='d', first_treat_col='first_treat',
-    aggregate='overall',  # or 'event_study' for joint Stute on multi-period panels
+    aggregate='overall',
     survey_design=None)   # SurveyDesign for survey-aware pretests (Phase 4.5 C)
 print(report.summary())
 print(report.all_pass, report.verdict)
 ```
 
 Individual tests:
 
-- `qug_test(d)` — Assumption 5 support condition. Extreme order statistics, Exp(1)/Exp(1) limit law. **Permanently rejects** non-`None` `survey_design=` / `weights=` (`NotImplementedError`) per Phase 4.5 C0 deferral — extreme-value functionals are not smooth in the empirical CDF, so standard survey machinery does not yield a calibrated test.
-- `stute_test(d, dy)` — Assumption 7 mean-independence of trends via Cramér-von Mises functional with Mammen wild bootstrap. Survey-aware via PSU-level Mammen multiplier bootstrap.
-- `yatchew_hr_test(d, dy, *, null="linearity")` — Assumption 8 linearity of `E[ΔY|D]` via Yatchew (1997) heteroskedasticity-robust variance-ratio test. The `null="mean_independence"` mode (R `YatchewTest::yatchew_test(order=0)`) is also exposed for placebo-style mean-independence testing. Survey-aware via closed-form weighted variance components (no bootstrap).
+- `qug_test(d)` — paper Theorem 4 support-infimum test (`H_0: d_lower = 0`; the QUG decides whether Design 1' or Design 1 applies in step 1 of the workflow). Extreme order statistics, Exp(1)/Exp(1) limit law. The QUG itself does NOT test Assumption 5 (which is the Design 1 sign-identification condition and is not testable via pre-trends per registry). **Permanently rejects** non-`None` `survey_design=` / `weights=` (`NotImplementedError`) per Phase 4.5 C0 deferral — extreme-value functionals are not smooth in the empirical CDF, so standard survey machinery does not yield a calibrated test.
+- `stute_test(d, dy)` — Assumption 8 linearity of `E[ΔY|D]` (paper Section 4.2 step 3) via Stute Cramér-von Mises functional with Mammen wild bootstrap. Survey-aware via PSU-level Mammen multiplier bootstrap.
+- `yatchew_hr_test(d, dy, *, null="linearity")` — Assumption 8 linearity of `E[ΔY|D]` (alternative test for step 3) via Yatchew (1997) heteroskedasticity-robust variance-ratio test. The `null="mean_independence"` mode (R `YatchewTest::yatchew_test(order=0)`) is also exposed for placebo-style mean-independence testing. Survey-aware via closed-form weighted variance components (no bootstrap).
 - `stute_joint_pretest(residuals_dict, d)` — joint Cramér-von Mises across K horizons with shared-η Mammen wild bootstrap (Delgado-Manteiga 2001 / Hlávka-Hušková 2020). Residuals-in core; the two data-in wrappers below construct residuals for the two paper-spelled nulls.
-- `joint_pretrends_test(...)` — joint pre-trends on K pre-periods (paper Section 4.2 step 2 closure on the event-study path).
-- `joint_homogeneity_test(...)` — joint linearity-and-homogeneity on K post-periods.
+- `joint_pretrends_test(...)` — Assumption 7 joint pre-trends on K pre-periods (paper Section 4.2 step 2 closure on the event-study path).
+- `joint_homogeneity_test(...)` — joint linearity-and-homogeneity on K post-periods (event-study step 3 alternative).
 
 The QUG-under-survey deferral is permanent; the linearity-family pretests support `survey_design=` (pweight, PSU, FPC) per Phase 4.5 C. Stratified designs and replicate-weight designs are deferred to follow-up PRs.
 
diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py
@@ -857,20 +857,29 @@ def _handle_had(results: Any):
             baker_step=3,
             label="Run the HAD pretest battery",
             why=(
-                "did_had_pretest_workflow bundles the paper's three "
-                "testable identifying assumptions: QUG (Assumption 5 "
-                "support condition), Stute (Assumption 7 mean-independence "
-                "of trends), and Yatchew-HR (Assumption 8 linearity of "
-                "E[ΔY|D]). Assumption 5/6 boundary continuity is not "
-                "testable - the workflow vets what can be vetted."
+                "On a two-period panel did_had_pretest_workflow runs "
+                "paper Section 4.2 step 1 (QUG support-infimum test - "
+                "decides Design 1' vs Design 1) and step 3 (Stute / "
+                "Yatchew-HR Assumption 8 linearity tests). Step 2 "
+                "(Assumption 7 pre-trends) is NOT covered on the overall "
+                "path - a single pre-period cannot support the joint "
+                "Stute variant - and the returned verdict explicitly "
+                "flags that gap. To close step 2, refit on a multi-period "
+                "panel with aggregate='event_study'. Assumptions 3 / 5 / 6 "
+                "(uniform continuity at the boundary, Design 1 sign / "
+                "WAS_d_lower identification) are NOT testable via "
+                "pre-trends - the workflow vets only what can be vetted."
             ),
             code=(
                 "from diff_diff import did_had_pretest_workflow\n"
                 "report = did_had_pretest_workflow(\n"
                 "    data, outcome_col='y', unit_col='unit',\n"
                 "    time_col='t', dose_col='d',\n"
                 "    first_treat_col='first_treat')\n"
-                "print(report.summary())"
+                "print(report.summary())\n"
+                "# verdict explicitly flags the Assumption 7 gap on the\n"
+                "# overall path; aggregate='event_study' on a multi-period\n"
+                "# panel adds joint Stute pre-trends + joint homogeneity-linearity."
             ),
             step_name="parallel_trends",
         ),
diff --git a/tests/test_guides.py b/tests/test_guides.py
@@ -413,3 +413,110 @@ def test_llms_full_paper_citation(self):
         had_end = text.index("### StackedDiD", had_start)
         had_text = text[had_start:had_end]
         assert "D'Haultfœuille" in had_text
+
+    def test_llms_full_had_results_class_field_lists_match_real_dataclass(self):
+        # Every public dataclass field on HeterogeneousAdoptionDiDResults
+        # and HeterogeneousAdoptionDiDEventStudyResults must appear in the
+        # documented field table. Catches the failure mode where new
+        # result fields land but the guide isn't updated, so agents
+        # treating llms-full.txt as the authoritative surface miss
+        # available diagnostics / metadata.
+        import dataclasses
+
+        from diff_diff import (
+            HeterogeneousAdoptionDiDEventStudyResults,
+            HeterogeneousAdoptionDiDResults,
+        )
+
+        text = get_llm_guide("full")
+
+        # Single-period result class
+        sp_start = text.index("### HeterogeneousAdoptionDiDResults")
+        sp_end = text.index("### HeterogeneousAdoptionDiDEventStudyResults", sp_start)
+        sp_block = text[sp_start:sp_end]
+        for field in dataclasses.fields(HeterogeneousAdoptionDiDResults):
+            assert f"`{field.name}`" in sp_block, (
+                f"HeterogeneousAdoptionDiDResults guide block is missing "
+                f"the public dataclass field {field.name!r}. The table "
+                f"must enumerate every field so agents see all available "
+                f"diagnostics / metadata."
+            )
+
+        # Event-study result class
+        es_start = text.index("### HeterogeneousAdoptionDiDEventStudyResults")
+        es_end = text.index("### TROPResults", es_start)
+        es_block = text[es_start:es_end]
+        for field in dataclasses.fields(HeterogeneousAdoptionDiDEventStudyResults):
+            assert f"`{field.name}`" in es_block, (
+                f"HeterogeneousAdoptionDiDEventStudyResults guide block "
+                f"is missing the public dataclass field {field.name!r}."
+            )
+
+    def test_llms_full_had_pretests_assumption_labels_correct(self):
+        # Per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD
+        # § "Assumptions / Theorems / Estimators":
+        #   - Assumption 5 = Design 1 sign identification (NOT testable)
+        #   - Assumption 6 = Design 1 WAS_d_lower identification (NOT testable)
+        #   - Assumption 7 = pre-trends (paper Section 4.2 step 2)
+        #   - Assumption 8 = linearity (paper Section 4.2 step 3)
+        # The HAD Pretests section must NOT mislabel these:
+        #   - qug_test is the support-infimum test (H0: d_lower = 0),
+        #     NOT "Assumption 5" (which is non-testable per registry).
+        #   - stute_test is Assumption 8 (linearity), NOT Assumption 7.
+        text = get_llm_guide("full")
+        pretests_start = text.index("## HAD Pretests")
+        pretests_end = text.index("## Honest DiD", pretests_start)
+        pretests_block = text[pretests_start:pretests_end]
+        # qug_test bullet: must positively label QUG as a support-infimum
+        # test, NOT as a positive "Assumption 5 support condition" claim
+        # (a negative disclaimer "does NOT test Assumption 5" is fine).
+        forbidden_qug_positive_claims = (
+            "Assumption 5 support condition",
+            "QUG (Assumption 5",
+            "qug_test`) — Assumption 5",
+            "qug_test(d)` — Assumption 5",
+        )
+        # stute_test bullet: must positively label as Assumption 8
+        # linearity, NOT as Assumption 7 mean-independence.
+        forbidden_stute_positive_claims = (
+            "stute_test(d, dy)` — Assumption 7",
+            "Stute (Assumption 7",
+            "Assumption 7 mean-independence",
+        )
+        for line in pretests_block.splitlines():
+            if line.startswith("- `qug_test"):
+                # Positive claim of what QUG IS:
+                assert (
+                    "support-infimum" in line
+                    or "support infimum" in line
+                    or "Theorem 4" in line
+                    or "H_0: d_lower" in line
+                ), (
+                    f"qug_test bullet must positively label QUG as the "
+                    f"support-infimum / Theorem-4 test. Line: {line!r}"
+                )
+                for phrase in forbidden_qug_positive_claims:
+                    assert phrase not in line, (
+                        f"qug_test bullet must not positively claim QUG "
+                        f"is an 'Assumption 5' test ({phrase!r}). QUG "
+                        f"tests H_0: d_lower = 0; Assumption 5 is the "
+                        f"Design 1 sign-identification condition (NOT "
+                        f"testable per registry). A negative disclaimer "
+                        f"that QUG does NOT test Assumption 5 is fine. "
+                        f"Line: {line!r}"
+                    )
+            if line.startswith("- `stute_test"):
+                # Positive claim of what Stute IS:
+                assert "Assumption 8" in line or "linearity" in line.lower(), (
+                    f"stute_test bullet must positively label as "
+                    f"Assumption 8 / linearity test. Line: {line!r}"
+                )
+                for phrase in forbidden_stute_positive_claims:
+                    assert phrase not in line, (
+                        f"stute_test bullet must not positively claim "
+                        f"Stute is an Assumption 7 mean-independence "
+                        f"test ({phrase!r}). stute_test is Assumption 8 "
+                        f"linearity (paper Section 4.2 step 3); "
+                        f"Assumption 7 is pre-trends (step 2, only "
+                        f"covered on the event-study path). Line: {line!r}"
+                    )
diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py