Skip to content

Commit ed2842a

Browse files
igerberclaude
andcommitted
Address PR #402 R2 review (1 P1, 1 P2)
P1 pretest assumption labels: _handle_had step-3 + llms-full.txt HAD Pretests section + qug_test/stute_test bullets misstated which paper assumption each shipped test actually tests: - qug_test was labeled "Assumption 5 support condition", but QUG tests H_0: d_lower = 0 (paper Theorem 4 / step 1 of the workflow). Assumption 5 is the Design 1 sign-identification condition and is NOT testable via pre-trends per REGISTRY.md:2270. - stute_test was labeled "Assumption 7 mean-independence", but stute_test is the Assumption 8 linearity test (paper Section 4.2 step 3 / Appendix D). Assumption 7 is pre-trends (step 2). - did_had_pretest_workflow(aggregate="overall") was implied to cover step 2, but the workflow runs steps 1 + 3 only - step 2 is explicitly not covered on the overall path (had_pretests.py:4434-4441 + the workflow's verdict flags the gap). Rewrote both surfaces to match the actual contracts: QUG = paper Theorem 4 support-infimum test (step 1, decides Design 1' vs Design 1); Stute / Yatchew-HR = Assumption 8 linearity tests (step 3); Assumption 7 step 2 closure requires aggregate="event_study" (joint Stute pre-trends). Assumption 7 / step 2 gap is explicitly flagged on the overall path so agents do not assume coverage where there is none. P2 result-class field tables incomplete: HeterogeneousAdoptionDiDResults table was missing n_mass_point, n_above_d_lower, cluster_name, bias_corrected_fit, variance_formula, effective_dose_mean. HeterogeneousAdoptionDiDEventStudyResults table was missing vcov_type, cluster_name, bandwidth_diagnostics, bias_corrected_fit, filter_info. Added all missing fields with correct types and descriptions. Tests added (3 new, 86 total): - test_llms_full_had_results_class_field_lists_match_real_dataclass: uses dataclasses.fields() to enumerate every public field on both result classes and assert each appears in the documented table. Catches future drift where new fields land but the guide is not updated. - test_llms_full_had_pretests_assumption_labels_correct: scans the qug_test and stute_test bullets in the HAD Pretests section and enforces positive labels (support-infimum / Theorem 4 / linearity) + forbids positive Assumption-5 / Assumption-7 misclaims (negative disclaimers like "QUG does NOT test Assumption 5" remain allowed). - test_had_step_3_pretest_assumption_labels_correct: same checks on the practitioner.py _handle_had step-3 why-text; also requires positive acknowledgment of the Assumption 7 / step 2 gap on the overall workflow path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 6d6e950 commit ed2842a

4 files changed

Lines changed: 201 additions & 23 deletions

File tree

diff_diff/guides/llms-full.txt

Lines changed: 32 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1226,7 +1226,7 @@ Each event study effect dict contains: `effect`, `se`, `t_stat`, `p_value`, `con
12261226

12271227
### HeterogeneousAdoptionDiDResults
12281228

1229-
Single-period results container for `HeterogeneousAdoptionDiD`.
1229+
Single-period results container for `HeterogeneousAdoptionDiD`. The table below enumerates every public dataclass field; a regression test in `tests/test_guides.py` (`test_llms_full_had_results_class_field_lists_match_real_dataclass`) compares this list against the real `dataclasses.fields()` of the result class.
12301230

12311231
| Attribute | Type | Description |
12321232
|-----------|------|-------------|
@@ -1243,16 +1243,22 @@ Single-period results container for `HeterogeneousAdoptionDiD`.
12431243
| `n_obs` | `int` | Units contributing to estimation |
12441244
| `n_treated` | `int` | Units with `D > d_lower` |
12451245
| `n_control` | `int` | Units at or below `d_lower` |
1246+
| `n_mass_point` | `int | None` | Mass-point design only: units exactly at `d_lower`; `None` on continuous designs |
1247+
| `n_above_d_lower` | `int | None` | Mass-point design only: units strictly above `d_lower`; `None` on continuous designs |
12461248
| `inference_method` | `str` | `"analytical_nonparametric"` or `"analytical_2sls"` |
12471249
| `vcov_type` | `str | None` | Mass-point only: `"classical"`, `"hc1"`, or `"cr1"` |
1248-
| `bandwidth_diagnostics` | `BandwidthResult | None` | MSE-DPI selector output (continuous designs); `None` on `mass_point` |
1250+
| `cluster_name` | `str | None` | Cluster column name when CR1 cluster-robust SE is requested; `None` otherwise |
12491251
| `survey_metadata` | `SurveyMetadata | None` | Repo-standard survey metadata when `survey_design=` / `weights=` is supplied |
1252+
| `bandwidth_diagnostics` | `BandwidthResult | None` | MSE-DPI selector output (continuous designs); `None` on `mass_point` |
1253+
| `bias_corrected_fit` | `BiasCorrectedFit | None` | Phase 1c bias-corrected local-linear fit object (continuous designs); `None` on `mass_point` |
1254+
| `variance_formula` | `str | None` | HAD-specific SE label on the weighted continuous path: `"pweight"` (CCT 2014 weighted-robust) or `"survey_binder_tsl"` (Binder 1983); `None` on unweighted / mass-point fits |
1255+
| `effective_dose_mean` | `float | None` | Weighted denominator used by the β̂-scale rescaling on the weighted continuous path; `None` on unweighted fits |
12501256

12511257
**Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`
12521258

12531259
### HeterogeneousAdoptionDiDEventStudyResults
12541260

1255-
Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `aggregate="event_study"`. The anchor horizon `e = -1` is excluded by construction.
1261+
Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `aggregate="event_study"`. The anchor horizon `e = -1` is excluded by construction. The table below enumerates every public dataclass field; a regression test (`test_llms_full_had_results_class_field_lists_match_real_dataclass`) compares this list against the real `dataclasses.fields()`.
12561262

12571263
| Attribute | Type | Description |
12581264
|-----------|------|-------------|
@@ -1263,11 +1269,6 @@ Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `a
12631269
| `p_value` | `np.ndarray` | Per-horizon p-values |
12641270
| `conf_int_low` | `np.ndarray` | Pointwise CI lower bounds |
12651271
| `conf_int_high` | `np.ndarray` | Pointwise CI upper bounds |
1266-
| `cband_low` | `np.ndarray | None` | Simultaneous (sup-t) band lower bounds; `None` on unweighted fits or when `cband=False` |
1267-
| `cband_high` | `np.ndarray | None` | Simultaneous (sup-t) band upper bounds |
1268-
| `cband_crit_value` | `float | None` | Sup-t critical value used for the simultaneous band |
1269-
| `cband_method` | `str | None` | `"multiplier_bootstrap"` when populated |
1270-
| `cband_n_bootstrap` | `int | None` | Bootstrap iterations used for the band |
12711272
| `n_obs_per_horizon` | `np.ndarray` | Per-horizon contributing-unit counts |
12721273
| `alpha` | `float` | CI level used at fit time |
12731274
| `design` | `str` | Shared across horizons (paper Appendix B.2 invariant) |
@@ -1277,9 +1278,19 @@ Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `a
12771278
| `F` | `object` | First-treatment period label |
12781279
| `n_units` | `int` | Unique units contributing to the fit (post last-cohort filter) |
12791280
| `inference_method` | `str` | `"analytical_nonparametric"` or `"analytical_2sls"` |
1281+
| `vcov_type` | `str | None` | Mass-point only: `"classical"`, `"hc1"`, or `"cr1"`; `None` on continuous designs |
1282+
| `cluster_name` | `str | None` | Cluster column name when CR1 is requested; `None` otherwise |
12801283
| `survey_metadata` | `SurveyMetadata | None` | Populated on weighted fits |
1284+
| `bandwidth_diagnostics` | `list[BandwidthResult | None] | None` | Per-horizon MSE-DPI selector output (continuous designs); `None` on `mass_point`; entries can be `None` on degenerate horizons |
1285+
| `bias_corrected_fit` | `list[BiasCorrectedFit | None] | None` | Per-horizon Phase 1c bias-corrected local-linear fit objects; `None` on `mass_point`; entries can be `None` on degenerate horizons |
1286+
| `filter_info` | `dict | None` | Staggered last-cohort auto-filter metadata (`F_last`, `n_kept`, `n_dropped`, `dropped_cohorts`); `None` when no filter applied |
12811287
| `variance_formula` | `str | None` | Per-horizon variance family label |
12821288
| `effective_dose_mean` | `float | None` | Weighted denominator |
1289+
| `cband_low` | `np.ndarray | None` | Simultaneous (sup-t) band lower bounds; `None` on unweighted fits or when `cband=False` |
1290+
| `cband_high` | `np.ndarray | None` | Simultaneous (sup-t) band upper bounds |
1291+
| `cband_crit_value` | `float | None` | Sup-t critical value used for the simultaneous band |
1292+
| `cband_method` | `str | None` | `"multiplier_bootstrap"` when populated |
1293+
| `cband_n_bootstrap` | `int | None` | Bootstrap iterations used for the band |
12831294

12841295
**Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`
12851296

@@ -1393,7 +1404,7 @@ results = did.fit(data, outcome='y', treatment='treated', time='post')
13931404

13941405
## HAD Pretests
13951406

1396-
Diagnostic pretests for the `HeterogeneousAdoptionDiD` identifying assumptions (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). The composite workflow `did_had_pretest_workflow` is the recommended entry point — call it before reporting WAS as causal.
1407+
Diagnostic pretests for the `HeterogeneousAdoptionDiD` identifying assumptions (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). The composite workflow `did_had_pretest_workflow` is the recommended entry point — call it before reporting WAS as causal. The workflow follows paper Section 4.2's three-step battery: **step 1** is the QUG support-infimum test (decides whether Design 1' or Design 1 applies); **step 2** is the Assumption 7 pre-trends test (joint Stute on the event-study path; explicitly NOT covered on the overall path because a single-pre-period panel cannot support the joint variant); **step 3** is the Assumption 8 linearity test (`stute_test` or `yatchew_hr_test`). On the default `aggregate="overall"` path the workflow runs steps 1 + 3 only and the returned `verdict` flags the Assumption 7 gap; pass `aggregate="event_study"` on a multi-period panel to close that gap.
13971408

13981409
```python
13991410
from diff_diff import (
@@ -1402,24 +1413,29 @@ from diff_diff import (
14021413
stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test,
14031414
)
14041415

1405-
# Composite workflow — bundles QUG + Stute + Yatchew per the paper's three-step battery
1416+
# Composite workflow:
1417+
# aggregate="overall" -> steps 1 + 3 (QUG + Assumption 8 linearity)
1418+
# step 2 (Assumption 7 pre-trends) NOT covered;
1419+
# verdict explicitly flags this gap.
1420+
# aggregate="event_study" -> steps 1 + 2 + 3 (QUG + joint Stute pre-trends +
1421+
# joint homogeneity-linearity Stute) on multi-period panels.
14061422
report = did_had_pretest_workflow(
14071423
data, outcome_col='y', unit_col='unit', time_col='t',
14081424
dose_col='d', first_treat_col='first_treat',
1409-
aggregate='overall', # or 'event_study' for joint Stute on multi-period panels
1425+
aggregate='overall',
14101426
survey_design=None) # SurveyDesign for survey-aware pretests (Phase 4.5 C)
14111427
print(report.summary())
14121428
print(report.all_pass, report.verdict)
14131429
```
14141430

14151431
Individual tests:
14161432

1417-
- `qug_test(d)` — Assumption 5 support condition. Extreme order statistics, Exp(1)/Exp(1) limit law. **Permanently rejects** non-`None` `survey_design=` / `weights=` (`NotImplementedError`) per Phase 4.5 C0 deferral — extreme-value functionals are not smooth in the empirical CDF, so standard survey machinery does not yield a calibrated test.
1418-
- `stute_test(d, dy)` — Assumption 7 mean-independence of trends via Cramér-von Mises functional with Mammen wild bootstrap. Survey-aware via PSU-level Mammen multiplier bootstrap.
1419-
- `yatchew_hr_test(d, dy, *, null="linearity")` — Assumption 8 linearity of `E[ΔY|D]` via Yatchew (1997) heteroskedasticity-robust variance-ratio test. The `null="mean_independence"` mode (R `YatchewTest::yatchew_test(order=0)`) is also exposed for placebo-style mean-independence testing. Survey-aware via closed-form weighted variance components (no bootstrap).
1433+
- `qug_test(d)` — paper Theorem 4 support-infimum test (`H_0: d_lower = 0`; the QUG decides whether Design 1' or Design 1 applies in step 1 of the workflow). Extreme order statistics, Exp(1)/Exp(1) limit law. The QUG itself does NOT test Assumption 5 (which is the Design 1 sign-identification condition and is not testable via pre-trends per registry). **Permanently rejects** non-`None` `survey_design=` / `weights=` (`NotImplementedError`) per Phase 4.5 C0 deferral — extreme-value functionals are not smooth in the empirical CDF, so standard survey machinery does not yield a calibrated test.
1434+
- `stute_test(d, dy)` — Assumption 8 linearity of `E[ΔY|D]` (paper Section 4.2 step 3) via Stute Cramér-von Mises functional with Mammen wild bootstrap. Survey-aware via PSU-level Mammen multiplier bootstrap.
1435+
- `yatchew_hr_test(d, dy, *, null="linearity")` — Assumption 8 linearity of `E[ΔY|D]` (alternative test for step 3) via Yatchew (1997) heteroskedasticity-robust variance-ratio test. The `null="mean_independence"` mode (R `YatchewTest::yatchew_test(order=0)`) is also exposed for placebo-style mean-independence testing. Survey-aware via closed-form weighted variance components (no bootstrap).
14201436
- `stute_joint_pretest(residuals_dict, d)` — joint Cramér-von Mises across K horizons with shared-η Mammen wild bootstrap (Delgado-Manteiga 2001 / Hlávka-Hušková 2020). Residuals-in core; the two data-in wrappers below construct residuals for the two paper-spelled nulls.
1421-
- `joint_pretrends_test(...)` — joint pre-trends on K pre-periods (paper Section 4.2 step 2 closure on the event-study path).
1422-
- `joint_homogeneity_test(...)` — joint linearity-and-homogeneity on K post-periods.
1437+
- `joint_pretrends_test(...)` — Assumption 7 joint pre-trends on K pre-periods (paper Section 4.2 step 2 closure on the event-study path).
1438+
- `joint_homogeneity_test(...)` — joint linearity-and-homogeneity on K post-periods (event-study step 3 alternative).
14231439

14241440
The QUG-under-survey deferral is permanent; the linearity-family pretests support `survey_design=` (pweight, PSU, FPC) per Phase 4.5 C. Stratified designs and replicate-weight designs are deferred to follow-up PRs.
14251441

diff_diff/practitioner.py

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -857,20 +857,29 @@ def _handle_had(results: Any):
857857
baker_step=3,
858858
label="Run the HAD pretest battery",
859859
why=(
860-
"did_had_pretest_workflow bundles the paper's three "
861-
"testable identifying assumptions: QUG (Assumption 5 "
862-
"support condition), Stute (Assumption 7 mean-independence "
863-
"of trends), and Yatchew-HR (Assumption 8 linearity of "
864-
"E[ΔY|D]). Assumption 5/6 boundary continuity is not "
865-
"testable - the workflow vets what can be vetted."
860+
"On a two-period panel did_had_pretest_workflow runs "
861+
"paper Section 4.2 step 1 (QUG support-infimum test - "
862+
"decides Design 1' vs Design 1) and step 3 (Stute / "
863+
"Yatchew-HR Assumption 8 linearity tests). Step 2 "
864+
"(Assumption 7 pre-trends) is NOT covered on the overall "
865+
"path - a single pre-period cannot support the joint "
866+
"Stute variant - and the returned verdict explicitly "
867+
"flags that gap. To close step 2, refit on a multi-period "
868+
"panel with aggregate='event_study'. Assumptions 3 / 5 / 6 "
869+
"(uniform continuity at the boundary, Design 1 sign / "
870+
"WAS_d_lower identification) are NOT testable via "
871+
"pre-trends - the workflow vets only what can be vetted."
866872
),
867873
code=(
868874
"from diff_diff import did_had_pretest_workflow\n"
869875
"report = did_had_pretest_workflow(\n"
870876
" data, outcome_col='y', unit_col='unit',\n"
871877
" time_col='t', dose_col='d',\n"
872878
" first_treat_col='first_treat')\n"
873-
"print(report.summary())"
879+
"print(report.summary())\n"
880+
"# verdict explicitly flags the Assumption 7 gap on the\n"
881+
"# overall path; aggregate='event_study' on a multi-period\n"
882+
"# panel adds joint Stute pre-trends + joint homogeneity-linearity."
874883
),
875884
step_name="parallel_trends",
876885
),

tests/test_guides.py

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -413,3 +413,110 @@ def test_llms_full_paper_citation(self):
413413
had_end = text.index("### StackedDiD", had_start)
414414
had_text = text[had_start:had_end]
415415
assert "D'Haultfœuille" in had_text
416+
417+
def test_llms_full_had_results_class_field_lists_match_real_dataclass(self):
418+
# Every public dataclass field on HeterogeneousAdoptionDiDResults
419+
# and HeterogeneousAdoptionDiDEventStudyResults must appear in the
420+
# documented field table. Catches the failure mode where new
421+
# result fields land but the guide isn't updated, so agents
422+
# treating llms-full.txt as the authoritative surface miss
423+
# available diagnostics / metadata.
424+
import dataclasses
425+
426+
from diff_diff import (
427+
HeterogeneousAdoptionDiDEventStudyResults,
428+
HeterogeneousAdoptionDiDResults,
429+
)
430+
431+
text = get_llm_guide("full")
432+
433+
# Single-period result class
434+
sp_start = text.index("### HeterogeneousAdoptionDiDResults")
435+
sp_end = text.index("### HeterogeneousAdoptionDiDEventStudyResults", sp_start)
436+
sp_block = text[sp_start:sp_end]
437+
for field in dataclasses.fields(HeterogeneousAdoptionDiDResults):
438+
assert f"`{field.name}`" in sp_block, (
439+
f"HeterogeneousAdoptionDiDResults guide block is missing "
440+
f"the public dataclass field {field.name!r}. The table "
441+
f"must enumerate every field so agents see all available "
442+
f"diagnostics / metadata."
443+
)
444+
445+
# Event-study result class
446+
es_start = text.index("### HeterogeneousAdoptionDiDEventStudyResults")
447+
es_end = text.index("### TROPResults", es_start)
448+
es_block = text[es_start:es_end]
449+
for field in dataclasses.fields(HeterogeneousAdoptionDiDEventStudyResults):
450+
assert f"`{field.name}`" in es_block, (
451+
f"HeterogeneousAdoptionDiDEventStudyResults guide block "
452+
f"is missing the public dataclass field {field.name!r}."
453+
)
454+
455+
def test_llms_full_had_pretests_assumption_labels_correct(self):
456+
# Per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD
457+
# § "Assumptions / Theorems / Estimators":
458+
# - Assumption 5 = Design 1 sign identification (NOT testable)
459+
# - Assumption 6 = Design 1 WAS_d_lower identification (NOT testable)
460+
# - Assumption 7 = pre-trends (paper Section 4.2 step 2)
461+
# - Assumption 8 = linearity (paper Section 4.2 step 3)
462+
# The HAD Pretests section must NOT mislabel these:
463+
# - qug_test is the support-infimum test (H0: d_lower = 0),
464+
# NOT "Assumption 5" (which is non-testable per registry).
465+
# - stute_test is Assumption 8 (linearity), NOT Assumption 7.
466+
text = get_llm_guide("full")
467+
pretests_start = text.index("## HAD Pretests")
468+
pretests_end = text.index("## Honest DiD", pretests_start)
469+
pretests_block = text[pretests_start:pretests_end]
470+
# qug_test bullet: must positively label QUG as a support-infimum
471+
# test, NOT as a positive "Assumption 5 support condition" claim
472+
# (a negative disclaimer "does NOT test Assumption 5" is fine).
473+
forbidden_qug_positive_claims = (
474+
"Assumption 5 support condition",
475+
"QUG (Assumption 5",
476+
"qug_test`) — Assumption 5",
477+
"qug_test(d)` — Assumption 5",
478+
)
479+
# stute_test bullet: must positively label as Assumption 8
480+
# linearity, NOT as Assumption 7 mean-independence.
481+
forbidden_stute_positive_claims = (
482+
"stute_test(d, dy)` — Assumption 7",
483+
"Stute (Assumption 7",
484+
"Assumption 7 mean-independence",
485+
)
486+
for line in pretests_block.splitlines():
487+
if line.startswith("- `qug_test"):
488+
# Positive claim of what QUG IS:
489+
assert (
490+
"support-infimum" in line
491+
or "support infimum" in line
492+
or "Theorem 4" in line
493+
or "H_0: d_lower" in line
494+
), (
495+
f"qug_test bullet must positively label QUG as the "
496+
f"support-infimum / Theorem-4 test. Line: {line!r}"
497+
)
498+
for phrase in forbidden_qug_positive_claims:
499+
assert phrase not in line, (
500+
f"qug_test bullet must not positively claim QUG "
501+
f"is an 'Assumption 5' test ({phrase!r}). QUG "
502+
f"tests H_0: d_lower = 0; Assumption 5 is the "
503+
f"Design 1 sign-identification condition (NOT "
504+
f"testable per registry). A negative disclaimer "
505+
f"that QUG does NOT test Assumption 5 is fine. "
506+
f"Line: {line!r}"
507+
)
508+
if line.startswith("- `stute_test"):
509+
# Positive claim of what Stute IS:
510+
assert "Assumption 8" in line or "linearity" in line.lower(), (
511+
f"stute_test bullet must positively label as "
512+
f"Assumption 8 / linearity test. Line: {line!r}"
513+
)
514+
for phrase in forbidden_stute_positive_claims:
515+
assert phrase not in line, (
516+
f"stute_test bullet must not positively claim "
517+
f"Stute is an Assumption 7 mean-independence "
518+
f"test ({phrase!r}). stute_test is Assumption 8 "
519+
f"linearity (paper Section 4.2 step 3); "
520+
f"Assumption 7 is pre-trends (step 2, only "
521+
f"covered on the event-study path). Line: {line!r}"
522+
)

0 commit comments

Comments
 (0)