You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Address PR #409 R5 review (1 P1 — paper Step 4 vs Design 1' caveat)
Two methodology framing errors conflated in the original tutorial:
- "Paper Step 4" was described as "Boundary continuity (Assumptions 5/6)"
in the workflow taxonomy. Per REGISTRY's pretest workflow (lines
2482-2487 surrounding the four-step enumeration), Step 4 is actually
the DECISION RULE: "if Steps 1-3 don't reject, TWFE may be used."
Boundary-continuity assumptions are a separate concern.
- Assumptions 5/6 are Design 1 (continuous_near_d_lower / mass_point)
identification caveats — the library emits a UserWarning citing them
on Design 1 fits and stays silent on Design 1' (continuous_at_zero)
fits per REGISTRY:2532 and had.py. T21's panel resolves to Design 1'
via QUG fail-to-reject + the _detect_design() heuristic, so the
relevant non-testable caveat is **Assumption 3** (uniform continuity
of d -> Y_2(d) at zero, REGISTRY:2270), NOT Assumptions 5/6.
Inherited the 5/6 framing from T20 (which IS Design 1) inappropriately.
Reframed across 7 surfaces in the build script:
- Section 1 four-step enumeration: Step 4 is now the decision rule
- Section 1: added a separate paragraph for the non-testable
identification caveat that's design-path-specific (Assumption 3 for
Design 1', Assumptions 5/6 for Design 1) and explicitly notes the
library's UserWarning behavior matches this split
- Section 4 event-study verdict reading: separated Step 4 (decision
rule) from the Design 1' caveat
- Section 4 horizon-detail closing: same split
- Section 6 leadership template: replaced "Step 4 / Assumptions 5/6"
caveat with the correct Design 1' caveat (Assumption 3); explicit
parenthetical noting T20's caveat was different because T20 was
Design 1
- Section 6 bottom line: same split (decision rule vs caveat)
- Section 8 summary checklist: replaced single Step-4-as-caveat
bullet with a two-part bullet on the workflow vs caveat distinction
Notebook re-executed, review extract regenerated. All 16 drift tests
still pass; nbmake clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/_review/t21_notebook_extract.md
+9-7Lines changed: 9 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,15 +26,17 @@ This tutorial picks up where T20 left off. We re-run the brand campaign on a pan
26
26
27
27
## 1. The Pre-test Battery
28
28
29
-
de Chaisemartin et al. (2026) Section 4.2 lays out a four-step workflow for HAD identification:
29
+
de Chaisemartin et al. (2026) Section 4.2 lays out a four-step pre-test workflow for HAD identification:
30
30
31
31
1.**Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1', `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1, `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters.
32
32
2.**Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD.
33
33
3.**Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias?
34
-
4.**Step 4 - Boundary continuity (paper Assumptions 5, 6):**local-linearity of the dose-response near the boundary `d_lower`. **Non-testable**; argued from domain knowledge.
34
+
4.**Step 4 - Decision rule:**if Steps 1-3 all fail to reject, TWFE may be used to estimate the treatment effect (paper Section 4.3).
35
35
36
36
The library bundles the testable steps into one entry point: `did_had_pretest_workflow`. It dispatches to a two-period implementation (steps 1 + 3 only - step 2 needs at least two pre-periods) or a multi-period implementation (steps 1 + 2 + 3 jointly). The Yatchew-HR test from Step 3 is also exposed standalone with two null modes; we exercise both in the side panel.
37
37
38
+
**Non-testable identification caveat (separate from the four-step workflow).** Identification of the WAS estimand under Design 1' (`continuous_at_zero`, target = `WAS`) requires **Assumption 3** (uniform continuity of `d -> Y_2(d)` at zero, holds if the dose-response is Lipschitz; not testable). The Design 1 paths (`continuous_near_d_lower` / `mass_point`, target = `WAS_d_lower`) instead need **Assumption 5** (sign identification) or **Assumption 6** (`WAS_d_lower` point identification) - that is the caveat T20's tutorial flagged because T20's panel was Design 1. T21's panel resolves to Design 1' (see Section 2 + Section 3), so the relevant non-testable caveat here is Assumption 3, NOT Assumptions 5/6. The library reflects this: it emits a UserWarning about Assumption 5/6 on Design 1 fits and does not emit it on `continuous_at_zero` (Design 1') fits.
39
+
38
40
## 2. The Panel
39
41
40
42
We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1') identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design="auto"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1' from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.
**Reading the event-study verdict.** Now the verdict reads `"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)"`. The `"deferred"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.
262
264
263
-
A note on the verdict's "TWFE admissible" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05`. That is non-rejection evidence under the diagnostics' finite-sample power and specification, not a proof that the identifying assumptions hold. Step 4 (boundary continuity, paper Assumptions 5 / 6) remains non-testable from data and is not covered by any of the three diagnostics here.
265
+
A note on the verdict's "TWFE admissible" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05` (paper Step 4 decision rule). That is non-rejection evidence under the diagnostics' finite-sample power and specification, not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero, see Section 1) sits alongside this and is not covered by any of the three diagnostics.
264
266
265
267
The joint pre-trends test runs over `n_horizons = 3` (pre-periods 1, 2, 3, with week 4 reserved as the base period). The joint homogeneity test runs over `n_horizons = 4` (post-periods 5, 6, 7, 8). Let's inspect the per-horizon detail.
266
268
@@ -333,7 +335,7 @@ The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 thres
333
335
334
336
The joint homogeneity p-value (~0.76) is comfortably far from rejection. The diagnostic does not flag heterogeneity bias on the dose dimension across the four post-launch horizons.
335
337
336
-
Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. Step 4 (boundary continuity, Assumptions 5 / 6) remains non-testable from data and is argued from domain knowledge, as in T20.
338
+
Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. By paper Step 4 (the decision rule), TWFE may then be used. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero) remains and is argued from domain knowledge.
337
339
338
340
## 5. Side Panel: Yatchew-HR Null Modes
339
341
@@ -417,9 +419,9 @@ Pre-test results travel awkwardly to non-technical audiences. The template below
417
419
> -**Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test does not reject (joint p approximately 0.07 across the three pre-period horizons). The p-value is close to alpha = 0.05, so the non-rejection here is not by a wide margin - in a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending.
418
420
> -**Step 3 (linearity, Assumption 8):** joint Stute homogeneity does not reject (joint p approximately 0.76 across the four post-launch horizons). The diagnostic does not flag heterogeneity bias on the dose dimension under the test's specification.
419
421
>
420
-
> **Non-testable from data (Step 4, paper Assumptions 5 / 6, boundary continuity):**local-linearity of the dose-response near `d_lower`. Argued from domain knowledge - is there reason to believe the marginal effect of an additional $1K of regional spend is roughly constant across the dose range? In our case yes, by DGP construction; in a real analysis we would justify this from prior knowledge of the channel's response shape.
422
+
> **Non-testable from data (Design 1' identification, paper Assumption 3 / boundary regularity at zero):**uniform continuity of the dose-response `d -> Y_2(d)` at zero. Argued from domain knowledge - is there reason to believe outcomes are continuous in spend at the lower-dose boundary, with no extensive-margin discontinuity at $0? In our case yes, by DGP construction. (Note: this is the Design 1' caveat. T20's panel was Design 1, where the corresponding non-testable caveats are Assumptions 5/6 - the library actually emits a UserWarning surfacing those on Design 1 fits but stays silent on Design 1' fits like ours.)
421
423
>
422
-
> **Bottom line:** the workflow's three testable diagnostics do not flag a violation. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and Step 4 (boundary continuity, non-testable from data). None of these are settled by non-rejection of the pre-tests.
424
+
> **Bottom line:** the workflow's three testable diagnostics do not flag a violation, so by paper Step 4 (decision rule) TWFE may be used. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and the non-testable Design 1' caveat (Assumption 3 / boundary regularity at zero). None of these are settled by non-rejection of the pre-tests.
423
425
424
426
## 7. Extensions
425
427
@@ -442,7 +444,7 @@ See the [`HeterogeneousAdoptionDiD` API reference](../api/had.html) and the [`HA
442
444
- HAD's pre-test workflow `did_had_pretest_workflow` bundles paper Section 4.2 Steps 1 (QUG support infimum), 2 (joint Stute pre-trends - event-study path only), and 3 (Stute / Yatchew-HR linearity, joint variant on event-study path).
443
445
- The two-period (`aggregate="overall"`) path runs Steps 1 + 3 only - it cannot run Step 2 because a single pre-period structurally has nothing to test against. The verdict says so verbatim: "Assumption 7 pre-trends test NOT run".
444
446
- Upgrade to the multi-period (`aggregate="event_study"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads "TWFE admissible under Section 4 assumptions" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof.
445
-
- Step 4 (paper Assumptions 5 / 6, boundary continuity) is **non-testable** from data - argue from domain knowledge.
447
+
-Paper Step 4 is the **decision rule** (if Steps 1-3 don't reject, use TWFE), not a non-testable assumption. The non-testable identification caveat is design-path-specific: **Assumption 3** (boundary regularity at zero) for `continuous_at_zero` (Design 1', T21), or **Assumptions 5/6** for the Design 1 paths (`continuous_near_d_lower` / `mass_point`, T20).
446
448
- The Yatchew-HR test exposes two null modes: `null="linearity"` (paper Theorem 7, default; what the workflow calls under the hood) and `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data).
447
449
- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The QUG test and HAD's `design="auto"` selector are independent rules: QUG is a statistical test on `H0: d_lower = 0`; `design="auto"` calls `_detect_design()` which uses a min/median heuristic on the dose vector. Both pointed to `continuous_at_zero` on this panel; finite-sample uncertainty in either decision is a remaining caveat.
448
450
- Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python).
0 commit comments