Skip to content

Commit 162f45a

Browse files
igerberclaude
andcommitted
Address PR #409 R5 review (1 P1 — paper Step 4 vs Design 1' caveat)
Two methodology framing errors conflated in the original tutorial: - "Paper Step 4" was described as "Boundary continuity (Assumptions 5/6)" in the workflow taxonomy. Per REGISTRY's pretest workflow (lines 2482-2487 surrounding the four-step enumeration), Step 4 is actually the DECISION RULE: "if Steps 1-3 don't reject, TWFE may be used." Boundary-continuity assumptions are a separate concern. - Assumptions 5/6 are Design 1 (continuous_near_d_lower / mass_point) identification caveats — the library emits a UserWarning citing them on Design 1 fits and stays silent on Design 1' (continuous_at_zero) fits per REGISTRY:2532 and had.py. T21's panel resolves to Design 1' via QUG fail-to-reject + the _detect_design() heuristic, so the relevant non-testable caveat is **Assumption 3** (uniform continuity of d -> Y_2(d) at zero, REGISTRY:2270), NOT Assumptions 5/6. Inherited the 5/6 framing from T20 (which IS Design 1) inappropriately. Reframed across 7 surfaces in the build script: - Section 1 four-step enumeration: Step 4 is now the decision rule - Section 1: added a separate paragraph for the non-testable identification caveat that's design-path-specific (Assumption 3 for Design 1', Assumptions 5/6 for Design 1) and explicitly notes the library's UserWarning behavior matches this split - Section 4 event-study verdict reading: separated Step 4 (decision rule) from the Design 1' caveat - Section 4 horizon-detail closing: same split - Section 6 leadership template: replaced "Step 4 / Assumptions 5/6" caveat with the correct Design 1' caveat (Assumption 3); explicit parenthetical noting T20's caveat was different because T20 was Design 1 - Section 6 bottom line: same split (decision rule vs caveat) - Section 8 summary checklist: replaced single Step-4-as-caveat bullet with a two-part bullet on the workflow vs caveat distinction Notebook re-executed, review extract regenerated. All 16 drift tests still pass; nbmake clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d8437a3 commit 162f45a

2 files changed

Lines changed: 63 additions & 59 deletions

File tree

docs/_review/t21_notebook_extract.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,15 +26,17 @@ This tutorial picks up where T20 left off. We re-run the brand campaign on a pan
2626

2727
## 1. The Pre-test Battery
2828

29-
de Chaisemartin et al. (2026) Section 4.2 lays out a four-step workflow for HAD identification:
29+
de Chaisemartin et al. (2026) Section 4.2 lays out a four-step pre-test workflow for HAD identification:
3030

3131
1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1', `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1, `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters.
3232
2. **Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD.
3333
3. **Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias?
34-
4. **Step 4 - Boundary continuity (paper Assumptions 5, 6):** local-linearity of the dose-response near the boundary `d_lower`. **Non-testable**; argued from domain knowledge.
34+
4. **Step 4 - Decision rule:** if Steps 1-3 all fail to reject, TWFE may be used to estimate the treatment effect (paper Section 4.3).
3535

3636
The library bundles the testable steps into one entry point: `did_had_pretest_workflow`. It dispatches to a two-period implementation (steps 1 + 3 only - step 2 needs at least two pre-periods) or a multi-period implementation (steps 1 + 2 + 3 jointly). The Yatchew-HR test from Step 3 is also exposed standalone with two null modes; we exercise both in the side panel.
3737

38+
**Non-testable identification caveat (separate from the four-step workflow).** Identification of the WAS estimand under Design 1' (`continuous_at_zero`, target = `WAS`) requires **Assumption 3** (uniform continuity of `d -> Y_2(d)` at zero, holds if the dose-response is Lipschitz; not testable). The Design 1 paths (`continuous_near_d_lower` / `mass_point`, target = `WAS_d_lower`) instead need **Assumption 5** (sign identification) or **Assumption 6** (`WAS_d_lower` point identification) - that is the caveat T20's tutorial flagged because T20's panel was Design 1. T21's panel resolves to Design 1' (see Section 2 + Section 3), so the relevant non-testable caveat here is Assumption 3, NOT Assumptions 5/6. The library reflects this: it emits a UserWarning about Assumption 5/6 on Design 1 fits and does not emit it on `continuous_at_zero` (Design 1') fits.
39+
3840
## 2. The Panel
3941

4042
We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1') identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design="auto"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1' from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.
@@ -260,7 +262,7 @@ homogeneity_joint populated? True
260262

261263
**Reading the event-study verdict.** Now the verdict reads `"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject (TWFE admissible under Section 4 assumptions)"`. The `"deferred"` caveat from the overall path is gone because the joint pre-trends and joint homogeneity diagnostics now ran. The structural fields confirm: `pretrends_joint` and `homogeneity_joint` are both populated.
262264

263-
A note on the verdict's "TWFE admissible" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05`. That is non-rejection evidence under the diagnostics' finite-sample power and specification, not a proof that the identifying assumptions hold. Step 4 (boundary continuity, paper Assumptions 5 / 6) remains non-testable from data and is not covered by any of the three diagnostics here.
265+
A note on the verdict's "TWFE admissible" language. This is the workflow's classifier output when none of the three testable diagnostics rejects at the configured `alpha = 0.05` (paper Step 4 decision rule). That is non-rejection evidence under the diagnostics' finite-sample power and specification, not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero, see Section 1) sits alongside this and is not covered by any of the three diagnostics.
264266

265267
The joint pre-trends test runs over `n_horizons = 3` (pre-periods 1, 2, 3, with week 4 reserved as the base period). The joint homogeneity test runs over `n_horizons = 4` (post-periods 5, 6, 7, 8). Let's inspect the per-horizon detail.
266268

@@ -333,7 +335,7 @@ The pre-trends p-value (~0.07) sits close to the conventional alpha = 0.05 thres
333335

334336
The joint homogeneity p-value (~0.76) is comfortably far from rejection. The diagnostic does not flag heterogeneity bias on the dose dimension across the four post-launch horizons.
335337

336-
Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. Step 4 (boundary continuity, Assumptions 5 / 6) remains non-testable from data and is argued from domain knowledge, as in T20.
338+
Together with QUG (Step 1's design decision) and joint linearity (Step 3), the workflow has now run all three testable steps and none reject at alpha = 0.05. By paper Step 4 (the decision rule), TWFE may then be used. That is the workflow's strongest non-rejection evidence; it is not proof that the identifying assumptions hold. The non-testable Design 1' identification caveat (Assumption 3 / boundary regularity at zero) remains and is argued from domain knowledge.
337339

338340
## 5. Side Panel: Yatchew-HR Null Modes
339341

@@ -417,9 +419,9 @@ Pre-test results travel awkwardly to non-technical audiences. The template below
417419
> - **Step 2 (parallel pre-trends, Assumption 7):** the joint Stute pre-trends test does not reject (joint p approximately 0.07 across the three pre-period horizons). The p-value is close to alpha = 0.05, so the non-rejection here is not by a wide margin - in a high-stakes deployment we would inspect the per-horizon contributions (`per_horizon_stats`) and consider Pierce-Schott-style linear-trend detrending.
418420
> - **Step 3 (linearity, Assumption 8):** joint Stute homogeneity does not reject (joint p approximately 0.76 across the four post-launch horizons). The diagnostic does not flag heterogeneity bias on the dose dimension under the test's specification.
419421
>
420-
> **Non-testable from data (Step 4, paper Assumptions 5 / 6, boundary continuity):** local-linearity of the dose-response near `d_lower`. Argued from domain knowledge - is there reason to believe the marginal effect of an additional $1K of regional spend is roughly constant across the dose range? In our case yes, by DGP construction; in a real analysis we would justify this from prior knowledge of the channel's response shape.
422+
> **Non-testable from data (Design 1' identification, paper Assumption 3 / boundary regularity at zero):** uniform continuity of the dose-response `d -> Y_2(d)` at zero. Argued from domain knowledge - is there reason to believe outcomes are continuous in spend at the lower-dose boundary, with no extensive-margin discontinuity at $0? In our case yes, by DGP construction. (Note: this is the Design 1' caveat. T20's panel was Design 1, where the corresponding non-testable caveats are Assumptions 5/6 - the library actually emits a UserWarning surfacing those on Design 1 fits but stays silent on Design 1' fits like ours.)
421423
>
422-
> **Bottom line:** the workflow's three testable diagnostics do not flag a violation. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and Step 4 (boundary continuity, non-testable from data). None of these are settled by non-rejection of the pre-tests.
424+
> **Bottom line:** the workflow's three testable diagnostics do not flag a violation, so by paper Step 4 (decision rule) TWFE may be used. Carrying the headline per-$1K lift forward should be paired with the standard caveats: finite-sample power of the diagnostics, the test specifications themselves, and the non-testable Design 1' caveat (Assumption 3 / boundary regularity at zero). None of these are settled by non-rejection of the pre-tests.
423425
424426
## 7. Extensions
425427

@@ -442,7 +444,7 @@ See the [`HeterogeneousAdoptionDiD` API reference](../api/had.html) and the [`HA
442444
- HAD's pre-test workflow `did_had_pretest_workflow` bundles paper Section 4.2 Steps 1 (QUG support infimum), 2 (joint Stute pre-trends - event-study path only), and 3 (Stute / Yatchew-HR linearity, joint variant on event-study path).
443445
- The two-period (`aggregate="overall"`) path runs Steps 1 + 3 only - it cannot run Step 2 because a single pre-period structurally has nothing to test against. The verdict says so verbatim: "Assumption 7 pre-trends test NOT run".
444446
- Upgrade to the multi-period (`aggregate="event_study"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads "TWFE admissible under Section 4 assumptions" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof.
445-
- Step 4 (paper Assumptions 5 / 6, boundary continuity) is **non-testable** from data - argue from domain knowledge.
447+
- Paper Step 4 is the **decision rule** (if Steps 1-3 don't reject, use TWFE), not a non-testable assumption. The non-testable identification caveat is design-path-specific: **Assumption 3** (boundary regularity at zero) for `continuous_at_zero` (Design 1', T21), or **Assumptions 5/6** for the Design 1 paths (`continuous_near_d_lower` / `mass_point`, T20).
446448
- The Yatchew-HR test exposes two null modes: `null="linearity"` (paper Theorem 7, default; what the workflow calls under the hood) and `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data).
447449
- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The QUG test and HAD's `design="auto"` selector are independent rules: QUG is a statistical test on `H0: d_lower = 0`; `design="auto"` calls `_detect_design()` which uses a min/median heuristic on the dose vector. Both pointed to `continuous_at_zero` on this panel; finite-sample uncertainty in either decision is a remaining caveat.
448450
- Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python).

0 commit comments

Comments
 (0)