Skip to content

Commit 7beef1a

Browse files
igerberclaude
andcommitted
BaconDecomposition R parity goldens
Closes the PR #454 deferred R parity follow-up (TODO.md row removed). Generated `benchmarks/data/r_bacondecomp_golden.json` from the committed `benchmarks/R/generate_bacon_golden.R` script against `bacondecomp 0.1.1` on R 4.5.2. Three DGP fixtures: `uniform_3groups_with_never_treated`, `two_groups_no_never_treated`, `always_treated_remapped`. Parity results at atol=1e-6 via `tests/test_methodology_bacon.py::TestBaconParityR`: - TWFE coefficient: ✅ matches across all 3 fixtures - Weights-sum: ✅ matches across all 3 fixtures - Per-component: ✅ on the 2 non-remap fixtures; **structural convention divergence** on `always_treated_remapped` (skipped per-component, kept aggregate). R keeps `first_treat=1` as a distinct timing cohort and emits `Later vs Always Treated` comparisons; Python's paper-footnote-11 convention remaps those units to `U` and folds them into a single `treated_vs_never` cell per treated cohort. The aggregate is invariant per Theorem 1 — the U bucket's weight is re-allocated across nested 2x2 cells but the total weight on {cohort_k vs U} is identical. Only the per-component breakdown differs structurally between conventions. Tracker promotions: - METHODOLOGY_REVIEW.md: BaconDecomposition status row → **Complete** (was `**Complete** (R parity pending)`); removed from In Progress prose mention; removed from Priority Order substantive-review list; Test Coverage count refreshed (24 → 33); R Comparison Results block rewritten as **Validated**. - docs/methodology/REGISTRY.md: Reference Implementations bullet + Verified Components checklist + Note (weight modes) updated; new Note (R parity convention divergence on always-treated) documents the convention. - TODO.md: BaconDecomposition R parity goldens row removed. - CHANGELOG.md: new `[Unreleased]` Added bullet for the close-out; PR-B Changed entry tightened ("intended to match" → "matching ... at atol=1e-6"). - diff_diff/bacon.py: `bacon_decompose` docstring example wording tightened from "intended to match" to "matches" with TestBaconParityR pointer. Tests: 33/33 pass in test_methodology_bacon.py (no skips; was 30+3 skipped); 32 pass in test_bacon.py; 101 pass across the broader bacon/decompose surface (was 98+3 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9aedd33 commit 7beef1a

7 files changed

Lines changed: 255 additions & 27 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Large diffs are not rendered by default.

METHODOLOGY_REVIEW.md

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ A **Complete** entry has a documented review pass against the primary academic s
2424

2525
The catalog grew incrementally over several quarters, so formats vary across the existing Complete entries; the consistent invariant is that someone walked through the implementation against the academic source and captured the result here. New reviews going forward should aim for the fuller structure (Verified Components + Corrections Made + Deviations + dedicated methodology test file) used by the more recent entries.
2626

27-
**In Progress** entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures (e.g., DCDH has a methodology file, R parity, and a companion-paper review for the 2026 universal-rollout extension; HAD has its primary-source paper review and R parity but no dedicated methodology file; ContinuousDiD has the methodology file but no paper review); others have only the REGISTRY entry and unit tests (e.g., BaconDecomposition, PowerAnalysis). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete.
27+
**In Progress** entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures (e.g., DCDH has a methodology file, R parity, and a companion-paper review for the 2026 universal-rollout extension; HAD has its primary-source paper review and R parity but no dedicated methodology file; ContinuousDiD has the methodology file but no paper review); others have only the REGISTRY entry and unit tests (e.g., PowerAnalysis). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete.
2828

2929
**Not Started** entries have neither a tracker walk-through nor an REGISTRY.md section. This tracker no longer carries any Not Started rows; new estimators are expected to enter as In Progress when their REGISTRY entry lands.
3030

@@ -78,7 +78,7 @@ The catalog grew incrementally over several quarters, so formats vary across the
7878

7979
| Tool | Module | R Reference | Status | Last Review |
8080
|------|--------|-------------|--------|-------------|
81-
| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | **Complete** (R parity pending) | 2026-05-16 |
81+
| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | **Complete** | 2026-05-16 |
8282
| HonestDiD | `honest_did.py` | `HonestDiD` package | **Complete** | 2026-04-01 |
8383
| PreTrendsPower | `pretrends.py` | `pretrends` package | **In Progress** ||
8484
| PowerAnalysis | `power.py` | `pwr` / `DeclareDesign` | **In Progress** ||
@@ -909,7 +909,7 @@ and covariate-adjusted specifications.)
909909
| Module | `bacon.py` |
910910
| Primary Reference | Goodman-Bacon (2021), *Difference-in-differences with variation in treatment timing*, J. Econometrics 225(2), 254-277 |
911911
| R Reference | `bacondecomp::bacon()` |
912-
| Status | **Complete** (R parity goldens pending) |
912+
| Status | **Complete** |
913913
| Last Review | 2026-05-16 |
914914

915915
**Verified Components:**
@@ -926,14 +926,17 @@ and covariate-adjusted specifications.)
926926
- [x] No untreated group: `s_{kU}` terms drop, weights renormalize, sum-to-1 still holds
927927
- [x] Single timing group with U: only `treated_vs_never` comparisons
928928
- [x] Survey design composes cleanly with exact mode and warn+remap
929-
- [ ] R `bacondecomp::bacon()` parity at `atol=1e-6`R generator script committed; JSON goldens pending follow-up R install (see TODO.md)
929+
- [x] R `bacondecomp::bacon()` parity at `atol=1e-6`3 fixtures (`uniform_3groups_with_never_treated`, `two_groups_no_never_treated`, `always_treated_remapped`); TWFE coefficient + weights-sum match across all 3 fixtures; per-component estimate + weight parity locked on the 2 non-remap fixtures, with a documented convention divergence on `always_treated_remapped` (Python's footnote-11 U-remap vs R's distinct `Later vs Always Treated` cohort decomposition — aggregate is invariant, breakdown is structurally different). See `benchmarks/data/r_bacondecomp_golden.json` + `TestBaconParityR`.
930930

931931
**Test Coverage:**
932-
- 24 methodology tests in `tests/test_methodology_bacon.py` across 6 classes (21 active + 3 R-parity tests that skip on missing goldens)
932+
- 33 methodology tests in `tests/test_methodology_bacon.py` across 6 classes (all active; R parity activates once goldens are committed)
933933
- 32 existing tests in `tests/test_bacon.py` (basic decomposition, weight properties, weights-parameter API, TWFE integration, visualization, balanced-panel warnings, edge cases)
934934

935935
**R Comparison Results:**
936-
- **Pending**: `bacondecomp` R package not installed in the local R 4.5.2 library at PR-B authoring time. R generator script committed at `benchmarks/R/generate_bacon_golden.R`; running it requires `install.packages("bacondecomp")` + `install.packages("jsonlite")` then `cd benchmarks/R && Rscript generate_bacon_golden.R`. JSON goldens land at `benchmarks/data/r_bacondecomp_golden.json` once generated. `tests/test_methodology_bacon.py::TestBaconParityR` skips with a pointer until then. Tracked in TODO.md follow-up row.
936+
- **Validated** at `atol=1e-6` against `bacondecomp::bacon()` (version 0.1.1, R 4.5.2). Goldens at `benchmarks/data/r_bacondecomp_golden.json`; generator at `benchmarks/R/generate_bacon_golden.R`. Three DGP fixtures:
937+
- `uniform_3groups_with_never_treated`: 9 components covering all three comparison types — full per-component parity (estimate + weight at `atol=1e-6`).
938+
- `two_groups_no_never_treated`: 2 components, timing-only decomposition — full per-component parity.
939+
- `always_treated_remapped`: TWFE coefficient + weights-sum match at `atol=1e-6`; per-component breakdown diverges by convention (Python's paper-footnote-11 U-remap vs R's distinct `Later vs Always Treated` cohort decomposition). The aggregate is invariant to the re-bucketing per Theorem 1; only the breakdown differs. Per-component assertion skipped for this fixture with explicit documentation in `TestBaconParityR.test_component_estimates_match_r`.
937940

938941
**Corrections Made:**
939942
1. **Theorem 1 exact-weights rewrite** (`bacon.py:_recompute_exact_weights`, lines ~740-880). The previous "exact" mode implementation did not actually compute Eqs. 7-9 / 10e-g — it was missing the `(1 - n_kU)` factor in the within-subsample treatment variance, did not square the sample share, and added an extraneous `unit_share` factor not present in the paper. The post-hoc sum-to-1 normalization masked the relative-weight error but produced a decomposition error of ~0.3% (0.007 absolute) against TWFE on a 3-cohort + never-treated DGP. **Rewrote** the function to compute the exact numerators of Eqs. 10e/f/g (with proper Eqs. 7-9 variances) and let the post-hoc normalization handle the `V̂^D` denominator (Theorem 1 identity guarantees `V̂^D = Σ numerators`). Now matches TWFE at `atol=1e-10`. The existing `test_weighted_sum_equals_twfe` tolerance was tightened from `< 0.1` to `< 1e-10` to lock the contract.
@@ -1203,22 +1206,21 @@ Promotion priority for the **In Progress** entries, ordered by what's blocked on
12031206

12041207
**Substantive-review-blocked (no methodology test file, no paper review, no R parity):**
12051208

1206-
1. **BaconDecomposition** — chosen for next substantive review during the 2026-05-15 tracker refresh session. Smaller scope than estimator reviews; R reference (`bacondecomp::bacon()`) available; methodology is well-understood (Goodman-Bacon 2021); REGISTRY checklist provides a ready-made target.
1207-
2. **PreTrendsPower** — small surface, established R package (`pretrends`), Roth (2022) is short.
1208-
3. **PowerAnalysis** — larger surface (MDE / power / sample size / simulation paths); REGISTRY already lists Bloom (1995) and Burlig et al. (2020) as primary sources; least urgent if the library's power-analysis utilities are not heavily used.
1209-
4. **PlaceboTests** — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way.
1210-
5. **EfficientDiD** — no paper review on file; substantial implementation work (`tests/test_efficient_did.py` + validation tests) needs paper-vs-code audit against Chen, Sant'Anna & Xie (2025).
1211-
6. **ImputationDiD / TwoStageDiD** — natural pair (both single-treatment-effect-imputation methods). Each needs paper review, methodology file, R parity fixture against `didimputation` / `did2s`.
1209+
1. **PreTrendsPower** — small surface, established R package (`pretrends`), Roth (2022) is short.
1210+
2. **PowerAnalysis** — larger surface (MDE / power / sample size / simulation paths); REGISTRY already lists Bloom (1995) and Burlig et al. (2020) as primary sources; least urgent if the library's power-analysis utilities are not heavily used.
1211+
3. **PlaceboTests** — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way.
1212+
4. **EfficientDiD** — no paper review on file; substantial implementation work (`tests/test_efficient_did.py` + validation tests) needs paper-vs-code audit against Chen, Sant'Anna & Xie (2025).
1213+
5. **ImputationDiD / TwoStageDiD** — natural pair (both single-treatment-effect-imputation methods). Each needs paper review, methodology file, R parity fixture against `didimputation` / `did2s`.
12121214

12131215
**Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):**
12141216

1215-
7. **HeterogeneousAdoptionDiD (HAD)** — largest current surface, Phase 4.5 just shipped; shares the de Chaisemartin (2026) paper review with DCDH; needs a dedicated Verified Components block.
1216-
8. **ChaisemartinDHaultfoeuille (DCDH)** — methodology test file + 24 R parity tests + 347 unit tests + a companion-paper review for the 2026 universal-rollout extension. Primary-source reviews for the 2020 AER and 2022/2024 NBER WP 29873 papers are still outstanding alongside the Verified Components walk-through.
1217-
9. **WooldridgeDiD (ETWFE)** — companion-paper review (Wooldridge 2023 nonlinear extension) merged in PR #443; primary-source review for Wooldridge (2025) ETWFE not yet on file, and no dedicated methodology test file.
1218-
10. **ContinuousDiD** — 15 methodology tests already in place; mostly a consolidation pass with a documented boundary-knots deviation from R `contdid` v0.1.0.
1219-
11. **TROP** — paper review recently merged (PR #443); needs methodology file and cross-language anchor (when paper-author reference becomes available).
1220-
12. **StaggeredTripleDifference** — shares the primary paper (Ortiz-Villavicencio & Sant'Anna 2025) with TripleDifference, but no dedicated paper review on file yet; needs R parity (R fixtures gitignored — tracked in TODO.md, PR #245).
1221-
13. **ConleySpatialHAC** — paper review + committed R `conleyreg` goldens; needs dedicated methodology test file + summary R-parity table in this tracker.
1217+
6. **HeterogeneousAdoptionDiD (HAD)** — largest current surface, Phase 4.5 just shipped; shares the de Chaisemartin (2026) paper review with DCDH; needs a dedicated Verified Components block.
1218+
7. **ChaisemartinDHaultfoeuille (DCDH)** — methodology test file + 24 R parity tests + 347 unit tests + a companion-paper review for the 2026 universal-rollout extension. Primary-source reviews for the 2020 AER and 2022/2024 NBER WP 29873 papers are still outstanding alongside the Verified Components walk-through.
1219+
8. **WooldridgeDiD (ETWFE)** — companion-paper review (Wooldridge 2023 nonlinear extension) merged in PR #443; primary-source review for Wooldridge (2025) ETWFE not yet on file, and no dedicated methodology test file.
1220+
9. **ContinuousDiD** — 15 methodology tests already in place; mostly a consolidation pass with a documented boundary-knots deviation from R `contdid` v0.1.0.
1221+
10. **TROP** — paper review recently merged (PR #443); needs methodology file and cross-language anchor (when paper-author reference becomes available).
1222+
11. **StaggeredTripleDifference** — shares the primary paper (Ortiz-Villavicencio & Sant'Anna 2025) with TripleDifference, but no dedicated paper review on file yet; needs R parity (R fixtures gitignored — tracked in TODO.md, PR #245).
1223+
12. **ConleySpatialHAC** — paper review + committed R `conleyreg` goldens; needs dedicated methodology test file + summary R-parity table in this tracker.
12221224
14. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first.
12231225

12241226
---

TODO.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,6 @@ Deferred items from PR reviews that were not addressed before merge.
7474

7575
| Issue | Location | PR | Priority |
7676
|-------|----------|----|----------|
77-
| BaconDecomposition R parity goldens: `bacondecomp` R package not installed in the local R 4.5.2 library at PR-B authoring time (2026-05-16). R generator script committed at `benchmarks/R/generate_bacon_golden.R`; running it requires `install.packages("bacondecomp")` + `install.packages("jsonlite")` then `cd benchmarks/R && Rscript generate_bacon_golden.R`, writing `benchmarks/data/r_bacondecomp_golden.json`. `tests/test_methodology_bacon.py::TestBaconParityR` (3 tests) skips with a pointer until the JSON lands. The PR-B audit substantiates Theorem 1 (Eqs. 7-9 + 10e-g) via hand-calculable + machine-precision identity tests; R parity is desirable as a cross-language anchor but not the only substantiation. Mirrors StaggeredTripleDifference precedent (PR #245). | `benchmarks/R/generate_bacon_golden.R`, `benchmarks/data/r_bacondecomp_golden.json` (TBD), `tests/test_methodology_bacon.py::TestBaconParityR` | follow-up | Medium |
7877
| dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE. | `chaisemartin_dhaultfoeuille.py` | #294 | Low |
7978
| dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | #408 | Medium |
8079
| dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). | `test_chaisemartin_dhaultfoeuille_parity.py` | #294 | Low |

0 commit comments

Comments
 (0)