Add survey-aware DiD tutorial (Phase 7c) by igerber · Pull Request #249 · igerber/diff-diff

igerber · 2026-04-01T10:47:04Z

Summary

Add docs/tutorials/16_survey_did.ipynb — 35-cell tutorial framed around a state-level preventive care program evaluated with a stratified health survey (ACS/BRFSS-like)
Add generate_survey_did_data() DGP function to diff_diff/prep_dgp.py with realistic survey structure (strata, PSUs, FPC, weights, optional JK1 replicate weights)
Add 12 unit tests for the new DGP function
Mark Phase 7c complete in docs/survey-roadmap.md

Methodology references (required if estimator / math changes)

Method name(s): N/A — no estimator math changes; tutorial and DGP function only
Paper / source link(s): N/A
Any intentional deviations from the source (and why): None

Validation

Tests added/updated: tests/test_prep.py::TestGenerateSurveyDidData (12 tests)
Backtest / simulation / notebook evidence (if applicable): Tutorial notebook validates end-to-end via jupyter nbconvert --execute

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Add docs/tutorials/16_survey_did.ipynb — a 35-cell tutorial framed around a state-level preventive care program evaluated with a stratified health survey (ACS/BRFSS-like). Covers: why survey design matters for DiD, SurveyDesign setup, basic and staggered DiD with survey design, replicate weights (JK1), subpopulation analysis, DEFF diagnostics, repeated cross-sections, and estimator support reference. Also adds generate_survey_did_data() to diff_diff/prep_dgp.py with realistic survey structure (strata, PSUs, FPC, sampling weights, optional JK1 replicate weights) and 12 unit tests. Marks Phase 7c complete in the survey roadmap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-01T10:54:13Z

Overall Assessment

⚠️ Needs changes

Executive Summary

No estimator implementation changed in this PR; the review risk is concentrated in the new survey tutorial and the new generate_survey_did_data() public DGP.
The tutorial’s estimator-support matrix overstates survey support for ImputationDiD and TwoStageDiD: it says full strata/PSU/FPC (TSL) support, but both the registry and code explicitly reject fpc.
The repeated-cross-section section does not actually generate repeated cross-sections; it relabels panel observations, which does not match the panel=False methodology documented for CallawaySantAnna.
The committed notebook still contains a first-cell ImportError and was not re-executed after the new export was added.
The new JK1 replicate-weight path lacks a minimum-PSU guard; the arithmetic is undefined at one PSU, and the new tests do not cover that boundary.

Methodology

Methods affected in this PR are CallawaySantAnna tutorial coverage (including panel=False), plus survey-support claims for ImputationDiD and TwoStageDiD. I cross-checked these against the Methodology Registry and the in-code survey guards.

Severity: P1. Impact: The support table marks ImputationDiD and TwoStageDiD as Full under Strata/PSU/FPC (TSL), and the legend defines Full as including FPC-based TSL. That contradicts both the registry and the estimator code, which explicitly reject fpc for these estimators. Users following the tutorial will be told a survey-variance path exists that the library does not implement. References: docs/tutorials/16_survey_did.ipynb:718, docs/tutorials/16_survey_did.ipynb:719, docs/tutorials/16_survey_did.ipynb:727, docs/methodology/REGISTRY.md:854, docs/methodology/REGISTRY.md:932, diff_diff/imputation.py:261, diff_diff/two_stage.py:257. Concrete fix: Change those rows to reflect actual support (pweight + PSU/strata, no FPC), and tighten the summary text so it does not imply a uniform full-design SurveyDesign contract across estimators.
Severity: P1. Impact: The repeated-cross-section example says panel=False is for “different respondents sampled each period,” but it builds df_rc from the panel DGP and only renames unit. generate_survey_did_data() draws unit_fe once and reuses it across periods, so the same latent respondent is still present in every period; that is not the Sant'Anna–Zhao repeated-cross-section setting documented in the registry. References: docs/tutorials/16_survey_did.ipynb:651, docs/tutorials/16_survey_did.ipynb:662, docs/tutorials/16_survey_did.ipynb:670, docs/methodology/REGISTRY.md:426, diff_diff/prep_dgp.py:1264, diff_diff/prep_dgp.py:1272. Concrete fix: Add a true repeated-cross-section mode to generate_survey_did_data() (fresh respondent effects/covariates/assignments each period) or construct the tutorial example from fresh per-period draws instead of ID relabeling.

Code Quality

Severity: P2. Impact: JK1 replicate generation has no precondition check for the minimum number of PSUs. The code computes n_rep / (n_rep - 1), which is undefined when n_rep == 1, so include_replicate_weights=True can produce malformed replicate weights on boundary inputs instead of failing fast. The new tests only exercise a 12-PSU happy path. References: diff_diff/prep_dgp.py:1217, diff_diff/prep_dgp.py:1311, diff_diff/prep_dgp.py:1320, tests/test_prep.py:1234. Concrete fix: Validate positive n_strata/psu_per_stratum, and require at least two PSUs when include_replicate_weights=True; add a negative test for the one-PSU case.

Performance

No material findings in PR scope.

Maintainability

No material findings in PR scope.

Tech Debt

No material findings in PR scope.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The notebook was committed with a first-cell ImportError output from an environment where generate_survey_did_data was not importable, and none of the later cells were re-executed. That makes the tutorial artifact untrustworthy as committed, even though the code export was added in this PR. The new tests also only import from diff_diff.prep, not the top-level import path used by the notebook. References: docs/tutorials/16_survey_did.ipynb:38, docs/tutorials/16_survey_did.ipynb:60, diff_diff/init.py:91, tests/test_prep.py:1176. Concrete fix: Re-run or clear notebook outputs against the current source tree, and add either an nbconvert --execute smoke test or at least a top-level from diff_diff import generate_survey_did_data import test.

Path to Approval

Fix the tutorial support matrix and related summary text so ImputationDiD and TwoStageDiD match the documented/code reality: no FPC support, no “Full strata/PSU/FPC (TSL)” claim.
Replace the repeated-cross-section example with a genuine repeated-cross-section DGP path instead of relabeling panel IDs.

- Fix estimator support table: ImputationDiD and TwoStageDiD now show "Partial (no FPC)" instead of "Full" for strata/PSU/FPC support, matching the code which explicitly rejects FPC (P1) - Fix repeated cross-section example: add panel=False parameter to generate_survey_did_data() that draws fresh respondent effects each period instead of relabeling panel unit IDs (P1) - Add JK1 minimum-PSU guard: raise ValueError when n_psu < 2 to prevent division by zero in replicate weight generation (P2) - Clear stale notebook outputs committed from wrong environment (P2) - Add top-level import test and JK1 boundary test (P2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-01T12:17:54Z

/ai-review

github-actions · 2026-04-01T12:25:47Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7d381de4c03daf6a167ddf49879c9f4f64bcb415

Overall Assessment

⚠️ Needs changes

Executive Summary

Most prior review items are fixed in this revision: the ImputationDiD/TwoStageDiD FPC claims were corrected, the repeated-cross-section example now uses panel=False, notebook outputs were cleared, and the JK1 minimum-PSU guard/test were added.
No estimator implementation changed in this PR; the remaining risk is methodology accuracy in the new tutorial plus edge-case behavior in the new public DGP.
The new support matrix still misclassifies StackedDiD: the notebook says no strata/PSU/FPC support, but the registry and estimator implement TSL variance with full design metadata on composed weights.
The replicate-weight walkthrough builds plain JK1 delete-one-PSU replicates on a stratified/FPC DGP, then states that JK1 and TSL both “correctly account for the survey design.” That comparison is not design-consistent.
generate_survey_did_data() is exported correctly and has decent happy-path coverage, but it still lacks validation for documented parameter contracts like weight_variation and cohort_periods.
I could not execute pytest or the notebook in this workspace because pytest and numpy are unavailable.

Methodology

Severity: P1. Impact: The estimator support table marks StackedDiD as pweight only under Strata/PSU/FPC (TSL), and the legend defines pweight only as “no strata/PSU/FPC.” That contradicts both the Methodology Registry and the implementation, which compose Q-weights with survey weights and then compute TSL variance on the stacked design. Users are told a documented survey-variance path does not exist when it does. Locations: docs/tutorials/16_survey_did.ipynb#L700, docs/tutorials/16_survey_did.ipynb#L712, docs/methodology/REGISTRY.md#L1039, diff_diff/stacked_did.py#L242, diff_diff/stacked_did.py#L431. Concrete fix: change the StackedDiD TSL cell to reflect actual support and separate weight_type='pweight' restrictions from design-element support in the legend/table wording.
Severity: P1. Impact: The replicate-weight section generates global JK1 delete-one-PSU weights, but the comparison benchmark is a stratified/FPC TSL design. The notebook itself distinguishes JK1 from stratified JKn, and the registry states replicate weights must encode the design implicitly instead of being combined with strata/PSU/FPC metadata. In this example they do not, so the statement that both methods “correctly account for the survey design” is not supported for the constructed data. Locations: docs/tutorials/16_survey_did.ipynb#L439, docs/tutorials/16_survey_did.ipynb#L489, docs/tutorials/16_survey_did.ipynb#L502, diff_diff/prep_dgp.py#L1344, docs/methodology/REGISTRY.md#L2195, docs/methodology/REGISTRY.md#L2196, docs/methodology/REGISTRY.md#L2208. Concrete fix: either generate design-consistent stratified replicates for the stratified DGP (for example JKn) or keep the JK1 workflow example but stop presenting it as the same survey design as the strata/PSU/FPC TSL path.

Code Quality

Severity: P2. Impact: generate_survey_did_data() documents weight_variation as an enum and cohort_periods as treatment periods, but it does not validate either. Unknown weight_variation values silently fall through to the 1.0 branch, and cohort_periods has no explicit empty/range guard, unlike the existing staggered DGP. That lets the new public generator produce a different weight/treatment structure than the caller likely intended, with no clear error. Locations: diff_diff/prep_dgp.py#L1187, diff_diff/prep_dgp.py#L1244, diff_diff/prep_dgp.py#L1252, diff_diff/prep_dgp.py#L218, tests/test_prep.py#L1176. Concrete fix: validate weight_variation against the documented set, validate cohort_periods explicitly, and add negative tests for those contracts.

Performance

No material findings in PR scope.

Maintainability

No material findings beyond the table/legend ambiguity called out in Methodology.

Tech Debt

No material findings. I did not find matching TODO.md entries that mitigate the P1/P2 items above.

Security

No findings.

Documentation/Tests

No additional material findings beyond the coverage gap noted in Code Quality. The earlier stale-output/import concern appears addressed by clearing notebook outputs and adding a top-level import test at tests/test_prep.py#L1296. I could not run the notebook or test suite in this workspace because numpy and pytest are unavailable.

Path to Approval

Fix the survey-support matrix so StackedDiD reflects its actual strata/PSU/FPC TSL support, and adjust the legend so weight_type='pweight' restrictions are not conflated with “no full design support.”
Make the replicate-weight section design-consistent: either generate stratified replicate weights for the stratified DGP or remove the claim that the JK1 example is the same survey design as the strata/PSU/FPC TSL fit.

- Fix StackedDiD support table: show "Full (pweight only)" for strata/PSU/FPC since it supports full TSL on composed weights, just restricted to pweight weight type (P1) - Fix replicate weight section: remove misleading TSL vs JK1 equivalence claim; show JK1 as standalone API demo and note that stratified replicates (JKn) are needed for stratified designs (P1) - Add input validation for weight_variation and cohort_periods in generate_survey_did_data() with negative tests (P2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-01T12:33:50Z

/ai-review

github-actions · 2026-04-01T12:39:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2dc1888b028cfdd53d3d02dc8b1cb90c380485b1

Overall Assessment

✅ Looks good

Executive Summary

Re-review: the two prior P1 methodology issues appear resolved. The support table now reflects StackedDiD’s documented pweight-only/full-TSL behavior at docs/tutorials/16_survey_did.ipynb#L694 and docs/methodology/REGISTRY.md#L1039, and the JK1 walkthrough now explicitly labels the generated replicates as unstratified at docs/tutorials/16_survey_did.ipynb#L494, consistent with docs/methodology/REGISTRY.md#L2195.
No estimator implementation changed in this PR; the remaining review risk is tutorial accuracy and public DGP contract handling.
Severity P2: generate_survey_did_data() still does not validate that cohort_periods are in-range integers, so invalid inputs can silently create extra never-treated units or cohorts that are never treated in-sample.
Severity P3: the notebook globally suppresses warnings, which also hides the repeated-cross-section stationarity warning emitted by CallawaySantAnna(panel=False).
I could not run pytest or execute the notebook here because numpy, pandas, and pytest are unavailable in this workspace.

Methodology

Prior methodology blockers from the last review look resolved.

Severity: P3. Impact: docs/tutorials/16_survey_did.ipynb#L114 suppresses all warnings even though the comment says it is only suppressing numerical warnings. That also hides the repeated-cross-section assumption warning emitted at diff_diff/staggered.py#L1256, while Section 8 introduces CallawaySantAnna(panel=False) at docs/tutorials/16_survey_did.ipynb#L632. Concrete fix: replace the global ignore with a targeted filter for specific numerical warnings, or reset warnings before Section 8 and state the stationary cross-sectional sampling assumption explicitly.

Code Quality

Severity: P2. Impact: the new generator only checks that cohort_periods is non-empty at diff_diff/prep_dgp.py#L1221 and diff_diff/prep_dgp.py#L1224. Unlike the existing generate_staggered_data() guard at diff_diff/prep_dgp.py#L218, it does not enforce that each cohort period is an integer in [1, n_periods - 1] before assignment into the integer unit_cohort array at diff_diff/prep_dgp.py#L1262. Values like 0, 2.5, or n_periods + 1 can therefore be silently coerced or produce cohorts with no treated observations in-sample. Concrete fix: mirror the existing staggered-DGP range check, reject non-integer cohort periods explicitly, and fail fast before cohort assignment.

Performance

No findings in PR scope.

Maintainability

No additional findings in PR scope.

Tech Debt

No separate findings. I did not find a TODO.md entry that mitigates the P2/P3 items above.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new tests at tests/test_prep.py#L1326 and tests/test_prep.py#L1334 cover invalid weight_variation and empty cohort_periods, but not the remaining out-of-range or non-integer cohort_periods contract gap. Concrete fix: add negative tests for cohort_periods=[0], cohort_periods=[n_periods], and cohort_periods=[2.5].
Execution note: I could not run the suite or notebook in this workspace because numpy, pandas, and pytest are not installed.

Mirror the existing generate_staggered_data() guard: reject cohort periods that are non-integer, < 1, or >= n_periods. Add negative tests for out-of-range and non-integer inputs. Fix existing tests that used default cohort_periods with small n_periods (now caught by the new validation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-01T16:21:16Z

/ai-review

github-actions · 2026-04-01T16:29:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: df331ab1bb7ed8f560d905e26524965482cc2305

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review: the prior cohort_periods contract gap is fixed in diff_diff/prep_dgp.py:L1221, with negative tests added at tests/test_prep.py:L1342 and tests/test_prep.py:L1354. The earlier JK1/StackedDiD tutorial clarifications also remain aligned with docs/tutorials/16_survey_did.ipynb:L494, docs/tutorials/16_survey_did.ipynb:L694, docs/methodology/REGISTRY.md:L1040, and docs/methodology/REGISTRY.md:L2195.
Severity P1: generate_survey_did_data() validates JK1 on the configured PSU count, not the populated PSU count. With only one observed PSU, the new replicate-weight path can still fall through and hit a division-by-zero path in diff_diff/prep_dgp.py:L1282, diff_diff/prep_dgp.py:L1363, and diff_diff/prep_dgp.py:L1370.
Severity P2: the new survey-support table overstates non-pweight support for several estimators and omits the EfficientDiD survey-plus-covariates restriction in docs/tutorials/16_survey_did.ipynb:L686 and docs/tutorials/16_survey_did.ipynb:L728.
Severity P3: the notebook still globally suppresses warnings at docs/tutorials/16_survey_did.ipynb:L113, which hides the repeated-cross-section stationarity warning emitted at diff_diff/staggered.py:L1254.
I could not run pytest or execute the notebook here because numpy, pandas, pytest, and jupyter are not installed in this workspace.

Methodology

Severity: P2. Impact: the support matrix marks CallawaySantAnna, TripleDifference, ImputationDiD, and TwoStageDiD as having Full weight support in docs/tutorials/16_survey_did.ipynb:L691, docs/tutorials/16_survey_did.ipynb:L692, docs/tutorials/16_survey_did.ipynb:L695, and docs/tutorials/16_survey_did.ipynb:L696, but the implementations reject non-pweight survey designs in diff_diff/staggered.py:L1229, diff_diff/staggered.py:L1291, diff_diff/triple_diff.py:L496, diff_diff/imputation.py:L206, diff_diff/imputation.py:L255, diff_diff/two_stage.py:L202, and diff_diff/two_stage.py:L251. The same section also omits that EfficientDiD raises NotImplementedError when covariates and survey_design are both set in diff_diff/efficient_did.py:L331 and diff_diff/efficient_did.py:L384. Concrete fix: change those rows to pweight only / Full (pweight only) as appropriate and add a footnote for the EfficientDiD covariate restriction.
Severity: P3. Impact: docs/tutorials/16_survey_did.ipynb:L113 suppresses all warnings even though the comment says “numerical warnings,” so the repeated-cross-section assumption warning from diff_diff/staggered.py:L1254 is hidden during Section 8. Concrete fix: narrow the warning filter to the specific numerical warning(s) you want to silence, or reset warnings before the panel=False section and state the stationarity assumption explicitly.

Code Quality

Severity: P1. Impact: the JK1 guard uses the theoretical PSU count at diff_diff/prep_dgp.py:L1242 and diff_diff/prep_dgp.py:L1282, but replicate construction uses only populated PSUs at diff_diff/prep_dgp.py:L1363. If only one PSU is actually observed, n_rep == 1 and the rescaling in diff_diff/prep_dgp.py:L1370 becomes a division-by-zero path instead of the documented user-facing validation error. Concrete fix: validate len(np.unique(unit_psu)) or len(psu_ids) and raise ValueError when fewer than two populated PSUs remain.

Performance

No findings in PR scope.

Maintainability

No findings in PR scope.

Tech Debt

No findings in PR scope. I did not find a TODO.md entry that already tracks the P1/P2/P3 items above, so they remain unmitigated.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new tests cover the fixed cohort_periods issues, but test_jk1_minimum_psu_guard in tests/test_prep.py:L1303 only exercises the n_psu_total == 1 case. It does not cover the one-populated-PSU case that still reaches replicate construction. Concrete fix: add a regression test for n_units=1, n_strata=1, psu_per_stratum=2, include_replicate_weights=True and assert the public ValueError.
Execution note: I could not run pytest or jupyter nbconvert --execute because numpy, pandas, pytest, and jupyter are unavailable in this workspace.

Path to Approval

Change generate_survey_did_data() so the JK1 precondition is checked on populated PSUs, not n_strata * psu_per_stratum, and raise the documented ValueError when fewer than two PSUs are observed.
Add a regression test for the one-populated-PSU JK1 case so CI enforces that contract.

- CallawaySantAnna, TripleDifference, ImputationDiD, TwoStageDiD: change Weights column from "Full" to "pweight only" (all reject fweight/aweight) - Add EfficientDiD footnote: covariates + survey_design cannot be used simultaneously - Clarify legend for "pweight only" in both Weights and TSL columns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- JK1 replicate guard now checks populated PSU count (not just configured count), preventing division-by-zero when n_units < n_psu_total leaves only 1 PSU with observations (P1) - Add regression test for one-populated-PSU edge case - Narrow notebook warning filter from global ignore to RuntimeWarning only, so UserWarning (e.g. repeated cross-section stationarity assumption) is preserved (P3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-01T17:27:44Z

/ai-review

github-actions · 2026-04-01T17:34:09Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 93f8103842b38bccb0d1a9e8ef9698c6211c425d

Overall Assessment

Looks good

Executive Summary

Re-review: the prior JK1 populated-PSU blocker is fixed. generate_survey_did_data() now re-checks populated PSUs before building replicates at diff_diff/prep_dgp.py:L1363, with regression coverage at tests/test_prep.py:L1315.
Re-review: the survey-support table and the EfficientDiD survey-plus-covariates caveat in docs/tutorials/16_survey_did.ipynb:L656 now line up with the current estimator restrictions and Registry notes.
Re-review: the notebook now suppresses only RuntimeWarnings at docs/tutorials/16_survey_did.ipynb:L112, so the repeated-cross-section assumption warning from diff_diff/staggered.py:L1254 is no longer hidden.
Severity P2: the new cohort_periods validator rejects NumPy integer scalars, so array-derived cohort inputs fail for the new public helper.
Severity P3: the new tutorial is only linked from the tutorials index; the main entry points still do not surface it.
I could not run pytest or execute the notebook here because pytest, numpy, and pandas are unavailable in this workspace.

Methodology

No findings. I cross-checked the updated support claims in docs/tutorials/16_survey_did.ipynb:L656 against diff_diff/staggered.py:L1291, diff_diff/triple_diff.py:L496, diff_diff/imputation.py:L255, diff_diff/two_stage.py:L251, diff_diff/efficient_did.py:L384, docs/methodology/REGISTRY.md:L420, and docs/methodology/REGISTRY.md:L691. The prior methodology/documentation mismatches appear resolved.

Code Quality

Severity: P2. Impact: generate_survey_did_data() only accepts built-in int values in cohort_periods at diff_diff/prep_dgp.py:L1225, so inputs like np.array([2, 3], dtype=np.int64) raise ValueError even though the existing staggered DGP only range-checks cohort periods at diff_diff/prep_dgp.py:L218. The new tests only cover float rejection at tests/test_prep.py:L1368. Concrete fix: accept np.integer alongside int while still rejecting bool, and add a regression test using NumPy integer cohort periods.

Performance

No findings in PR scope.

Maintainability

No findings in PR scope.

Tech Debt

No findings in PR scope.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the tutorial is indexed in docs/tutorials/README.md:L54, but README.md, docs/choosing_estimator.rst, and docs/quickstart.rst still do not reference it, so discoverability is lower than the “Implemented” wording in docs/survey-roadmap.md:L162 suggests. Concrete fix: add short cross-links in those entry points, or soften the roadmap wording to reflect that this PR ships the notebook plus the tutorials index entry.
Execution note: I could not run the added tests or notebook in this environment because pytest, numpy, and pandas are not installed.

…nces - Accept np.integer alongside int in cohort_periods validation, matching the existing generate_staggered_data() behavior (P2) - Add regression test for numpy integer cohort periods - Add tutorial 16 cross-links in README.md, quickstart.rst, and choosing_estimator.rst for discoverability (P3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-01T17:43:59Z

/ai-review

github-actions · 2026-04-01T18:02:49Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 84fffb522772abd4c15bb6e409cee4cfcede0e83

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review: the prior discoverability gap is fixed. The tutorial is now surfaced from README.md:L143, docs/quickstart.rst:L208, and docs/choosing_estimator.rst:L573.
Re-review: the prior NumPy-cohort fix is only partial. Element validation now accepts np.integer, but direct array-like inputs still fail at the truthiness guard in diff_diff/prep_dgp.py:L1223, and the new regression only covers list(periods) at tests/test_prep.py:L1376.
Severity P1 [Newly identified]: generate_survey_did_data() still hard-codes cohort_periods=[3, 5] when cohort_periods=None, so otherwise valid calls like generate_survey_did_data(n_periods=4) or n_periods=5 fail before any data is generated at diff_diff/prep_dgp.py:L1221. The sibling staggered DGP adapts defaults to n_periods at diff_diff/prep_dgp.py:L216.
Severity P3: the new survey-support matrix says it covers survey design “across all estimators,” but omits StaggeredTripleDifference even though both the Registry and the estimator docstring document survey support at docs/tutorials/16_survey_did.ipynb:L656, docs/methodology/REGISTRY.md:L1393, and diff_diff/staggered_triple_diff.py:L214.
I could not run pytest or execute the notebook here because pytest, numpy, and pandas are not installed in this workspace.

Methodology

No findings. This PR adds a DGP and documentation only; I did not find an estimator-math, weighting, variance, or default-behavior change that conflicts with the Methodology Registry. The survey-support claims I spot-checked against the Registry and estimator docstrings line up aside from the documentation completeness issue noted below.

Code Quality

Severity P1 [Newly identified]. Impact: the new public helper’s default parameter interaction is broken. cohort_periods=None maps to [3, 5] unconditionally in diff_diff/prep_dgp.py:L1221, so small-but-valid panels (n_periods=4 or 5) now error out during validation at diff_diff/prep_dgp.py:L1230. The existing staggered helper already avoids this by deriving defaults from n_periods in diff_diff/prep_dgp.py:L216. The added tests only exercise n_periods=4 when the caller manually supplies cohort_periods at tests/test_prep.py:L1332, so this path remains uncovered. Concrete fix: derive default cohort periods from n_periods (mirroring the existing staggered DGP or another explicit small-panel rule), or reject cohort_periods=None with a direct explanatory error before validation; add regression coverage for default n_periods=4 and n_periods=5.
Severity P2. Impact: the NumPy-integer fix is incomplete because if not cohort_periods in diff_diff/prep_dgp.py:L1223 still throws NumPy’s ambiguous-truth-value error when callers pass np.array([3, 5], dtype=np.int64) directly. The new test at tests/test_prep.py:L1376 only covers list(periods), so the array-like API case that triggered the original review concern is still not exercised. Concrete fix: coerce cohort_periods to a list/tuple before the emptiness check, or switch to an explicit length check, and add a regression test that passes the NumPy array itself.

Performance

No findings in PR scope.

Maintainability

No findings in PR scope.

Tech Debt

No findings in TODO.md scope. No new deferral was added for the cohort_periods issues above, so they remain live defects rather than accepted debt.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the tutorial’s capability table presents itself as the survey-support reference for “all estimators,” but it omits StaggeredTripleDifference at docs/tutorials/16_survey_did.ipynb:L656, even though survey support is documented in both docs/methodology/REGISTRY.md:L1393 and diff_diff/staggered_triple_diff.py:L214. That makes the new tutorial slightly misleading as a support matrix. Concrete fix: add a StaggeredTripleDifference row, or soften the lead-in so the table is clearly a subset rather than a complete catalog.
Execution note: I could not run the added tests or the notebook here because pytest, numpy, and pandas are unavailable in this environment.

Path to Approval

Make generate_survey_did_data() handle cohort_periods=None correctly for small n_periods by deriving valid defaults from n_periods or by raising an explicit upfront error for unsupported combinations, and add regression tests covering at least generate_survey_did_data(n_periods=4) and generate_survey_did_data(n_periods=5) with default cohort_periods.

…eDiff to table - Derive default cohort_periods from n_periods (mirroring generate_staggered_data) so n_periods=4/5 work without explicit cohort_periods (P1) - Coerce cohort_periods to list before emptiness check, fixing numpy array direct input (P2) - Add StaggeredTripleDifference row to survey support table (P3) - Add regression tests for small n_periods defaults and numpy array cohort_periods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-01T18:05:31Z

/ai-review

github-actions · 2026-04-01T18:15:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 036833c9df7206d42915a9af8719c423922baeb1

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review: the earlier discoverability gap is fixed. The new tutorial is now surfaced from README.md:L138-L143, docs/quickstart.rst:L202-L208, and docs/choosing_estimator.rst:L573-L575.
Re-review: the NumPy-array cohort_periods failure is fixed in both the helper and its regression coverage at diff_diff/prep_dgp.py:L1227-L1235 and tests/test_prep.py:L1376-L1387.
Severity P1: the small-panel default fix is still semantically off by one for a 1-indexed generator. diff_diff/prep_dgp.py:L1221-L1238 allows first_treat=1 and rejects first_treat=n_periods; for n_periods=3-5 that creates always-treated cohorts with no pre-period, and downstream CallawaySantAnna silently drops those groups when g-1 is absent at diff_diff/staggered.py:L491-L505, consistent with tests/test_csdid_ported.py:L613-L638.
The new regression only checks numeric bounds and row counts at tests/test_prep.py:L1389-L1401, so it misses the estimability failure above.
I could not run pytest or execute the notebook here because numpy, pandas, and pytest are not installed in this workspace.

Methodology

No unmitigated findings. This PR does not change estimator math, weighting formulas, or variance code paths, and the prior survey-support table omission is fixed: the tutorial now includes StaggeredTripleDifference in line with docs/tutorials/16_survey_did.ipynb:L656-L656, docs/methodology/REGISTRY.md:L1431-L1435, and diff_diff/staggered_triple_diff.py:L214-L220.

Code Quality

Severity P1. Impact: generate_survey_did_data() uses 1-indexed periods, but its cohort validation/default logic still follows the 0-indexed generate_staggered_data() contract at diff_diff/prep_dgp.py:L1221-L1238. That means it accepts g=1 even though those units have no observable pre-treatment period, and it rejects g=n_periods even though the base period g-1 exists. For n_periods=3-5, the new default path emits g=1, so the helper can silently generate treated cohorts that are dropped by CallawaySantAnna with no warning at diff_diff/staggered.py:L491-L505. Concrete fix: shift the 1-indexed contract so generated cohorts always have at least one pre-period; if last-period adoption is intended, allow g=n_periods, and if short panels cannot support multiple estimable cohorts, raise a direct early error instead of defaulting to g=1.

Performance

No findings in PR scope.

Maintainability

No findings in PR scope.

Tech Debt

No separate findings. The P1 above is not tracked under TODO.md:L47-L71, so it is not mitigated by existing deferral.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the supporting docs/tests still encode the wrong cohort contract. The helper docstring still says the default is [3, 5] at diff_diff/prep_dgp.py:L1171-L1172, test_cohort_period_out_of_range treats g=n_periods as invalid at tests/test_prep.py:L1356-L1366, and test_default_cohort_periods_small_n_periods only checks numeric range/row-count invariants at tests/test_prep.py:L1389-L1401. Concrete fix: after correcting the code, update the docstring and rewrite these regressions to reject g=1, allow any intended last-period cohort behavior explicitly, and assert that default cohorts all have at least one pre-treatment observation.

Path to Approval

Fix generate_survey_did_data()’s 1-indexed cohort contract so defaults never produce first_treat=1. Either allow last-period adoption and validate accordingly, or raise an explicit error for unsupported short-panel defaults.
Replace the current small-panel regression with an estimability check: for default n_periods=3/4/5, assert that every positive generated cohort has a pre-period and survives a downstream CallawaySantAnna(...).fit(...).group_time_effects pass.
Update the helper docstring and the out-of-range/default-cohort tests to match the corrected contract.

…lity - Default cohort_periods now requires g >= 2 so every cohort has at least one pre-treatment period (needed for CallawaySantAnna). For n_periods < 4, raises ValueError instead of producing inestimable cohorts (P1) - Allow g = n_periods (last-period adoption has base period g-1) - Reject g = 1 (no pre-period exists) - Update docstring to reflect derived defaults and g >= 2 contract - Strengthen tests: assert all cohorts have pre-periods, test n_periods=3 rejection, test g=n_periods is valid Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-01T18:18:56Z

/ai-review

github-actions · 2026-04-01T18:32:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 05fb867651d57d2a84f4f72263a3e08ec7622621

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review: the earlier generate_survey_did_data() cohort-contract defect is fixed. Defaults now enforce g >= 2, allow g = n_periods, and reject unsupported default small panels in diff_diff/prep_dgp.py:1223 and tests/test_prep.py:1356.
[Newly identified] The new survey tutorial still plots “survey-aware” event-study intervals as 1.96 * se, even though the notebook text and the library’s inference path use survey-d.f. t critical values. That makes the displayed survey intervals too narrow whenever df_survey is finite.
Tutorial discoverability is now fixed via the README, quickstart, and choosing-estimator links.
Non-blocking: the added small-panel regression still checks cohort bounds only, not end-to-end estimability through CallawaySantAnna.

Methodology

Estimator code paths are unchanged in this PR scope, and the prior DGP boundary issue is resolved.

Severity P1 [Newly identified]. Impact: the tutorial explicitly says survey inference uses df = n_PSU - n_strata t-distribution critical values at docs/tutorials/16_survey_did.ipynb:262, and the library applies survey df in inference at diff_diff/linalg.py:2050 and docs/methodology/REGISTRY.md:2117. But both event-study plots still draw the “survey-aware” intervals as 1.96 * se at docs/tutorials/16_survey_did.ipynb:134 and docs/tutorials/16_survey_did.ipynb:354. Under the tutorial’s own default design (docs/tutorials/16_survey_did.ipynb:81 with 5 strata and 40 PSUs), the correct survey critical value is larger than 1.96, so the plotted blue intervals understate uncertainty. Concrete fix: stop recomputing CIs in the notebook and plot from each effect’s returned conf_int instead, using asymmetric error bars (effect - ci_low, ci_high - effect).

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The P1 above is not mitigated by an existing TODO.md entry.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new small-panel regression at tests/test_prep.py:1392 only asserts 2 <= g <= n_periods; it still does not exercise a downstream CallawaySantAnna fit, so a future boundary regression could slip back in without breaking this test. Concrete fix: extend the test to fit CallawaySantAnna(control_group='never_treated') on the default n_periods=4..7 outputs and assert that each treated cohort contributes at least one group_time_effects entry.
I could not run pytest or execute the notebook here because numpy is not installed in this workspace.

Path to Approval

Replace the two hard-coded 1.96 * se event-study plot calculations in docs/tutorials/16_survey_did.ipynb:134 and docs/tutorials/16_survey_did.ipynb:354 with intervals derived from each result object’s conf_int.
Re-execute the notebook after that change and confirm the plotted survey-aware intervals match the estimator’s reported inference.

… plots Both event study plots now use the returned conf_int (which respects survey df t-critical values) instead of hardcoded z=1.96, ensuring plotted intervals match the library's inference output (P1). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-01T18:34:56Z

/ai-review

github-actions · 2026-04-01T18:56:39Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4d7e169df8782cf7d03d55a6fbfef22a5d265b3c

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review: the prior notebook inference defect is fixed. Both event-study plotting cells now use each result’s returned conf_int rather than hard-coded 1.96 * se in docs/tutorials/16_survey_did.ipynb:L115-L119 and docs/tutorials/16_survey_did.ipynb:L305-L309.
No estimator math changed in this diff, and the tutorial’s survey-variance narrative is broadly aligned with the local methodology registry and survey implementation.
[Newly identified] The new public generate_survey_did_data() helper does not validate several basic parameter contracts, so invalid inputs can yield malformed output or internally impossible survey designs instead of clear ValueErrors. That is the blocking issue.
Non-blocking: the new small-panel regression test still checks cohort bounds only and does not actually fit CallawaySantAnna, so the exact end-to-end estimability contract from the earlier bug is still not locked in.
I could not run pytest or execute the notebook here because numpy, pandas, and pytest are not installed in this workspace.

Methodology

Severity P1 [Newly identified]. Impact: generate_survey_did_data() is a new public survey DGP, but it only validates cohort_periods and weight_variation before constructing survey structure and treatment assignment in diff_diff/prep_dgp.py:L1223-L1296. That leaves several silent or late-failing bad states unguarded: n_units=0 produces an empty DataFrame with none of the documented columns at diff_diff/prep_dgp.py:L1375-L1395; out-of-range never_treated_frac values feed directly into slice arithmetic at diff_diff/prep_dgp.py:L1283-L1296 and can silently generate the wrong cohort structure; and impossible FPC values are emitted unchanged in the fpc column even though the library’s own survey resolver requires FPC >= n_PSU in diff_diff/survey.py:L326-L360. Concrete fix: add upfront validation for positive count parameters, 0 <= never_treated_frac <= 1, and fpc_per_stratum >= the populated PSU count in each stratum, and raise explicit ValueErrors before any records are built.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The blocking validation gap above is not currently tracked in TODO.md, so it remains unmitigated.

Security

No findings.

Documentation/Tests

Severity P3. Impact: test_default_cohort_periods_small_n_periods() still only asserts that default cohorts stay within bounds, and never verifies that those defaults remain estimable through the downstream staggered estimator in tests/test_prep.py:L1392-L1412. A future regression could reintroduce a small-panel usability break without failing this test as long as the cohorts remain “in range.” Concrete fix: extend that test to fit CallawaySantAnna(control_group='never_treated') on the default n_periods=4..7 outputs and assert that each treated cohort contributes at least one group-time effect.

Path to Approval

Add explicit parameter validation in generate_survey_did_data() for n_units > 0, n_periods > 0, n_strata > 0, psu_per_stratum > 0, 0 <= never_treated_frac <= 1, and FPC coherence with populated PSU counts per stratum.
Add regression tests for invalid inputs such as n_units=0, never_treated_frac=-0.1, never_treated_frac=1.1, and fpc_per_stratum smaller than the populated PSU count, asserting clean ValueErrors instead of malformed output or late failures.
Strengthen the small-panel default-cohort test to run an actual CallawaySantAnna(control_group='never_treated') fit for n_periods=4..7 and assert non-empty downstream effects.

Validate n_units, n_periods, n_strata, psu_per_stratum > 0, never_treated_frac in [0, 1], and fpc_per_stratum >= psu_per_stratum before building any records. Prevents malformed output or late failures from invalid inputs (P1). Add negative tests for all boundary cases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-01T18:58:47Z

/ai-review

github-actions · 2026-04-01T19:11:45Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a5f1a3e105ae3867338925061e73e7582fcf5ad5

Overall Assessment

Looks good

Executive Summary

Re-review: the prior blocking validation issue in generate_survey_did_data() is resolved. The new helper now rejects zero/invalid count parameters, out-of-range never_treated_frac, invalid cohort periods, and bad JK1/FPC setups in diff_diff/prep_dgp.py:L1223-L1404, with matching regression tests in tests/test_prep.py:L1303-L1432.
No estimator formulas, weighting rules, variance estimators, or defaults changed in this PR; the methodology-sensitive tutorial content is broadly aligned with the survey registry and current implementation notes.
I did not find a new P0 or P1 issue in the DGP helper, exports, or documentation updates.
One non-blocking test gap remains: the small-panel default-cohort regression test still checks bounds only, not actual downstream estimability through CallawaySantAnna in tests/test_prep.py:L1392-L1404.
I could not execute pytest or the notebook here because numpy, pandas, and pytest are not installed in this workspace.

Methodology

No findings. No estimator math changed, and the new tutorial’s survey-design, repeated-cross-section, and subpopulation guidance is consistent with the registry notes at docs/methodology/REGISTRY.md:L420-L426 and docs/methodology/REGISTRY.md:L2273-L2292. The prior methodology-adjacent validation defect in the new public DGP is also addressed in diff_diff/prep_dgp.py:L1223-L1404.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3. Impact: test_default_cohort_periods_small_n_periods() only verifies that derived cohorts satisfy 2 <= g <= n_periods, while the generator’s default-selection logic is explicitly justified in terms of Callaway-Sant’Anna estimability in diff_diff/prep_dgp.py:L1242-L1254. A future regression could preserve those bounds while still producing no usable downstream effects on small panels. Concrete fix: extend tests/test_prep.py:L1392-L1404 to fit CallawaySantAnna(control_group="never_treated") on the n_periods=4..7 outputs and assert at least one finite ATT(g,t) or non-empty aggregation result per treated cohort.

igerber and others added 2 commits April 1, 2026 13:24

igerber merged commit 6e3628f into main Apr 1, 2026
14 checks passed

igerber deleted the survey-tutorial branch April 1, 2026 20:18

igerber mentioned this pull request Apr 1, 2026

Bump version to 2.8.1 #254

Merged

Conversation

igerber commented Apr 1, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant