Skip to content

Add survey-aware DiD tutorial (Phase 7c)#249

Merged
igerber merged 11 commits into
mainfrom
survey-tutorial
Apr 1, 2026
Merged

Add survey-aware DiD tutorial (Phase 7c)#249
igerber merged 11 commits into
mainfrom
survey-tutorial

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 1, 2026

Summary

  • Add docs/tutorials/16_survey_did.ipynb — 35-cell tutorial framed around a state-level preventive care program evaluated with a stratified health survey (ACS/BRFSS-like)
  • Add generate_survey_did_data() DGP function to diff_diff/prep_dgp.py with realistic survey structure (strata, PSUs, FPC, weights, optional JK1 replicate weights)
  • Add 12 unit tests for the new DGP function
  • Mark Phase 7c complete in docs/survey-roadmap.md

Methodology references (required if estimator / math changes)

  • Method name(s): N/A — no estimator math changes; tutorial and DGP function only
  • Paper / source link(s): N/A
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: tests/test_prep.py::TestGenerateSurveyDidData (12 tests)
  • Backtest / simulation / notebook evidence (if applicable): Tutorial notebook validates end-to-end via jupyter nbconvert --execute

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Add docs/tutorials/16_survey_did.ipynb — a 35-cell tutorial framed
around a state-level preventive care program evaluated with a
stratified health survey (ACS/BRFSS-like). Covers: why survey design
matters for DiD, SurveyDesign setup, basic and staggered DiD with
survey design, replicate weights (JK1), subpopulation analysis, DEFF
diagnostics, repeated cross-sections, and estimator support reference.

Also adds generate_survey_did_data() to diff_diff/prep_dgp.py with
realistic survey structure (strata, PSUs, FPC, sampling weights,
optional JK1 replicate weights) and 12 unit tests.

Marks Phase 7c complete in the survey roadmap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

Overall Assessment

⚠️ Needs changes

Executive Summary

  • No estimator implementation changed in this PR; the review risk is concentrated in the new survey tutorial and the new generate_survey_did_data() public DGP.
  • The tutorial’s estimator-support matrix overstates survey support for ImputationDiD and TwoStageDiD: it says full strata/PSU/FPC (TSL) support, but both the registry and code explicitly reject fpc.
  • The repeated-cross-section section does not actually generate repeated cross-sections; it relabels panel observations, which does not match the panel=False methodology documented for CallawaySantAnna.
  • The committed notebook still contains a first-cell ImportError and was not re-executed after the new export was added.
  • The new JK1 replicate-weight path lacks a minimum-PSU guard; the arithmetic is undefined at one PSU, and the new tests do not cover that boundary.

Methodology

Methods affected in this PR are CallawaySantAnna tutorial coverage (including panel=False), plus survey-support claims for ImputationDiD and TwoStageDiD. I cross-checked these against the Methodology Registry and the in-code survey guards.

Code Quality

  • Severity: P2. Impact: JK1 replicate generation has no precondition check for the minimum number of PSUs. The code computes n_rep / (n_rep - 1), which is undefined when n_rep == 1, so include_replicate_weights=True can produce malformed replicate weights on boundary inputs instead of failing fast. The new tests only exercise a 12-PSU happy path. References: diff_diff/prep_dgp.py:1217, diff_diff/prep_dgp.py:1311, diff_diff/prep_dgp.py:1320, tests/test_prep.py:1234. Concrete fix: Validate positive n_strata/psu_per_stratum, and require at least two PSUs when include_replicate_weights=True; add a negative test for the one-PSU case.

Performance

  • No material findings in PR scope.

Maintainability

  • No material findings in PR scope.

Tech Debt

  • No material findings in PR scope.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The notebook was committed with a first-cell ImportError output from an environment where generate_survey_did_data was not importable, and none of the later cells were re-executed. That makes the tutorial artifact untrustworthy as committed, even though the code export was added in this PR. The new tests also only import from diff_diff.prep, not the top-level import path used by the notebook. References: docs/tutorials/16_survey_did.ipynb:38, docs/tutorials/16_survey_did.ipynb:60, diff_diff/init.py:91, tests/test_prep.py:1176. Concrete fix: Re-run or clear notebook outputs against the current source tree, and add either an nbconvert --execute smoke test or at least a top-level from diff_diff import generate_survey_did_data import test.

Path to Approval

  1. Fix the tutorial support matrix and related summary text so ImputationDiD and TwoStageDiD match the documented/code reality: no FPC support, no “Full strata/PSU/FPC (TSL)” claim.
  2. Replace the repeated-cross-section example with a genuine repeated-cross-section DGP path instead of relabeling panel IDs.

- Fix estimator support table: ImputationDiD and TwoStageDiD now show
  "Partial (no FPC)" instead of "Full" for strata/PSU/FPC support,
  matching the code which explicitly rejects FPC (P1)
- Fix repeated cross-section example: add panel=False parameter to
  generate_survey_did_data() that draws fresh respondent effects each
  period instead of relabeling panel unit IDs (P1)
- Add JK1 minimum-PSU guard: raise ValueError when n_psu < 2 to
  prevent division by zero in replicate weight generation (P2)
- Clear stale notebook outputs committed from wrong environment (P2)
- Add top-level import test and JK1 boundary test (P2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 7d381de4c03daf6a167ddf49879c9f4f64bcb415


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Most prior review items are fixed in this revision: the ImputationDiD/TwoStageDiD FPC claims were corrected, the repeated-cross-section example now uses panel=False, notebook outputs were cleared, and the JK1 minimum-PSU guard/test were added.
  • No estimator implementation changed in this PR; the remaining risk is methodology accuracy in the new tutorial plus edge-case behavior in the new public DGP.
  • The new support matrix still misclassifies StackedDiD: the notebook says no strata/PSU/FPC support, but the registry and estimator implement TSL variance with full design metadata on composed weights.
  • The replicate-weight walkthrough builds plain JK1 delete-one-PSU replicates on a stratified/FPC DGP, then states that JK1 and TSL both “correctly account for the survey design.” That comparison is not design-consistent.
  • generate_survey_did_data() is exported correctly and has decent happy-path coverage, but it still lacks validation for documented parameter contracts like weight_variation and cohort_periods.
  • I could not execute pytest or the notebook in this workspace because pytest and numpy are unavailable.

Methodology

Code Quality

  • Severity: P2. Impact: generate_survey_did_data() documents weight_variation as an enum and cohort_periods as treatment periods, but it does not validate either. Unknown weight_variation values silently fall through to the 1.0 branch, and cohort_periods has no explicit empty/range guard, unlike the existing staggered DGP. That lets the new public generator produce a different weight/treatment structure than the caller likely intended, with no clear error. Locations: diff_diff/prep_dgp.py#L1187, diff_diff/prep_dgp.py#L1244, diff_diff/prep_dgp.py#L1252, diff_diff/prep_dgp.py#L218, tests/test_prep.py#L1176. Concrete fix: validate weight_variation against the documented set, validate cohort_periods explicitly, and add negative tests for those contracts.

Performance

No material findings in PR scope.

Maintainability

No material findings beyond the table/legend ambiguity called out in Methodology.

Tech Debt

No material findings. I did not find matching TODO.md entries that mitigate the P1/P2 items above.

Security

No findings.

Documentation/Tests

No additional material findings beyond the coverage gap noted in Code Quality. The earlier stale-output/import concern appears addressed by clearing notebook outputs and adding a top-level import test at tests/test_prep.py#L1296. I could not run the notebook or test suite in this workspace because numpy and pytest are unavailable.

Path to Approval

  1. Fix the survey-support matrix so StackedDiD reflects its actual strata/PSU/FPC TSL support, and adjust the legend so weight_type='pweight' restrictions are not conflated with “no full design support.”
  2. Make the replicate-weight section design-consistent: either generate stratified replicate weights for the stratified DGP or remove the claim that the JK1 example is the same survey design as the strata/PSU/FPC TSL fit.

- Fix StackedDiD support table: show "Full (pweight only)" for
  strata/PSU/FPC since it supports full TSL on composed weights,
  just restricted to pweight weight type (P1)
- Fix replicate weight section: remove misleading TSL vs JK1
  equivalence claim; show JK1 as standalone API demo and note
  that stratified replicates (JKn) are needed for stratified
  designs (P1)
- Add input validation for weight_variation and cohort_periods
  in generate_survey_did_data() with negative tests (P2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 2dc1888b028cfdd53d3d02dc8b1cb90c380485b1


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review: the two prior P1 methodology issues appear resolved. The support table now reflects StackedDiD’s documented pweight-only/full-TSL behavior at docs/tutorials/16_survey_did.ipynb#L694 and docs/methodology/REGISTRY.md#L1039, and the JK1 walkthrough now explicitly labels the generated replicates as unstratified at docs/tutorials/16_survey_did.ipynb#L494, consistent with docs/methodology/REGISTRY.md#L2195.
  • No estimator implementation changed in this PR; the remaining review risk is tutorial accuracy and public DGP contract handling.
  • Severity P2: generate_survey_did_data() still does not validate that cohort_periods are in-range integers, so invalid inputs can silently create extra never-treated units or cohorts that are never treated in-sample.
  • Severity P3: the notebook globally suppresses warnings, which also hides the repeated-cross-section stationarity warning emitted by CallawaySantAnna(panel=False).
  • I could not run pytest or execute the notebook here because numpy, pandas, and pytest are unavailable in this workspace.

Methodology

Prior methodology blockers from the last review look resolved.

  • Severity: P3. Impact: docs/tutorials/16_survey_did.ipynb#L114 suppresses all warnings even though the comment says it is only suppressing numerical warnings. That also hides the repeated-cross-section assumption warning emitted at diff_diff/staggered.py#L1256, while Section 8 introduces CallawaySantAnna(panel=False) at docs/tutorials/16_survey_did.ipynb#L632. Concrete fix: replace the global ignore with a targeted filter for specific numerical warnings, or reset warnings before Section 8 and state the stationary cross-sectional sampling assumption explicitly.

Code Quality

  • Severity: P2. Impact: the new generator only checks that cohort_periods is non-empty at diff_diff/prep_dgp.py#L1221 and diff_diff/prep_dgp.py#L1224. Unlike the existing generate_staggered_data() guard at diff_diff/prep_dgp.py#L218, it does not enforce that each cohort period is an integer in [1, n_periods - 1] before assignment into the integer unit_cohort array at diff_diff/prep_dgp.py#L1262. Values like 0, 2.5, or n_periods + 1 can therefore be silently coerced or produce cohorts with no treated observations in-sample. Concrete fix: mirror the existing staggered-DGP range check, reject non-integer cohort periods explicitly, and fail fast before cohort assignment.

Performance

  • No findings in PR scope.

Maintainability

  • No additional findings in PR scope.

Tech Debt

  • No separate findings. I did not find a TODO.md entry that mitigates the P2/P3 items above.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new tests at tests/test_prep.py#L1326 and tests/test_prep.py#L1334 cover invalid weight_variation and empty cohort_periods, but not the remaining out-of-range or non-integer cohort_periods contract gap. Concrete fix: add negative tests for cohort_periods=[0], cohort_periods=[n_periods], and cohort_periods=[2.5].
  • Execution note: I could not run the suite or notebook in this workspace because numpy, pandas, and pytest are not installed.

Mirror the existing generate_staggered_data() guard: reject
cohort periods that are non-integer, < 1, or >= n_periods.
Add negative tests for out-of-range and non-integer inputs.
Fix existing tests that used default cohort_periods with small
n_periods (now caught by the new validation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: df331ab1bb7ed8f560d905e26524965482cc2305


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

Code Quality

Performance

  • No findings in PR scope.

Maintainability

  • No findings in PR scope.

Tech Debt

  • No findings in PR scope. I did not find a TODO.md entry that already tracks the P1/P2/P3 items above, so they remain unmitigated.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new tests cover the fixed cohort_periods issues, but test_jk1_minimum_psu_guard in tests/test_prep.py:L1303 only exercises the n_psu_total == 1 case. It does not cover the one-populated-PSU case that still reaches replicate construction. Concrete fix: add a regression test for n_units=1, n_strata=1, psu_per_stratum=2, include_replicate_weights=True and assert the public ValueError.
  • Execution note: I could not run pytest or jupyter nbconvert --execute because numpy, pandas, pytest, and jupyter are unavailable in this workspace.

Path to Approval

  1. Change generate_survey_did_data() so the JK1 precondition is checked on populated PSUs, not n_strata * psu_per_stratum, and raise the documented ValueError when fewer than two PSUs are observed.
  2. Add a regression test for the one-populated-PSU JK1 case so CI enforces that contract.

igerber and others added 2 commits April 1, 2026 13:24
- CallawaySantAnna, TripleDifference, ImputationDiD, TwoStageDiD:
  change Weights column from "Full" to "pweight only" (all reject
  fweight/aweight)
- Add EfficientDiD footnote: covariates + survey_design cannot be
  used simultaneously
- Clarify legend for "pweight only" in both Weights and TSL columns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- JK1 replicate guard now checks populated PSU count (not just
  configured count), preventing division-by-zero when n_units <
  n_psu_total leaves only 1 PSU with observations (P1)
- Add regression test for one-populated-PSU edge case
- Narrow notebook warning filter from global ignore to
  RuntimeWarning only, so UserWarning (e.g. repeated cross-section
  stationarity assumption) is preserved (P3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 93f8103842b38bccb0d1a9e8ef9698c6211c425d


Overall Assessment

Looks good

Executive Summary

  • Re-review: the prior JK1 populated-PSU blocker is fixed. generate_survey_did_data() now re-checks populated PSUs before building replicates at diff_diff/prep_dgp.py:L1363, with regression coverage at tests/test_prep.py:L1315.
  • Re-review: the survey-support table and the EfficientDiD survey-plus-covariates caveat in docs/tutorials/16_survey_did.ipynb:L656 now line up with the current estimator restrictions and Registry notes.
  • Re-review: the notebook now suppresses only RuntimeWarnings at docs/tutorials/16_survey_did.ipynb:L112, so the repeated-cross-section assumption warning from diff_diff/staggered.py:L1254 is no longer hidden.
  • Severity P2: the new cohort_periods validator rejects NumPy integer scalars, so array-derived cohort inputs fail for the new public helper.
  • Severity P3: the new tutorial is only linked from the tutorials index; the main entry points still do not surface it.
  • I could not run pytest or execute the notebook here because pytest, numpy, and pandas are unavailable in this workspace.

Methodology

Code Quality

  • Severity: P2. Impact: generate_survey_did_data() only accepts built-in int values in cohort_periods at diff_diff/prep_dgp.py:L1225, so inputs like np.array([2, 3], dtype=np.int64) raise ValueError even though the existing staggered DGP only range-checks cohort periods at diff_diff/prep_dgp.py:L218. The new tests only cover float rejection at tests/test_prep.py:L1368. Concrete fix: accept np.integer alongside int while still rejecting bool, and add a regression test using NumPy integer cohort periods.

Performance

  • No findings in PR scope.

Maintainability

  • No findings in PR scope.

Tech Debt

  • No findings in PR scope.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the tutorial is indexed in docs/tutorials/README.md:L54, but README.md, docs/choosing_estimator.rst, and docs/quickstart.rst still do not reference it, so discoverability is lower than the “Implemented” wording in docs/survey-roadmap.md:L162 suggests. Concrete fix: add short cross-links in those entry points, or soften the roadmap wording to reflect that this PR ships the notebook plus the tutorials index entry.
  • Execution note: I could not run the added tests or notebook in this environment because pytest, numpy, and pandas are not installed.

…nces

- Accept np.integer alongside int in cohort_periods validation,
  matching the existing generate_staggered_data() behavior (P2)
- Add regression test for numpy integer cohort periods
- Add tutorial 16 cross-links in README.md, quickstart.rst, and
  choosing_estimator.rst for discoverability (P3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 84fffb522772abd4c15bb6e409cee4cfcede0e83


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

No findings. This PR adds a DGP and documentation only; I did not find an estimator-math, weighting, variance, or default-behavior change that conflicts with the Methodology Registry. The survey-support claims I spot-checked against the Registry and estimator docstrings line up aside from the documentation completeness issue noted below.

Code Quality

  • Severity P1 [Newly identified]. Impact: the new public helper’s default parameter interaction is broken. cohort_periods=None maps to [3, 5] unconditionally in diff_diff/prep_dgp.py:L1221, so small-but-valid panels (n_periods=4 or 5) now error out during validation at diff_diff/prep_dgp.py:L1230. The existing staggered helper already avoids this by deriving defaults from n_periods in diff_diff/prep_dgp.py:L216. The added tests only exercise n_periods=4 when the caller manually supplies cohort_periods at tests/test_prep.py:L1332, so this path remains uncovered. Concrete fix: derive default cohort periods from n_periods (mirroring the existing staggered DGP or another explicit small-panel rule), or reject cohort_periods=None with a direct explanatory error before validation; add regression coverage for default n_periods=4 and n_periods=5.
  • Severity P2. Impact: the NumPy-integer fix is incomplete because if not cohort_periods in diff_diff/prep_dgp.py:L1223 still throws NumPy’s ambiguous-truth-value error when callers pass np.array([3, 5], dtype=np.int64) directly. The new test at tests/test_prep.py:L1376 only covers list(periods), so the array-like API case that triggered the original review concern is still not exercised. Concrete fix: coerce cohort_periods to a list/tuple before the emptiness check, or switch to an explicit length check, and add a regression test that passes the NumPy array itself.

Performance

No findings in PR scope.

Maintainability

No findings in PR scope.

Tech Debt

No findings in TODO.md scope. No new deferral was added for the cohort_periods issues above, so they remain live defects rather than accepted debt.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the tutorial’s capability table presents itself as the survey-support reference for “all estimators,” but it omits StaggeredTripleDifference at docs/tutorials/16_survey_did.ipynb:L656, even though survey support is documented in both docs/methodology/REGISTRY.md:L1393 and diff_diff/staggered_triple_diff.py:L214. That makes the new tutorial slightly misleading as a support matrix. Concrete fix: add a StaggeredTripleDifference row, or soften the lead-in so the table is clearly a subset rather than a complete catalog.
  • Execution note: I could not run the added tests or the notebook here because pytest, numpy, and pandas are unavailable in this environment.

Path to Approval

  1. Make generate_survey_did_data() handle cohort_periods=None correctly for small n_periods by deriving valid defaults from n_periods or by raising an explicit upfront error for unsupported combinations, and add regression tests covering at least generate_survey_did_data(n_periods=4) and generate_survey_did_data(n_periods=5) with default cohort_periods.

…eDiff to table

- Derive default cohort_periods from n_periods (mirroring
  generate_staggered_data) so n_periods=4/5 work without
  explicit cohort_periods (P1)
- Coerce cohort_periods to list before emptiness check, fixing
  numpy array direct input (P2)
- Add StaggeredTripleDifference row to survey support table (P3)
- Add regression tests for small n_periods defaults and numpy
  array cohort_periods

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 036833c9df7206d42915a9af8719c423922baeb1


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

No unmitigated findings. This PR does not change estimator math, weighting formulas, or variance code paths, and the prior survey-support table omission is fixed: the tutorial now includes StaggeredTripleDifference in line with docs/tutorials/16_survey_did.ipynb:L656-L656, docs/methodology/REGISTRY.md:L1431-L1435, and diff_diff/staggered_triple_diff.py:L214-L220.

Code Quality

  • Severity P1. Impact: generate_survey_did_data() uses 1-indexed periods, but its cohort validation/default logic still follows the 0-indexed generate_staggered_data() contract at diff_diff/prep_dgp.py:L1221-L1238. That means it accepts g=1 even though those units have no observable pre-treatment period, and it rejects g=n_periods even though the base period g-1 exists. For n_periods=3-5, the new default path emits g=1, so the helper can silently generate treated cohorts that are dropped by CallawaySantAnna with no warning at diff_diff/staggered.py:L491-L505. Concrete fix: shift the 1-indexed contract so generated cohorts always have at least one pre-period; if last-period adoption is intended, allow g=n_periods, and if short panels cannot support multiple estimable cohorts, raise a direct early error instead of defaulting to g=1.

Performance

No findings in PR scope.

Maintainability

No findings in PR scope.

Tech Debt

No separate findings. The P1 above is not tracked under TODO.md:L47-L71, so it is not mitigated by existing deferral.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the supporting docs/tests still encode the wrong cohort contract. The helper docstring still says the default is [3, 5] at diff_diff/prep_dgp.py:L1171-L1172, test_cohort_period_out_of_range treats g=n_periods as invalid at tests/test_prep.py:L1356-L1366, and test_default_cohort_periods_small_n_periods only checks numeric range/row-count invariants at tests/test_prep.py:L1389-L1401. Concrete fix: after correcting the code, update the docstring and rewrite these regressions to reject g=1, allow any intended last-period cohort behavior explicitly, and assert that default cohorts all have at least one pre-treatment observation.

Path to Approval

  1. Fix generate_survey_did_data()’s 1-indexed cohort contract so defaults never produce first_treat=1. Either allow last-period adoption and validate accordingly, or raise an explicit error for unsupported short-panel defaults.
  2. Replace the current small-panel regression with an estimability check: for default n_periods=3/4/5, assert that every positive generated cohort has a pre-period and survives a downstream CallawaySantAnna(...).fit(...).group_time_effects pass.
  3. Update the helper docstring and the out-of-range/default-cohort tests to match the corrected contract.

…lity

- Default cohort_periods now requires g >= 2 so every cohort has at
  least one pre-treatment period (needed for CallawaySantAnna). For
  n_periods < 4, raises ValueError instead of producing inestimable
  cohorts (P1)
- Allow g = n_periods (last-period adoption has base period g-1)
- Reject g = 1 (no pre-period exists)
- Update docstring to reflect derived defaults and g >= 2 contract
- Strengthen tests: assert all cohorts have pre-periods, test
  n_periods=3 rejection, test g=n_periods is valid

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 05fb867651d57d2a84f4f72263a3e08ec7622621


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review: the earlier generate_survey_did_data() cohort-contract defect is fixed. Defaults now enforce g >= 2, allow g = n_periods, and reject unsupported default small panels in diff_diff/prep_dgp.py:1223 and tests/test_prep.py:1356.
  • [Newly identified] The new survey tutorial still plots “survey-aware” event-study intervals as 1.96 * se, even though the notebook text and the library’s inference path use survey-d.f. t critical values. That makes the displayed survey intervals too narrow whenever df_survey is finite.
  • Tutorial discoverability is now fixed via the README, quickstart, and choosing-estimator links.
  • Non-blocking: the added small-panel regression still checks cohort bounds only, not end-to-end estimability through CallawaySantAnna.

Methodology

Estimator code paths are unchanged in this PR scope, and the prior DGP boundary issue is resolved.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The P1 above is not mitigated by an existing TODO.md entry.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the new small-panel regression at tests/test_prep.py:1392 only asserts 2 <= g <= n_periods; it still does not exercise a downstream CallawaySantAnna fit, so a future boundary regression could slip back in without breaking this test. Concrete fix: extend the test to fit CallawaySantAnna(control_group='never_treated') on the default n_periods=4..7 outputs and assert that each treated cohort contributes at least one group_time_effects entry.
  • I could not run pytest or execute the notebook here because numpy is not installed in this workspace.

Path to Approval

  1. Replace the two hard-coded 1.96 * se event-study plot calculations in docs/tutorials/16_survey_did.ipynb:134 and docs/tutorials/16_survey_did.ipynb:354 with intervals derived from each result object’s conf_int.
  2. Re-execute the notebook after that change and confirm the plotted survey-aware intervals match the estimator’s reported inference.

… plots

Both event study plots now use the returned conf_int (which respects
survey df t-critical values) instead of hardcoded z=1.96, ensuring
plotted intervals match the library's inference output (P1).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 4d7e169df8782cf7d03d55a6fbfef22a5d265b3c


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review: the prior notebook inference defect is fixed. Both event-study plotting cells now use each result’s returned conf_int rather than hard-coded 1.96 * se in docs/tutorials/16_survey_did.ipynb:L115-L119 and docs/tutorials/16_survey_did.ipynb:L305-L309.
  • No estimator math changed in this diff, and the tutorial’s survey-variance narrative is broadly aligned with the local methodology registry and survey implementation.
  • [Newly identified] The new public generate_survey_did_data() helper does not validate several basic parameter contracts, so invalid inputs can yield malformed output or internally impossible survey designs instead of clear ValueErrors. That is the blocking issue.
  • Non-blocking: the new small-panel regression test still checks cohort bounds only and does not actually fit CallawaySantAnna, so the exact end-to-end estimability contract from the earlier bug is still not locked in.
  • I could not run pytest or execute the notebook here because numpy, pandas, and pytest are not installed in this workspace.

Methodology

  • Severity P1 [Newly identified]. Impact: generate_survey_did_data() is a new public survey DGP, but it only validates cohort_periods and weight_variation before constructing survey structure and treatment assignment in diff_diff/prep_dgp.py:L1223-L1296. That leaves several silent or late-failing bad states unguarded: n_units=0 produces an empty DataFrame with none of the documented columns at diff_diff/prep_dgp.py:L1375-L1395; out-of-range never_treated_frac values feed directly into slice arithmetic at diff_diff/prep_dgp.py:L1283-L1296 and can silently generate the wrong cohort structure; and impossible FPC values are emitted unchanged in the fpc column even though the library’s own survey resolver requires FPC >= n_PSU in diff_diff/survey.py:L326-L360. Concrete fix: add upfront validation for positive count parameters, 0 <= never_treated_frac <= 1, and fpc_per_stratum >= the populated PSU count in each stratum, and raise explicit ValueErrors before any records are built.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The blocking validation gap above is not currently tracked in TODO.md, so it remains unmitigated.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: test_default_cohort_periods_small_n_periods() still only asserts that default cohorts stay within bounds, and never verifies that those defaults remain estimable through the downstream staggered estimator in tests/test_prep.py:L1392-L1412. A future regression could reintroduce a small-panel usability break without failing this test as long as the cohorts remain “in range.” Concrete fix: extend that test to fit CallawaySantAnna(control_group='never_treated') on the default n_periods=4..7 outputs and assert that each treated cohort contributes at least one group-time effect.

Path to Approval

  1. Add explicit parameter validation in generate_survey_did_data() for n_units > 0, n_periods > 0, n_strata > 0, psu_per_stratum > 0, 0 <= never_treated_frac <= 1, and FPC coherence with populated PSU counts per stratum.
  2. Add regression tests for invalid inputs such as n_units=0, never_treated_frac=-0.1, never_treated_frac=1.1, and fpc_per_stratum smaller than the populated PSU count, asserting clean ValueErrors instead of malformed output or late failures.
  3. Strengthen the small-panel default-cohort test to run an actual CallawaySantAnna(control_group='never_treated') fit for n_periods=4..7 and assert non-empty downstream effects.

Validate n_units, n_periods, n_strata, psu_per_stratum > 0,
never_treated_frac in [0, 1], and fpc_per_stratum >= psu_per_stratum
before building any records. Prevents malformed output or late
failures from invalid inputs (P1).

Add negative tests for all boundary cases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: a5f1a3e105ae3867338925061e73e7582fcf5ad5


Overall Assessment

Looks good

Executive Summary

  • Re-review: the prior blocking validation issue in generate_survey_did_data() is resolved. The new helper now rejects zero/invalid count parameters, out-of-range never_treated_frac, invalid cohort periods, and bad JK1/FPC setups in diff_diff/prep_dgp.py:L1223-L1404, with matching regression tests in tests/test_prep.py:L1303-L1432.
  • No estimator formulas, weighting rules, variance estimators, or defaults changed in this PR; the methodology-sensitive tutorial content is broadly aligned with the survey registry and current implementation notes.
  • I did not find a new P0 or P1 issue in the DGP helper, exports, or documentation updates.
  • One non-blocking test gap remains: the small-panel default-cohort regression test still checks bounds only, not actual downstream estimability through CallawaySantAnna in tests/test_prep.py:L1392-L1404.
  • I could not execute pytest or the notebook here because numpy, pandas, and pytest are not installed in this workspace.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: test_default_cohort_periods_small_n_periods() only verifies that derived cohorts satisfy 2 <= g <= n_periods, while the generator’s default-selection logic is explicitly justified in terms of Callaway-Sant’Anna estimability in diff_diff/prep_dgp.py:L1242-L1254. A future regression could preserve those bounds while still producing no usable downstream effects on small panels. Concrete fix: extend tests/test_prep.py:L1392-L1404 to fit CallawaySantAnna(control_group="never_treated") on the n_periods=4..7 outputs and assert at least one finite ATT(g,t) or non-empty aggregation result per treated cohort.

@igerber igerber merged commit 6e3628f into main Apr 1, 2026
14 checks passed
@igerber igerber deleted the survey-tutorial branch April 1, 2026 20:18
@igerber igerber mentioned this pull request Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant