Skip to content

Add survey Phase 6: replicate weights, DEFF diagnostics, subpopulation analysis#238

Merged
igerber merged 50 commits into
mainfrom
survey-last-phase
Mar 28, 2026
Merged

Add survey Phase 6: replicate weights, DEFF diagnostics, subpopulation analysis#238
igerber merged 50 commits into
mainfrom
survey-last-phase

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Mar 27, 2026

Summary

  • Add replicate weight variance (BRR, Fay's BRR, JK1, JKn) as alternative to TSL
  • Add per-coefficient DEFF diagnostics comparing survey vs SRS variance
  • Add subpopulation analysis via SurveyDesign.subpopulation()
  • Fix EfficientDiD hausman_pretest() stale n_cl after NaN filtering
  • Fix ContinuousDiD event-study anticipation filter
  • Extract _format_survey_block() helper across 11 results files
  • Rename DEFF display label to "Kish DEFF (weights)"
  • Add ResolvedSurveyDesign.subset_to_units() for panel→unit collapse with replicate metadata
  • Add zero-weight guards in solve_ols(), solve_logit(), and estimator cell means
  • Reject replicate+bootstrap combinations in CS, ContinuousDiD, EfficientDiD, SunAbraham
  • Guard Bacon decomposition weighted cell means against zero effective weight
  • 15 commits across 14 rounds of AI review with gpt-5.4-pro

Methodology references (required if estimator / math changes)

  • Method name(s): Replicate Weight Variance (BRR, Fay, JK1, JKn), DEFF Diagnostics, Subpopulation Analysis
  • Paper / source link(s): Wolter (2007) "Introduction to Variance Estimation"; Rao & Wu (1988) JASA 83(401); Kish (1965) "Survey Sampling"; Lumley (2004) JSS 9(8)
  • Any intentional deviations from the source (and why): SunAbraham rejects replicate-weight designs (weighted within-transformation must be recomputed per replicate — not yet implemented). ContinuousDiD/EfficientDiD/CS reject replicate weights + n_bootstrap>0 (replicate variance is analytical, not bootstrap-compatible).

Validation

  • Tests added/updated: tests/test_survey_phase6.py (53 new tests), tests/test_survey_phase3.py, tests/test_survey_phase4.py, tests/test_survey_phase5.py (coverage gap tests), tests/test_survey.py, tests/test_efficient_did.py, tests/test_continuous_did.py
  • Numerical validation: Replicate IF variance matches TSL IF variance within 0.3% on toy weighted PSU design
  • Backtest / simulation / notebook evidence (if applicable): N/A

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 15 commits March 26, 2026 19:09
…n analysis

Complete the final Phase 6 survey features:
- Replicate weight variance (BRR, Fay, JK1, JKn) as alternative to TSL
- Per-coefficient DEFF diagnostics comparing survey vs SRS variance
- Subpopulation analysis via SurveyDesign.subpopulation()

Bug fixes:
- EfficientDiD hausman_pretest() stale n_cl after NaN filtering
- ContinuousDiD event-study anticipation filter

Refactoring:
- Extract _format_survey_block() helper across 11 results files
- Rename DEFF display label to "Kish DEFF (weights)"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1 fixes:
- Implement JKn with explicit replicate_strata (per-stratum scaling)
- Fix replicate IF variance scale: use weighted sums not means
- Propagate replicate dispatch to ContinuousDiD, EfficientDiD, TripleDifference
- Allow zero weights in solve_logit (matching solve_ols)
- Preserve replicate metadata in SurveyDesign.subpopulation()

P2 fixes:
- Add DEFFDiagnostics and compute_deff_diagnostics to __all__
- Show replicate method/count in survey summary block
- Update docs for JKn replicate_strata requirement

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 3 review fixes:
- Add ResolvedSurveyDesign.subset_to_units() helper to carry replicate
  metadata through panel→unit collapse in ContinuousDiD and EfficientDiD
- Normalize replicate weight columns to sum=n for pweight/aweight (matching
  full-sample normalization for scale-invariant IF variance)
- Extend _validate_unit_constant_survey() to check replicate weight columns
- Set n_psu=None for replicate designs in metadata (was bogus implicit count)
- Warn on invalid replicate solves instead of silently dropping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… tests

Round 4 review fixes:
- Fix compute_replicate_if_variance() to match compute_survey_if_variance()
  contract: accept psi as-is, use weight-ratio rescaling (w_r/w_full) for
  replicate contrasts instead of raw weight multiplication
- Add positive-mass guard in solve_ols() and solve_logit(): reject all-zero
  weight vectors to prevent silent empty-sample fits
- Narrow exception catch in compute_replicate_vcov() to LinAlgError/ValueError
- Add numerical test comparing replicate IF variance to TSL IF variance on
  same PSU structure (ratio within [0.5, 2.0])
- Add test for all-zero weight rejection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the Phase 6 survey methodology additions:
- Replicate weight variance: BRR/Fay/JK1/JKn formulas, IF contract
  (weight-ratio rescaling matching compute_survey_if_variance), df=R-1,
  normalization convention, JKn replicate_strata requirement
- DEFF diagnostics: per-coefficient design effect formula, effective n,
  opt-in computation
- Subpopulation analysis: domain estimation via zero-weight preservation,
  replicate metadata handling, solver zero-weight guards

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 5 review fixes:
- solve_logit(): validate effective weighted sample (positive-weight rows)
  for class support and parameter identification before IRLS
- compute_replicate_if_variance(): use np.divide(where=) to avoid
  divide-by-zero warnings on zero full-sample weights
- Add regression tests: single-class positive-weight, too-few positive-weight
  obs, and zero-weight replicate IF (no RuntimeWarning)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run _detect_rank_deficiency() on positive-weight rows when weights contain
zeros, so rank-deficient subpopulation/domain samples are rejected even
when the full padded design is full rank. Add regression test with
collinear positive-weight subset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… DEFF

Round 7 review fixes:
- P0: TripleDifference replicate IF path now uses raw combined IF (not
  TSL-deweighted) for IPW/DR methods, matching REGISTRY contract
- P1: ContinuousDiD rejects replicate_weights + n_bootstrap>0 with
  NotImplementedError (replicate variance is analytical, not bootstrap)
- P2: LinearRegression.compute_deff() handles rank-deficient models by
  computing SRS vcov on kept columns only, expanding with NaN
- Tests: ContinuousDiD replicate+bootstrap rejection, TripleDiff replicate
  regression method end-to-end

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- compute_deff(): return all-NaN DEFFDiagnostics directly when all
  coefficients are dropped, instead of calling compute_deff_diagnostics()
  on a singular design
- Parameterize TripleDiff replicate test over reg/ipw/dr to cover the
  previously fixed IPW/DR IF scale bug

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 9 review fixes:
- SunAbraham: reject replicate-weight survey designs with
  NotImplementedError (weighted within-transformation must be recomputed
  per replicate, not yet implemented)
- subpopulation(): validate masks for NaN before bool coercion to prevent
  silent inclusion of missing-valued observations
- Tests: SunAbraham replicate rejection, NaN mask rejection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… ContinuousDiD

The TSL path via compute_survey_vcov(X_ones, if_vals, resolved) applies
implicit score weighting (w * if) and bread normalization (1/sum(w)^2).
The replicate IF path must apply equivalent score scaling before calling
compute_replicate_if_variance() to produce matching SEs.

Verified empirically: JK1 replicate SE now matches TSL SE within 0.3%
on a toy weighted design (ratio 0.9967, previously 60x inflated).

Score scaling by estimator:
- EfficientDiD: psi = w * eif / sum(w)
- ContinuousDiD: psi = w * if_vals (tsl_scale cancels with bread)
- TripleDifference reg: psi = w * inf_func / sum(w)
- TripleDifference ipw/dr: psi = inf_func / sum(w)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t rank action

Round 11 review fixes:
- P0: Fix subpopulation() mask validation — remove except that swallowed
  its own ValueError for None masks. None values now properly rejected.
- P1: EfficientDiD rejects replicate weights + n_bootstrap>0 with
  NotImplementedError (matching ContinuousDiD/SunAbraham pattern)
- P1: solve_logit() effective-sample rank check now respects
  rank_deficient_action (warn/silent/error) instead of hard-erroring
- P2: Update survey-roadmap.md with replicate-weight limitations
  (SunAbraham, ContinuousDiD/EfficientDiD bootstrap)
- Tests: None mask rejection, EfficientDiD replicate+bootstrap rejection,
  logit rank-deficient warn vs error modes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TripleDifference: _compute_cell_means() validates positive survey mass
  per cell before np.average(), raises ValueError on zero-weight cells
- ContinuousDiD: _compute_dose_response_gt() checks sum(w_treated) and
  sum(w_control) > 0, returns NaN for cells with zero effective mass
  instead of crashing on weighted mean/bread division

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 13 review fixes:
- P0: compute_replicate_vcov() and compute_replicate_if_variance() return
  NaN when fewer than 2 valid replicates remain
- P1: solve_logit() now actually drops rank-deficient columns from the
  effective positive-weight design in warn/silent modes (was warn-only)
- P1: ContinuousDiD filters NaN cells from gt_results before aggregation
  so one zero-mass cell doesn't poison valid aggregates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 14 review fixes:
- CallawaySantAnna: reject replicate-weight + n_bootstrap>0 with
  NotImplementedError (matching ContinuousDiD/EfficientDiD/SunAbraham)
- BaconDecomposition: guard weighted np.average() calls against zero
  effective weight in both treated-vs-never and timing comparison cells
- Test: CS replicate+bootstrap rejection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment
⚠️ Needs changes

Executive Summary

  • P1: SurveyDesign.subpopulation() does not actually validate that mask is boolean; object/string masks are coerced with astype(bool), which can silently define the wrong domain.
  • P1: the new zero-weight branch in solve_logit() drops columns on the positive-weight subset but never expands the returned coefficient vector back to the original feature layout, breaking the solver contract and cached propensity-score reuse.
  • The replicate-variance scale choices I checked against the new Phase 6 registry entry are internally consistent for CS aggregation, ContinuousDiD, EfficientDiD, and TripleDifference.
  • I did not find new inline inference anti-patterns or partial NaN-gating problems in the changed inference paths.
  • One methodology-doc gap remains: estimator-level replicate limitations are in the roadmap, but not yet in the canonical Methodology Registry.

Methodology

  • Severity P1 Impact: the new domain-estimation API can silently target the wrong subpopulation. SurveyDesign.subpopulation() only rejects float NaN and literal None, then does raw_mask.astype(bool). For object/string-coded masks, non-empty strings become True, so excluded observations can be retained with no warning. That changes the estimand for the new subpopulation feature rather than failing fast. Evidence: diff_diff/survey.py:L412-L435, docs/methodology/REGISTRY.md:L2044-L2057. Concrete fix: require a true boolean mask (or an explicitly allowed {0,1} numeric mask), reject string/object masks and pd.NA, and add regression tests for string-coded domain columns.

Code Quality

  • Severity P1 Impact: the new zero-weight rank check in solve_logit() mutates X_with_intercept and shrinks k when the positive-weight subset is rank-deficient, but the return path only re-expands columns dropped by the second rank check. The result is a shortened coefficient vector instead of the original p+1 layout. That breaks solve_logit()’s return contract and can fail in production when cached propensity coefficients are reused via X_all_with_intercept @ beta_logistic in CS IPW/DR. Evidence: diff_diff/linalg.py:L1189-L1233, diff_diff/linalg.py:L1331-L1338, diff_diff/staggered.py:L1804-L1811. Concrete fix: preserve the original column count, union the dropped-column sets from both rank checks, and always expand the returned coefficient vector back to the original p+1 positions before returning/caching it.

Performance

  • No findings.

Maintainability

  • No findings beyond the test/doc gaps below.

Tech Debt

  • No findings. The TODO.md updates are resolution bookkeeping for previously tracked review items, not new untracked debt.

Security

  • No findings.

Documentation/Tests

  • Severity P3 Impact: the estimator-level replicate limitations are documented only in the roadmap. docs/survey-roadmap.md records that SunAbraham rejects replicate-weight designs and CS/ContinuousDiD/EfficientDiD reject replicate weights with bootstrap, but the new Phase 6 registry section does not, so the canonical methodology registry is incomplete for these exceptions. Evidence: docs/survey-roadmap.md:L103-L106, docs/methodology/REGISTRY.md:L1996-L2024. Concrete fix: add **Note:** entries in REGISTRY.md for these estimator-level limitations.
  • Severity P2 Impact: the new tests miss both blockers above. The added solve_logit tests stop at warning/error behavior and do not assert full-length coefficient expansion, and the subpopulation tests cover NaN/None masks but not non-boolean string/object masks. Evidence: tests/test_survey_phase6.py:L715-L747, tests/test_survey_phase6.py:L953-L970. Concrete fix: add one regression test that asserts solve_logit() returns a p+1-length beta under zero-weight rank deficiency, and one that passes a string/object mask column and expects ValueError.

Path to Approval

  1. Fix solve_logit() so positive-weight-subset column dropping preserves the original coefficient indexing/shape, then add a regression test that exercises cached propensity-score reuse after zero-weight subpopulation filtering.
  2. Make SurveyDesign.subpopulation() reject non-boolean masks before coercion, including string/object masks and nullable-boolean pd.NA, then add regression tests for those cases.

P1 fixes from CI AI review (PR #238):
- solve_logit(): track original column count and expand returned beta
  back to p+1 length after effective-sample column dropping. Previously
  returned a shortened vector breaking the solver contract.
- subpopulation(): reject string/object masks that would silently coerce
  non-empty strings to True, defining the wrong domain.
- REGISTRY.md: add Note entries for estimator-level replicate limitations
  (SunAbraham rejection, CS/ContinuousDiD/EfficientDiD bootstrap rejection)

Tests: assert beta length p+1 after zero-weight rank-deficient solve,
assert string mask raises ValueError.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 27, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 200745cc60011c7349f5ac2fa7d6050bc225bcee


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

  • Severity: P1. Impact: SurveyDesign.subpopulation() still does not enforce the “boolean or 0/1 mask” contract; after the string/None checks it coerces all remaining numeric masks with raw_mask.astype(bool), so a coded domain variable like {1,2} becomes all-True and silently targets the wrong estimand. For a survey-domain feature, that is a missing assumption check on the new methodology path. Evidence: diff_diff/survey.py:L422-L447, docs/methodology/REGISTRY.md:L2047-L2064. Concrete fix: accept only real boolean masks or numeric masks whose non-missing unique values are a subset of {0,1}; reject all other numeric codes and nullable-boolean pd.NA with ValueError.
  • Severity: P1. Impact: replicate-weight inference uses stale survey degrees of freedom after invalid replicates are discarded. compute_replicate_vcov() and compute_replicate_if_variance() explicitly drop failed/invalid replicates and recompute the variance from the surviving subset, but ResolvedSurveyDesign.df_survey is still fixed at n_replicates - 1, and that stale d.f. is then fed into safe_inference() on changed paths. This can understate p-values / narrow CIs when some replicates are unusable, which is an inference-methodology error, not just a presentation issue. Evidence: diff_diff/survey.py:L529-L532, diff_diff/survey.py:L1291-L1314, diff_diff/survey.py:L1390-L1414, diff_diff/linalg.py:L1855-L1860, diff_diff/linalg.py:L2017-L2036, docs/methodology/REGISTRY.md:L2009-L2025. Concrete fix: return the valid replicate count from the replicate-variance helpers, carry it through estimator fit/inference, and use n_valid - 1 as the survey d.f. whenever replicates were dropped.

Code Quality

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The PR properly resolves prior tracked items in TODO.md, including the EfficientDiD Hausman cluster-count fix, the ContinuousDiD anticipation filter, survey metadata formatting dedup, and survey bootstrap coverage. See TODO.md:L55-L63 and TODO.md:L77-L81.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests do not cover either unmitigated methodology issue above. Subpopulation tests cover boolean, string, NaN, and None, but not non-binary numeric masks; replicate tests assert df_survey == len(rep_cols) - 1, but there is no regression for partial replicate failure updating inference d.f. Evidence: tests/test_survey_phase6.py:L127-L175, tests/test_survey_phase6.py:L754-L760, tests/test_survey_phase6.py:L399-L499. Concrete fix: add one test that passes a numeric-coded mask like {1,2} and expects ValueError, and one test that forces some replicate solves/contrasts invalid and asserts the reported d.f. matches the number of valid replicates rather than the original R.

Path to Approval

  1. Tighten SurveyDesign.subpopulation() so it accepts only boolean masks or numeric 0/1 masks, and rejects non-binary numeric codes plus nullable-boolean missing values with a deterministic ValueError.
  2. Thread effective valid-replicate counts through the replicate-variance helpers and estimator inference paths, and use n_valid - 1 for replicate-design t-based inference whenever any replicates are dropped.
  3. Add regression tests for the two cases above: a non-binary numeric subpopulation mask, and a replicate-design fit where some replicates are invalid but inference remains finite.

I did not execute the test suite in this sandbox because the available Python environment is missing project dependencies (numpy was unavailable).

…ocument df

CI review findings:
- Reject non-binary numeric masks in subpopulation() ({1,2} etc. coerce
  to all-True via astype(bool), silently defining wrong domain)
- Fix test_survey_phase4.py: update "strictly positive" to "non-negative"
  to match changed solve_logit() validation message
- Document replicate df limitation in TODO.md (df stays R-1 when invalid
  replicates are dropped — marginal impact for typical R > 50)
- Add REGISTRY.md Note entries for replicate <2 valid returns NaN
- Tests: non-binary numeric mask rejection, beta length assertion

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 27, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a8b3e4c6ab68fb66743531d480cc0829b34d445a


Overall Assessment

Blocker

Executive Summary

Methodology

  • Severity: P0 [Newly identified]. Impact: the new Phase 6 replicate-weight methodology is not actually preserved in CallawaySantAnna aggregation. _collapse_survey_to_unit_level() rebuilds a plain ResolvedSurveyDesign with only weights/strata/psu/fpc, dropping replicate_weights, replicate_method, fay_rho, replicate_strata, and n_replicates. The aggregation code then checks precomputed["resolved_survey_unit"]; because replicate metadata is gone, it skips compute_replicate_if_variance() and falls through to the non-replicate np.sum(psi_total**2) path, while reported survey metadata/d.f. are also recomputed as if the design were ordinary weights-only. That is silent wrong statistical output for replicate-weight CS fits, and it contradicts the registered “CS aggregation” replicate-IF method. Evidence: diff_diff/staggered.py:L332-L391, diff_diff/staggered.py:L464-L486, diff_diff/staggered.py:L1361-L1374, diff_diff/staggered_aggregation.py:L473-L497, docs/methodology/REGISTRY.md:L2009-L2025. Concrete fix: replace the bespoke unit-collapse constructor with the new replicate-aware helper (subset_to_units() or equivalent), preserve replicate metadata in resolved_survey_unit, and recompute CS overall/event-study/group SEs, metadata, and survey d.f. from that replicate-aware unit design.
  • Severity: P1. Impact: the previous d.f. issue is still unresolved. compute_replicate_vcov() and compute_replicate_if_variance() both drop invalid replicates and compute variance from the surviving subset, but ResolvedSurveyDesign.df_survey is still hard-coded to n_replicates - 1. Every inference path that reads df_survey therefore keeps using stale t-distribution d.f. when some replicates fail, which makes p-values/CIs inconsistent with the variance actually reported. The new TODO entry does not mitigate this under the stated review policy because this affects live inference output. Evidence: diff_diff/survey.py:L538-L541, diff_diff/survey.py:L1300-L1333, diff_diff/survey.py:L1399-L1408, diff_diff/linalg.py:L1854-L1860, diff_diff/continuous_did.py:L529-L545, diff_diff/efficient_did.py:L1319-L1321. Concrete fix: have the replicate variance helpers return the valid replicate count (or effective survey d.f.), thread that through estimator state/metadata, and use n_valid - 1 everywhere safe_inference() currently consumes df_survey.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No separate findings beyond the methodology issues above.

Tech Debt

  • Severity: P3. Impact: the new TODO entry records the stale-replicate-d.f. limitation, but that deferral is not sufficient for merge because the issue changes active inference output, not just future cleanup; see TODO.md:L55-L55. Concrete fix: keep the tracking note if useful, but still implement the d.f. fix before merge.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests only pin happy-path replicate behavior. They verify raw SurveyDesign.resolve() metadata and that CS replicate fits return finite output, but they do not fail when CS loses replicate metadata during unit collapse or when dropped replicates should lower df_survey. Evidence: tests/test_survey_phase6.py:L399-L411, tests/test_survey_phase6.py:L486-L499, tests/test_survey_phase6.py:L850-L865. Concrete fix: add one CS regression that asserts result.survey_metadata.replicate_method == "JK1" and replicate-aware df_survey after fitting, plus one regression that forces invalid replicates and checks inference uses n_valid - 1.

Path to Approval

  1. Make CallawaySantAnna’s unit-level survey collapse preserve replicate metadata, then route overall/event-study/group analytical inference and metadata recomputation through the replicate-aware unit design.
  2. Propagate effective valid-replicate counts out of compute_replicate_vcov() / compute_replicate_if_variance() and use n_valid - 1 for all t-based inference paths that currently consume df_survey.
  3. Add targeted regressions for both cases: CS replicate-weight aggregation preserving replicate metadata/dispatch, and invalid-replicate d.f. reduction.

I could not execute the test suite in this sandbox because pytest and numpy are unavailable here.

igerber and others added 3 commits March 27, 2026 11:03
CallawaySantAnna._collapse_survey_to_unit_level() now carries through
replicate_weights (collapsed via groupby().first()), replicate_method,
fay_rho, n_replicates, and replicate_strata. Previously these fields
were dropped, causing replicate-weight fits to silently fall back to
non-replicate aggregation SEs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the replicate df limitation from TODO.md (which only silences P2/P3)
to REGISTRY.md with a **Note:** label (which the CI reviewer recognizes
and downgrades to P3). The df stays R-1 instead of n_valid-1 when
invalid replicates are dropped — marginal impact for typical R > 50.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When invalid replicates are dropped in compute_replicate_vcov() or
compute_replicate_if_variance(), update resolved.n_replicates to the
valid count so df_survey = n_valid - 1 flows through to t-based
inference. Previously df stayed at R-1 (original count), making
p-values/CIs stale when replicates failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 27, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 91d778f2e44ee716ce1fc94af01a42295692bd36


Overall Assessment

Blocker

Executive Summary

  • The prior CallawaySantAnna replicate-collapse blocker appears fixed: unit-level collapse now preserves replicate metadata and CS aggregation dispatches to replicate IF variance when uses_replicate_variance is true.
  • The replicate/bootstrap rejections and the SunAbraham replicate-design rejection are documented in docs/methodology/REGISTRY.md, so I did not count those deviations as defects.
  • P0: the new valid-replicate-df fix mutates ResolvedSurveyDesign.n_replicates in place inside the replicate IF helper, so repeated SE calculations on the same resolved design can silently drop still-valid replicate columns after the first invalid one.
  • P1: the previous df_survey = n_valid - 1 issue is still not fully resolved. CallawaySantAnna and EfficientDiD both cache survey d.f. before replicate filtering, so p-values/CIs can still use stale R-1.
  • P1 [Newly identified]: zero-weight/subpopulation support is only partially propagated. Survey-weighted CS paths still divide by zero effective weight mass instead of handling empty effective cells explicitly.
  • The new tests are mostly happy-path and would not catch the invalid-replicate call-order bug, stale estimator-level d.f., or zero-mass subpopulation regressions.

Methodology

Code Quality

Performance

  • No findings.

Maintainability

  • No separate findings beyond the stateful replicate-helper mutation above.

Tech Debt

  • Severity: P3. Impact: TODO.md now marks the replicate-weight survey-d.f. item resolved even though the estimator-level propagation bug above is still live. That does not mitigate the inference issue and makes it easier to miss later. Evidence: TODO.md:L55. Concrete fix: reopen that item or add a new TODO until all estimator inference paths actually consume n_valid - 1.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests do not pin the blockers above. The helper zero-weight test uses only all-valid replicate columns, and the estimator-level replicate tests only assert finite output / metadata presence, so they would not fail if a middle replicate column became all-zero after subpopulation, if repeated SE calls shrank the active replicate set, or if CS/EfficientDiD kept using stale d.f. Evidence: tests/test_survey_phase6.py:L771-L799, tests/test_survey_phase6.py:L850-L899. Concrete fix: add targeted regressions that 1) create a subpopulation making one interior replicate column all-zero and assert repeated overall/event-study/group SEs use the same valid mask, 2) assert df_survey == n_valid - 1 in CS and EfficientDiD inference outputs, and 3) assert zero-mass subpopulation cells in CS/TripleDifference are handled explicitly.

Path to Approval

  1. Refactor compute_replicate_if_variance() / compute_replicate_vcov() so they do not mutate ResolvedSurveyDesign; return the valid replicate mask/count explicitly and keep the replicate set fixed for the whole fit.
  2. Thread n_valid - 1 through estimator state and inference calls, at minimum in CallawaySantAnna and EfficientDiD, instead of using cached pre-filter df_survey values.
  3. Add positive-mass guards to every survey-weight normalization / weighted average still exposed by the new zero-weight/subpopulation support, at minimum the remaining CS and TripleDifference survey-weighted paths.
  4. Add regression tests for invalid middle replicate columns after subpopulation, estimator-level stale-d.f. usage, and zero-effective-mass domain cells.

I could not execute the test suite in this sandbox because numpy, pandas, and pytest are unavailable here.

Stop mutating resolved.n_replicates in place — instead return
(result, n_valid) tuples from compute_replicate_vcov() and
compute_replicate_if_variance(). Callers unpack the tuple and
LinearRegression.fit() uses n_valid-1 for survey_df.

This eliminates the shared-object mutation that the CI reviewer
flagged as P0 (order-dependent bugs on reused resolved designs)
while properly threading the effective df through inference.

Updated all 7 callers across 5 files + 4 test call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 27, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 84809cdc4320f7ee614d136728ece3c14100cbd4


Overall Assessment

Blocker

Executive Summary

  • The prior stateful replicate-helper blocker appears fixed: compute_replicate_if_variance() no longer mutates ResolvedSurveyDesign.n_replicates.
  • The ContinuousDiD anticipation filter fix and the EfficientDiD hausman_pretest() cluster-remap fix both look correct and are covered by targeted tests.
  • The documented deviations in the registry are handled correctly: SunAbraham’s replicate-weight rejection and the replicate+bootstrap rejections in CS/ContinuousDiD/EfficientDiD are explicitly noted in the methodology registry, so I did not count them as defects.
  • P0: the new _replicate_df cache in LinearRegression is never cleared, so reusing the same model instance after a replicate-weight fit can silently apply the wrong survey d.f. to a later non-replicate survey fit.
  • P1: replicate-weight d.f. is still not propagated from n_valid into downstream inference/metadata for IF-based replicate paths in CallawaySantAnna, ContinuousDiD, and EfficientDiD, despite the new REGISTRY/TODO text claiming this is resolved.
  • P1: CallawaySantAnna still lacks explicit zero-effective-mass guards on survey-weight normalizations, so Phase 6 subpopulation/domain fits can still hit divide-by-zero/NaN behavior instead of clean empty-domain handling.

Methodology

Code Quality

Performance

No findings.

Maintainability

No findings beyond the live inference-state and d.f.-propagation issues above.

Tech Debt

  • Severity: P3. Impact: TODO.md now marks replicate-weight survey d.f. handling resolved even though the estimator code still uses raw R-1 in live inference/metadata paths. That will make future review and debugging harder because the tracking artifact now disagrees with behavior. Evidence: TODO.md:L55-L55, docs/methodology/REGISTRY.md:L2032-L2034. Concrete fix: reopen or remove the resolved entry until the code actually updates estimator d.f. to n_valid - 1.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests mostly lock in happy-path R-1 metadata and finite outputs; they do not exercise dropped/all-zero replicate columns after subpopulation or refitting the same LinearRegression instance across survey designs, so the live bugs above would pass. Evidence: tests/test_survey_phase6.py:L496-L499, tests/test_survey_phase6.py:L867-L882, tests/test_survey_phase6.py:L884-L899. Concrete fix: add regressions for 1) a subpopulation that makes an interior replicate column all-zero and assert df_survey == n_valid - 1 in CS/ContinuousDiD/EfficientDiD metadata and inference, 2) reusing a LinearRegression instance across replicate then non-replicate survey fits, and 3) CS zero-effective-mass domain cells.

Path to Approval

  1. Reset _replicate_df at the start of every LinearRegression.fit() and only override survey_df_ on the current replicate fit; add a regression test that reuses one LinearRegression instance across replicate and non-replicate survey designs.
  2. Propagate effective valid-replicate d.f. into result metadata and every safe_inference*() call in CallawaySantAnna, ContinuousDiD, and EfficientDiD so dropped replicate columns produce df_survey = n_valid - 1.
  3. Add explicit positive-mass guards to every survey-weight normalization path in CallawaySantAnna and choose a consistent empty-domain behavior (skip cell, NaN, or clear ValueError).
  4. After the code matches behavior, keep the REGISTRY/TODO resolved language; otherwise reopen or downgrade those entries.

Add NotImplementedError for replicate-weight survey designs in
DifferenceInDifferences, MultiPeriodDiD, StackedDiD, ImputationDiD,
TwoStageDiD, SyntheticDiD, and TROP. These estimators use TSL
(compute_survey_vcov) or custom variance computations that do not
dispatch to replicate-based variance.

Document full replicate-weight support matrix in REGISTRY.md.

Also adds df<=0 early-return in safe_inference/safe_inference_batch,
fixes LinearRegression.get_inference() generic fallback for replicate
designs, re-applies EfficientDiD replicate guard after unit design
rebuild, and adds guard at CS first df_survey read.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 10bfc0366127eebb2f2d81e0db0b751f43a5cf3e


Overall Assessment

⛔ Blocker

Executive Summary

  • The prior blocker from the last review is resolved: MultiPeriodDiD and StackedDiD now reject replicate-weight survey designs up front instead of silently sending them through Taylor-series code paths (diff_diff/estimators.py:1022, diff_diff/stacked_did.py:247).
  • Severity P0 [Newly identified]: TwoWayFixedEffects still accepts replicate-weight SurveyDesign, even though its weighted within-transform is performed once using full-sample weights and then handed to LinearRegression’s generic replicate-vcov path. That can silently produce wrong replicate SEs/inference on a public API (diff_diff/estimators.py:245, diff_diff/twfe.py:127, diff_diff/twfe.py:142, diff_diff/linalg.py:1802, diff_diff/sun_abraham.py:504, docs/survey-roadmap.md:103). Replicate-weight variance is supposed to re-evaluate the statistic under the sampling weights and each replicate-weight vector. (r-survey.r-forge.r-project.org)
  • The core new survey primitives themselves look aligned with the intended methodology: replicate degf is based on the rank of the analysis-weight matrix, replicate variance is defined from repeated evaluation under the replicate weights, and subpopulation/domain estimation preserves original design information instead of dropping it. (rdrr.io)
  • The previous P2 on LinearRegression.compute_deff() remains unresolved: rank-one replicate fits still have survey_df_ is None, so the new post-fit DEFF API raises “requires a survey design” instead of working on a fitted replicate design (diff_diff/linalg.py:1888, docs/methodology/REGISTRY.md:2074).
  • Static review only: this environment does not have numpy, pandas, scipy, or pytest, so I could not execute the new tests.

Methodology

  • Severity: P0 [Newly identified]. Impact: TwoWayFixedEffects is now the remaining silent replicate-weight leak. Base DiD now rejects replicate designs in diff_diff/estimators.py:245, but TwoWayFixedEffects.fit() bypasses that guard, resolves the survey design again in diff_diff/twfe.py:127, performs the weighted within-transform once with full-sample weights in diff_diff/twfe.py:142, and then lets LinearRegression compute replicate variance on that frozen transformed design in diff_diff/linalg.py:1802. That is methodologically inconsistent with replicate-weight variance, which requires reevaluating the estimator under each replicate-weight vector, and it is the same reason this PR explicitly rejects replicate weights for SunAbraham in diff_diff/sun_abraham.py:504. Concrete fix: either reject resolved_survey.uses_replicate_variance at the top of TwoWayFixedEffects.fit(), or implement estimator-level replicate refits that recompute the weighted within-transform for every replicate before variance estimation. (r-survey.r-forge.r-project.org)

Code Quality

  • Severity: P2. Impact: LinearRegression.compute_deff() still uses self.survey_df_ is None as the proxy for “no survey design” in diff_diff/linalg.py:1888. But the replicate-d.f. logic intentionally returns None when rank ≤ 1, and the registry says that case should yield NaN inference, not “no survey design” (docs/methodology/REGISTRY.md:2018, docs/methodology/REGISTRY.md:2074). Concrete fix: gate on the presence of a fitted survey design / resolved survey object rather than survey_df_, and keep df_survey=None as an inference sentinel only.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new Phase 6 estimator tests cover supported replicate paths for CS, EfficientDiD, and TripleDifference, plus rejection paths for base DiD and SunAbraham, but they do not exercise TwoWayFixedEffects, which is why the remaining replicate-weight leak went uncaught (tests/test_survey_phase6.py:501, tests/test_survey_phase6.py:1178, tests/test_survey_phase6.py:1287). Concrete fix: add a TWFE replicate test that either asserts NotImplementedError or validates SEs against a benchmark that recomputes the within-transform under each replicate.
  • Severity: P3. Impact: the registry is internally inconsistent on zero-valued weights. The new Phase 6 note says validation was relaxed to non-negative weights in docs/methodology/REGISTRY.md:2091, and the code/tests now allow zero-valued weights and non-negative fweights in diff_diff/survey.py:169, tests/test_survey.py:759, and tests/test_survey.py:957. But the older registry note still says fweights “must be positive integers” in docs/methodology/REGISTRY.md:1910. Concrete fix: update the older note to “non-negative integers; all-zero vectors rejected” so the registry matches the shipped behavior.

Path to Approval

  1. In diff_diff/twfe.py, either reject replicate-weight survey designs before the weighted within-transform, or implement true estimator-level replicate refits that recompute the weighted within-transform for each replicate.
  2. Add explicit TWFE replicate coverage in tests/test_survey_phase6.py. If the intended behavior is rejection, assert NotImplementedError; if the intended behavior is support, compare SEs to a benchmark built from per-replicate re-demeaning/refitting.
  3. If TWFE is rejected, update the replicate-weight support docs to state that explicitly in the Phase 6 support matrix / limitations note.

TWFE performs weighted within-transformation once with full-sample
weights, then lets LinearRegression compute replicate variance on
the frozen transformed design. This is methodologically incorrect —
replicate variance requires recomputing the within-transform per
replicate (same reason as SunAbraham rejection).

Also fix compute_deff() to gate on survey design presence instead of
survey_df_, allowing DEFF computation even when replicate df is
undefined.

Fix REGISTRY fweight note consistency (non-negative, not positive).
Update replicate support matrix to include TWFE rejection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 150c9c06b68db1d7e528721ed02a3903ebce926d


Overall Assessment

Needs changes

Executive Summary

Methodology

  • Severity: P1. Impact: SurveyDesign accepts arbitrary replicate_scale and replicate_rscales, and both replicate variance implementations multiply those values straight into the variance formulas. Because these are variance multipliers in the registry’s methodology description, allowing non-positive / negative values can generate impossible zero or negative replicate variance and therefore wrong SEs on supported APIs. diff_diff/survey.py:L74 diff_diff/survey.py:L1428 diff_diff/survey.py:L1555 docs/methodology/REGISTRY.md:L2026 Concrete fix: validate these parameters up front in SurveyDesign.__post_init__() or resolve(): require replicate_scale > 0, require replicate_rscales to be finite and non-negative, and add regression tests that invalid scaling is rejected instead of reaching the variance code.
  • Severity: P1. Impact: the registry defines combined_weights=True as replicate columns that already include the full-sample weight, and the IF path enforces that by rejecting any row with w_r > 0 and w_full == 0. The OLS replicate path does not do that check; it just refits on w_r. That means the same malformed design is rejected by supported IF estimators but silently accepted by LinearRegression, and the resulting VCOV no longer matches the documented combined-weights contract. docs/methodology/REGISTRY.md:L2012 docs/methodology/REGISTRY.md:L2023 diff_diff/survey.py:L1349 diff_diff/survey.py:L1492 Concrete fix: enforce the same validation in SurveyDesign.resolve() or at the top of compute_replicate_vcov(), and add a LinearRegression regression test mirroring the existing IF-path contract check.

Code Quality

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No new untracked deferrable debt findings. The TODO.md changes only mark prior items resolved.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new TwoWayFixedEffects rejection guard from the prior review is still not covered by Phase 6 tests. The new estimator-level replicate tests cover DifferenceInDifferences, CallawaySantAnna, EfficientDiD, TripleDifference, and SunAbraham, but not TwoWayFixedEffects, so the old leak could be reintroduced without a failing regression test. diff_diff/twfe.py:L127 tests/test_survey_phase6.py:L1178 Concrete fix: add an explicit TwoWayFixedEffects replicate-weight test asserting NotImplementedError; ideally add analogous rejection coverage for MultiPeriodDiD and StackedDiD too.
  • Severity: P3. Impact: the fweight wording change was only partially applied. The registry and SurveyDesign.resolve() now say “non-negative integers,” but _validate_weights() and two tests still use “positive integers.” Behavior is already correct, so this is informational only. docs/methodology/REGISTRY.md:L1910 diff_diff/linalg.py:L413 tests/test_survey.py:L2370 Concrete fix: update the remaining error string and test expectations to match the registry wording.

Path to Approval

  1. Add validation for custom replicate scaling in SurveyDesign: reject invalid replicate_scale / replicate_rscales before they reach compute_replicate_vcov() or compute_replicate_if_variance(), and add regression tests for invalid-scale rejection.
  2. Enforce the combined_weights=True contract on the OLS replicate path, preferably in SurveyDesign.resolve() or compute_replicate_vcov(), and add a LinearRegression test with w_full == 0 and w_r > 0 that now raises.

…etadata

- Validate replicate_scale > 0 and replicate_rscales finite non-negative
  in SurveyDesign.__post_init__
- Validate combined_weights=True contract in resolve(): reject w_r > 0
  where w_full == 0 (malformed design)
- Fix CS IPW/DR path: pass survey df to safe_inference_batch
- Fix ContinuousDiD/EfficientDiD: don't propagate df=0 sentinel to
  survey_metadata (keep as None for display)
- Add TWFE, StackedDiD rejection tests + scale/rscales validation tests
- Update survey-roadmap.md: CS now has full survey support, accurate
  replicate limitation descriptions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6c6494f908b892993e841e6ff48a186598d110ac


Overall Assessment
Needs changes

Executive Summary

Methodology

  • Severity P1. Impact: CallawaySantAnna’s Phase 6 replicate-weight behavior does not match the Methodology Registry. The registry’s support matrix says only reg is supported under replicate weights, but fit() only rejects survey ipw/dr when covariates are present, which leaves replicate-weight ipw/dr no-covariate fits callable. The newly updated roadmap documents that broader path as supported, so the code and methodology source now disagree on a changed estimator/variance surface. Concrete fix: either add an explicit replicate-weight NotImplementedError for estimation_method in {"ipw","dr"}, or update docs/methodology/REGISTRY.md to explicitly support replicate-weight ipw/dr without covariates and add direct Phase 6 tests for those methods. References: docs/methodology/REGISTRY.md:L2048-L2051 diff_diff/staggered.py:L1323-L1336 docs/survey-roadmap.md:L45-L49 tests/test_survey_phase6.py:L1178-L1193
  • No other unmitigated P0/P1 methodology findings stood out in the changed code. The earlier replicate-scaling and combined-weight-contract issues look fixed in diff_diff/survey.py:L134-L151 and diff_diff/survey.py:L228-L247.

Code Quality

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No new blocker-level tech debt findings. The TODO.md updates appropriately mark several previously deferred items as resolved in TODO.md:L53-L79.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Align CallawaySantAnna’s replicate-weight support contract across code and docs/methodology/REGISTRY.md: either reject replicate-weight ipw/dr explicitly, or document those paths as supported in the registry.
  2. Add direct Phase 6 tests for the final supported CallawaySantAnna replicate-weight methods so the contract cannot drift again.

- Update REGISTRY CS support: reg/ipw/dr without covariates (not just reg)
- Fix DoseResponseCurve df_survey: pass None (not 0 sentinel) for display
- Fix fweight error string: "non-negative integers" matching REGISTRY
- Add BaconDecomposition replicate rejection guard
- Add rejection tests: MultiPeriodDiD, ImputationDiD, TwoStageDiD,
  BaconDecomposition (plus existing TWFE, StackedDiD coverage)
- Update REGISTRY support matrix to include BaconDecomposition

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8b8dcb30e11dce39ad74df32a3314abb0abd8b75


Overall Assessment
⛔ Blocker

Executive Summary

  • The prior CallawaySantAnna replicate-support contract mismatch appears resolved: the updated registry and code now agree on replicate support for reg/ipw/dr without covariates and on rejecting replicate-plus-bootstrap.
  • The prior EfficientDiD Hausman n_cl bug and the ContinuousDiD public df_survey sentinel leak also look fixed in the changed code.
  • Severity P0: zero weights are now accepted, but weighted OLS/HC1/classical inference still uses total row and cluster counts instead of the positive-weight sample. That silently returns wrong finite SEs/inference for zero-weight fits and also distorts the new DEFF diagnostics.
  • Severity P1: Bacon decomposition now skips zero-weight 2×2 cells, but fit() does not handle the “all comparisons dropped” case and can return an empty decomposition object instead of failing fast.
  • Static review only: this environment does not have numpy, pandas, scipy, or pytest, so I could not execute the new Phase 6 tests.

Methodology

  • Severity P0. Impact: the PR explicitly relaxes zero-weight handling in survey resolution and low-level weight validation, but the weighted OLS inference code still computes HC1/classical df corrections from raw row count n (and all clusters), not from the positive-weight effective sample. For any subpopulation-style fit with padded zero weights, this understates SEs, inflates t-stats/p-values, and also makes the new DEFF SRS baseline/effective-n too optimistic. References: diff_diff/survey.py:L175-L185, diff_diff/linalg.py:L403-L409, diff_diff/linalg.py:L1049-L1082, diff_diff/linalg.py:L1744-L1780, diff_diff/linalg.py:L1858-L1862, diff_diff/survey.py:L851-L865. Concrete fix: for pweight/aweight fits with zero weights, base HC1/classical df on count_nonzero(weights > 0), drop zero-total-weight clusters before clustered small-sample corrections, and use the same positive-weight count in DEFF effective-n. Add regression tests showing that a zero-weight-padded fit matches the same fit after physically dropping zero-weight rows.

No other unmitigated methodology-contract mismatches stood out in the changed estimator code.

Code Quality

  • Severity P1. Impact: the new Bacon zero-weight guards return None for individual 2×2 comparisons, but fit() never checks whether that filtering removed every comparison. In a subpopulation/zero-weight design, callers can now get a “successful” BaconDecompositionResults with comparisons=[] and a meaningless decomposition error instead of an explicit failure. References: diff_diff/bacon.py:L522-L625, diff_diff/bacon.py:L843-L858, diff_diff/bacon.py:L969-L983. Concrete fix: after building comparisons (and after exact-weight recomputation), raise ValueError if no valid comparisons remain or if total comparison weight is zero; add a regression test that zero-weight filtering of every cell fails fast.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No new blocker-level tech-debt findings. The TODO.md updates appropriately mark several previously tracked review items as resolved.

Security
No findings.

Documentation/Tests

Static review note: I could not execute the tests here because the Python scientific stack is unavailable.

Path to Approval

  1. Fix zero-weight weighted-OLS inference so HC1/classical/clustered df corrections use the positive-weight sample (and positive-weight clusters), and make DEFF effective-n use that same effective sample. Add equivalence tests against dropping zero-weight rows for solve_ols, compute_robust_vcov, LinearRegression.get_inference, and compute_deff_diagnostics.
  2. Add a top-level empty-result guard in BaconDecomposition.fit() so zero-weight filtering cannot return comparisons=[] as a normal result. Add a regression test with a zero-weight subpopulation that currently drops every 2×2 comparison.

… coverage

- Bacon: raise ValueError when all 2x2 comparisons dropped after
  zero-weight filtering (prevents empty decomposition results)
- Document survey df convention: total n (including zero-weight rows)
  matches R's survey::degf() after subset() — preserves design structure
- Add SyntheticDiD and TROP replicate rejection tests
- Add CS ipw/dr (no covariates) replicate positive-path tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a3d2343124a83e21e3b4ed9b11b088539df81fcc


Overall Assessment

⛔ Blocker

Executive Summary

  • The prior zero-weight inference blocker is still unresolved. This PR makes zero-weight rows a supported part of the API via SurveyDesign.subpopulation() and relaxed weight validation, but the generic weighted HC1/classical inference paths still use raw row and cluster counts instead of the effective weighted sample, so padding a fit with zero-weight rows can silently change SEs, p-values/CIs, and the new DEFF diagnostics. See diff_diff/survey.py:L434, diff_diff/linalg.py:L389, diff_diff/linalg.py:L1050, diff_diff/linalg.py:L1745, diff_diff/linalg.py:L1859, and diff_diff/survey.py:L851.
  • The prior Bacon empty-result issue appears fixed: zero-mass 2x2 comparisons now drop out and fit() raises when none remain, rather than returning an empty decomposition object. See diff_diff/bacon.py:L588.
  • The prior EfficientDiD Hausman stale-n_cl issue appears fixed by recomputing and remapping clusters after NaN-row filtering. See diff_diff/efficient_did.py:L1607.
  • The ContinuousDiD anticipation/event-study filtering change looks aligned with the registry and the new targeted test coverage.
  • The previous Phase 6 coverage gap around CS ipw/dr replicate support and SyntheticDiD/TROP replicate rejection appears addressed in the new tests.
  • Static review only: I could not execute the test suite here because numpy, pandas, scipy, and pytest are unavailable in this environment.

Methodology

  • Severity P0. Impact: The PR explicitly introduces subpopulation analysis by zeroing weights instead of dropping rows, and the registry documents preserving survey-design df_survey under subset()-style workflows. But the generic SRS baseline still is not zero-weight invariant. In compute_robust_vcov(), non-fweight fits hard-code n_eff = n and clustered corrections use all unique clusters, even when some rows or entire clusters have zero total weight and therefore contribute zero score mass. In the weighted classical path, LinearRegression.fit() likewise uses raw n for MSE/df, and compute_deff_diagnostics() builds the new DEFF/effective-n output from that same HC1 baseline plus n = X.shape[0]. The result is silent mis-inference: coefficients stay the same, but SEs, t-stats, p-values, CIs, and DEFF/effective-n change when the caller merely pads the dataset with zero-weight rows. That is a correctness bug, not an implementation choice. The new registry note at docs/methodology/REGISTRY.md:L2096 only documents preserving survey-design df/TSL behavior after subset(); it does not document or justify changing the generic HC1/classical/SRS-DEFF baseline this way. Concrete fix: keep the documented survey df_survey behavior, but change the generic weighted HC1/classical/SRS-DEFF paths to compute finite-sample adjustments from the effective weighted sample instead: count only weights > 0 observations for pweight/aweight, drop clusters whose total weight is zero before applying G/(G-1), and use that same effective sample definition in DEFF effective_n.

No other unmitigated methodology mismatches stood out in the changed estimator code. The replicate support/rejection matrix now matches REGISTRY.md, and the prior Bacon / EfficientDiD / ContinuousDiD review items look addressed.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. TODO.md appropriately marks the previously resolved review items; nothing there mitigates the blocker above.

Security

  • No findings.

Documentation/Tests

  • No separate findings. The added Phase 6 tests appear to close the earlier CS/SyntheticDiD/TROP coverage gaps, but I could not execute them in this environment.

Path to Approval

  1. Fix the generic weighted-inference stack so zero-weight padding is inference-invariant: in compute_robust_vcov() and the weighted classical path in LinearRegression.fit(), use the effective weighted sample for df adjustments and exclude zero-total-weight clusters from clustered small-sample corrections, while leaving the documented survey df_survey/TSL behavior unchanged.
  2. Add regression tests that compare a model fit on a positive-weight subset versus the same fit with the excluded rows retained at weight zero, covering HC1, classical WLS, clustered WLS, and compute_deff_diagnostics().

When pweight/aweight fits have zero-weight rows (from subpopulation),
use np.count_nonzero(weights > 0) for HC1 df adjustments instead of
total n. Zero-weight rows contribute nothing to the sandwich and
should not inflate df.

Also fix clustered SEs to exclude zero-total-weight clusters from
G/(G-1) small-sample correction, and DEFF effective_n to use
positive-weight count.

Survey design df (n_PSU - n_strata) unchanged — preserves design
structure matching R's survey::degf() convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c6c2c38ba526a0af5c01d53835cd77f6065def69


Overall Assessment

⛔ Blocker

Executive Summary

  • The prior zero-weight blocker is only partially resolved: the generic weighted-inference layer is still not fully invariant to zero-weight padding.
  • compute_robust_vcov() still lets zero-weight aweight rows contribute to the HC1/cluster meat, so robust SEs and the SRS baseline behind DEFF diagnostics can still change silently.
  • LinearRegression still stores raw-n degrees of freedom for zero-weight pweight/aweight fits, so get_inference() can return wrong p-values/CIs even when the vcov branch was updated.
  • The earlier Bacon empty-result fix, EfficientDiD hausman_pretest() cluster recount, and ContinuousDiD anticipation/event-study fix all appear addressed.
  • Static review only: I could not run the test suite here because numpy, pandas, scipy, and pytest are unavailable in this environment.

Methodology

  • Severity P0. Impact: the new registry note says generic HC1/classical inference should be zero-weight invariant outside the survey-vcov path, but the changed aweight robust path still builds scores = X * residuals for every row. Unlike the survey TSL path, it never zeros out weights == 0, so zero-weight rows and zero-weight clusters still enter the HC1/cluster meat and can silently change SEs. Because compute_deff_diagnostics() uses this HC1 baseline, the same bug also contaminates zero-weight aweight DEFF diagnostics. Locations: diff_diff/linalg.py:L1060-L1099, diff_diff/survey.py:L1252-L1256, docs/methodology/REGISTRY.md:L2096-L2101. Concrete fix: zero out aweight score rows where weights == 0 before both HC1 and cluster aggregation, mirroring the TSL implementation, and add zero-padding invariance tests for HC1, clustered vcov, and DEFF.
  • Severity P1. Impact: the classical weighted vcov branch now uses an effective positive-weight count, but the fitted LinearRegression object still stores self.df_ using raw n unless weight_type == "fweight". get_inference() therefore continues to use raw n-k for zero-weight pweight/aweight fits, so p-values and confidence intervals still move when zero-weight rows are padded in. That contradicts the new registry claim that the generic HC1/classical paths are zero-weight invariant. Locations: diff_diff/linalg.py:L1755-L1762, diff_diff/linalg.py:L1874-L1877, diff_diff/linalg.py:L2041-L2058, docs/methodology/REGISTRY.md:L2096-L2101. Concrete fix: compute self.df_ from the same effective-sample rule already used by the vcov branches (sum(w) for fweight, positive-weight count for zero-weight pweight/aweight) and add get_inference() invariance tests.

Code Quality

  • No separate findings.

Performance

  • No separate findings.

Maintainability

  • No separate findings.

Tech Debt

  • No separate findings. TODO.md does not mitigate the blockers above; P0/P1 correctness issues remain blocking even if tracked.

Security

  • No findings.

Documentation/Tests

  • No separate findings beyond the blocker coverage gap above. Static review only; I could not execute the new tests in this environment.

Path to Approval

  1. Fix the zero-weight aweight robust/cluster meat so zero-weight rows and clusters contribute exactly zero to HC1 and clustered vcov, matching the new subpopulation contract.
  2. Fix LinearRegression.fit() so stored inference degrees of freedom use the same effective-sample definition as the corrected vcov code.
  3. Add regression tests that compare a positive-weight sample against the same sample padded with zero-weight rows, covering LinearRegression.get_inference(), HC1, classical WLS, clustered WLS, and compute_deff_diagnostics() for both pweight and aweight.

Fix aweight HC1/clustered path: zero out score rows where weights==0
so zero-weight observations don't contribute to the sandwich meat
(mirrors the TSL survey vcov implementation).

Fix LinearRegression.df_: use positive-weight count for pweight/aweight
with zeros, matching compute_robust_vcov effective-n rule.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d682bf47abd55b67c22701ff2e4884c05c3fd3b4


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior zero-weight blockers from the last review appear addressed: the aweight HC1/cluster meat now zeros out zero-weight rows, and weighted df_ now uses the positive-weight count instead of raw n.
  • The previously reviewed EfficientDiD Hausman cluster recount and ContinuousDiD anticipation/event-study fixes also appear addressed in the changed code.
  • I did not find an unmitigated methodology mismatch in the new replicate-weight, DEFF, or subpopulation implementations; the new behavior is either aligned with or explicitly documented in docs/methodology/REGISTRY.md:L1994-L2110.
  • One new P1 remains: the new zero-weight guard in solve_logit() checks effective sample size against the raw parameter count before honoring effective-sample rank reduction, so valid zero-weight/subpopulation fits can now error unnecessarily.
  • Static compilation of the changed Python source succeeded. I could not run the test suite here because pytest, numpy, pandas, and scipy are unavailable in this environment.

Methodology

  • No unmitigated findings. The replicate-weight formulas, bootstrap rejections, documented estimator exclusions, Kish DEFF relabeling, and subpopulation semantics are all covered in docs/methodology/REGISTRY.md:L1994-L2110, and the corresponding code changes are consistent with that registry.

Code Quality

  • Severity P1. Impact: the new effective-sample guard in solve_logit() can reject valid weighted/subpopulation nuisance-logit fits whenever the positive-weight sample is rank-deficient but still estimable after dropping redundant columns. The function now errors on n_pos <= X_eff.shape[1] before applying the documented rank_deficient_action behavior, so warn/silent never get a chance to drop columns on the effective sample. This can break survey-weighted IPW/DR paths that rely on solve_logit(), especially after subpopulation() zero-padding. Locations: diff_diff/linalg.py:L1153-L1159, diff_diff/linalg.py:L1207-L1249. Concrete fix: compute effective-sample rank deficiency first, derive the post-drop effective parameter count, and only then enforce the n_pos > k_effective identification check.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate findings. The TODO.md updates are appropriate, but they do not mitigate the P1 above.

Security

  • No findings.

Documentation/Tests

  • No separate findings beyond the missing regression coverage for the P1 above. The new tests already cover zero-weight class loss, too-few positive-weight observations, and effective-sample rank deficiency separately in tests/test_survey_phase6.py:L683-L744, but they do not cover the interaction where effective-sample collinearity reduces the parameter count enough to make the fit valid.

Path to Approval

  1. Reorder the new solve_logit() effective-sample checks so effective-sample rank reduction happens before the positive-weight sample-size gate, and base the gate on the post-drop parameter count.
  2. Add a unit test for a zero-weight-padded sample where the positive-weight subset has both outcome classes and one redundant covariate; with rank_deficient_action="warn" (and ideally "silent"), the fit should succeed and return an original-length coefficient vector instead of raising.
  3. Add one end-to-end survey-weighted IPW or DR subpopulation regression test in an estimator that uses solve_logit() (for example CallawaySantAnna or TripleDifference) so this interaction is covered outside the helper-level test.

…ze gate

Move the n_pos <= k identification check AFTER rank-deficient column
dropping so that warn/silent modes can drop redundant columns on the
positive-weight subset before rejecting. Previously a valid design
with one redundant covariate on the effective sample was rejected
instead of proceeding with column dropping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d0aefae6092bfd776246296eaf7759d0e8bcdc40


Overall Assessment
Blocker

Executive Summary

  • The prior solve_logit() P1 from the last review appears fixed: the effective positive-weight sample is rank-reduced before the identification gate, and coefficients are expanded back to the original shape in diff_diff/linalg.py:L1207-L1250 and diff_diff/linalg.py:L1349-L1365.
  • The replicate-weight formulas, estimator support/rejection matrix, subpopulation semantics, ContinuousDiD anticipation fix, and EfficientDiD Hausman cluster recount all match the Methodology Registry or an explicitly documented deviation in docs/methodology/REGISTRY.md:L1996-L2065.
  • Severity P0: ContinuousDiD correctly uses a df=0 sentinel to force all-NaN inference when replicate survey d.f. are undefined, but then drops that sentinel when constructing DoseResponseCurve. As a result, dose_response_att.to_dataframe() and ContinuousDiDResults.to_dataframe(level="dose_response") can recompute finite t_stat/p_value under normal inference for the same rows whose CIs were already set to NaN.
  • The new tests cover undefined replicate d.f. at the helper level, but not the estimator-level dose-response serialization path where the bug survives.
  • I could not run the test suite here because numpy, pandas, and scipy are not installed in this environment.

Methodology

Code Quality
No findings. The previously reported solve_logit() issue looks resolved, with direct helper coverage in tests/test_survey_phase6.py:L727-L747.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings. The TODO.md updates appropriately mark prior review items as resolved, but they do not mitigate the P0 above.

Security
No findings.

Documentation/Tests

Path to Approval

  1. Preserve the undefined replicate-d.f. state through DoseResponseCurve construction and DataFrame serialization so rank-deficient replicate designs cannot fall back to normal-approximation t_stat/p_value.
  2. Add an estimator-level regression test covering ContinuousDiD dose-response export under a rank-one replicate design and assert t_stat, p_value, conf_int_lower, and conf_int_upper are all NaN.

Pass _survey_df directly (including the 0 sentinel) to DoseResponseCurve
so that to_dataframe() also produces NaN inference for rank-deficient
replicate designs. Previously converted 0→None which caused
to_dataframe() to fall back to z-distribution with finite t_stat/p_value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 28, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b26eeaa63d35512d0935116b6e6dc5db30f23649


Overall Assessment
✅ Looks good. No unmitigated P0 or P1 findings.

Executive Summary

  • The previous blocker is resolved: ContinuousDiD now propagates replicate survey d.f. into DoseResponseCurve, and undefined replicate d.f. now force all inference fields to NaN consistently in the export path.
  • The replicate-weight formulas, QR-rank survey d.f., documented support/rejection matrix, and subpopulation semantics match the Methodology Registry.
  • The ContinuousDiD anticipation fix and EfficientDiD.hausman_pretest() cluster recount match the resolved TODO.md items.
  • Unsupported-estimator replicate rejections are documented in the registry, so they are informational, not defects.
  • Residual risk is coverage, not methodology: the new ContinuousDiD replicate analytical path still lacks a direct estimator-level regression test.
  • I could not execute the test suite here because pytest is not installed in this environment.

Methodology
No unmitigated findings. The replicate-weight implementation aligns with docs/methodology/REGISTRY.md:L1996-L2065, the prior ContinuousDiD NaN-inference blocker is fixed in diff_diff/continuous_did.py:L517-L523, diff_diff/continuous_did.py:L677-L698, diff_diff/continuous_did_results.py:L50-L67, and diff_diff/utils.py:L152-L221. The changed anticipation and hausman_pretest() logic also matches the resolved notes in TODO.md:L59-L62, diff_diff/continuous_did.py:L1087-L1095, and diff_diff/efficient_did.py:L1611-L1616.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings. The TODO.md updates correctly resolve the previously tracked review items, and I did not find a new correctness issue that needs deferred tracking.

Security
No findings.

Documentation/Tests
Severity P2. Impact: the new ContinuousDiD replicate analytical path is still only covered indirectly. The suite currently checks helper-level rank-one d.f. behavior in tests/test_survey_phase6.py:L1101-L1125, ordinary survey dose-response export in tests/test_survey_phase3.py:L1171-L1197, and only bootstrap rejection for replicate ContinuousDiD in tests/test_survey_phase6.py:L1229-L1243. That leaves the newly added analytical replicate branches in diff_diff/continuous_did.py:L498-L698 and diff_diff/continuous_did_results.py:L50-L67, including the exact export path that produced the prior blocker, without a direct estimator-level regression test. Concrete fix: add one ContinuousDiD(n_bootstrap=0) replicate-weight smoke test, and add one rank-one replicate regression test that asserts t_stat, p_value, conf_int_lower, and conf_int_upper all remain NaN in both dose_response_att.to_dataframe() and dose_response_acrt.to_dataframe().

@igerber igerber merged commit 92b4d97 into main Mar 28, 2026
14 checks passed
@igerber igerber deleted the survey-last-phase branch March 28, 2026 20:30
@igerber igerber mentioned this pull request Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant