Add survey Phase 6: replicate weights, DEFF diagnostics, subpopulation analysis#238
Conversation
…n analysis Complete the final Phase 6 survey features: - Replicate weight variance (BRR, Fay, JK1, JKn) as alternative to TSL - Per-coefficient DEFF diagnostics comparing survey vs SRS variance - Subpopulation analysis via SurveyDesign.subpopulation() Bug fixes: - EfficientDiD hausman_pretest() stale n_cl after NaN filtering - ContinuousDiD event-study anticipation filter Refactoring: - Extract _format_survey_block() helper across 11 results files - Rename DEFF display label to "Kish DEFF (weights)" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1 fixes: - Implement JKn with explicit replicate_strata (per-stratum scaling) - Fix replicate IF variance scale: use weighted sums not means - Propagate replicate dispatch to ContinuousDiD, EfficientDiD, TripleDifference - Allow zero weights in solve_logit (matching solve_ols) - Preserve replicate metadata in SurveyDesign.subpopulation() P2 fixes: - Add DEFFDiagnostics and compute_deff_diagnostics to __all__ - Show replicate method/count in survey summary block - Update docs for JKn replicate_strata requirement Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 3 review fixes: - Add ResolvedSurveyDesign.subset_to_units() helper to carry replicate metadata through panel→unit collapse in ContinuousDiD and EfficientDiD - Normalize replicate weight columns to sum=n for pweight/aweight (matching full-sample normalization for scale-invariant IF variance) - Extend _validate_unit_constant_survey() to check replicate weight columns - Set n_psu=None for replicate designs in metadata (was bogus implicit count) - Warn on invalid replicate solves instead of silently dropping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… tests Round 4 review fixes: - Fix compute_replicate_if_variance() to match compute_survey_if_variance() contract: accept psi as-is, use weight-ratio rescaling (w_r/w_full) for replicate contrasts instead of raw weight multiplication - Add positive-mass guard in solve_ols() and solve_logit(): reject all-zero weight vectors to prevent silent empty-sample fits - Narrow exception catch in compute_replicate_vcov() to LinAlgError/ValueError - Add numerical test comparing replicate IF variance to TSL IF variance on same PSU structure (ratio within [0.5, 2.0]) - Add test for all-zero weight rejection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the Phase 6 survey methodology additions: - Replicate weight variance: BRR/Fay/JK1/JKn formulas, IF contract (weight-ratio rescaling matching compute_survey_if_variance), df=R-1, normalization convention, JKn replicate_strata requirement - DEFF diagnostics: per-coefficient design effect formula, effective n, opt-in computation - Subpopulation analysis: domain estimation via zero-weight preservation, replicate metadata handling, solver zero-weight guards Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 5 review fixes: - solve_logit(): validate effective weighted sample (positive-weight rows) for class support and parameter identification before IRLS - compute_replicate_if_variance(): use np.divide(where=) to avoid divide-by-zero warnings on zero full-sample weights - Add regression tests: single-class positive-weight, too-few positive-weight obs, and zero-weight replicate IF (no RuntimeWarning) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run _detect_rank_deficiency() on positive-weight rows when weights contain zeros, so rank-deficient subpopulation/domain samples are rejected even when the full padded design is full rank. Add regression test with collinear positive-weight subset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… DEFF Round 7 review fixes: - P0: TripleDifference replicate IF path now uses raw combined IF (not TSL-deweighted) for IPW/DR methods, matching REGISTRY contract - P1: ContinuousDiD rejects replicate_weights + n_bootstrap>0 with NotImplementedError (replicate variance is analytical, not bootstrap) - P2: LinearRegression.compute_deff() handles rank-deficient models by computing SRS vcov on kept columns only, expanding with NaN - Tests: ContinuousDiD replicate+bootstrap rejection, TripleDiff replicate regression method end-to-end Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- compute_deff(): return all-NaN DEFFDiagnostics directly when all coefficients are dropped, instead of calling compute_deff_diagnostics() on a singular design - Parameterize TripleDiff replicate test over reg/ipw/dr to cover the previously fixed IPW/DR IF scale bug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 9 review fixes: - SunAbraham: reject replicate-weight survey designs with NotImplementedError (weighted within-transformation must be recomputed per replicate, not yet implemented) - subpopulation(): validate masks for NaN before bool coercion to prevent silent inclusion of missing-valued observations - Tests: SunAbraham replicate rejection, NaN mask rejection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… ContinuousDiD The TSL path via compute_survey_vcov(X_ones, if_vals, resolved) applies implicit score weighting (w * if) and bread normalization (1/sum(w)^2). The replicate IF path must apply equivalent score scaling before calling compute_replicate_if_variance() to produce matching SEs. Verified empirically: JK1 replicate SE now matches TSL SE within 0.3% on a toy weighted design (ratio 0.9967, previously 60x inflated). Score scaling by estimator: - EfficientDiD: psi = w * eif / sum(w) - ContinuousDiD: psi = w * if_vals (tsl_scale cancels with bread) - TripleDifference reg: psi = w * inf_func / sum(w) - TripleDifference ipw/dr: psi = inf_func / sum(w) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t rank action Round 11 review fixes: - P0: Fix subpopulation() mask validation — remove except that swallowed its own ValueError for None masks. None values now properly rejected. - P1: EfficientDiD rejects replicate weights + n_bootstrap>0 with NotImplementedError (matching ContinuousDiD/SunAbraham pattern) - P1: solve_logit() effective-sample rank check now respects rank_deficient_action (warn/silent/error) instead of hard-erroring - P2: Update survey-roadmap.md with replicate-weight limitations (SunAbraham, ContinuousDiD/EfficientDiD bootstrap) - Tests: None mask rejection, EfficientDiD replicate+bootstrap rejection, logit rank-deficient warn vs error modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TripleDifference: _compute_cell_means() validates positive survey mass per cell before np.average(), raises ValueError on zero-weight cells - ContinuousDiD: _compute_dose_response_gt() checks sum(w_treated) and sum(w_control) > 0, returns NaN for cells with zero effective mass instead of crashing on weighted mean/bread division Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 13 review fixes: - P0: compute_replicate_vcov() and compute_replicate_if_variance() return NaN when fewer than 2 valid replicates remain - P1: solve_logit() now actually drops rank-deficient columns from the effective positive-weight design in warn/silent modes (was warn-only) - P1: ContinuousDiD filters NaN cells from gt_results before aggregation so one zero-mass cell doesn't poison valid aggregates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 14 review fixes: - CallawaySantAnna: reject replicate-weight + n_bootstrap>0 with NotImplementedError (matching ContinuousDiD/EfficientDiD/SunAbraham) - BaconDecomposition: guard weighted np.average() calls against zero effective weight in both treated-vs-never and timing comparison cells - Test: CS replicate+bootstrap rejection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1 fixes from CI AI review (PR #238): - solve_logit(): track original column count and expand returned beta back to p+1 length after effective-sample column dropping. Previously returned a shortened vector breaking the solver contract. - subpopulation(): reject string/object masks that would silently coerce non-empty strings to True, defining the wrong domain. - REGISTRY.md: add Note entries for estimator-level replicate limitations (SunAbraham rejection, CS/ContinuousDiD/EfficientDiD bootstrap rejection) Tests: assert beta length p+1 after zero-weight rank-deficient solve, assert string mask raises ValueError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
I did not execute the test suite in this sandbox because the available Python environment is missing project dependencies ( |
…ocument df
CI review findings:
- Reject non-binary numeric masks in subpopulation() ({1,2} etc. coerce
to all-True via astype(bool), silently defining wrong domain)
- Fix test_survey_phase4.py: update "strictly positive" to "non-negative"
to match changed solve_logit() validation message
- Document replicate df limitation in TODO.md (df stays R-1 when invalid
replicates are dropped — marginal impact for typical R > 50)
- Add REGISTRY.md Note entries for replicate <2 valid returns NaN
- Tests: non-binary numeric mask rejection, beta length assertion
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
I could not execute the test suite in this sandbox because |
CallawaySantAnna._collapse_survey_to_unit_level() now carries through replicate_weights (collapsed via groupby().first()), replicate_method, fay_rho, n_replicates, and replicate_strata. Previously these fields were dropped, causing replicate-weight fits to silently fall back to non-replicate aggregation SEs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the replicate df limitation from TODO.md (which only silences P2/P3) to REGISTRY.md with a **Note:** label (which the CI reviewer recognizes and downgrades to P3). The df stays R-1 instead of n_valid-1 when invalid replicates are dropped — marginal impact for typical R > 50. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When invalid replicates are dropped in compute_replicate_vcov() or compute_replicate_if_variance(), update resolved.n_replicates to the valid count so df_survey = n_valid - 1 flows through to t-based inference. Previously df stayed at R-1 (original count), making p-values/CIs stale when replicates failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
I could not execute the test suite in this sandbox because |
Stop mutating resolved.n_replicates in place — instead return (result, n_valid) tuples from compute_replicate_vcov() and compute_replicate_if_variance(). Callers unpack the tuple and LinearRegression.fit() uses n_valid-1 for survey_df. This eliminates the shared-object mutation that the CI reviewer flagged as P0 (order-dependent bugs on reused resolved designs) while properly threading the effective df through inference. Updated all 7 callers across 5 files + 4 test call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No findings beyond the live inference-state and d.f.-propagation issues above. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
Add NotImplementedError for replicate-weight survey designs in DifferenceInDifferences, MultiPeriodDiD, StackedDiD, ImputationDiD, TwoStageDiD, SyntheticDiD, and TROP. These estimators use TSL (compute_survey_vcov) or custom variance computations that do not dispatch to replicate-based variance. Document full replicate-weight support matrix in REGISTRY.md. Also adds df<=0 early-return in safe_inference/safe_inference_batch, fixes LinearRegression.get_inference() generic fallback for replicate designs, re-applies EfficientDiD replicate guard after unit design rebuild, and adds guard at CS first df_survey read. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No findings. Tech Debt No findings. Security No findings. Documentation/Tests
Path to Approval
|
TWFE performs weighted within-transformation once with full-sample weights, then lets LinearRegression compute replicate variance on the frozen transformed design. This is methodologically incorrect — replicate variance requires recomputing the within-transform per replicate (same reason as SunAbraham rejection). Also fix compute_deff() to gate on survey design presence instead of survey_df_, allowing DEFF computation even when replicate df is undefined. Fix REGISTRY fweight note consistency (non-negative, not positive). Update replicate support matrix to include TWFE rejection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Needs changes Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…etadata - Validate replicate_scale > 0 and replicate_rscales finite non-negative in SurveyDesign.__post_init__ - Validate combined_weights=True contract in resolve(): reject w_r > 0 where w_full == 0 (malformed design) - Fix CS IPW/DR path: pass survey df to safe_inference_batch - Fix ContinuousDiD/EfficientDiD: don't propagate df=0 sentinel to survey_metadata (keep as None for display) - Add TWFE, StackedDiD rejection tests + scale/rscales validation tests - Update survey-roadmap.md: CS now has full survey support, accurate replicate limitation descriptions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
- Update REGISTRY CS support: reg/ipw/dr without covariates (not just reg) - Fix DoseResponseCurve df_survey: pass None (not 0 sentinel) for display - Fix fweight error string: "non-negative integers" matching REGISTRY - Add BaconDecomposition replicate rejection guard - Add rejection tests: MultiPeriodDiD, ImputationDiD, TwoStageDiD, BaconDecomposition (plus existing TWFE, StackedDiD coverage) - Update REGISTRY support matrix to include BaconDecomposition Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
No other unmitigated methodology-contract mismatches stood out in the changed estimator code. Code Quality
Performance Maintainability Tech Debt Security Documentation/Tests
Static review note: I could not execute the tests here because the Python scientific stack is unavailable. Path to Approval
|
… coverage - Bacon: raise ValueError when all 2x2 comparisons dropped after zero-weight filtering (prevents empty decomposition results) - Document survey df convention: total n (including zero-weight rows) matches R's survey::degf() after subset() — preserves design structure - Add SyntheticDiD and TROP replicate rejection tests - Add CS ipw/dr (no covariates) replicate positive-path tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
No other unmitigated methodology mismatches stood out in the changed estimator code. The replicate support/rejection matrix now matches Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
When pweight/aweight fits have zero-weight rows (from subpopulation), use np.count_nonzero(weights > 0) for HC1 df adjustments instead of total n. Zero-weight rows contribute nothing to the sandwich and should not inflate df. Also fix clustered SEs to exclude zero-total-weight clusters from G/(G-1) small-sample correction, and DEFF effective_n to use positive-weight count. Survey design df (n_PSU - n_strata) unchanged — preserves design structure matching R's survey::degf() convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Fix aweight HC1/clustered path: zero out score rows where weights==0 so zero-weight observations don't contribute to the sandwich meat (mirrors the TSL survey vcov implementation). Fix LinearRegression.df_: use positive-weight count for pweight/aweight with zeros, matching compute_robust_vcov effective-n rule. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…ze gate Move the n_pos <= k identification check AFTER rank-deficient column dropping so that warn/silent modes can drop redundant columns on the positive-weight subset before rejecting. Previously a valid design with one redundant covariate on the effective sample was rejected instead of proceeding with column dropping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality Performance Maintainability Tech Debt Security Documentation/Tests
Path to Approval
|
Pass _survey_df directly (including the 0 sentinel) to DoseResponseCurve so that to_dataframe() also produces NaN inference for rank-deficient replicate designs. Previously converted 0→None which caused to_dataframe() to fall back to z-distribution with finite t_stat/p_value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology Code Quality Performance Maintainability Tech Debt Security Documentation/Tests |
Summary
SurveyDesign.subpopulation()hausman_pretest()stalen_clafter NaN filtering_format_survey_block()helper across 11 results filesResolvedSurveyDesign.subset_to_units()for panel→unit collapse with replicate metadatasolve_ols(),solve_logit(), and estimator cell meansMethodology references (required if estimator / math changes)
Validation
tests/test_survey_phase6.py(53 new tests),tests/test_survey_phase3.py,tests/test_survey_phase4.py,tests/test_survey_phase5.py(coverage gap tests),tests/test_survey.py,tests/test_efficient_did.py,tests/test_continuous_did.pySecurity / privacy
Generated with Claude Code