Survey Phase 7: CS IPW/DR covariates, repeated cross-sections, HonestDiD survey variance#240
Conversation
…DiD survey variance Phase 7a: Remove NotImplementedError gate for IPW/DR + covariates + survey. Add DRDID panel nuisance IF corrections (PS + OR) for both survey and non-survey DR paths. Extract _safe_inv helper for matrix inversions. Phase 7d: Thread survey df through HonestDiD for t-distribution critical values. Compute full event-study VCV from influence function vectors. Add event_study_vcov to CallawaySantAnnaResults. Phase 7b: Add panel=False for repeated cross-section support in CallawaySantAnna. New _precompute_structures_rc, _compute_att_gt_rc, and three RC estimation methods (reg, ipw, dr) with covariates and survey weights. Canonical index abstraction in aggregation/bootstrap. RCS data generator in generate_staggered_data(panel=False). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR ReviewOverall assessment
Executive summary
Methodology Cross-check basis:
Code Quality No additional findings beyond the methodology issues above. Performance No findings. Maintainability
Tech Debt No separate findings. The new Security No findings. Documentation/Tests
Path to Approval
Static review only; I was not able to do runtime spot-checks in this environment because the Python runtime here is missing required dependencies such as |
… VCV, panel on results Fix 5 findings from PR #240 CI review: - Add cross-sectional nuisance IF corrections (PS + OR) to _ipw_estimation_rc and _doubly_robust_rc, matching panel path methodology - Use fixed full-sample cohort masses for unweighted RCS aggregation weights (consistency with WIF group-share denominator) - Guard replicate-weight designs from full event-study VCV (diagonal fallback) - Add panel field to CallawaySantAnnaResults, fix summary labels for RCS - Add panel to class docstring, replicate VCV test, RCS IF correction test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR ReviewOverall assessment
Executive Summary
Methodology Affected methods: Callaway-Sant'Anna repeated cross sections (
Code Quality No findings beyond the methodology issues above. Performance No findings. Maintainability No findings. Tech Debt No separate findings. Neither remaining P1 issue is tracked in TODO.md:L48-L60, so Security No findings. Documentation/Tests
Path to Approval
Static review only: I could not execute the test suite here because this environment lacks |
…equal-count tests - Use separate normalizer_t/normalizer_s in _doubly_robust_rc() IF corrections (post vs base period treated denominators must match the DR estimator) - Precompute rcs_cohort_masses in _precompute_structures_rc() and return cohort mass as n_treated from _compute_att_gt_rc() — fixes all downstream consumers (aggregation, bootstrap, balance_e) at the source instead of per-consumer patches - Remove now-unnecessary rcs_cohort_counts blocks from aggregation - Add unequal cohort count test fixture and regression tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR ReviewOverall assessment Executive summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…ulas
_outcome_regression_rc: Pool all treated obs for OR correction term
(was: separate per-period averages). Period-specific treated means for Y.
Matches Sant'Anna & Zhao (2020) Eq 2.2 / R reg_did_rc exactly.
_doubly_robust_rc: Fit 4 OLS models (control+treated, pre+post) for
locally efficient DR estimator (was: 2 control-only). Implements tau_1
(AIPW) + tau_2 (local efficiency adjustment) with full 11-component IF.
Matches Sant'Anna & Zhao (2020) Eq 3.3+3.4 / R drdid_rc exactly.
Add agg_weight field to group_time_effects for RCS aggregation weight
(cohort mass), separate from n_treated (per-cell display count).
Aggregation uses data.get("agg_weight", data["n_treated"]) for
backward compatibility with panel data.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability
Tech Debt No separate findings. The P1 bootstrap-weight issue above is not currently tracked in TODO.md:L48-L60 or TODO.md:L167-L170, so it remains unmitigated. Security No findings. Documentation/Tests
Path to Approval
|
- Bootstrap overall/event-study reaggregation now uses agg_weight (fixed cohort mass) for panel=False, matching the analytical aggregation path - Reset self._event_study_vcov = None at start of fit() to prevent stale VCV from prior fit leaking into reused estimator objects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only: I could not execute the changed tests in this environment because the default Python interpreter here does not have the project dependencies available. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…dy SEs Prevents HonestDiD from mixing analytical IF-based VCV with bootstrap SEs on bootstrap-fit CallawaySantAnna results. When n_bootstrap>0, the event_study_vcov is set to None so HonestDiD falls back to diagonal from the bootstrap SEs (consistent variance path). Add regression test: bootstrap CS → HonestDiD asserts vcov is None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only: I could not execute the added tests here because Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…s sum not mean - _outcome_regression_rc: M1 denominator changed from sum_w_D to n_all (matching R colMeans convention); inf_cont_2 / sum_w_D then gives correct single normalization by mean_w_D * n_all = sum_w_D - _ipw_estimation_rc: PS M2 uses np.sum/n_all instead of np.mean (which divided by n_ct/n_cs instead of n_all, under-scaling the correction) - _doubly_robust_rc: PS M2 already correct (np.sum/n_all), no change Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Static review plus source cross-check only. I could not run Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
…convergence test REGISTRY.md: Document that RCS IFs use phi=psi/n convention (SE = sqrt(sum(phi^2))), algebraically equivalent to R's sd(psi)/sqrt(n). The 1/n_all denominator in gradient terms is the colMeans -> phi conversion, not extra shrinkage. Add test proving correctness: analytical SE within 20% of bootstrap SE (499 iters) for RCS reg with covariates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cfcc441 to
c2f8fdc
Compare
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Static review only. I could not run the test suite here because this environment does not have Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Restructure _outcome_regression_rc, _ipw_estimation_rc, _doubly_robust_rc to compute leading IF terms in R's unnormalized psi convention (using mean_w_* = sum_w_*/n_all normalizers matching R's mean()), then convert to library phi = psi/n_all at the boundary. Makes DRDID correspondence explicit with R variable name comments. Fix HonestDiD event_study_vcov subsetting: when filtering NaN-SE event times, subset the VCV matrix to match the surviving rel_times (was using the full unfiltered matrix, causing dimension mismatch on interior drops). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Static review only. I could not run the suite here because this environment is missing Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. The Security No findings. Documentation/Tests
Path to Approval
|
…2 gradients Resolve TODO.md merge conflict (keep our resolved items from Phase 7). Panel IPW/DR PS nuisance gradients (M2, M2_dr): change np.mean() over control rows to np.sum() / (n_t + n_c), matching R's colMeans() over ALL n observations (zero for treated). This makes the panel paths consistent with the RC paths' R-convention refactoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Static review only. I could not run the suite here because this environment is missing Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No separate findings. The Security No findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology Affected methods:
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…nator) The previous change from np.mean to np.sum/n_all_panel was wrong — R's colMeans over control rows divides by n_c, not n_all. With H/n and asy_rep/n already on phi scale, M2 = colMeans(control) with n_c denominator is the correct gradient. Reverts all three panel M2/M2_dr sites (IPW, DR survey, DR non-survey) from np.sum()/n_all_panel back to np.mean(). Also: fix "Control" → "Never-treated" label, update REGISTRY note. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Drop the H/n, asy_rep/n convention from all three panel PS correction sites. Now uses R's direct formulation: H = X'WX, asy_rep = score @ inv(H), M2 = colMeans (sum over control terms / n_all). The /n factors were algebraically canceling but confused the static reviewer. Also: add duplicate unit-ID check for panel=False. Update REGISTRY note to reflect the simpler H = X'WX convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only: I could not run the new paths locally in this environment because the Python deps are unavailable. Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability
Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
…i conversion All 5 PS nuisance correction sites (3 panel + 2 RC) now compute in R's psi convention: H_psi = X'WX/n, asy_lin_rep_psi = score @ solve(H_psi) / n, M2 = colMeans(...). The psi-scale product is converted to phi via / n before adding to stored IF. This matches R's DRDID formulation line-for-line. Previously: H = X'WX (no /n), asy_rep = score @ solve(H), M2 = sum/n. The algebraic result is identical (confirmed: all SE ratios unchanged at 0.97-1.03), but the intermediate variables now have R's exact magnitudes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only; I could not run the targeted tests here because Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability
Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
…rap validation Remove R-comparison framing from PS nuisance correction note. Describe as a valid numerical approach to M-estimation variance (library's pre-scaled IF convention). Include bootstrap convergence ratios as empirical proof the correction is not under-scaled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. This environment does not have the project’s Python dependencies installed, so I could not run the targeted tests here. Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability
Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
The PS asymptotic linear representation (asy_lin_rep_psi) was being divided by n twice — once in its computation (score @ H_psi_inv / n) and again when adding the correction to inf_func (/ n_all_panel). Since H_psi_inv = solve(X'WX/n) already contains the factor of n, score @ H_psi_inv produces psi-scale values matching R's asy.lin.rep.ps. The extra /n made the PS correction O(1/n²) instead of O(1/n), under-scaling it relative to the leading IF terms. Fix: remove /n from asy_lin_rep_psi computation at all 5 sites (panel survey IPW, panel survey DR, panel non-survey DR, RCS IPW, RCS DR). The single /n on the correction line is the legitimate psi→phi conversion. Also updates REGISTRY.md notes and fixes stale survey comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run the targeted tests in this environment because Executive Summary
Methodology Affected methods:
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
…ubsets The M2 gradient terms in PS nuisance corrections used np.mean() over control subsets, introducing an extra 1/n_c divisor. R's DRDID computes M2 as colMeans() over the full n-sample (zeros for treated), then divides by mean(w.cont) — the n's cancel, giving sum(w*resid*X)/sum(w). With our Hajek-normalized weights (w_norm = w/sum(w)), np.sum(w_norm*resid*X) directly yields sum(w*resid*X)/sum(w), matching R after cancellation. The single /n on the correction line remains as the psi-to-phi conversion. Applied at all 5 PS correction sites (panel survey IPW/DR, panel non-survey DR, RCS IPW, RCS DR). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run the targeted tests in this environment because Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Flip PS nuisance correction sign in panel survey IPW and RCS IPW paths. R adds the correction to inf.control then subtracts (att = treat - control), so the net effect on ATT IF is subtraction. DR paths are unaffected because their M2 residual (m_control - control_change) already has opposite sign. Also merges main (StaggeredTripleDifference from #245), resolves TODO.md conflict, and updates survey-roadmap.md to reflect Phase 7a implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run the test suite in this workspace because Executive Summary
Methodology Re-review note: I rechecked the touched panel and RCS IPW/DR influence-function code against the DRDID source implementations; I do not see the prior IPW sign mismatch anymore. citeturn5view0turn6view0
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
…of mutating shared state _compute_aggregated_se_with_wif() was mutating precomputed["df_survey"] as a side effect, causing "last mutation wins" when multiple aggregations (overall, event-study horizons, groups) each computed different effective df values from dropped replicates. Event-study/group inference and HonestDiD could read stale or incorrect df. Fix: return effective_df alongside SE from _compute_aggregated_se_with_wif() instead of mutating shared state. Callers use per-statistic df for inference (minimum across horizons/groups for conservative batch inference). fit() uses the overall aggregation's effective_df. Also reconciles status docs: ROADMAP.md and survey-roadmap.md now reflect that Phase 7a/7b/7d are implemented, with tutorial (7c) remaining open. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only; Executive Summary
Methodology No findings. I cross-checked the new repeated-cross-section path against DRDID’s Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
|
…turn _aggregate_simple() now returns (att, se, effective_df). Update the StaggeredTripleDifference caller to unpack the third element. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
NotImplementedErrorgate for CallawaySantAnna IPW/DR + covariates + survey. Implement DRDID panel nuisance IF corrections (propensity score + outcome regression) for both survey-weighted and non-survey DR paths (Sant'Anna & Zhao 2020, Theorem 3.1). Extract_safe_inv()helper for matrix inversions.event_study_vcovfield toCallawaySantAnnaResultsandsurvey_metadata/df_surveytoHonestDiDResults.panel=Falsefor repeated cross-section support in CallawaySantAnna. New_precompute_structures_rc(),_compute_att_gt_rc(), and three RC estimation methods (_outcome_regression_rc,_ipw_estimation_rc,_doubly_robust_rc) with covariates and survey weights. Canonical index abstraction in aggregation/bootstrap mixins. RCS data generator viagenerate_staggered_data(panel=False).Methodology references
Validation
tests/test_survey_phase7a.py(22 tests): smoke, scale invariance, uniform-weight equivalence, IF correction, aggregation, bootstrap, edge casestests/test_staggered_rc.py(23 tests): all methods, covariates, survey, aggregation, bootstrap, control groups, base periods, data generator, edge casestests/test_honest_did.py(+4 tests): survey df extraction, VCV computation, bounds widening, no-survey baselinetests/test_survey_phase4.py: 2 negative tests converted to positive assertionsSecurity / privacy
Generated with Claude Code