Skip to content

Commit 176174c

Browse files
authored
Merge pull request #453 from igerber/wave-3-observability
Wave 3 estimator observability: HonestDiD M=0 test, Wooldridge canonical-link warning, ARP vertex diagnostic
2 parents b926a10 + 54efaea commit 176174c

3 files changed

Lines changed: 124 additions & 28 deletions

File tree

TODO.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -94,13 +94,12 @@ Deferred items from PR reviews that were not addressed before merge.
9494
| Replicate weight tests use Fay-like BRR perturbations (0.5/1.5), not true half-sample BRR. Add true BRR regressions per estimator family. Existing `test_survey_phase6.py` covers true BRR at the helper level. | `tests/test_replicate_weight_expansion.py` | #253 | Low |
9595
| WooldridgeDiD: QMLE sandwich uses `aweight` cluster-robust adjustment `(G/(G-1))*(n-1)/(n-k)` vs Stata's `G/(G-1)` only. Conservative (inflates SEs). Add `qmle` weight type if Stata golden values confirm material difference. | `wooldridge.py`, `linalg.py` | #216 | Medium |
9696
| WooldridgeDiD: aggregation weights use cell-level n_{g,t} counts. Paper (W2025 Eqs. 7.2-7.4) defines cohort-share weights. Add optional `weights="cohort_share"` parameter to `aggregate()`. | `wooldridge_results.py` | #216 | Medium |
97-
| WooldridgeDiD: canonical link requirement (W2023 Prop 3.1) not enforced — no warning if user applies wrong method to outcome type. Estimator is consistent regardless, but equivalence with imputation breaks. | `wooldridge.py` | #216 | Low |
97+
| WooldridgeDiD: optional *efficiency hint* (NOT a canonical-link violation per W2023 Prop 3.1) when method/outcome pairing is sub-optimal — e.g., `method="ols"` on binary data is consistent under QMLE, but `method="logit"` is typically more efficient. The original framing in this row as a "canonical link requirement" tied to Prop 3.1 was incorrect: Wooldridge (2023) Table 1 lists Gaussian/OLS for "any response" and logistic-Bernoulli for "binary OR fractional". A useful hint exists (efficiency), but should not be framed as a methodology violation. See PR #453 R1 review for the corrected reading. | `wooldridge.py` | #216 | Low |
9898
| WooldridgeDiD: Stata `jwdid` golden value tests — add R/Stata reference script and `TestReferenceValues` class. | `tests/test_wooldridge.py` | #216 | Medium |
9999
| Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the 8 standalone estimators that expose `cluster=`: `CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD`. Phase 1a added `vcov_type` to the `DifferenceInDifferences` inheritance chain only. | multiple | Phase 1a | Medium |
100100
| Weighted one-way Bell-McCaffrey (`vcov_type="hc2_bm"` + `weights`, no cluster) currently raises `NotImplementedError`. `_compute_bm_dof_from_contrasts` builds its hat matrix from the unscaled design via `X (X'WX)^{-1} X' W`, but `solve_ols` solves the WLS problem by transforming to `X* = sqrt(w) X`, so the correct symmetric idempotent residual-maker is `M* = I - sqrt(W) X (X'WX)^{-1} X' sqrt(W)`. Rederive the Satterthwaite `(tr G)^2 / tr(G^2)` ratio on the transformed design and add weighted parity tests before lifting the guard. | `linalg.py::_compute_bm_dof_from_contrasts`, `linalg.py::_validate_vcov_args` | Phase 1a | Medium |
101101
| HC2 / HC2 + Bell-McCaffrey on absorbed-FE fits currently raises `NotImplementedError` in three places: `TwoWayFixedEffects` unconditionally; `DifferenceInDifferences(absorb=..., vcov_type in {"hc2","hc2_bm"})`; `MultiPeriodDiD(absorb=..., vcov_type in {"hc2","hc2_bm"})`. Within-transformation preserves coefficients and residuals under FWL but not the hat matrix, so the reduced-design `h_ii` is not the diagonal of the full FE projection and CR2's block adjustment `A_g = (I - H_gg)^{-1/2}` is likewise wrong on absorbed cluster blocks. Lifting the guard needs HC2/CR2-BM computed from the full absorbed projection (unit/time FE dummies reconstructed internally, or a FE-aware hat-matrix formulation) and a parity harness against a full-dummy OLS run or R `fixest`/`clubSandwich`. HC1/CR1 are unaffected by this because they have no leverage term. | `twfe.py::fit`, `estimators.py::DifferenceInDifferences.fit`, `estimators.py::MultiPeriodDiD.fit` | Phase 1a | Medium |
102102
| Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
103-
| `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low |
104103
| Unify Rust local-method `estimate_model` solver path to `solve_wls_svd` (the same SVD helper used by the global-method since PR #348) for sub-1e-14 bootstrap SE parity. Current local-method bootstrap parity test (`tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local`) passes at `atol=1e-5` — the residual ~1e-7 gap is roundoff between Rust's `estimate_model` matrix factorization and numpy's `lstsq`, which accumulates differently across per-replicate bootstrap fits. Main-fit ATT parity is regime-dependent (`atol=1e-14` for `lambda_nn=inf`, `atol=1e-10` for finite `lambda_nn` — see `test_local_method_main_fit_parity`); the bootstrap gap is a same-solver-path roundoff concern and not a user-visible correctness bug. | `rust/src/trop.rs::estimate_model`, `rust/src/linalg.rs::solve_wls_svd` | follow-up | Low |
105104
| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #354): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
106105
| `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
@@ -158,7 +157,6 @@ Deferred items from PR reviews that were not addressed before merge.
158157
| CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path | `tests/test_methodology_callaway.py` | #202 | Low |
159158
| Doc-snippet smoke tests only cover `.rst` files; `.txt` AI guides outside CI validation | `tests/test_doc_snippets.py` | #239 | Low |
160159
| Add CI validation for `docs/doc-deps.yaml` integrity (stale paths, unmapped source files) | `docs/doc-deps.yaml` | #269 | Low |
161-
| HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` || Low |
162160
| SyntheticDiD: rename internal `placebo_effects` variable to `variance_effects` (or `resampled_effects`). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve `placebo_effects` as a deprecated alias for one release. | `synthetic_did.py`, `results.py` | follow-up | Medium |
163161
| AI review CI: pin workflow contract via test (uses `openai/codex-action@v1`, passes `prompt-file`, reads `steps.run_codex.outputs.final-message`, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #416 | Low |
164162
| `TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 dismissal guard) does not model: `bash <script>` / `sh <script>` / `./<script>` / `source <script>` / `. <script>` direct shell-script execution; multi-line `python3 -c` bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (`SCRIPT="$X"; python3 "$SCRIPT"`); `eval`; `find -exec`; `xargs -I {}`. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in `.github/workflows/ai_pr_review.yml` comment block. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #436 | Low |
@@ -172,12 +170,10 @@ Ordered paydown view across the tables above. Tier A → D is by effort × risk,
172170

173171
#### Tier A — Quick wins (≤1 day, ≤3 CI rounds expected)
174172

175-
- HonestDiD `test_m0_short_circuit`: replace wall-clock `elapsed < 0.5s` proxy with a state flag (`tests/test_methodology_honest_did.py:246`)
176173
- EfficientDiD `control_group="last_cohort"` REGISTRY-vs-code alignment with `anticipation>0` (`efficient_did.py`, one design decision)
177174
- TripleDifference: add `generate_ddd_panel_data` for panel DDD power analysis (`prep_dgp.py`, `power.py`)
178175
- TROP: extract shared data-setup helper between `fit()` and `_fit_global()` (~150 LoC dedup; `trop.py`, `trop_global.py`, `trop_local.py`)
179-
- WooldridgeDiD: emit warning when canonical-link is violated (`wooldridge.py`; W2023 Prop 3.1)
180-
- HonestDiD vertex-rejection diagnostic counter on ARP enumeration exhaust (`honest_did.py:1907`)
176+
- WooldridgeDiD: optional efficiency hint when method/outcome pairing is sub-optimal (NOT a canonical-link violation per W2023 Prop 3.1 — see Methodology/Correctness row for the corrected framing)
181177

182178
(SyntheticDiD `placebo_effects``variance_effects` rename moved to Tier B — the user-facing field rename + one-release deprecation alias is too large for ≤1 day / ≤3 CI rounds.)
183179

diff_diff/honest_did.py

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
"""
1919

2020
from dataclasses import dataclass, field
21+
import warnings
2122
from typing import Any, Dict, List, Literal, Optional, Tuple, Union
2223

2324
import numpy as np
@@ -1890,11 +1891,15 @@ def _enumerate_vertices(
18901891
if n_eq > n_moments:
18911892
return []
18921893

1893-
vertices = []
1894+
vertices: List[np.ndarray] = []
1895+
n_total = 0
1896+
n_linalg_error = 0
1897+
n_infeasible = 0
18941898

18951899
# Each vertex has exactly n_eq non-zero (basic) variables
18961900
for basis_idx in itertools.combinations(range(n_moments), n_eq):
18971901
basis_idx = list(basis_idx)
1902+
n_total += 1
18981903

18991904
# Build the system for basic variables
19001905
# gamma[basis_idx]' @ X_tilde[basis_idx, :] = 0
@@ -1913,13 +1918,43 @@ def _enumerate_vertices(
19131918
try:
19141919
gamma_basic = np.linalg.solve(A_sys, b_sys)
19151920
except np.linalg.LinAlgError:
1921+
n_linalg_error += 1
19161922
continue
19171923

19181924
# Check feasibility: gamma >= 0
19191925
if np.all(gamma_basic >= -1e-10):
19201926
gamma = np.zeros(n_moments)
19211927
gamma[basis_idx] = np.maximum(gamma_basic, 0)
19221928
vertices.append(gamma)
1929+
else:
1930+
n_infeasible += 1
1931+
1932+
# Diagnostic warnings — surface vertex-search pathologies that would
1933+
# otherwise hide behind `_compute_arp_test` returning False
1934+
# (conservative non-rejection).
1935+
if n_total > 0 and len(vertices) == 0:
1936+
warnings.warn(
1937+
f"ARP vertex enumeration exhausted without feasible vertices: "
1938+
f"tried {n_total} bases, {n_linalg_error} rejected for "
1939+
f"LinAlgError, {n_infeasible} infeasible (negative basic "
1940+
f"variables). The caller (_compute_arp_test) will return False "
1941+
f"(conservative non-rejection). This may indicate near-singular "
1942+
f"nuisance constraints (X_tilde) or a degenerate "
1943+
f"moment-inequality system.",
1944+
RuntimeWarning,
1945+
stacklevel=3,
1946+
)
1947+
elif n_total > 0 and n_linalg_error / n_total >= 0.5:
1948+
warnings.warn(
1949+
f"ARP vertex enumeration heavily constrained: "
1950+
f"{n_linalg_error} of {n_total} bases ({100 * n_linalg_error / n_total:.0f}%) "
1951+
f"rejected for LinAlgError, {n_infeasible} infeasible. "
1952+
f"{len(vertices)} feasible vertex(es) recovered. Results may be "
1953+
f"numerically fragile; consider regularizing the moment-inequality "
1954+
f"system or reviewing the nuisance constraints (X_tilde).",
1955+
RuntimeWarning,
1956+
stacklevel=3,
1957+
)
19231958

19241959
return vertices
19251960

tests/test_methodology_honest_did.py

Lines changed: 86 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@
55
equations, known analytical cases, and expected mathematical properties.
66
"""
77

8-
import os
8+
9+
import warnings
910

1011
import numpy as np
1112
import pytest
@@ -245,34 +246,37 @@ def test_optimal_flci_is_finite_and_valid(self):
245246
assert ci_lb_opt <= lb, "CI lower should be <= identified set lower"
246247
assert ci_ub_opt >= ub, "CI upper should be >= identified set upper"
247248

248-
@pytest.mark.skipif(
249-
os.environ.get("CI") == "true",
250-
reason="wall-clock timing is flaky on shared CI runners; short-circuit "
251-
"correctness signal will be replaced with a mock/spy per TODO.md "
252-
"(see PR #330 follow-up note)",
253-
)
254249
def test_m0_short_circuit(self):
255-
"""M=0 should use standard CI without optimization.
256-
257-
Uses wall-clock elapsed time as a proxy for "short-circuit path
258-
taken" — fast path is ``<0.5s``, slow optimization would be ``>>
259-
0.5s``. Skipped on CI because neighbor-VM contention on shared
260-
runners can push even the short-circuit path past the threshold.
261-
Run locally to validate the fast-path invariant; the TODO.md entry
262-
added by PR #330 tracks replacing this with a mock/spy so the
263-
correctness signal becomes CI-safe.
250+
"""M=0 takes the bias=0 fast path and never invokes the LP solver.
251+
252+
``_compute_worst_case_bias`` returns ``0.0`` immediately when ``M=0``
253+
(diff_diff/honest_did.py:1650), so ``scipy.optimize.linprog`` is
254+
never reached. Patching the LP solver and asserting ``call_count
255+
== 0`` is a direct correctness signal — CI-safe (no wall-clock
256+
dependency) and faster than the prior timing-based proxy.
264257
"""
258+
from unittest.mock import patch
259+
265260
beta_pre = np.array([0.3, 0.2, 0.1])
266261
beta_post = np.array([2.0])
267262
sigma = np.eye(4) * 0.01
268263
l_vec = np.array([1.0])
269264

270-
import time
271-
t0 = time.time()
272-
_compute_optimal_flci(beta_pre, beta_post, sigma, l_vec, 3, 1, M=0.0)
273-
elapsed = time.time() - t0
265+
with patch("diff_diff.honest_did.optimize.linprog") as mock_linprog:
266+
ci_lb, ci_ub = _compute_optimal_flci(
267+
beta_pre, beta_post, sigma, l_vec, 3, 1, M=0.0
268+
)
274269

275-
assert elapsed < 0.5, f"M=0 should be fast, took {elapsed:.2f}s"
270+
assert mock_linprog.call_count == 0, (
271+
f"M=0 must skip the LP solver (fast path at "
272+
f"_compute_worst_case_bias:1650); got "
273+
f"{mock_linprog.call_count} linprog call(s)."
274+
)
275+
# End-to-end correctness: M=0 CI is still well-defined.
276+
assert np.isfinite(ci_lb) and np.isfinite(ci_ub), (
277+
f"M=0 CI must be finite; got [{ci_lb}, {ci_ub}]"
278+
)
279+
assert ci_lb <= ci_ub, f"M=0 CI must be ordered; got [{ci_lb}, {ci_ub}]"
276280

277281
def test_smoothness_flci_with_survey_df(self):
278282
"""Survey df should widen the smoothness FLCI (folded t vs folded normal)."""
@@ -493,3 +497,64 @@ def test_breakdown_monotonicity(self):
493497
# The optimal FLCI is efficient, so need large M for a weak effect.
494498
r_large = honest.fit(results, M=20.0)
495499
assert r_large.ci_lb <= 0 <= r_large.ci_ub, "Should lose significance at large M"
500+
501+
502+
class TestARPVertexEnumeration:
503+
"""Diagnostic warnings on `_enumerate_vertices` vertex-search pathologies."""
504+
505+
def test_enumerate_vertices_warns_on_exhausted_search(self):
506+
"""All-LinAlgError path: fully-zero nuisance column makes A_sys
507+
singular on every basis, so the enumeration exhausts without
508+
feasible vertices and the user should see a RuntimeWarning rather
509+
than a silent empty-list return."""
510+
from diff_diff.honest_did import _enumerate_vertices
511+
512+
# 4 moments, 1 nuisance column (all zeros) → A_sys singular on every basis
513+
X_tilde = np.zeros((4, 1))
514+
sigma_tilde_diag = np.array([1.0, 1.0, 1.0, 1.0])
515+
with pytest.warns(RuntimeWarning, match="exhausted"):
516+
vertices = _enumerate_vertices(X_tilde, sigma_tilde_diag, n_moments=4)
517+
assert vertices == []
518+
519+
def test_enumerate_vertices_warns_on_heavy_rejection(self):
520+
"""Mixed-basis path: 5 moments, 1 nuisance column. C(5, 2) = 10
521+
bases. By design, 6 bases hit LinAlgError (the singular pairs
522+
among indices {0,1,2,3} that share aligned nuisance/sigma values)
523+
and 4 bases produce feasible vertices (the (i, 4) pairs that pair
524+
a positive-X_tilde row with the unique negative-X_tilde row at
525+
index 4). 60% rejection rate trips the `heavily constrained`
526+
branch specifically, not the exhausted branch."""
527+
from diff_diff.honest_did import _enumerate_vertices
528+
529+
X_tilde = np.array([[1.0], [1.0], [1.0], [2.0], [-1.0]])
530+
sigma_tilde_diag = np.array([1.0, 1.0, 1.0, 2.0, 1.0])
531+
with pytest.warns(RuntimeWarning, match="heavily constrained"):
532+
vertices = _enumerate_vertices(X_tilde, sigma_tilde_diag, n_moments=5)
533+
assert len(vertices) >= 1, (
534+
f"Heavy-rejection construction must still produce some feasible "
535+
f"vertices (otherwise the exhausted branch fires); got "
536+
f"{len(vertices)} vertices."
537+
)
538+
539+
def test_enumerate_vertices_quiet_on_healthy_enumeration(self):
540+
"""Well-conditioned X_tilde: most bases solve cleanly and feasible
541+
vertices are recovered. No RuntimeWarning should fire."""
542+
from diff_diff.honest_did import _enumerate_vertices
543+
544+
rng = np.random.default_rng(0)
545+
# 4 moments, 1 nuisance — small and well-conditioned
546+
X_tilde = rng.normal(size=(4, 1))
547+
sigma_tilde_diag = np.array([1.0, 1.0, 1.0, 1.0])
548+
with warnings.catch_warnings(record=True) as caught:
549+
warnings.simplefilter("always", RuntimeWarning)
550+
vertices = _enumerate_vertices(X_tilde, sigma_tilde_diag, n_moments=4)
551+
diag_warnings = [
552+
w for w in caught
553+
if "exhausted" in str(w.message) or "heavily constrained" in str(w.message)
554+
]
555+
assert not diag_warnings, (
556+
f"Healthy enumeration must not emit ARP diagnostics; got "
557+
f"{[str(w.message) for w in diag_warnings]}"
558+
)
559+
# Sanity: we expect some feasible vertices on a well-conditioned input
560+
assert isinstance(vertices, list)

0 commit comments

Comments
 (0)