Skip to content

Commit 986612b

Browse files
igerberclaude
andcommitted
Codex Wave A R1: fix sparse haversine cutoff > πR and DiD absorb-time leak
Address Wave A codex review round 1 findings. P0 #1 — Sparse haversine wrong at cutoff > π·R_earth - `_compute_spatial_bartlett_meat_sparse` computed `chord_radius = 2·sin(cutoff/(2R))` directly. That function is monotone only on [0, πR]; for cutoff > π·R (~20,015 km, half-Earth circumference) it shrinks again, and the kd-tree silently drops pairs that still have positive Bartlett weight. The dense path saturates at πR via `_haversine_km`'s `np.clip(a, 0, 1)`; the sparse path now mirrors that by clamping the arc to π radians before the chord computation: arc_radians = min(cutoff / R_earth, π) chord_radius = 2 · sin(arc_radians / 2) At cutoff >= πR, chord_radius reaches 2 (sphere diameter), so the kd-tree captures all pairs. P0 #2 — DiD + Conley + absorb leaks demeaned time into the helper - `DifferenceInDifferences.fit()` built `_conley_time_arr`, `_conley_unit_arr`, and `_conley_coords_arr` from `working_data`, but `absorb` demeans columns in `working_data` (including `time` when `time` is listed in `absorb`, or whenever `time` appears in `vars_to_demean`). The Conley helper would then partition the within-period spatial sandwich on residualized floats instead of the true pre/post periods. Now reads from the original `data` frame, mirroring `TwoWayFixedEffects.fit` which has the same FWL-composability contract: meat is computed on demeaned scores but the kernel grid uses raw coords + time/unit. P2 — Regression coverage - `test_sparse_haversine_cutoff_above_half_earth_circumference`: forces sparse path with cutoff = 25,000 km (> π·R_earth ≈ 20,015 km); asserts dense/sparse parity at atol=1e-10. - `test_sparse_haversine_cutoff_at_exactly_half_earth_circumference`: boundary test at cutoff = π·R_earth exactly. - `test_did_conley_with_absorb_uses_raw_time_labels`: monkeypatches `_compute_conley_vcov` to capture the `time` arg, asserts the helper sees raw integer 0/1 labels (not absorb-demeaned floats). P3 — Stale docstrings reconciled - `diff_diff/estimators.py` DiD class docstring `vcov_type` entry: "rejected at fit-time" → Wave A support description. - `diff_diff/estimators.py` MultiPeriodDiD class docstring `vcov_type` entry: dropped the "cluster= raises" restriction (Wave A #119 lifts it via the combined kernel). - `diff_diff/linalg.py` `compute_robust_vcov` docstring: `vcov_type` entry updated for the combined spatial + cluster kernel. - `diff_diff/linalg.py` `LinearRegression.__init__` docstring: Conley contract updated for combined cluster kernel + DiD support. Full Wave A regression: 514 passed (up from 511 with the 3 new tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ecbb93b commit 986612b

4 files changed

Lines changed: 196 additions & 40 deletions

File tree

diff_diff/conley.py

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -298,11 +298,19 @@ def _compute_spatial_bartlett_meat_sparse(
298298
np.sin(lat_r),
299299
]
300300
)
301-
# Chord on the unit sphere matching arc-length cutoff_km.
302-
# Multiply by (1 + 1e-12) to absorb chord-projection float roundoff
303-
# so that pairs whose arc-distance is just under cutoff aren't
304-
# excluded by a chord-roundoff false-negative in the kd-tree query.
305-
chord_radius = 2.0 * np.sin(cutoff / (2.0 * _CONLEY_EARTH_RADIUS_KM))
301+
# Chord on the unit sphere matching arc-length cutoff_km. Clamp the
302+
# arc to π radians (half-Earth circumference) before computing the
303+
# chord: the function `arc -> 2 sin(arc/2)` is monotonically
304+
# increasing only on [0, π] and reaches its global max of 2 (the
305+
# unit-sphere diameter) at arc = π. Without the clamp, cutoffs
306+
# above π·R (~20,015 km) would compute a SMALLER chord radius than
307+
# at cutoff = π·R and silently drop antipodal pairs that still have
308+
# positive Bartlett weight. The dense path saturates at π·R via
309+
# `_haversine_km`'s `np.clip(a, 0, 1)` (arcsin caps at π/2), so the
310+
# clamp mirrors that saturation. The 1+1e-12 epsilon then absorbs
311+
# chord-projection float roundoff at sub-cutoff distances.
312+
arc_radians = min(cutoff / _CONLEY_EARTH_RADIUS_KM, np.pi)
313+
chord_radius = 2.0 * np.sin(arc_radians / 2.0)
306314
chord_radius *= 1.0 + 1e-12
307315
tree = cKDTree(xyz)
308316
neighbors = tree.query_ball_tree(tree, r=chord_radius, p=2.0)

diff_diff/estimators.py

Lines changed: 37 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -69,15 +69,18 @@ class DifferenceInDifferences:
6969
with ``cluster=``, Pustejovsky-Tipton (2018) CR2 cluster-robust.
7070
(Note: ``MultiPeriodDiD`` does NOT yet support ``cluster=`` with
7171
``"hc2_bm"`` — see ``MultiPeriodDiD`` docstring and REGISTRY.md.)
72-
- ``"conley"``: Conley 1999 spatial-HAC sandwich. **Accepted by the
73-
constructor for sklearn-style API symmetry but rejected at
74-
fit-time on ``DifferenceInDifferences``** because DiD is
75-
intrinsically a two-period panel design, and Phase 1's cross-
76-
sectional Conley does not handle the time dimension. The
77-
supported Phase 1 path for Conley is direct
78-
``compute_robust_vcov`` / ``LinearRegression`` on a single-period
79-
regression. Phase 2 will add the space-time product kernel
80-
(Driscoll-Kraay) and lift the rejection.
72+
- ``"conley"``: Conley 1999 spatial-HAC sandwich. Pass
73+
``conley_coords=(lat_col, lon_col)``, ``conley_cutoff_km=<float>``,
74+
and ``conley_lag_cutoff=<int>`` on the constructor; pass
75+
``unit=<col>`` as a fit-time kwarg to :meth:`fit` (NOT on
76+
``__init__``; unused unless Conley is set; not part of
77+
``get_params()`` / ``set_params()``). The block-decomposed panel
78+
sandwich (matches R ``conleyreg`` with ``lag_cutoff > 0``) sums
79+
within-period spatial pairs plus within-unit Bartlett serial
80+
pairs (lag=0 excluded). Explicit ``cluster=<col>`` enables the
81+
combined spatial + cluster product kernel; ``survey_design=``
82+
and ``inference='wild_bootstrap'`` both raise
83+
``NotImplementedError``.
8184
alpha : float, default=0.05
8285
Significance level for confidence intervals.
8386
inference : str, default="analytical"
@@ -514,24 +517,30 @@ def fit(
514517
# backward compatibility (see `_resolve_effective_vcov_type`).
515518
_fit_vcov_type = self._resolve_effective_vcov_type(effective_cluster_ids)
516519

517-
# Build Conley coord/time/unit arrays when applicable. The same
518-
# `working_data` (post-absorb demean if absorb was used) supplies
519-
# the rows; the validator above already enforced that `unit` is
520-
# set and conley_coords/conley_cutoff_km/conley_lag_cutoff are
521-
# consistent. The time column for the panel block-decomposed form
522-
# is the `time` argument passed to fit() (i.e. the post-treatment
523-
# period indicator, which is integer 0/1 for the standard DiD
524-
# design — `_compute_conley_vcov` normalizes to dense codes
525-
# internally so non-numeric labels still work).
520+
# Build Conley coord/time/unit arrays when applicable. CRITICAL:
521+
# read from the ORIGINAL `data` frame, NOT `working_data` — `absorb`
522+
# demeans `time` (and any column listed in `absorb`) in working_data,
523+
# so reading `working_data[time]` would silently partition the
524+
# within-period spatial sandwich on residualized floats instead of
525+
# the true pre/post periods (Codex Wave A R1 P0). Coords are likewise
526+
# read from raw `data` for symmetry with TwoWayFixedEffects
527+
# (`twfe.py::TwoWayFixedEffects.fit`) which has the same FWL-
528+
# composability contract: the meat is computed on demeaned scores
529+
# but the kernel grid uses the original space (coords) and time/unit
530+
# indexing. `_compute_conley_vcov` normalizes time labels to dense
531+
# codes 0..T-1 internally, so non-numeric `time` labels (datetime64,
532+
# pd.Period, strings) still work on the MultiPeriodDiD path; DiD's
533+
# binary `time` column is integer 0/1 by convention and is unaffected
534+
# by the normalization.
526535
if _fit_vcov_type == "conley":
527536
_conley_coords_arr: Optional[np.ndarray] = np.column_stack(
528537
[
529-
working_data[self.conley_coords[0]].values.astype(np.float64),
530-
working_data[self.conley_coords[1]].values.astype(np.float64),
538+
data[self.conley_coords[0]].values.astype(np.float64),
539+
data[self.conley_coords[1]].values.astype(np.float64),
531540
]
532541
)
533-
_conley_time_arr: Optional[np.ndarray] = np.asarray(working_data[time].values)
534-
_conley_unit_arr: Optional[np.ndarray] = working_data[unit].values
542+
_conley_time_arr: Optional[np.ndarray] = np.asarray(data[time].values)
543+
_conley_unit_arr: Optional[np.ndarray] = data[unit].values
535544
else:
536545
_conley_coords_arr = None
537546
_conley_time_arr = None
@@ -1123,9 +1132,12 @@ class MultiPeriodDiD(DifferenceInDifferences):
11231132
auto-derived from the ``time`` column at fit-time and normalized
11241133
to dense panel-period codes ``0..T-1`` so ``conley_lag_cutoff``
11251134
always counts panel periods (works for int / datetime64 /
1126-
``pd.Period`` / string encodings). Restrictions: ``cluster=``,
1127-
``survey_design=``, and ``inference="wild_bootstrap"`` raise on
1128-
this path (Phase 5 / follow-up).
1135+
``pd.Period`` / string encodings). Explicit ``cluster=<col>``
1136+
enables the combined spatial + cluster product kernel
1137+
(Wave A #119; cluster must be constant within each unit across
1138+
periods). Restrictions: ``survey_design=`` and
1139+
``inference="wild_bootstrap"`` raise on this path
1140+
(Phase 5 / follow-up).
11291141
alpha : float, default=0.05
11301142
Significance level for confidence intervals.
11311143
conley_coords, conley_cutoff_km, conley_metric, conley_kernel, conley_lag_cutoff

diff_diff/linalg.py

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1290,9 +1290,13 @@ def compute_robust_vcov(
12901290
bandwidth). Two operating modes: cross-sectional (default) and panel
12911291
block-decomposed (pass the three co-required kwargs ``conley_time`` /
12921292
``conley_unit`` / ``conley_lag_cutoff``, matching R ``conleyreg`` with
1293-
``lag_cutoff > 0``). Combining with ``cluster_ids`` or ``weights``
1294-
raises ``NotImplementedError`` (combined spatial-cluster kernel and
1295-
Bertanha-Imbens weighted-Conley deferred to follow-up PRs).
1293+
``lag_cutoff > 0``). Combining with ``cluster_ids`` applies the
1294+
combined spatial + cluster product kernel
1295+
``K_total[i, j] = K_space(d_ij/h) · 1{c_i = c_j}`` (Wave A #119; on
1296+
the panel path the cluster must be constant within each unit across
1297+
periods). Combining with ``weights`` still raises
1298+
``NotImplementedError`` (Bertanha-Imbens 2014 weighted-Conley
1299+
deferred to a follow-up PR).
12961300
12971301
Parameters
12981302
----------
@@ -2485,13 +2489,15 @@ class LinearRegression:
24852489
pass the three co-required kwargs ``conley_time`` / ``conley_unit`` /
24862490
``conley_lag_cutoff``). Requires ``conley_coords`` (n × 2 array) and
24872491
a positive ``conley_cutoff_km``. Combining ``vcov_type="conley"``
2488-
with ``cluster_ids``, ``weights``, or ``survey_design`` raises
2489-
``NotImplementedError`` (combined spatial-cluster kernel +
2490-
Bertanha-Imbens weighted-Conley deferred to follow-up PRs). The
2491-
panel DiD estimator rejects at fit-time (DiD.fit() has no unit
2492-
column declaration; use MultiPeriodDiD or TwoWayFixedEffects, which
2493-
thread their `unit`/`time` columns into the block-decomposed
2494-
sandwich via ``conley_lag_cutoff`` on the constructor).
2492+
with ``cluster_ids`` applies the combined spatial + cluster product
2493+
kernel (Wave A #119; cluster must be constant within each unit on
2494+
the panel path). Combining with ``weights`` / ``survey_design``
2495+
still raises ``NotImplementedError`` (Bertanha-Imbens 2014
2496+
weighted-Conley deferred to a follow-up PR). The DiD / MPD / TWFE
2497+
estimators all support panel Conley by passing ``unit`` at fit-time
2498+
(DiD as a fit-time kwarg; MPD/TWFE via the existing ``unit``
2499+
argument), threading ``conley_time`` / ``conley_unit`` into the
2500+
block-decomposed sandwich.
24952501
conley_coords : ndarray of shape (n, 2), optional
24962502
Required when ``vcov_type="conley"``. Two-column array of
24972503
``[lat, lon]`` (degrees, for ``conley_metric="haversine"``) or

tests/test_conley_vcov.py

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1124,6 +1124,54 @@ def test_did_conley_matches_mpd_post_periods_1(self, two_period_panel):
11241124
np.testing.assert_allclose(res_did.att, res_mpd.att, atol=1e-10)
11251125
np.testing.assert_allclose(res_did.se, res_mpd.se, atol=1e-10)
11261126

1127+
def test_did_conley_with_absorb_uses_raw_time_labels(self, two_period_panel, monkeypatch):
1128+
"""DiD + Conley + absorb=[<unit>] must feed the Conley helper the
1129+
ORIGINAL time/unit/coord columns from `data`, not the absorb-demeaned
1130+
`working_data` (in which time has been residualized to floats).
1131+
Otherwise the within-period spatial sandwich silently partitions on
1132+
per-unit demeaned floats instead of the true pre/post periods.
1133+
Codex Wave A R1 P0 #2.
1134+
"""
1135+
import diff_diff.linalg as linalg_module
1136+
from diff_diff import DifferenceInDifferences
1137+
1138+
df = two_period_panel.copy()
1139+
captured: dict = {"time_arg": None, "unit_arg": None}
1140+
orig = linalg_module._compute_conley_vcov
1141+
1142+
def _spy(*args, **kwargs):
1143+
captured["time_arg"] = kwargs.get("time")
1144+
captured["unit_arg"] = kwargs.get("unit")
1145+
return orig(*args, **kwargs)
1146+
1147+
monkeypatch.setattr(linalg_module, "_compute_conley_vcov", _spy)
1148+
DifferenceInDifferences(
1149+
vcov_type="conley",
1150+
conley_coords=("lat", "lon"),
1151+
conley_cutoff_km=2000.0,
1152+
conley_lag_cutoff=1,
1153+
).fit(
1154+
df,
1155+
outcome="y",
1156+
treatment="treated",
1157+
time="time",
1158+
unit="unit",
1159+
absorb=["unit"],
1160+
)
1161+
assert captured["time_arg"] is not None
1162+
# Raw labels are integer 0/1 (the binary post-treatment indicator);
1163+
# demeaned values would be floats from absorb's within-unit
1164+
# demeaning. np.unique on raw labels yields exactly 2 distinct
1165+
# values; on demeaned floats it would yield ~n_units distinct.
1166+
time_arg = np.asarray(captured["time_arg"])
1167+
uniques = np.unique(time_arg)
1168+
assert len(uniques) == 2, (
1169+
f"Expected 2 unique time labels (raw 0/1), got {len(uniques)}: "
1170+
f"{uniques[:5]} — absorb is leaking demeaned time into the "
1171+
"Conley helper."
1172+
)
1173+
assert set(uniques.tolist()) == {0, 1}, f"Expected raw integer labels 0/1, got {uniques}"
1174+
11271175
def test_multi_period_did_with_conley_panel(self):
11281176
"""Phase 2 MultiPeriodDiD + vcov_type='conley' uses the block-decomposed
11291177
sandwich (matches R conleyreg). Verifies that finite SEs are produced
@@ -2631,6 +2679,88 @@ def test_helper_direct_matches_dense_meat_haversine(self):
26312679
meat_sparse = _compute_spatial_bartlett_meat_sparse(S, coords, cutoff, "haversine")
26322680
np.testing.assert_allclose(meat_sparse, meat_dense, atol=1e-10, rtol=1e-10)
26332681

2682+
def test_sparse_haversine_cutoff_above_half_earth_circumference(self):
2683+
"""Sparse haversine path with conley_cutoff_km > π·R_earth (~20,015 km)
2684+
must include all geometrically eligible pairs. Without the arc-radians
2685+
clamp, the chord-radius formula 2·sin(arc/2) shrinks for arc > π and
2686+
the kd-tree silently drops pairs that still have positive Bartlett
2687+
weight. The dense path saturates at π·R via _haversine_km's clip;
2688+
the sparse path matches via the clamp. Codex Wave A R1 P0 #1.
2689+
"""
2690+
rng = np.random.default_rng(seed=101)
2691+
n = 200
2692+
lats = rng.uniform(-90.0, 90.0, size=n)
2693+
lons = rng.uniform(-180.0, 180.0, size=n)
2694+
coords = np.column_stack([lats, lons])
2695+
X = np.column_stack([np.ones(n), rng.standard_normal(n), rng.standard_normal(n)])
2696+
y = X @ np.array([1.0, 2.0, -0.5]) + rng.standard_normal(n) * 0.4
2697+
coefs, *_ = np.linalg.lstsq(X, y, rcond=None)
2698+
residuals = y - X @ coefs
2699+
bread = X.T @ X
2700+
# Cutoff well above half-Earth circumference (~20,015 km). Without
2701+
# the clamp, the sparse path drops antipodal pairs and the meat
2702+
# diverges from the dense path.
2703+
cutoff_km = 25_000.0 # > π·R_earth ≈ 20015 km
2704+
V_dense = _compute_conley_vcov(
2705+
X,
2706+
residuals,
2707+
coords,
2708+
cutoff_km,
2709+
"haversine",
2710+
"bartlett",
2711+
bread,
2712+
_conley_sparse=False,
2713+
)
2714+
V_sparse = _compute_conley_vcov(
2715+
X,
2716+
residuals,
2717+
coords,
2718+
cutoff_km,
2719+
"haversine",
2720+
"bartlett",
2721+
bread,
2722+
_conley_sparse=True,
2723+
)
2724+
np.testing.assert_allclose(V_sparse, V_dense, atol=1e-10, rtol=1e-10)
2725+
2726+
def test_sparse_haversine_cutoff_at_exactly_half_earth_circumference(self):
2727+
"""Cutoff = π·R_earth: chord radius = 2 (sphere diameter); all
2728+
pairs are included. Bartlett at u=1 returns 0, so the antipodal
2729+
pair contributes zero — but pairs at all other distances
2730+
contribute. Sparse and dense paths must agree."""
2731+
rng = np.random.default_rng(seed=103)
2732+
n = 150
2733+
lats = rng.uniform(-90.0, 90.0, size=n)
2734+
lons = rng.uniform(-180.0, 180.0, size=n)
2735+
coords = np.column_stack([lats, lons])
2736+
X = np.column_stack([np.ones(n), rng.standard_normal(n)])
2737+
y = X @ np.array([1.0, 1.5]) + rng.standard_normal(n) * 0.5
2738+
coefs, *_ = np.linalg.lstsq(X, y, rcond=None)
2739+
residuals = y - X @ coefs
2740+
bread = X.T @ X
2741+
cutoff_km = float(np.pi * _CONLEY_EARTH_RADIUS_KM) # ≈ 20015.16 km
2742+
V_dense = _compute_conley_vcov(
2743+
X,
2744+
residuals,
2745+
coords,
2746+
cutoff_km,
2747+
"haversine",
2748+
"bartlett",
2749+
bread,
2750+
_conley_sparse=False,
2751+
)
2752+
V_sparse = _compute_conley_vcov(
2753+
X,
2754+
residuals,
2755+
coords,
2756+
cutoff_km,
2757+
"haversine",
2758+
"bartlett",
2759+
bread,
2760+
_conley_sparse=True,
2761+
)
2762+
np.testing.assert_allclose(V_sparse, V_dense, atol=1e-10, rtol=1e-10)
2763+
26342764
def test_panel_block_decomposed_sparse_matches_dense(self):
26352765
"""Panel block-decomposed sandwich produces the same vcov whether
26362766
the spatial component is computed dense or sparse. The serial

0 commit comments

Comments
 (0)