Skip to content

Commit 68678e9

Browse files
igerberclaude
andcommitted
Codex CI R6 holistic: shared Conley validator + sparse density gate + P3 uniform-kernel limit fix
Three findings on R6, addressed via pattern-level fixes rather than single-site patches, after 5 cross-round findings across 2 mirror classes (validation-gap, raw-data-vs-working-data) signaled the symptoms were drifting between estimator surfaces. P1 (Code Quality) — shared estimator validator Validation gaps mirrored across DiD (R1, R2) and MPD/TWFE (R6) under the same `vcov_type='conley'` entry-point contract: missing/unknown unit column, malformed conley_coords arity, missing coord column, survey_design rejection, wild_bootstrap rejection. Replaces three separate inline blocks (one per estimator) with a single `_validate_conley_estimator_inputs(...)` helper in `conley.py`, called from `DifferenceInDifferences.fit()`, `MultiPeriodDiD.fit()`, and `TwoWayFixedEffects.fit()`. Single source of truth for the seven-check union; future estimator surfaces (HAD-Conley, SDID-Conley) pick up the same contract by one-line opt-in. P2 (Performance) — sparse density gate The sparse k-d-tree path's CSR storage (~12 bytes/nnz: data + indices + indptr) loses its memory advantage over dense float64 (8 bytes/cell) at ~67% density; at high density "sparse" can silently use MORE memory than dense. Adds `_CONLEY_SPARSE_DENSITY_THRESHOLD = 0.3` (well below break-even for a comfortable safety margin). After kd-tree build, `cKDTree.count_neighbors` (shares traversal with the subsequent `query_ball_tree`) measures actual neighbor density; if it exceeds the threshold, `_compute_spatial_bartlett_meat_sparse` returns None, the dispatcher falls back to the dense path, and a UserWarning surfaces the reason. Locks the auto-sparse contract: it only fires when it actually saves memory. P3 (Documentation/Tests) — uniform kernel for huge-cutoff limit `test_combined_kernel_reduces_to_pure_cluster_at_huge_cutoff` used Bartlett at cutoff=1e9; Bartlett gives K=1-d/h<1 for d>0, so the reduction to pure CR1 is asymptotic, not exact. Switched to uniform kernel (K=1 exactly for all in-cutoff pairs). REGISTRY note clarifies the kernel choice rationale. Tests: - test_sparse_density_gate_falls_back_to_dense_and_warns: tight-cluster + huge cutoff → 100% density → fallback + warning + result equals dense. - test_sparse_density_gate_does_not_trigger_below_threshold: realistic cutoff → no fallback / no warning. Full regression sweep: 493 passed (no regressions across conley_vcov, linalg, estimators, estimators_vcov_type, methodology_twfe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e03318d commit 68678e9

5 files changed

Lines changed: 295 additions & 124 deletions

File tree

diff_diff/conley.py

Lines changed: 149 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@
4747
from __future__ import annotations
4848

4949
import warnings
50-
from typing import Callable, Literal, Optional, Union, cast
50+
from typing import Any, Callable, Literal, Optional, Union, cast
5151

5252
import numpy as np
5353

@@ -87,6 +87,18 @@
8787
# the dense path is used regardless of n.
8888
_CONLEY_SPARSE_N_THRESHOLD = 5_000
8989

90+
# Density gate: fraction of n*n pairs within cutoff above which the sparse
91+
# CSR matrix's storage overhead (~12 bytes/nnz: data + indices + indptr)
92+
# loses its memory advantage over a dense float64 (8 bytes/cell). The
93+
# break-even is at ~67% density; we gate at 30% (well below break-even)
94+
# to give a comfortable safety margin: at 30% nnz, CSR uses ~45% of
95+
# dense memory. Above 30%, fall back to dense + emit a UserWarning so
96+
# users with large cutoffs aren't surprised by the "sparse" path
97+
# materializing a near-dense matrix. Computed exactly via
98+
# ``cKDTree.count_neighbors`` (O(n log n + nnz), shares tree traversal
99+
# with the subsequent ``query_ball_tree``).
100+
_CONLEY_SPARSE_DENSITY_THRESHOLD = 0.3
101+
90102

91103
def _haversine_km(
92104
lat1: np.ndarray,
@@ -186,6 +198,109 @@ def _validate_callable_metric_result(result: object, n: int) -> np.ndarray:
186198
return arr
187199

188200

201+
def _validate_conley_estimator_inputs(
202+
*,
203+
estimator_name: str,
204+
data: "Any",
205+
unit: Optional[str],
206+
conley_coords: object,
207+
conley_cutoff_km: Optional[float],
208+
conley_lag_cutoff: Optional[int],
209+
survey_design: object,
210+
inference: str,
211+
) -> None:
212+
"""Shared front-door validation for ``vcov_type='conley'`` on the
213+
estimator entry points (``DifferenceInDifferences``, ``MultiPeriodDiD``,
214+
``TwoWayFixedEffects``).
215+
216+
Each estimator's ``fit()`` calls this BEFORE building Conley arrays or
217+
threading them into the variance computation. The seven checks below
218+
are the union of what each estimator needs; estimator-specific bits
219+
(e.g. building the array from a column name) remain inline at the
220+
caller. Centralizing this in one place prevents the validation-class
221+
drift that surfaced repeatedly across Wave A CI rounds (DiD's
222+
front-door guard for `unit` and `conley_coords` was missing on MPD
223+
/ TWFE).
224+
225+
Parameters
226+
----------
227+
estimator_name : str
228+
Class name surfaced in error messages (e.g.
229+
``"DifferenceInDifferences"``).
230+
data : pandas.DataFrame
231+
The dataset passed to ``fit()``. Used to check column existence
232+
for ``unit`` and ``conley_coords``. Typed as ``Any`` here to
233+
avoid importing pandas at module load time.
234+
unit : str or None
235+
Name of the unit identifier column (required for Conley).
236+
conley_coords : tuple/list/None
237+
Pair of column names ``(lat_col, lon_col)``.
238+
conley_cutoff_km : float or None
239+
Positive finite bandwidth.
240+
conley_lag_cutoff : int or None
241+
Non-negative integer lag for the within-unit Bartlett serial sum.
242+
survey_design : object
243+
``SurveyDesign`` instance or ``None``. Survey + Conley is deferred.
244+
inference : str
245+
Estimator inference mode. ``"wild_bootstrap"`` + Conley is rejected.
246+
247+
Raises
248+
------
249+
ValueError
250+
Missing/malformed conley_coords, missing conley_cutoff_km,
251+
missing/unknown unit, missing conley_lag_cutoff,
252+
coord column not in ``data``.
253+
NotImplementedError
254+
``survey_design`` is non-None, or ``inference == "wild_bootstrap"``.
255+
"""
256+
if conley_coords is None or conley_cutoff_km is None:
257+
raise ValueError(
258+
f"{estimator_name}(vcov_type='conley') requires "
259+
"conley_coords=(lat_col, lon_col) and conley_cutoff_km "
260+
"on the constructor."
261+
)
262+
if (
263+
not isinstance(conley_coords, (tuple, list))
264+
or len(conley_coords) != 2
265+
or not all(isinstance(c, str) for c in conley_coords)
266+
):
267+
raise ValueError(
268+
"conley_coords must be a 2-element tuple/list of column "
269+
f"names (lat_col, lon_col); got {conley_coords!r}."
270+
)
271+
for _coord_col in conley_coords:
272+
if _coord_col not in data.columns:
273+
raise ValueError(f"conley_coords column '{_coord_col}' not found in data.")
274+
if unit is None:
275+
raise ValueError(
276+
f"{estimator_name}(vcov_type='conley') requires `unit=<column_name>` "
277+
"— the panel block-decomposed Conley sandwich needs the unit "
278+
"identifier to compute the per-unit serial sum."
279+
)
280+
if unit not in data.columns:
281+
raise ValueError(f"Unit column '{unit}' not found in data")
282+
if conley_lag_cutoff is None:
283+
raise ValueError(
284+
f"{estimator_name}(vcov_type='conley') requires conley_lag_cutoff "
285+
"(non-negative int; 0 means spatial-within-period only, no serial "
286+
"component). See R conleyreg's `lag_cutoff` argument for the convention."
287+
)
288+
if survey_design is not None:
289+
raise NotImplementedError(
290+
f"{estimator_name}(vcov_type='conley') + survey_design is a "
291+
"follow-up (Bertanha-Imbens 2014 weighted-Conley). Drop "
292+
"survey_design for cross-sectional Conley, or use vcov_type='hc1' "
293+
"for survey-aware cluster-robust without spatial HAC."
294+
)
295+
if inference == "wild_bootstrap":
296+
raise NotImplementedError(
297+
f"{estimator_name}(vcov_type='conley', inference='wild_bootstrap') "
298+
"is not supported: the wild bootstrap is a separate inference path "
299+
"that does not consume the analytical Conley sandwich. Use "
300+
"inference='analytical' for Conley SEs."
301+
)
302+
303+
189304
def _pairwise_distance_matrix(coords: np.ndarray, metric: ConleyMetric) -> np.ndarray:
190305
"""Build the dense n×n pairwise distance matrix.
191306
@@ -248,7 +363,8 @@ def _compute_spatial_bartlett_meat_sparse(
248363
metric: str,
249364
*,
250365
cluster_codes: Optional[np.ndarray] = None,
251-
) -> np.ndarray:
366+
density_threshold: float = _CONLEY_SPARSE_DENSITY_THRESHOLD,
367+
) -> Optional[np.ndarray]:
252368
"""Sparse k-d-tree-based spatial Bartlett meat: ``S.T @ K_bartlett @ S``.
253369
254370
Used by :func:`_compute_conley_vcov` when ``_conley_sparse`` is True or
@@ -329,22 +445,41 @@ def _compute_spatial_bartlett_meat_sparse(
329445
# clamp mirrors that saturation. The 1+1e-12 epsilon then absorbs
330446
# chord-projection float roundoff at sub-cutoff distances.
331447
arc_radians = min(cutoff / _CONLEY_EARTH_RADIUS_KM, np.pi)
332-
chord_radius = 2.0 * np.sin(arc_radians / 2.0)
333-
chord_radius *= 1.0 + 1e-12
448+
query_r = 2.0 * np.sin(arc_radians / 2.0)
449+
query_r *= 1.0 + 1e-12
334450
tree = cKDTree(xyz)
335-
neighbors = tree.query_ball_tree(tree, r=chord_radius, p=2.0)
336451
elif metric == "euclidean":
337452
# Small relative epsilon for symmetry with the haversine branch.
338453
# Bartlett's <=-vs-< boundary is moot since kernel is exactly 0 at u=1.
339454
query_r = cutoff * (1.0 + 1e-12)
340455
tree = cKDTree(coords)
341-
neighbors = tree.query_ball_tree(tree, r=query_r, p=2.0)
342456
else:
343457
raise ValueError(
344458
"sparse Conley path requires metric in {'haversine', 'euclidean'}; "
345459
f"got {metric!r}. (Callable metrics fall back to the dense path.)"
346460
)
347461

462+
# Density gate: count in-range pairs cheaply via the same tree (shares
463+
# traversal cost with query_ball_tree, no extra allocation). If density
464+
# exceeds the threshold, return None so the caller falls back to dense
465+
# (avoids materializing a near-dense CSR matrix that would use MORE
466+
# memory than dense float64). Codex CI R6 P2.
467+
n_pairs_in_range = int(tree.count_neighbors(tree, r=query_r, p=2.0))
468+
density = n_pairs_in_range / float(n * n)
469+
if density > density_threshold:
470+
warnings.warn(
471+
f"Conley sparse path: neighbor density {density:.1%} exceeds "
472+
f"threshold {density_threshold:.1%}; falling back to dense "
473+
"(CSR storage would use more memory than the dense float64 "
474+
"matrix at this density). Consider a smaller conley_cutoff_km "
475+
"for genuine memory savings.",
476+
UserWarning,
477+
stacklevel=3,
478+
)
479+
return None
480+
481+
neighbors = tree.query_ball_tree(tree, r=query_r, p=2.0)
482+
348483
rows_list: list[np.ndarray] = []
349484
cols_list: list[np.ndarray] = []
350485
data_list: list[np.ndarray] = []
@@ -732,10 +867,16 @@ def _spatial_meat_for_mask(mask: Optional[np.ndarray] = None) -> np.ndarray:
732867
# The auto-toggle and explicit-True gate above both guarantee
733868
# metric is a str (haversine/euclidean), so this cast is safe;
734869
# using cast() avoids leaking the narrowing into the dense
735-
# fallback below.
736-
return _compute_spatial_bartlett_meat_sparse(
870+
# fallback below. The sparse helper returns None when the
871+
# neighbor density exceeds the threshold (sparse CSR storage
872+
# would use more memory than dense); fall through to dense in
873+
# that case (the warning is already emitted by the helper).
874+
sparse_meat = _compute_spatial_bartlett_meat_sparse(
737875
S_sub, coords_sub, cutoff, cast(str, metric), cluster_codes=cluster_sub
738876
)
877+
if sparse_meat is not None:
878+
return sparse_meat
879+
# Density-gated fallback: continue to the dense path below.
739880
D = _pairwise_distance_matrix(coords_sub, metric)
740881
K = _kernel_fn(D / cutoff)
741882
if cluster_sub is not None:

diff_diff/estimators.py

Lines changed: 32 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -392,57 +392,22 @@ def fit(
392392
# sums; we mirror MultiPeriodDiD's reject pattern for missing args
393393
# and the survey/wild-bootstrap incompatibilities.
394394
if self.vcov_type == "conley":
395-
if unit is None:
396-
raise ValueError(
397-
"DifferenceInDifferences(vcov_type='conley').fit() requires "
398-
"`unit=<column_name>` — the panel block-decomposed Conley "
399-
"sandwich needs the unit identifier to compute the per-unit "
400-
"serial sum. Pass DiD(...).fit(data, ..., unit='<col>')."
401-
)
402-
if unit not in data.columns:
403-
raise ValueError(f"Unit column '{unit}' not found in data")
404-
if self.conley_lag_cutoff is None:
405-
raise ValueError(
406-
"DifferenceInDifferences(vcov_type='conley') requires "
407-
"conley_lag_cutoff (non-negative int; 0 means spatial-"
408-
"within-period only, no serial component)."
409-
)
410-
if self.conley_coords is None or self.conley_cutoff_km is None:
411-
raise ValueError(
412-
"DifferenceInDifferences(vcov_type='conley') requires "
413-
"conley_coords=(lat_col, lon_col) and conley_cutoff_km "
414-
"on the constructor."
415-
)
416-
# Validate conley_coords is a 2-element tuple/list of strings
417-
# and both columns exist on `data`. Without these guards, a
418-
# malformed tuple or missing column fell through to an opaque
419-
# IndexError / pandas KeyError downstream. Codex CI R2 P1.
420-
if (
421-
not isinstance(self.conley_coords, (tuple, list))
422-
or len(self.conley_coords) != 2
423-
or not all(isinstance(c, str) for c in self.conley_coords)
424-
):
425-
raise ValueError(
426-
"conley_coords must be a 2-element tuple/list of column "
427-
f"names (lat_col, lon_col); got {self.conley_coords!r}."
428-
)
429-
for _coord_col in self.conley_coords:
430-
if _coord_col not in data.columns:
431-
raise ValueError(f"conley_coords column '{_coord_col}' not found in data.")
432-
if survey_design is not None:
433-
raise NotImplementedError(
434-
"DifferenceInDifferences(vcov_type='conley') + survey_design "
435-
"is a follow-up (Bertanha-Imbens 2014 weighted-Conley). Drop "
436-
"survey_design for cross-sectional Conley."
437-
)
438-
if self.inference == "wild_bootstrap":
439-
raise NotImplementedError(
440-
"DifferenceInDifferences(vcov_type='conley', "
441-
"inference='wild_bootstrap') is not supported: the wild "
442-
"bootstrap is a separate inference path that does not "
443-
"consume the analytical Conley sandwich. Use "
444-
"inference='analytical' for Conley SEs."
445-
)
395+
# Shared front-door validation across DiD / MPD / TWFE entry
396+
# points (Wave A holistic fix: replaces the inline drift that
397+
# accumulated across CI R1/R2/R6 — same-class validation gaps
398+
# mirrored across estimator surfaces).
399+
from diff_diff.conley import _validate_conley_estimator_inputs
400+
401+
_validate_conley_estimator_inputs(
402+
estimator_name="DifferenceInDifferences",
403+
data=data,
404+
unit=unit,
405+
conley_coords=self.conley_coords,
406+
conley_cutoff_km=self.conley_cutoff_km,
407+
conley_lag_cutoff=self.conley_lag_cutoff,
408+
survey_design=survey_design,
409+
inference=self.inference,
410+
)
446411

447412
if absorb:
448413
# FWL theorem: demean ALL regressors alongside outcome.
@@ -1492,47 +1457,23 @@ def fit( # type: ignore[override]
14921457
)
14931458

14941459
# MultiPeriodDiD is intrinsically a multi-period panel estimator;
1495-
# Phase 2 panel block-decomposed Conley (matches R conleyreg): require
1496-
# `unit` and `conley_lag_cutoff` when vcov_type="conley". The actual
1497-
# array extraction (and conley_coords resolution from column names)
1498-
# happens just below at the solve_ols call.
1460+
# Phase 2 panel block-decomposed Conley (matches R conleyreg) needs
1461+
# `unit`, `conley_lag_cutoff`, and `conley_coords` at fit-time. The
1462+
# validation is shared with DiD / TWFE to avoid the validation-class
1463+
# drift that surfaced across Wave A CI R1/R2/R6.
14991464
if self.vcov_type == "conley":
1500-
if self.conley_coords is None or self.conley_cutoff_km is None:
1501-
raise ValueError(
1502-
"MultiPeriodDiD(vcov_type='conley') requires "
1503-
"conley_coords=(lat_col, lon_col) and conley_cutoff_km "
1504-
"on the constructor."
1505-
)
1506-
if unit is None:
1507-
raise ValueError(
1508-
"MultiPeriodDiD(vcov_type='conley') requires unit= at "
1509-
"fit-time (the panel block-decomposed sandwich computes "
1510-
"a per-unit serial sum, matching R conleyreg)."
1511-
)
1512-
if self.conley_lag_cutoff is None:
1513-
raise ValueError(
1514-
"MultiPeriodDiD(vcov_type='conley') requires "
1515-
"conley_lag_cutoff (non-negative int; 0 means spatial-"
1516-
"within-period only, no serial component). See R "
1517-
"conleyreg's `lag_cutoff` argument for the convention."
1518-
)
1519-
if survey_design is not None:
1520-
raise NotImplementedError(
1521-
"MultiPeriodDiD(vcov_type='conley', survey_design=...) "
1522-
"is not supported: Conley + survey weights / replicate "
1523-
"vcov is deferred to a follow-up PR (Bertanha-Imbens 2014 "
1524-
"territory). Use vcov_type='hc1' for survey-aware "
1525-
"cluster-robust without spatial HAC, or drop survey_design= "
1526-
"for panel Conley."
1527-
)
1528-
if self.inference == "wild_bootstrap":
1529-
raise NotImplementedError(
1530-
"MultiPeriodDiD(vcov_type='conley', "
1531-
"inference='wild_bootstrap') is not supported: wild "
1532-
"bootstrap is a separate inference path that does not "
1533-
"consume the analytical Conley sandwich. Use "
1534-
"inference='analytical' for Conley SEs."
1535-
)
1465+
from diff_diff.conley import _validate_conley_estimator_inputs
1466+
1467+
_validate_conley_estimator_inputs(
1468+
estimator_name="MultiPeriodDiD",
1469+
data=data,
1470+
unit=unit,
1471+
conley_coords=self.conley_coords,
1472+
conley_cutoff_km=self.conley_cutoff_km,
1473+
conley_lag_cutoff=self.conley_lag_cutoff,
1474+
survey_design=survey_design,
1475+
inference=self.inference,
1476+
)
15361477
# Pre-compute non_ref_periods (needed for absorb demeaning)
15371478
non_ref_periods = [p for p in all_periods if p != reference_period]
15381479

diff_diff/twfe.py

Lines changed: 12 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -173,27 +173,18 @@ def fit( # type: ignore[override]
173173
# with the original (un-demeaned) time / unit vectors and coords
174174
# yields the correct block-decomposed sandwich.
175175
if self.vcov_type == "conley":
176-
if self.conley_lag_cutoff is None:
177-
raise ValueError(
178-
"TwoWayFixedEffects(vcov_type='conley') requires "
179-
"conley_lag_cutoff (non-negative int; 0 means spatial-"
180-
"within-period only, no serial component). See R "
181-
"conleyreg's `lag_cutoff` argument for the convention."
182-
)
183-
if self.conley_coords is None or self.conley_cutoff_km is None:
184-
raise ValueError(
185-
"TwoWayFixedEffects(vcov_type='conley') requires "
186-
"conley_coords=(lat_col, lon_col) and "
187-
"conley_cutoff_km on the constructor."
188-
)
189-
if self.inference == "wild_bootstrap":
190-
raise NotImplementedError(
191-
"TwoWayFixedEffects(vcov_type='conley', "
192-
"inference='wild_bootstrap') is not supported: the "
193-
"wild bootstrap is a separate inference path that does "
194-
"not consume the analytical Conley sandwich. Use "
195-
"inference='analytical' for Conley SEs."
196-
)
176+
from diff_diff.conley import _validate_conley_estimator_inputs
177+
178+
_validate_conley_estimator_inputs(
179+
estimator_name="TwoWayFixedEffects",
180+
data=data,
181+
unit=unit,
182+
conley_coords=self.conley_coords,
183+
conley_cutoff_km=self.conley_cutoff_km,
184+
conley_lag_cutoff=self.conley_lag_cutoff,
185+
survey_design=survey_design,
186+
inference=self.inference,
187+
)
197188

198189
# Check for staggered treatment timing and warn if detected
199190
self._check_staggered_treatment(data, treatment, time, unit)

0 commit comments

Comments
 (0)