Skip to content

Commit 3755a44

Browse files
igerberclaude
andcommitted
Codex re-review: remove float64 coercion in MPD/TWFE Conley wire-up
P1 (Methodology): the prior fix normalized `time` inside `_compute_conley_vcov` but `MultiPeriodDiD.fit()` and `TwoWayFixedEffects.fit()` still coerced `data[time].values.astype(np.float64)` before passing to the helper. datetime64 / pd.Period / string time labels fail before the helper's normalization runs, so the documented "normalizes to dense panel-period codes" contract was unreachable on the public estimator surfaces. Fix: replace `.astype(np.float64)` with `np.asarray(...)` so the original ordered labels reach the helper, which then normalizes via `np.unique(return_inverse=True)`. P3 (Documentation): updated the `MultiPeriodDiD` class docstring's `vcov_type="conley"` bullet to describe the Phase 2 block-decomposed contract (was still saying "rejected at fit-time" / "Phase 2 will add the space-time product kernel"). Also updated the `unit` fit-arg docstring to note it is REQUIRED when `vcov_type="conley"` rather than "does NOT affect SE computation". Regression: `test_multi_period_did_conley_with_datetime64_time` fits MPD with `time_dt` (pd.to_datetime) and `time_int` (0,1,2) on the same panel and asserts the diagonal SEs match at atol=1e-10. Verifies the end-to-end estimator surface, not just the helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent fa820a8 commit 3755a44

3 files changed

Lines changed: 99 additions & 18 deletions

File tree

diff_diff/estimators.py

Lines changed: 29 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1033,21 +1033,28 @@ class MultiPeriodDiD(DifferenceInDifferences):
10331033
- ``"hc2_bm"``: one-way HC2 + Imbens-Kolesar (2016) Satterthwaite DOF
10341034
per coefficient plus a contrast-aware DOF for the post-period-average
10351035
ATT. **Unsupported with** ``cluster=`` — see ``cluster`` above.
1036-
- ``"conley"``: Conley 1999 spatial-HAC sandwich. **Accepted by the
1037-
constructor for sklearn-style API symmetry but rejected at
1038-
fit-time on ``MultiPeriodDiD``** because MultiPeriodDiD is
1039-
intrinsically a multi-period panel estimator and Phase 1's
1040-
cross-sectional Conley does not handle the time dimension. The
1041-
supported Phase 1 path for Conley is direct
1042-
``compute_robust_vcov`` / ``LinearRegression`` on a single-period
1043-
regression. Phase 2 will add the space-time product kernel
1044-
(Driscoll-Kraay) and lift the rejection.
1036+
- ``"conley"``: Conley 1999 spatial-HAC sandwich via the panel
1037+
block-decomposed form (matches R ``conleyreg`` with
1038+
``lag_cutoff > 0``). Pass ``conley_coords=(lat_col, lon_col)``,
1039+
``conley_cutoff_km=<float>``, and ``conley_lag_cutoff=<int>`` on
1040+
the constructor; ``unit=`` must be supplied at fit-time. The
1041+
sandwich sums within-period spatial pairs plus within-unit
1042+
Bartlett serial pairs (lag=0 excluded to avoid double-counting);
1043+
this is NOT a multiplicative product kernel. ``conley_time`` is
1044+
auto-derived from the ``time`` column at fit-time and normalized
1045+
to dense panel-period codes ``0..T-1`` so ``conley_lag_cutoff``
1046+
always counts panel periods (works for int / datetime64 /
1047+
``pd.Period`` / string encodings). Restrictions: ``cluster=``,
1048+
``survey_design=``, and ``inference="wild_bootstrap"`` raise on
1049+
this path (Phase 5 / follow-up).
10451050
alpha : float, default=0.05
10461051
Significance level for confidence intervals.
1047-
conley_coords, conley_cutoff_km, conley_metric, conley_kernel
1048-
Accepted by the constructor for sklearn-style API symmetry, but
1049-
``vcov_type="conley"`` is rejected at fit-time on ``MultiPeriodDiD``
1050-
(see ``vcov_type`` above).
1052+
conley_coords, conley_cutoff_km, conley_metric, conley_kernel, conley_lag_cutoff
1053+
Constructor kwargs that take effect when ``vcov_type="conley"``.
1054+
``conley_coords`` is a ``(lat_col, lon_col)`` tuple of column names
1055+
on ``data``. ``conley_lag_cutoff`` is the within-unit Bartlett lag
1056+
(non-negative int; 0 means within-period spatial only, no serial
1057+
component).
10511058
10521059
Attributes
10531060
----------
@@ -1147,9 +1154,10 @@ def fit( # type: ignore[override]
11471154
unit : str, optional
11481155
Name of the unit identifier column. When provided, checks whether
11491156
treatment timing varies across units and warns if staggered adoption
1150-
is detected (suggests CallawaySantAnna instead). Does NOT affect
1151-
standard error computation -- use the ``cluster`` parameter for
1152-
cluster-robust SEs.
1157+
is detected (suggests CallawaySantAnna instead). Required when
1158+
``vcov_type="conley"`` (the panel block-decomposed sandwich computes
1159+
a per-unit serial sum). For other ``vcov_type`` values, use the
1160+
``cluster`` parameter for cluster-robust SEs.
11531161
survey_design : SurveyDesign, optional
11541162
Survey design specification for design-based inference. When provided,
11551163
uses Taylor Series Linearization for variance estimation and
@@ -1550,7 +1558,11 @@ def fit( # type: ignore[override]
15501558
working_data[self.conley_coords[1]].values.astype(np.float64),
15511559
]
15521560
)
1553-
_conley_time_arr: Optional[np.ndarray] = working_data[time].values.astype(np.float64)
1561+
# Preserve the original time-label dtype (int, datetime64, pd.Period,
1562+
# string). `_compute_conley_vcov` normalizes to dense 0..T-1 codes
1563+
# internally; float coercion here would break datetime64 / Period /
1564+
# string encodings before the normalizer runs.
1565+
_conley_time_arr: Optional[np.ndarray] = np.asarray(working_data[time].values)
15541566
_conley_unit_arr: Optional[np.ndarray] = working_data[unit].values
15551567
else:
15561568
_conley_coords_arr = None

diff_diff/twfe.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -368,7 +368,11 @@ def fit( # type: ignore[override]
368368
data[self.conley_coords[1]].values.astype(np.float64),
369369
]
370370
)
371-
_conley_time_arr: Optional[np.ndarray] = data[time].values.astype(np.float64)
371+
# Preserve the original time-label dtype (int, datetime64, pd.Period,
372+
# string). `_compute_conley_vcov` normalizes to dense 0..T-1 codes
373+
# internally; float coercion here would break datetime64 / Period /
374+
# string encodings before the normalizer runs.
375+
_conley_time_arr: Optional[np.ndarray] = np.asarray(data[time].values)
372376
_conley_unit_arr: Optional[np.ndarray] = data[unit].values
373377
# vcov_type="conley" + cluster_ids raises at the linalg validator
374378
# (combined kernel deferred). TWFE's auto-cluster would force that

tests/test_conley_vcov.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1178,6 +1178,71 @@ def test_multi_period_did_conley_with_survey_design_raises(self):
11781178
survey_design=sd_psu,
11791179
)
11801180

1181+
def test_multi_period_did_conley_with_datetime64_time(self):
1182+
"""End-to-end MPD + vcov_type='conley' with datetime64 time labels.
1183+
Closes Codex re-review P1: the wrapper must NOT coerce time to float64
1184+
before passing to _compute_conley_vcov; the helper normalizes to
1185+
dense codes internally. Verifies the SEs match an equivalent
1186+
dense-integer-coded fit.
1187+
"""
1188+
import pandas as pd
1189+
1190+
from diff_diff import MultiPeriodDiD
1191+
1192+
rng = np.random.default_rng(seed=37)
1193+
n_units = 12
1194+
date_labels = pd.to_datetime(["2024-01-01", "2024-04-01", "2024-08-01"])
1195+
rows = []
1196+
for u in range(n_units):
1197+
lat = rng.uniform(-30, 30)
1198+
lon = rng.uniform(-100, 100)
1199+
for t_idx, dt in enumerate(date_labels):
1200+
treated = u < 6
1201+
y = 0.2 * t_idx + (1.0 if (treated and t_idx >= 1) else 0.0) + rng.normal(0, 0.4)
1202+
rows.append(
1203+
{
1204+
"unit": u,
1205+
"time_dt": dt,
1206+
"time_int": t_idx,
1207+
"y": y,
1208+
"treated": int(treated),
1209+
"lat": lat,
1210+
"lon": lon,
1211+
}
1212+
)
1213+
df_mp = pd.DataFrame(rows)
1214+
kwargs = dict(
1215+
vcov_type="conley",
1216+
conley_coords=("lat", "lon"),
1217+
conley_cutoff_km=2000.0,
1218+
conley_lag_cutoff=1,
1219+
)
1220+
res_int = MultiPeriodDiD(**kwargs).fit(
1221+
df_mp,
1222+
outcome="y",
1223+
treatment="treated",
1224+
time="time_int",
1225+
post_periods=[1, 2],
1226+
unit="unit",
1227+
reference_period=0,
1228+
)
1229+
res_dt = MultiPeriodDiD(**kwargs).fit(
1230+
df_mp,
1231+
outcome="y",
1232+
treatment="treated",
1233+
time="time_dt",
1234+
post_periods=[date_labels[1], date_labels[2]],
1235+
unit="unit",
1236+
reference_period=date_labels[0],
1237+
)
1238+
# Per-coefficient SE should match across the two encodings (dense
1239+
# codes normalize identically). MPD orders coefficients by the
1240+
# reference-vs-non-reference period split; with reference_period=0
1241+
# and post_periods=[1,2] the coefficient ordering is bit-identical.
1242+
se_int = np.sqrt(np.diag(res_int.vcov))
1243+
se_dt = np.sqrt(np.diag(res_dt.vcov))
1244+
np.testing.assert_allclose(se_dt, se_int, atol=1e-10)
1245+
11811246
def test_multi_period_did_conley_missing_coords_raises(self):
11821247
"""MPD + vcov_type='conley' without conley_coords raises a clean
11831248
ValueError instead of a raw TypeError on `self.conley_coords[0]`.

0 commit comments

Comments
 (0)