Skip to content

Commit ee0cc54

Browse files
igerberclaude
andcommitted
Densify PSU codes over eligible subset + always populate per-cell tensor
Addresses two P0 correctness regressions in the PR-4 bootstrap PSU-map plumbing flagged by CI review. **P0 #1 - valid_map gate discarded the per-cell tensor too eagerly.** When any variance-eligible group had no positive-weight cells (all- sentinel row in psu_codes_per_cell), the old code set valid_map=False and left BOTH group_id_to_psu_code_bootstrap AND psu_codes_per_cell_bootstrap as None. The bootstrap then silently dropped to unclustered group-level instead of excluding only that group's empty row. Fix: always populate psu_codes_per_cell_bootstrap once the tensor is built; the cell-level path already masks out -1 cells at unroll time. Always populate group_id_to_psu_code_bootstrap with a per-group code (use placeholder 0 for all-sentinel rows since those groups have no IF mass and the multiplier they receive is irrelevant on either the legacy or the cell-level path). **P0 #2 - dense PSU codes factorized over non-eligible subset.** `np.unique(obs_psu_codes[pos_mask_boot])` previously included PSU labels from groups that were filtered out of _eligible_group_ids (e.g., singleton-baseline-excluded groups). The excluded groups' PSUs contributed dense codes that formed gaps in the eligible subset's map. Downstream `_generate_psu_or_group_weights` computes `n_psu = max(code) + 1` and triggers the identity fast path when `n_psu >= n_groups_target`. A gapped map like `[1, 1]` or `[0, 2, 2]` silently activated independent-draws clustering for eligible groups that should have shared a multiplier. Fix: restrict the np.unique factorization to the eligible-subset positive-weight obs only (`elig_obs_mask = pos_mask_boot & (g_idx_arr >= 0) & (t_idx_arr >= 0)`), so the dense code domain exactly matches the PSUs actually used by variance-eligible groups. Tests: - `test_bootstrap_zero_weight_group_equivalent_to_removing_it`: fit with vs without an all-zero-weight eligible group must produce byte-identical bootstrap SE at the same seed (byte- identity would have failed before P0 #1 fix because valid_map flipped the PSU-aware path off for the with-zero-group fit). - `test_bootstrap_dense_codes_under_singleton_baseline_excluded_group`: spies on the group_id_to_psu_code dict passed to `_compute_dcdh_bootstrap` under a fixture with an always-treated singleton-baseline group and strictly-coarser PSU among eligible groups. Asserts the dict's values form a contiguous `[0, n_unique-1]` range (no gaps from the excluded group's PSU), and that eligible groups sharing a PSU label receive the same dense code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 77ce297 commit ee0cc54

2 files changed

Lines changed: 213 additions & 56 deletions

File tree

diff_diff/chaisemartin_dhaultfoeuille.py

Lines changed: 65 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -2314,81 +2314,90 @@ def fit(
23142314
_obs_survey_info["weights"], dtype=np.float64
23152315
)
23162316
pos_mask_boot = obs_weights_boot > 0
2317+
gid_to_idx = {
2318+
gid: i for i, gid in enumerate(_eligible_group_ids)
2319+
}
2320+
tid_to_idx = {t: i for i, t in enumerate(all_periods)}
2321+
n_elig_boot = len(_eligible_group_ids)
2322+
n_per_boot = len(all_periods)
2323+
g_idx_arr = np.array(
2324+
[gid_to_idx.get(g, -1) for g in obs_gids_boot],
2325+
dtype=np.int64,
2326+
)
2327+
t_idx_arr = np.array(
2328+
[tid_to_idx.get(t, -1) for t in obs_tids_boot],
2329+
dtype=np.int64,
2330+
)
23172331
# Factor PSU labels to dense int codes over the
2318-
# positive-weight subpopulation. Shared code domain
2319-
# for both the per-cell tensor and the group-level
2320-
# dict below.
2321-
pos_psu_labels = obs_psu_codes[pos_mask_boot]
2332+
# **eligible-subset** positive-weight observations only
2333+
# (not the full positive-weight population). Restricting
2334+
# to eligible obs ensures the resulting dense codes
2335+
# range ONLY over PSUs actually used by variance-
2336+
# eligible groups, so downstream n_psu = max(code) + 1
2337+
# is exact: no gaps from singleton-baseline-excluded
2338+
# groups that would silently trigger the identity
2339+
# fast path in `_generate_psu_or_group_weights`.
2340+
elig_obs_mask = (
2341+
pos_mask_boot & (g_idx_arr >= 0) & (t_idx_arr >= 0)
2342+
)
2343+
elig_psu_labels = obs_psu_codes[elig_obs_mask]
23222344
dense_per_row: Optional[np.ndarray] = None
2323-
if pos_psu_labels.size > 0:
2324-
_, pos_dense_codes = np.unique(
2325-
pos_psu_labels, return_inverse=True,
2345+
if elig_psu_labels.size > 0:
2346+
_, elig_dense_codes = np.unique(
2347+
elig_psu_labels, return_inverse=True,
23262348
)
2327-
pos_dense_codes = np.asarray(pos_dense_codes, dtype=np.int64)
2349+
elig_dense_codes = np.asarray(elig_dense_codes, dtype=np.int64)
23282350
dense_per_row = np.full(
23292351
len(obs_psu_codes), -1, dtype=np.int64,
23302352
)
2331-
dense_per_row[pos_mask_boot] = pos_dense_codes
2353+
dense_per_row[elig_obs_mask] = elig_dense_codes
23322354

23332355
# Per-cell PSU tensor: (n_eligible, n_periods), -1 sentinel
2334-
# for ineligible / zero-weight cells.
2356+
# for ineligible / zero-weight cells. Populated
2357+
# unconditionally when `dense_per_row` exists — a row
2358+
# that ends up all-sentinel (eligible group with no
2359+
# positive-weight obs) is masked out at unroll time,
2360+
# not by discarding the entire tensor. See also the
2361+
# dispatcher's `_psu_varies_within_group` helper which
2362+
# ignores sentinel entries row-wise.
23352363
if dense_per_row is not None:
2336-
gid_to_idx = {
2337-
gid: i for i, gid in enumerate(_eligible_group_ids)
2338-
}
2339-
tid_to_idx = {t: i for i, t in enumerate(all_periods)}
2340-
n_elig_boot = len(_eligible_group_ids)
2341-
n_per_boot = len(all_periods)
23422364
psu_codes_per_cell = np.full(
23432365
(n_elig_boot, n_per_boot), -1, dtype=np.int64,
23442366
)
2345-
g_idx_arr = np.array(
2346-
[gid_to_idx.get(g, -1) for g in obs_gids_boot],
2347-
dtype=np.int64,
2348-
)
2349-
t_idx_arr = np.array(
2350-
[tid_to_idx.get(t, -1) for t in obs_tids_boot],
2351-
dtype=np.int64,
2352-
)
2353-
valid_obs_boot = (
2354-
pos_mask_boot
2355-
& (g_idx_arr >= 0)
2356-
& (t_idx_arr >= 0)
2357-
)
23582367
psu_codes_per_cell[
2359-
g_idx_arr[valid_obs_boot],
2360-
t_idx_arr[valid_obs_boot],
2361-
] = dense_per_row[valid_obs_boot]
2362-
2363-
# Group-level dict: first non-sentinel code per row.
2364-
# Under within-group-constant PSU this matches the
2365-
# pre-PR-4 "first label per group" convention
2366-
# bit-for-bit; under varying PSU the dispatcher
2367-
# routes to the cell-level path which uses the
2368-
# full `psu_codes_per_cell` tensor.
2368+
g_idx_arr[elig_obs_mask],
2369+
t_idx_arr[elig_obs_mask],
2370+
] = dense_per_row[elig_obs_mask]
2371+
psu_codes_per_cell_bootstrap = psu_codes_per_cell
2372+
2373+
# Group-level dict: one PSU code per eligible
2374+
# group. For rows that are all-sentinel (eligible
2375+
# group has no positive-weight obs), assign code
2376+
# `0` as a harmless placeholder — the group's IF
2377+
# mass is zero, so the bootstrap multiplier it
2378+
# receives is irrelevant on either the legacy or
2379+
# the cell-level path. Always populate the dict
2380+
# so the legacy group-level path keeps clustering
2381+
# correctly when psu_varies=False even if some
2382+
# eligible groups happen to have no positive-
2383+
# weight obs.
23692384
group_psu_labels: List[int] = []
2370-
valid_map = True
23712385
for i in range(n_elig_boot):
23722386
row = psu_codes_per_cell[i]
23732387
valid = row[row >= 0]
23742388
if valid.size == 0:
2375-
valid_map = False
2376-
break
2377-
group_psu_labels.append(int(valid[0]))
2378-
if (
2379-
valid_map
2380-
and len(group_psu_labels) == n_groups_for_overall_var
2381-
):
2382-
group_id_to_psu_code_bootstrap = {
2383-
gid: code
2384-
for gid, code in zip(
2385-
_eligible_group_ids, group_psu_labels
2386-
)
2387-
}
2388-
eligible_group_ids_bootstrap = np.asarray(
2389-
_eligible_group_ids
2389+
group_psu_labels.append(0)
2390+
else:
2391+
group_psu_labels.append(int(valid[0]))
2392+
group_id_to_psu_code_bootstrap = {
2393+
gid: code
2394+
for gid, code in zip(
2395+
_eligible_group_ids, group_psu_labels
23902396
)
2391-
psu_codes_per_cell_bootstrap = psu_codes_per_cell
2397+
}
2398+
eligible_group_ids_bootstrap = np.asarray(
2399+
_eligible_group_ids
2400+
)
23922401

23932402
br = self._compute_dcdh_bootstrap(
23942403
n_groups_for_overall=n_groups_for_overall_var,

tests/test_survey_dcdh.py

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2001,3 +2001,151 @@ def test_bootstrap_cell_level_with_all_zero_weight_group_does_not_crash(self):
20012001
# Bootstrap SE should be finite (zero-weight group does not
20022002
# disturb the other groups' contributions).
20032003
assert np.isfinite(res.bootstrap_results.overall_se)
2004+
2005+
def test_bootstrap_zero_weight_group_equivalent_to_removing_it(self):
2006+
"""Fixture A: 9 groups (1 all-zero-weighted + 8 positive).
2007+
Fixture B: 8 groups (same panel without the zero-weight
2008+
group). Under the fix, an eligible group that has no
2009+
positive-weight cells contributes nothing to the bootstrap
2010+
(its `psu_codes_per_cell` row is all sentinel). Both fits
2011+
therefore produce byte-identical bootstrap SE at the same
2012+
seed. Without the fix, the `valid_map` gate in fit() would
2013+
disable the entire PSU-aware path when any row is all
2014+
sentinel, silently dropping to unclustered group-level for
2015+
the other groups.
2016+
"""
2017+
def _make(include_zero_group: bool) -> pd.DataFrame:
2018+
rows = []
2019+
n_groups = 9 if include_zero_group else 8
2020+
for g in range(n_groups):
2021+
f = 3 if g < 4 else None
2022+
for t in range(5):
2023+
pw = 0.0 if (include_zero_group and g == 8) else 1.0
2024+
d = 1 if (f is not None and t >= f) else 0
2025+
y = float(g) + 0.1 * t + 1.0 * d
2026+
rows.append({
2027+
"group": int(g),
2028+
"period": int(t),
2029+
"treatment": int(d),
2030+
"outcome": y,
2031+
"pw": pw,
2032+
"psu": int(g), # PSU=group, constant path
2033+
})
2034+
return pd.DataFrame(rows)
2035+
2036+
sd = SurveyDesign(weights="pw", psu="psu")
2037+
res_a = ChaisemartinDHaultfoeuille(n_bootstrap=200, seed=7).fit(
2038+
_make(include_zero_group=True),
2039+
outcome="outcome", group="group",
2040+
time="period", treatment="treatment",
2041+
survey_design=sd,
2042+
)
2043+
res_b = ChaisemartinDHaultfoeuille(n_bootstrap=200, seed=7).fit(
2044+
_make(include_zero_group=False),
2045+
outcome="outcome", group="group",
2046+
time="period", treatment="treatment",
2047+
survey_design=sd,
2048+
)
2049+
assert res_a.bootstrap_results is not None
2050+
assert res_b.bootstrap_results is not None
2051+
se_a = float(res_a.bootstrap_results.overall_se)
2052+
se_b = float(res_b.bootstrap_results.overall_se)
2053+
assert np.isfinite(se_a) and np.isfinite(se_b)
2054+
assert se_a == pytest.approx(se_b, rel=0.0, abs=1e-15), (
2055+
f"Bootstrap SE must match when a zero-weight eligible "
2056+
f"group is added (fix P0 #1 — no silent dropback to "
2057+
f"unclustered group-level). Got SE_with_zero={se_a!r}, "
2058+
f"SE_without_zero={se_b!r}."
2059+
)
2060+
2061+
def test_bootstrap_dense_codes_under_singleton_baseline_excluded_group(self):
2062+
"""Regression for P0 #2: when a group is singleton-baseline-
2063+
excluded (e.g., an always-treated group whose baseline D=1
2064+
has no peer), its PSU label must NOT pollute the dense code
2065+
factorization used by `_compute_dcdh_bootstrap`. Otherwise
2066+
eligible groups that share a PSU receive gapped dense codes
2067+
(e.g., `[1, 1]`), `_generate_psu_or_group_weights` computes
2068+
`n_psu = max + 1 = 2 == n_groups_target = 2`, and the
2069+
identity fast path wrongly triggers — giving those eligible
2070+
groups independent multiplier draws instead of a shared
2071+
one. Assertion: instrument the call to capture the
2072+
`group_id_to_psu_code` dict actually passed and confirm its
2073+
values form a contiguous range `[0, n_unique - 1]`.
2074+
"""
2075+
# Fixture: one always-treated group (D=1 at period 0 → singleton-
2076+
# baseline-excluded), plus eligible groups that share a PSU
2077+
# label while the excluded group has a different PSU.
2078+
rows = []
2079+
for g in range(5):
2080+
for t in range(5):
2081+
if g == 0:
2082+
d = 1 # always-treated; baseline D=1 singleton
2083+
psu = 100 # distinct PSU for the excluded group
2084+
else:
2085+
d = 1 if t >= 3 else 0 # joiners at period 3
2086+
# Groups 1, 2 share PSU=200; groups 3, 4 share PSU=300.
2087+
psu = 200 if g in (1, 2) else 300
2088+
rows.append({
2089+
"group": int(g),
2090+
"period": int(t),
2091+
"treatment": int(d),
2092+
"outcome": float(g) + 0.1 * t + 0.5 * d,
2093+
"pw": 1.0,
2094+
"psu": psu,
2095+
})
2096+
df_ = pd.DataFrame(rows)
2097+
sd = SurveyDesign(weights="pw", psu="psu")
2098+
2099+
captured: dict = {}
2100+
2101+
est = ChaisemartinDHaultfoeuille(n_bootstrap=50, seed=1)
2102+
original_bootstrap = est._compute_dcdh_bootstrap
2103+
2104+
def _spy(**kwargs):
2105+
captured["group_id_to_psu_code"] = kwargs.get(
2106+
"group_id_to_psu_code"
2107+
)
2108+
return original_bootstrap(**kwargs)
2109+
2110+
est._compute_dcdh_bootstrap = _spy # type: ignore[method-assign]
2111+
2112+
import warnings as _w
2113+
with _w.catch_warnings():
2114+
_w.simplefilter("ignore") # singleton-baseline warning
2115+
est.fit(
2116+
df_, outcome="outcome", group="group",
2117+
time="period", treatment="treatment",
2118+
survey_design=sd,
2119+
)
2120+
2121+
dict_passed = captured["group_id_to_psu_code"]
2122+
assert dict_passed is not None, (
2123+
"bootstrap received group_id_to_psu_code=None — the "
2124+
"PSU-aware path was disabled instead of routing to the "
2125+
"cell/legacy path via densified codes."
2126+
)
2127+
codes = sorted(set(dict_passed.values()))
2128+
# Eligible groups share only two PSUs (200 for g=1,2;
2129+
# 300 for g=3,4). Dense codes must be [0, 1], NOT [1, 2]
2130+
# (which would happen if the excluded g=0's PSU=100 were
2131+
# dense-coded first).
2132+
assert codes == list(range(len(codes))), (
2133+
f"group_id_to_psu_code values must be contiguous "
2134+
f"dense codes starting at 0, got {codes}. A non-"
2135+
f"contiguous range signals the excluded group's PSU "
2136+
f"polluted the dense factorization (P0 #2 regression)."
2137+
)
2138+
# Sanity: eligible groups 1, 2 must share a code (PSU=200),
2139+
# and eligible groups 3, 4 must share a code (PSU=300).
2140+
assert dict_passed[1] == dict_passed[2], (
2141+
"Groups 1 and 2 share PSU=200 and must receive the same "
2142+
"dense code under correct densification."
2143+
)
2144+
assert dict_passed[3] == dict_passed[4], (
2145+
"Groups 3 and 4 share PSU=300 and must receive the same "
2146+
"dense code."
2147+
)
2148+
assert dict_passed[1] != dict_passed[3], (
2149+
"Groups in PSU=200 and PSU=300 must receive distinct "
2150+
"dense codes."
2151+
)

0 commit comments

Comments
 (0)