Skip to content

Commit d4a3dc1

Browse files
igerberclaude
andcommitted
Fix profile_panel() binary detection for degenerate panels
The treatment classifier required exactly two observed distinct values to treat a column as binary. Panels that are entirely never-treated (values = {0}) or entirely always-treated (values = {1}) were falling through to "continuous", contradicting the documented taxonomy which defines "binary_absorbing" as "values in {0, 1}". Rule is now values_set <= {0, 1, 0.0, 1.0} with at least one observed value; entirely-NaN treatment columns fall through to "categorical" rather than "continuous". Docstring and the autonomous guide section §2 reference are updated to match. New regression tests: - all-zero treatment panel -> binary_absorbing, has_never_treated - all-one treatment panel -> binary_absorbing, has_always_treated - binary with NaNs, only zeros observed -> binary_absorbing - all-NaN treatment -> categorical - top-level import surface (profile_panel / PanelProfile / Alert) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c672230 commit d4a3dc1

3 files changed

Lines changed: 100 additions & 21 deletions

File tree

diff_diff/guides/llms-autonomous.txt

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -74,19 +74,24 @@ view. Every field below appears as a top-level key in that dict.
7474

7575
- **`treatment_type: str`** - classification of the treatment column.
7676
Exactly one of:
77-
- `"binary_absorbing"`: numeric with values in {0, 1}; each unit's
78-
treatment sequence (ordered by `time`) is weakly monotone
79-
non-decreasing. The canonical DiD setting.
80-
- `"binary_non_absorbing"`: values in {0, 1} but at least one unit
81-
switches from 1 back to 0. Only `ChaisemartinDHaultfoeuille` handles
82-
this natively; the other absorbing-only estimators would misapply.
83-
- `"continuous"`: numeric with more than two distinct values (e.g., a
84-
dose, a discrete-integer partial-adoption score). Use
77+
- `"binary_absorbing"`: observed non-NaN values are a subset of
78+
{0, 1} (one or two distinct values, covering all-zero and all-one
79+
panels as valid degenerate cases) and each unit's treatment
80+
sequence (ordered by `time`) is weakly monotone non-decreasing.
81+
The canonical DiD setting.
82+
- `"binary_non_absorbing"`: values a subset of {0, 1} with at least
83+
two distinct values observed, where at least one unit switches
84+
from 1 back to 0. Only `ChaisemartinDHaultfoeuille` handles this
85+
natively; the other absorbing-only estimators would misapply.
86+
- `"continuous"`: numeric with more than two distinct values, or a
87+
two-valued numeric column whose values are not in {0, 1} (e.g.,
88+
a dose, a discrete-integer partial-adoption score). Use
8589
`ContinuousDiD` or `HeterogeneousAdoptionDiD`.
86-
- `"categorical"`: non-numeric dtype (object / category) or bool dtype.
87-
Often indicates a treatment arm. Encode each arm as a binary
88-
indicator and fit separately, or use a multi-treatment workflow
89-
outside the current estimator suite.
90+
- `"categorical"`: non-numeric dtype (object / category), a bool
91+
dtype column, or a column that is entirely NaN. Often indicates
92+
a treatment arm. Encode each arm as a binary indicator and fit
93+
separately, or use a multi-treatment workflow outside the
94+
current estimator suite.
9095
- **`is_staggered: bool`** - true iff treatment is `binary_absorbing` and
9196
at least two distinct first-treatment periods are observed. Drives the
9297
choice between classic DiD/TWFE and staggered-robust estimators.

diff_diff/profile.py

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -172,15 +172,23 @@ def profile_panel(
172172
-----
173173
Classification rules for ``treatment_type``:
174174
175-
- ``"binary_absorbing"``: numeric treatment taking values in :math:`\\{0, 1\\}`
176-
where each unit's treatment sequence (ordered by ``time``) is weakly
177-
monotone non-decreasing.
178-
- ``"binary_non_absorbing"``: values in :math:`\\{0, 1\\}` but at least
179-
one unit switches from 1 back to 0.
175+
- ``"binary_absorbing"``: numeric treatment whose observed non-NaN
176+
values are a subset of :math:`\\{0, 1\\}` (one or two distinct
177+
values) AND each unit's treatment sequence (ordered by ``time``)
178+
is weakly monotone non-decreasing. All-zero and all-one panels
179+
are valid degenerate cases.
180+
- ``"binary_non_absorbing"``: values a subset of :math:`\\{0, 1\\}`
181+
with at least two distinct values observed, where at least one
182+
unit switches from 1 back to 0.
180183
- ``"continuous"``: numeric treatment with more than two distinct
181-
values (matches the ``ContinuousDiD`` convention).
182-
- ``"categorical"``: non-numeric dtype (object / category) or a
183-
boolean-dtype column.
184+
values, or a 2-valued numeric whose values are not in
185+
:math:`\\{0, 1\\}` (matches the ``ContinuousDiD`` convention).
186+
- ``"categorical"``: non-numeric dtype (object / category), a
187+
boolean-dtype column, or a column that is entirely NaN.
188+
189+
Boolean-dtype columns are intentionally classified as
190+
``"categorical"``; cast to ``int`` if you want binary-treatment
191+
profiling.
184192
185193
The profile does not recommend an estimator. Consult
186194
``diff_diff.get_llm_guide("autonomous")`` for the estimator-support
@@ -297,7 +305,9 @@ def _classify_treatment(
297305
distinct = col.dropna().unique()
298306
n_distinct = len(distinct)
299307
values_set = set(distinct.tolist())
300-
is_binary_valued = n_distinct == 2 and values_set <= {0, 1, 0.0, 1.0}
308+
if n_distinct == 0:
309+
return ("categorical", False, {}, False, False, None, None)
310+
is_binary_valued = values_set <= {0, 1, 0.0, 1.0}
301311

302312
if not is_binary_valued:
303313
return ("continuous", False, {}, False, False, None, None)

tests/test_profile_panel.py

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,3 +252,67 @@ def test_alert_dataclass_is_frozen():
252252
a = Alert(code="x", severity="info", message="m", observed=None)
253253
with pytest.raises(dataclasses.FrozenInstanceError):
254254
a.code = "y" # type: ignore[misc]
255+
256+
257+
def test_all_zero_treatment_is_binary_absorbing():
258+
"""Degenerate binary: no unit is ever treated. Must classify as binary,
259+
not continuous, so the documented taxonomy matches the implementation."""
260+
df = _make_panel(n_units=20, periods=range(0, 4), first_treat=None)
261+
profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y")
262+
assert profile.treatment_type == "binary_absorbing"
263+
assert profile.has_never_treated is True
264+
assert profile.has_always_treated is False
265+
assert profile.cohort_sizes == {}
266+
assert profile.n_cohorts == 0
267+
268+
269+
def test_all_one_treatment_is_binary_absorbing_always_treated():
270+
"""Degenerate binary: every unit treated in every period. Must classify as
271+
binary_absorbing with has_always_treated=True."""
272+
rows = []
273+
for u in range(1, 21):
274+
for t in range(4):
275+
rows.append({"u": u, "t": t, "tr": 1, "y": float(u) + 0.1 * t})
276+
df = pd.DataFrame(rows)
277+
profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y")
278+
assert profile.treatment_type == "binary_absorbing"
279+
assert profile.has_never_treated is False
280+
assert profile.has_always_treated is True
281+
codes = _alert_codes(profile)
282+
assert "has_always_treated_units" in codes
283+
284+
285+
def test_binary_with_nans_only_zeros_observed_is_binary():
286+
"""Binary panel with some NaNs and only 0 observed among non-NaN values —
287+
still classify as binary, not continuous."""
288+
rows = []
289+
for u in range(1, 11):
290+
for t in range(4):
291+
tr = 0 if (u + t) % 2 == 0 else np.nan
292+
rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t})
293+
df = pd.DataFrame(rows)
294+
profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y")
295+
assert profile.treatment_type == "binary_absorbing"
296+
297+
298+
def test_all_nan_treatment_is_categorical():
299+
"""Treatment column entirely NaN — classify as categorical (no info)."""
300+
rows = []
301+
for u in range(1, 11):
302+
for t in range(4):
303+
rows.append({"u": u, "t": t, "tr": np.nan, "y": float(u) + 0.1 * t})
304+
df = pd.DataFrame(rows)
305+
profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y")
306+
assert profile.treatment_type == "categorical"
307+
308+
309+
def test_top_level_import_surface():
310+
"""profile_panel, PanelProfile, and Alert must be importable from the
311+
top-level namespace so `help(diff_diff)` points at real symbols."""
312+
import diff_diff
313+
314+
assert callable(diff_diff.profile_panel)
315+
assert diff_diff.PanelProfile.__name__ == "PanelProfile"
316+
assert diff_diff.Alert.__name__ == "Alert"
317+
for name in ("profile_panel", "PanelProfile", "Alert"):
318+
assert name in diff_diff.__all__, f"{name} missing from __all__"

0 commit comments

Comments
 (0)