Use dedicated simultaneous-treatment dataset for MultiPeriodDiD tutorial

igerber · claude · igerber · commit 768ceea53950 · 2026-01-20T09:35:28.000-05:00
Instead of explaining rank deficiency behavior, create a clean dataset
using generate_did_data() with simultaneous treatment timing. This shows
MultiPeriodDiD working as intended without warnings or edge cases.

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;
diff --git a/docs/tutorials/02_staggered_did.ipynb b/docs/tutorials/02_staggered_did.ipynb
@@ -671,7 +671,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 11. Comparing with MultiPeriodDiD\n\nFor comparison, here's how you would use `MultiPeriodDiD` which estimates period-specific effects. \n\n**Important**: `MultiPeriodDiD` assumes **simultaneous treatment timing** (all treated units get treated at the same time). For staggered adoption, always use `CallawaySantAnna` or `SunAbraham` instead.\n\nTo demonstrate `MultiPeriodDiD` properly, we'll filter to a single treatment cohort (cohort 3) plus never-treated units.\n\n**Note on rank deficiency**: `MultiPeriodDiD` creates a design matrix with period dummies and treatment interactions. Depending on the data structure, this can create linear dependencies. When this happens, the solver will:\n- Emit a warning listing the dropped columns\n- Set coefficients of dropped columns to NA\n- Compute valid estimates for the remaining (identified) parameters\n\nThis R-style handling ensures you get useful results while being warned about the structural issue."
+   "source": "## 11. Comparing with MultiPeriodDiD\n\nFor comparison, here's how you would use `MultiPeriodDiD` which estimates period-specific effects. \n\n**Important**: `MultiPeriodDiD` assumes **simultaneous treatment timing** (all treated units get treated at the same time). For staggered adoption, always use `CallawaySantAnna` or `SunAbraham` instead.\n\nTo demonstrate `MultiPeriodDiD` properly, we'll create a simple dataset where all treated units receive treatment at the same time."
   },
   {
    "cell_type": "code",
@@ -685,7 +685,7 @@
     }
    },
    "outputs": [],
-   "source": "# Filter to cohort 3 only (single treatment timing) plus never-treated\n# This is the appropriate data structure for MultiPeriodDiD\ncohort3_df = df[df['cohort'].isin([0, 3])].copy()\n\nmp_did = MultiPeriodDiD()\nresults_mp = mp_did.fit(\n    cohort3_df,\n    outcome=\"outcome\",\n    treatment=\"treated\",\n    time=\"period\",\n    post_periods=[3, 4, 5, 6, 7]\n)\n\nprint(results_mp.summary())"
+   "source": "# Create a simple dataset with simultaneous treatment timing\n# This is the appropriate data structure for MultiPeriodDiD\nfrom diff_diff import generate_did_data\n\n# Generate data with simultaneous treatment at period 4\nmp_data = generate_did_data(\n    n_units=100,\n    n_periods=8,\n    treatment_period=4,  # All treated units get treatment at period 4\n    treatment_fraction=0.5,\n    treatment_effect=2.5,\n    seed=42\n)\n\nprint(f\"MultiPeriodDiD dataset: {len(mp_data)} obs\")\nprint(f\"Treatment starts at period 4 for all treated units\")\n\nmp_did = MultiPeriodDiD()\nresults_mp = mp_did.fit(\n    mp_data,\n    outcome=\"outcome\",\n    treatment=\"treated\",\n    time=\"period\",\n    post_periods=[4, 5, 6, 7]\n)\n\nprint(results_mp.summary())"
   },
   {
    "cell_type": "code",

Original file line number	Diff line number	Diff line change
`@@ -671,7 +671,7 @@`
`671`	`671`	`{`
`672`	`672`	`"cell_type": "markdown",`
`673`	`673`	`"metadata": {},`
`674`		- "source": "## 11. Comparing with MultiPeriodDiD\n\nFor comparison, here's how you would use `MultiPeriodDiD` which estimates period-specific effects. \n\nImportant: `MultiPeriodDiD` assumes simultaneous treatment timing (all treated units get treated at the same time). For staggered adoption, always use `CallawaySantAnna` or `SunAbraham` instead.\n\nTo demonstrate `MultiPeriodDiD` properly, we'll filter to a single treatment cohort (cohort 3) plus never-treated units.\n\nNote on rank deficiency: `MultiPeriodDiD` creates a design matrix with period dummies and treatment interactions. Depending on the data structure, this can create linear dependencies. When this happens, the solver will:\n- Emit a warning listing the dropped columns\n- Set coefficients of dropped columns to NA\n- Compute valid estimates for the remaining (identified) parameters\n\nThis R-style handling ensures you get useful results while being warned about the structural issue."
	`674`	+ "source": "## 11. Comparing with MultiPeriodDiD\n\nFor comparison, here's how you would use `MultiPeriodDiD` which estimates period-specific effects. \n\nImportant: `MultiPeriodDiD` assumes simultaneous treatment timing (all treated units get treated at the same time). For staggered adoption, always use `CallawaySantAnna` or `SunAbraham` instead.\n\nTo demonstrate `MultiPeriodDiD` properly, we'll create a simple dataset where all treated units receive treatment at the same time."
`675`	`675`	`},`
`676`	`676`	`{`
`677`	`677`	`"cell_type": "code",`
`@@ -685,7 +685,7 @@`
`685`	`685`	`}`
`686`	`686`	`},`
`687`	`687`	`"outputs": [],`
`688`		`- "source": "# Filter to cohort 3 only (single treatment timing) plus never-treated\n# This is the appropriate data structure for MultiPeriodDiD\ncohort3_df = df[df['cohort'].isin([0, 3])].copy()\n\nmp_did = MultiPeriodDiD()\nresults_mp = mp_did.fit(\n cohort3_df,\n outcome=\"outcome\",\n treatment=\"treated\",\n time=\"period\",\n post_periods=[3, 4, 5, 6, 7]\n)\n\nprint(results_mp.summary())"`
	`688`	+ "source": "# Create a simple dataset with simultaneous treatment timing\n# This is the appropriate data structure for MultiPeriodDiD\nfrom diff_diff import generate_did_data\n\n# Generate data with simultaneous treatment at period 4\nmp_data = generate_did_data(\n n_units=100,\n n_periods=8,\n treatment_period=4, # All treated units get treatment at period 4\n treatment_fraction=0.5,\n treatment_effect=2.5,\n seed=42\n)\n\nprint(f\"MultiPeriodDiD dataset: {len(mp_data)} obs\")\nprint(f\"Treatment starts at period 4 for all treated units\")\n\nmp_did = MultiPeriodDiD()\nresults_mp = mp_did.fit(\n mp_data,\n outcome=\"outcome\",\n treatment=\"treated\",\n time=\"period\",\n post_periods=[4, 5, 6, 7]\n)\n\nprint(results_mp.summary())"
`689`	`689`	`},`
`690`	`690`	`{`
`691`	`691`	`"cell_type": "code",`