Address code review feedback for PR #102

igerber · claude · igerber · commit 2ac66057699b · 2026-01-22T16:26:46.000-05:00
- Fix misleading "smaller SEs" wording in Tutorial 02 Cell 43
  → Changed to "lower t-statistics" which is more accurate
- Update SA reference period notation throughout docs
  → e=-1 → e=-1-anticipation (defaults to e=-1 when anticipation=0)
- Update REGISTRY.md SunAbraham section with correct reference period
- Add meaningful assertions to pre-period test
  → Test now requires pre-periods exist (not vacuous)
  → Assert CS(universal) closer to SA than CS(varying) for pre-periods

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -231,7 +231,7 @@ Aggregations:
   - Matches R `did::att_gt()` base_period parameter
 - Base period interaction with Sun-Abraham comparison:
   - CS with `base_period="varying"` produces different pre-treatment estimates than SA
-  - This is expected: CS uses consecutive comparisons, SA uses fixed reference (e=-1)
+  - This is expected: CS uses consecutive comparisons, SA uses fixed reference (e=-1-anticipation)
   - Use `base_period="universal"` for methodologically comparable pre-treatment effects
   - Post-treatment effects match regardless of base_period setting
 - Control group with `control_group="not_yet_treated"`:
@@ -262,7 +262,7 @@ Aggregations:
 *Assumption checks / warnings:*
 - Requires never-treated units as control group
 - Warns if treatment effects may be heterogeneous across cohorts (which the method handles)
-- Reference period must be specified (default: e=-1)
+- Reference period: e=-1-anticipation (defaults to e=-1 when anticipation=0)
 
 *Estimator equation (as implemented):*
 
diff --git a/docs/tutorials/02_staggered_did.ipynb b/docs/tutorials/02_staggered_did.ipynb
@@ -810,7 +810,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 14. Comparing CS and SA as a Robustness Check\n\nRunning both estimators provides a useful robustness check. When they agree, results are more credible.\n\n### Understanding Pre-Period Differences\n\nYou may notice that **post-treatment effects align closely** between CS and SA, but **pre-treatment effects can differ in magnitude and significance**. This is expected methodological behavior, not a bug.\n\n**Why the difference?**\n\n1. **Callaway-Sant'Anna with `base_period=\"varying\"` (default)**:\n   - Pre-treatment effects use **consecutive period comparisons** (period t vs period t-1)\n   - Each pre-period coefficient represents a one-period change\n   - Smaller changes → typically smaller SEs → may not reach significance\n\n2. **Sun-Abraham**:\n   - Uses a **fixed reference period** (e=-1, the period just before treatment)\n   - All coefficients are deviations from this single reference\n   - Pre-period coefficients show cumulative difference from the reference\n\n**To make CS pre-periods more comparable to SA**, use `base_period=\"universal\"`:\n\n```python\ncs_universal = CallawaySantAnna(base_period=\"universal\")\n```\n\nThis makes CS compare all periods to g-1 (like SA), producing more similar pre-treatment estimates."
+   "source": "## 14. Comparing CS and SA as a Robustness Check\n\nRunning both estimators provides a useful robustness check. When they agree, results are more credible.\n\n### Understanding Pre-Period Differences\n\nYou may notice that **post-treatment effects align closely** between CS and SA, but **pre-treatment effects can differ in magnitude and significance**. This is expected methodological behavior, not a bug.\n\n**Why the difference?**\n\n1. **Callaway-Sant'Anna with `base_period=\"varying\"` (default)**:\n   - Pre-treatment effects use **consecutive period comparisons** (period t vs period t-1)\n   - Each pre-period coefficient represents a one-period change\n   - These smaller incremental changes often yield lower t-statistics\n\n2. **Sun-Abraham**:\n   - Uses a **fixed reference period** (e=-1 when anticipation=0, or e=-1-anticipation otherwise)\n   - All coefficients are deviations from this single reference\n   - Pre-period coefficients show cumulative difference from the reference\n\n**To make CS pre-periods more comparable to SA**, use `base_period=\"universal\"`:\n\n```python\ncs_universal = CallawaySantAnna(base_period=\"universal\")\n```\n\nThis makes CS compare all periods to g-1 (like SA), producing more similar pre-treatment estimates."
   },
   {
    "cell_type": "code",
@@ -822,7 +822,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## Summary\n\nKey takeaways:\n\n1. **TWFE can be biased** with staggered adoption and heterogeneous effects\n2. **Goodman-Bacon decomposition** reveals *why* TWFE fails by showing:\n   - The implicit 2x2 comparisons and their weights\n   - How much weight falls on \"forbidden comparisons\" (already-treated as controls)\n3. **Callaway-Sant'Anna** properly handles staggered adoption by:\n   - Computing group-time specific effects ATT(g,t)\n   - Only using valid comparison groups\n   - Properly aggregating effects\n4. **Sun-Abraham** provides an alternative approach using:\n   - Interaction-weighted regression with cohort x relative-time indicators\n   - Different weighting scheme than CS\n   - More efficient under homogeneous effects\n5. **Run both CS and SA** as a robustness check—when they agree, results are more credible\n6. **Aggregation options**:\n   - `\"simple\"`: Overall ATT\n   - `\"group\"`: ATT by cohort\n   - `\"event\"`: ATT by event time (for event-study plots)\n7. **Bootstrap inference** provides valid standard errors and confidence intervals:\n   - Use `n_bootstrap` parameter to enable multiplier bootstrap\n   - Choose weight type: `'rademacher'`, `'mammen'`, or `'webb'`\n   - Bootstrap results include SEs, CIs, and p-values for all aggregations\n8. **Pre-treatment effects** provide parallel trends diagnostics:\n   - Use `base_period=\"varying\"` for consecutive period comparisons\n   - Pre-treatment ATT(g,t) should be near zero\n   - 95% CIs including zero is consistent with parallel trends\n   - See Tutorial 07 for pre-trends power analysis (Roth 2022)\n9. **Control group choices** affect efficiency and assumptions:\n   - `\"never_treated\"`: Stronger parallel trends assumption\n   - `\"not_yet_treated\"`: Weaker assumption, uses more data\n10. **CS vs SA pre-period differences are expected**:\n    - Post-treatment effects should be similar (robustness check)\n    - Pre-treatment effects differ due to base period methodology\n    - CS (varying): consecutive comparisons → one-period changes\n    - SA: fixed reference (e=-1) → cumulative deviations\n    - Use `base_period=\"universal\"` in CS for comparable pre-periods\n\nFor more details, see:\n- Callaway, B., & Sant'Anna, P. H. (2021). Difference-in-differences with multiple time periods. *Journal of Econometrics*.\n- Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. *Journal of Econometrics*.\n- Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. *Journal of Econometrics*."
+   "source": "## Summary\n\nKey takeaways:\n\n1. **TWFE can be biased** with staggered adoption and heterogeneous effects\n2. **Goodman-Bacon decomposition** reveals *why* TWFE fails by showing:\n   - The implicit 2x2 comparisons and their weights\n   - How much weight falls on \"forbidden comparisons\" (already-treated as controls)\n3. **Callaway-Sant'Anna** properly handles staggered adoption by:\n   - Computing group-time specific effects ATT(g,t)\n   - Only using valid comparison groups\n   - Properly aggregating effects\n4. **Sun-Abraham** provides an alternative approach using:\n   - Interaction-weighted regression with cohort x relative-time indicators\n   - Different weighting scheme than CS\n   - More efficient under homogeneous effects\n5. **Run both CS and SA** as a robustness check—when they agree, results are more credible\n6. **Aggregation options**:\n   - `\"simple\"`: Overall ATT\n   - `\"group\"`: ATT by cohort\n   - `\"event\"`: ATT by event time (for event-study plots)\n7. **Bootstrap inference** provides valid standard errors and confidence intervals:\n   - Use `n_bootstrap` parameter to enable multiplier bootstrap\n   - Choose weight type: `'rademacher'`, `'mammen'`, or `'webb'`\n   - Bootstrap results include SEs, CIs, and p-values for all aggregations\n8. **Pre-treatment effects** provide parallel trends diagnostics:\n   - Use `base_period=\"varying\"` for consecutive period comparisons\n   - Pre-treatment ATT(g,t) should be near zero\n   - 95% CIs including zero is consistent with parallel trends\n   - See Tutorial 07 for pre-trends power analysis (Roth 2022)\n9. **Control group choices** affect efficiency and assumptions:\n   - `\"never_treated\"`: Stronger parallel trends assumption\n   - `\"not_yet_treated\"`: Weaker assumption, uses more data\n10. **CS vs SA pre-period differences are expected**:\n    - Post-treatment effects should be similar (robustness check)\n    - Pre-treatment effects differ due to base period methodology\n    - CS (varying): consecutive comparisons → one-period changes\n    - SA: fixed reference (e=-1-anticipation) → cumulative deviations\n    - Use `base_period=\"universal\"` in CS for comparable pre-periods\n\nFor more details, see:\n- Callaway, B., & Sant'Anna, P. H. (2021). Difference-in-differences with multiple time periods. *Journal of Econometrics*.\n- Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. *Journal of Econometrics*.\n- Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. *Journal of Econometrics*."
   }
  ],
  "metadata": {
diff --git a/tests/test_sun_abraham.py b/tests/test_sun_abraham.py
@@ -601,27 +601,31 @@ def test_pre_period_difference_expected_between_cs_sa(self):
                 f"SA={sa_eff:.4f}, CS(univ)={cs_univ_eff:.4f}"
             )
 
-        # For pre-treatment periods, CS(universal) should be closer to SA than CS(varying)
-        # because both SA and CS(universal) use a fixed reference period
-        if len(pre_times) > 0:
-            total_diff_varying = 0.0
-            total_diff_universal = 0.0
-            for t in pre_times:
-                sa_eff = sa_results.event_study_effects[t]["effect"]
-                cs_vary_eff = cs_varying_results.event_study_effects[t]["effect"]
-                cs_univ_eff = cs_universal_results.event_study_effects[t]["effect"]
-
-                total_diff_varying += abs(sa_eff - cs_vary_eff)
-                total_diff_universal += abs(sa_eff - cs_univ_eff)
-
-            # CS(universal) should generally be closer to SA than CS(varying)
-            # for pre-treatment periods (due to similar reference period approach)
-            # Note: This is a soft assertion - in some data configurations
-            # the relationship may not hold perfectly due to weighting differences
-            # The key point is that the methodological difference exists
-            assert (
-                len(pre_times) > 0
-            ), "Test requires pre-treatment periods to verify methodology difference"
+        # Require pre-periods exist for this test to be meaningful
+        assert len(pre_times) > 0, (
+            "Test requires pre-treatment periods to validate methodology difference. "
+            "Increase n_periods or adjust cohort timing in test data."
+        )
+
+        # Compute total absolute differences
+        total_diff_varying = 0.0
+        total_diff_universal = 0.0
+        for t in pre_times:
+            sa_eff = sa_results.event_study_effects[t]["effect"]
+            cs_vary_eff = cs_varying_results.event_study_effects[t]["effect"]
+            cs_univ_eff = cs_universal_results.event_study_effects[t]["effect"]
+
+            total_diff_varying += abs(sa_eff - cs_vary_eff)
+            total_diff_universal += abs(sa_eff - cs_univ_eff)
+
+        # CS(universal) should generally be closer to SA than CS(varying)
+        # for pre-treatment periods (due to similar reference period approach)
+        # Allow some tolerance since weighting schemes still differ
+        assert total_diff_universal <= total_diff_varying + 0.5, (
+            f"Expected CS(universal) to be closer to SA than CS(varying) for pre-periods. "
+            f"Got: CS(univ)-SA diff={total_diff_universal:.4f}, "
+            f"CS(vary)-SA diff={total_diff_varying:.4f}"
+        )
 
     def test_agreement_under_homogeneous_effects(self):
         """Test that SA and CS agree under homogeneous treatment effects."""

Original file line number	Diff line number	Diff line change
`@@ -810,7 +810,7 @@`
`810`	`810`	`{`
`811`	`811`	`"cell_type": "markdown",`
`812`	`812`	`"metadata": {},`
`813`		- "source": "## 14. Comparing CS and SA as a Robustness Check\n\nRunning both estimators provides a useful robustness check. When they agree, results are more credible.\n\n### Understanding Pre-Period Differences\n\nYou may notice that post-treatment effects align closely between CS and SA, but pre-treatment effects can differ in magnitude and significance. This is expected methodological behavior, not a bug.\n\nWhy the difference?\n\n1. Callaway-Sant'Anna with `base_period=\"varying\"` (default):\n - Pre-treatment effects use consecutive period comparisons (period t vs period t-1)\n - Each pre-period coefficient represents a one-period change\n - Smaller changes → typically smaller SEs → may not reach significance\n\n2. Sun-Abraham:\n - Uses a fixed reference period (e=-1, the period just before treatment)\n - All coefficients are deviations from this single reference\n - Pre-period coefficients show cumulative difference from the reference\n\nTo make CS pre-periods more comparable to SA, use `base_period=\"universal\"`:\n\n```python\ncs_universal = CallawaySantAnna(base_period=\"universal\")\n```\n\nThis makes CS compare all periods to g-1 (like SA), producing more similar pre-treatment estimates."
	`813`	+ "source": "## 14. Comparing CS and SA as a Robustness Check\n\nRunning both estimators provides a useful robustness check. When they agree, results are more credible.\n\n### Understanding Pre-Period Differences\n\nYou may notice that post-treatment effects align closely between CS and SA, but pre-treatment effects can differ in magnitude and significance. This is expected methodological behavior, not a bug.\n\nWhy the difference?\n\n1. Callaway-Sant'Anna with `base_period=\"varying\"` (default):\n - Pre-treatment effects use consecutive period comparisons (period t vs period t-1)\n - Each pre-period coefficient represents a one-period change\n - These smaller incremental changes often yield lower t-statistics\n\n2. Sun-Abraham:\n - Uses a fixed reference period (e=-1 when anticipation=0, or e=-1-anticipation otherwise)\n - All coefficients are deviations from this single reference\n - Pre-period coefficients show cumulative difference from the reference\n\nTo make CS pre-periods more comparable to SA, use `base_period=\"universal\"`:\n\n```python\ncs_universal = CallawaySantAnna(base_period=\"universal\")\n```\n\nThis makes CS compare all periods to g-1 (like SA), producing more similar pre-treatment estimates."
`814`	`814`	`},`
`815`	`815`	`{`
`816`	`816`	`"cell_type": "code",`
`@@ -822,7 +822,7 @@`
`822`	`822`	`{`
`823`	`823`	`"cell_type": "markdown",`
`824`	`824`	`"metadata": {},`
`825`		- "source": "## Summary\n\nKey takeaways:\n\n1. TWFE can be biased with staggered adoption and heterogeneous effects\n2. Goodman-Bacon decomposition reveals why TWFE fails by showing:\n - The implicit 2x2 comparisons and their weights\n - How much weight falls on \"forbidden comparisons\" (already-treated as controls)\n3. Callaway-Sant'Anna properly handles staggered adoption by:\n - Computing group-time specific effects ATT(g,t)\n - Only using valid comparison groups\n - Properly aggregating effects\n4. Sun-Abraham provides an alternative approach using:\n - Interaction-weighted regression with cohort x relative-time indicators\n - Different weighting scheme than CS\n - More efficient under homogeneous effects\n5. Run both CS and SA as a robustness check—when they agree, results are more credible\n6. Aggregation options:\n - `\"simple\"`: Overall ATT\n - `\"group\"`: ATT by cohort\n - `\"event\"`: ATT by event time (for event-study plots)\n7. Bootstrap inference provides valid standard errors and confidence intervals:\n - Use `n_bootstrap` parameter to enable multiplier bootstrap\n - Choose weight type: `'rademacher'`, `'mammen'`, or `'webb'`\n - Bootstrap results include SEs, CIs, and p-values for all aggregations\n8. Pre-treatment effects provide parallel trends diagnostics:\n - Use `base_period=\"varying\"` for consecutive period comparisons\n - Pre-treatment ATT(g,t) should be near zero\n - 95% CIs including zero is consistent with parallel trends\n - See Tutorial 07 for pre-trends power analysis (Roth 2022)\n9. Control group choices affect efficiency and assumptions:\n - `\"never_treated\"`: Stronger parallel trends assumption\n - `\"not_yet_treated\"`: Weaker assumption, uses more data\n10. CS vs SA pre-period differences are expected:\n - Post-treatment effects should be similar (robustness check)\n - Pre-treatment effects differ due to base period methodology\n - CS (varying): consecutive comparisons → one-period changes\n - SA: fixed reference (e=-1) → cumulative deviations\n - Use `base_period=\"universal\"` in CS for comparable pre-periods\n\nFor more details, see:\n- Callaway, B., & Sant'Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics.\n- Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics.\n- Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics."
	`825`	+ "source": "## Summary\n\nKey takeaways:\n\n1. TWFE can be biased with staggered adoption and heterogeneous effects\n2. Goodman-Bacon decomposition reveals why TWFE fails by showing:\n - The implicit 2x2 comparisons and their weights\n - How much weight falls on \"forbidden comparisons\" (already-treated as controls)\n3. Callaway-Sant'Anna properly handles staggered adoption by:\n - Computing group-time specific effects ATT(g,t)\n - Only using valid comparison groups\n - Properly aggregating effects\n4. Sun-Abraham provides an alternative approach using:\n - Interaction-weighted regression with cohort x relative-time indicators\n - Different weighting scheme than CS\n - More efficient under homogeneous effects\n5. Run both CS and SA as a robustness check—when they agree, results are more credible\n6. Aggregation options:\n - `\"simple\"`: Overall ATT\n - `\"group\"`: ATT by cohort\n - `\"event\"`: ATT by event time (for event-study plots)\n7. Bootstrap inference provides valid standard errors and confidence intervals:\n - Use `n_bootstrap` parameter to enable multiplier bootstrap\n - Choose weight type: `'rademacher'`, `'mammen'`, or `'webb'`\n - Bootstrap results include SEs, CIs, and p-values for all aggregations\n8. Pre-treatment effects provide parallel trends diagnostics:\n - Use `base_period=\"varying\"` for consecutive period comparisons\n - Pre-treatment ATT(g,t) should be near zero\n - 95% CIs including zero is consistent with parallel trends\n - See Tutorial 07 for pre-trends power analysis (Roth 2022)\n9. Control group choices affect efficiency and assumptions:\n - `\"never_treated\"`: Stronger parallel trends assumption\n - `\"not_yet_treated\"`: Weaker assumption, uses more data\n10. CS vs SA pre-period differences are expected:\n - Post-treatment effects should be similar (robustness check)\n - Pre-treatment effects differ due to base period methodology\n - CS (varying): consecutive comparisons → one-period changes\n - SA: fixed reference (e=-1-anticipation) → cumulative deviations\n - Use `base_period=\"universal\"` in CS for comparable pre-periods\n\nFor more details, see:\n- Callaway, B., & Sant'Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics.\n- Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics.\n- Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics."
`826`	`826`	`}`
`827`	`827`	`],`
`828`	`828`	`"metadata": {`