Merge pull request #330 from igerber/fix/flaky-timing-tests

igerber · web-flow · commit 328dc33d28d6 · 2026-04-19T09:29:16.000-04:00
Exclude flaky wall-clock timing tests from default CI
diff --git a/TODO.md b/TODO.md
@@ -98,6 +98,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | Sphinx autodoc fails to import 3 result members: `DiDResults.ci`, `MultiPeriodDiDResults.att`, `CallawaySantAnnaResults.aggregate` — investigate whether these are renamed/removed or just unresolvable from autosummary template | `docs/api/results.rst`, `docs/api/staggered.rst` | — | Medium |
 | `EDiDBootstrapResults` cross-reference is ambiguous — class is exported from both `diff_diff` and `diff_diff.efficient_did_bootstrap`, producing 3 "more than one target found" warnings. Add `:noindex:` to one source or use full-path refs | `diff_diff/efficient_did_results.py`, `docs/api/efficient_did.rst` | — | Low |
 | Tracked Sphinx autosummary stubs in `docs/api/_autosummary/*.rst` are stale — every sphinx build regenerates them with new attributes (e.g., `coef_var`, `survey_metadata`) that have been added to result classes. Either commit a refresh or move the directory to `.gitignore` and treat as build output. Also 6 untracked stubs exist for newer estimators (`WooldridgeDiD`, `SimulationMDEResults`, etc.) that have never been committed. | `docs/api/_autosummary/` | — | Low |
+| HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` | — | Low |
 
 ---
 
diff --git a/tests/test_se_accuracy.py b/tests/test_se_accuracy.py
@@ -252,12 +252,18 @@ def test_se_vs_r_benchmark(self):
         assert se_diff_pct < 0.01, \
             f"SE differs from R by {se_diff_pct:.4f}%, expected <0.01%"
 
+    @pytest.mark.slow
     def test_timing_performance(self, cs_results):
         """
         Ensure estimation timing doesn't regress.
 
         Baseline: ~0.005s for 200 units x 8 periods (small scale)
-        Threshold: <0.1s (20x margin for CI variance)
+        Threshold: <0.1s.
+
+        Excluded from default CI via ``@pytest.mark.slow`` — wall-clock time
+        on shared runners is noisy (BLAS path variation, neighbor VM
+        contention, cold caches) and produces false positives. Run locally
+        with ``pytest -m slow`` for ad-hoc performance sanity checks.
         """
         _, elapsed = cs_results
 
@@ -398,8 +404,15 @@ def test_influence_function_normalization(self):
             f"Python SE {se_py:.4f} doesn't match standard {se_standard:.4f}"
 
 
+@pytest.mark.slow
 class TestPerformanceRegression:
-    """Tests to prevent performance regression."""
+    """Tests to prevent performance regression.
+
+    Excluded from default CI via ``@pytest.mark.slow`` — wall-clock time on
+    shared runners is noisy (BLAS path variation, neighbor VM contention,
+    cold caches) and produces false positives. Run locally with
+    ``pytest -m slow`` for ad-hoc performance sanity checks.
+    """
 
     @pytest.mark.parametrize("n_units,max_time", [
         (100, 0.15),   # Small: <150ms (CI runners need headroom)