Skip to content

Commit 328dc33

Browse files
authored
Merge pull request #330 from igerber/fix/flaky-timing-tests
Exclude flaky wall-clock timing tests from default CI
2 parents ac181b7 + 888b5d1 commit 328dc33

2 files changed

Lines changed: 16 additions & 2 deletions

File tree

TODO.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ Deferred items from PR reviews that were not addressed before merge.
9898
| Sphinx autodoc fails to import 3 result members: `DiDResults.ci`, `MultiPeriodDiDResults.att`, `CallawaySantAnnaResults.aggregate` — investigate whether these are renamed/removed or just unresolvable from autosummary template | `docs/api/results.rst`, `docs/api/staggered.rst` || Medium |
9999
| `EDiDBootstrapResults` cross-reference is ambiguous — class is exported from both `diff_diff` and `diff_diff.efficient_did_bootstrap`, producing 3 "more than one target found" warnings. Add `:noindex:` to one source or use full-path refs | `diff_diff/efficient_did_results.py`, `docs/api/efficient_did.rst` || Low |
100100
| Tracked Sphinx autosummary stubs in `docs/api/_autosummary/*.rst` are stale — every sphinx build regenerates them with new attributes (e.g., `coef_var`, `survey_metadata`) that have been added to result classes. Either commit a refresh or move the directory to `.gitignore` and treat as build output. Also 6 untracked stubs exist for newer estimators (`WooldridgeDiD`, `SimulationMDEResults`, etc.) that have never been committed. | `docs/api/_autosummary/` || Low |
101+
| HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` || Low |
101102

102103
---
103104

tests/test_se_accuracy.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -252,12 +252,18 @@ def test_se_vs_r_benchmark(self):
252252
assert se_diff_pct < 0.01, \
253253
f"SE differs from R by {se_diff_pct:.4f}%, expected <0.01%"
254254

255+
@pytest.mark.slow
255256
def test_timing_performance(self, cs_results):
256257
"""
257258
Ensure estimation timing doesn't regress.
258259
259260
Baseline: ~0.005s for 200 units x 8 periods (small scale)
260-
Threshold: <0.1s (20x margin for CI variance)
261+
Threshold: <0.1s.
262+
263+
Excluded from default CI via ``@pytest.mark.slow`` — wall-clock time
264+
on shared runners is noisy (BLAS path variation, neighbor VM
265+
contention, cold caches) and produces false positives. Run locally
266+
with ``pytest -m slow`` for ad-hoc performance sanity checks.
261267
"""
262268
_, elapsed = cs_results
263269

@@ -398,8 +404,15 @@ def test_influence_function_normalization(self):
398404
f"Python SE {se_py:.4f} doesn't match standard {se_standard:.4f}"
399405

400406

407+
@pytest.mark.slow
401408
class TestPerformanceRegression:
402-
"""Tests to prevent performance regression."""
409+
"""Tests to prevent performance regression.
410+
411+
Excluded from default CI via ``@pytest.mark.slow`` — wall-clock time on
412+
shared runners is noisy (BLAS path variation, neighbor VM contention,
413+
cold caches) and produces false positives. Run locally with
414+
``pytest -m slow`` for ad-hoc performance sanity checks.
415+
"""
403416

404417
@pytest.mark.parametrize("n_units,max_time", [
405418
(100, 0.15), # Small: <150ms (CI runners need headroom)

0 commit comments

Comments
 (0)