Summary
Third sub-issue of three for umbrella #1075. Build a periodic-aggregation pass that produces a machine-readable test-baseline snapshot across sessions — which tests have been failing for how long, with cross-session trend visibility.
Blocked by
#1098 (docs sweep — first variant). Build that first.
Context
Pre-existing test failures get re-verified each pipeline by narrative comparison ("implementer says these 230 are pre-existing"). No machine-readable baseline aggregates across sessions, so cross-session trends are lost: a test that has been failing for 12 sessions looks the same as a test that started failing today.
This session's #1086 pipeline surfaced the gap directly: the STEP 1 baseline-failing-tests capture timed out at 120s and silently produced a 0-byte file (filed as #1094 by CIA). The fix-forward classification relied on manual git stash verification at reviewer stage. A periodic baseline pass would have a known-good snapshot to compare against — and would obviate the per-pipeline STEP 1 capture entirely.
Implementation Approach
- Periodic command (e.g.,
/baseline --tests) that runs pytest --tb=no -q against the full suite with generous timeout.
- Persists results to a versioned artifact (e.g.,
.claude/local/test_baseline.json) with: test ID → first-failed-at-commit + last-seen-at-commit + current status + consecutive-failures count.
- Subsequent
/implement pipelines read this artifact at STEP 1 instead of running their own baseline capture.
- Idempotent — re-running with no test changes produces an updated snapshot but no spurious diff.
Test Scenarios
- Initial run captures full pass/fail snapshot to artifact path.
- Subsequent run with no changes produces identical artifact contents (idempotent).
/implement STEP 1 consumes this artifact instead of running its own baseline capture.
- Tests fixed since last snapshot are correctly classified as "fixed".
- New failures since last snapshot are correctly classified as "new" — implement gate blocks on these.
- Cross-session signal: a test that has been failing for N sessions shows a consecutive-failures count of N.
Acceptance Criteria
Relation to umbrella
Plugin Version: 3.50.0 (8035385)
Summary
Third sub-issue of three for umbrella #1075. Build a periodic-aggregation pass that produces a machine-readable test-baseline snapshot across sessions — which tests have been failing for how long, with cross-session trend visibility.
Blocked by
#1098 (docs sweep — first variant). Build that first.
Context
Pre-existing test failures get re-verified each pipeline by narrative comparison ("implementer says these 230 are pre-existing"). No machine-readable baseline aggregates across sessions, so cross-session trends are lost: a test that has been failing for 12 sessions looks the same as a test that started failing today.
This session's #1086 pipeline surfaced the gap directly: the STEP 1 baseline-failing-tests capture timed out at 120s and silently produced a 0-byte file (filed as #1094 by CIA). The fix-forward classification relied on manual
git stashverification at reviewer stage. A periodic baseline pass would have a known-good snapshot to compare against — and would obviate the per-pipeline STEP 1 capture entirely.Implementation Approach
/baseline --tests) that runspytest --tb=no -qagainst the full suite with generous timeout..claude/local/test_baseline.json) with: test ID → first-failed-at-commit + last-seen-at-commit + current status + consecutive-failures count./implementpipelines read this artifact at STEP 1 instead of running their own baseline capture.Test Scenarios
/implementSTEP 1 consumes this artifact instead of running its own baseline capture.Acceptance Criteria
/baseline --tests(or equivalent) runs the full pytest suite with generous timeout./implementpipelines consume this artifact instead of running baseline capture.docs/COMMANDS.md(or equivalent) with usage example.Relation to umbrella
Plugin Version: 3.50.0 (8035385)