Skip to content

[#1075 sub-issue] /baseline --tests periodic pass — cross-session test-baseline snapshot #1100

@akaszubski

Description

@akaszubski

Summary

Third sub-issue of three for umbrella #1075. Build a periodic-aggregation pass that produces a machine-readable test-baseline snapshot across sessions — which tests have been failing for how long, with cross-session trend visibility.

Blocked by

#1098 (docs sweep — first variant). Build that first.

Context

Pre-existing test failures get re-verified each pipeline by narrative comparison ("implementer says these 230 are pre-existing"). No machine-readable baseline aggregates across sessions, so cross-session trends are lost: a test that has been failing for 12 sessions looks the same as a test that started failing today.

This session's #1086 pipeline surfaced the gap directly: the STEP 1 baseline-failing-tests capture timed out at 120s and silently produced a 0-byte file (filed as #1094 by CIA). The fix-forward classification relied on manual git stash verification at reviewer stage. A periodic baseline pass would have a known-good snapshot to compare against — and would obviate the per-pipeline STEP 1 capture entirely.

Implementation Approach

  • Periodic command (e.g., /baseline --tests) that runs pytest --tb=no -q against the full suite with generous timeout.
  • Persists results to a versioned artifact (e.g., .claude/local/test_baseline.json) with: test ID → first-failed-at-commit + last-seen-at-commit + current status + consecutive-failures count.
  • Subsequent /implement pipelines read this artifact at STEP 1 instead of running their own baseline capture.
  • Idempotent — re-running with no test changes produces an updated snapshot but no spurious diff.

Test Scenarios

  1. Initial run captures full pass/fail snapshot to artifact path.
  2. Subsequent run with no changes produces identical artifact contents (idempotent).
  3. /implement STEP 1 consumes this artifact instead of running its own baseline capture.
  4. Tests fixed since last snapshot are correctly classified as "fixed".
  5. New failures since last snapshot are correctly classified as "new" — implement gate blocks on these.
  6. Cross-session signal: a test that has been failing for N sessions shows a consecutive-failures count of N.

Acceptance Criteria

  • /baseline --tests (or equivalent) runs the full pytest suite with generous timeout.
  • Persists results to versioned artifact with test ID, first-failed-at-commit, last-seen-at-commit, status, consecutive-failures count.
  • Subsequent /implement pipelines consume this artifact instead of running baseline capture.
  • Idempotent on a clean repo.
  • Structural test for the artifact path.
  • Documented in docs/COMMANDS.md (or equivalent) with usage example.

Relation to umbrella

Plugin Version: 3.50.0 (8035385)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions