-
Notifications
You must be signed in to change notification settings - Fork 74
Description
Problem Statement
Our integration tests currently use a blanket retry: 1 at the vitest project level for both the integ-node and integ-browser configs (see vitest.config.ts). This was added to handle transient failures from model providers, but it's a coarse approach:
- Every failed test gets retried regardless of the failure reason (transient API error vs. genuine bug).
- Retries against model providers cost real money and add wall-clock time.
- We have no visibility into which tests are flaky and why they fail transiently, making it hard to fix root causes.
Example transient failure: CI run with timeout failures
Proposed Solution
Investigate and improve our retry strategy across a few dimensions:
-
Root cause analysis — Audit recent CI runs to identify which integration tests fail transiently and why (throttling, timeouts, model non-determinism, cold starts, etc.). Categorize the failure modes.
-
Granular retry configuration — Instead of a global
retry: 1, consider:- Per-test or per-file retry annotations (vitest supports
retryintest()options) for tests known to hit transient issues. - Different retry counts based on failure type (e.g., retry on throttling/timeout but not on assertion failures).
- Custom retry logic using
onTestFailedor vitest hooks to inspect the error before deciding to retry.
- Per-test or per-file retry annotations (vitest supports
-
Reduce the need for retries — Where possible, make tests more resilient:
- Add request-level retries with backoff in the test setup/fixtures for API calls (separate from test-level retries).
- Use more deterministic prompts or lower temperature settings in tests to reduce model non-determinism.
- Add appropriate waits or rate limiting between concurrent tests to avoid throttling.
-
Observability — Track retry frequency and reasons in CI artifacts so we can measure improvement over time.
Use Case
This impacts every CI run on every PR. Fewer unnecessary retries means:
- Lower model API costs in CI.
- Faster CI feedback loops for contributors.
- Better signal-to-noise ratio — when a test fails after retries, it's more likely a real issue.
- Clearer understanding of our test suite's reliability.
Alternatives Solutions
- Keep the status quo — Global
retry: 1is simple and catches most transient issues, but we pay the cost/speed penalty on every flaky run without learning from it. - Remove retries entirely — Would surface all transient failures but would make CI unreliable and block PRs on flaky tests.
- Move to a separate flaky-test quarantine — Mark known-flaky tests and run them in a separate non-blocking job. This isolates the problem but doesn't fix it.
Additional Context
Current config in vitest.config.ts — both integ projects use retry: 1:
// integ-node
{
test: {
include: ['test/integ/**/*.test.ts', 'test/integ/**/*.test.node.ts'],
name: { label: 'integ-node', color: 'magenta' },
testTimeout: 60 * 1000,
retry: 1,
sequence: { concurrent: true },
},
}
// integ-browser
{
test: {
include: ['test/integ/**/*.test.ts', 'test/integ/**/*.test.browser.ts'],
name: { label: 'integ-browser', color: 'yellow' },
testTimeout: 60 * 1000,
retry: 1,
sequence: { concurrent: true },
},
}