Skip to content

[FEATURE] Improve integration test retry strategies for cost and speed efficiency #642

@pgrayy

Description

@pgrayy

Problem Statement

Our integration tests currently use a blanket retry: 1 at the vitest project level for both the integ-node and integ-browser configs (see vitest.config.ts). This was added to handle transient failures from model providers, but it's a coarse approach:

  • Every failed test gets retried regardless of the failure reason (transient API error vs. genuine bug).
  • Retries against model providers cost real money and add wall-clock time.
  • We have no visibility into which tests are flaky and why they fail transiently, making it hard to fix root causes.

Example transient failure: CI run with timeout failures

Proposed Solution

Investigate and improve our retry strategy across a few dimensions:

  1. Root cause analysis — Audit recent CI runs to identify which integration tests fail transiently and why (throttling, timeouts, model non-determinism, cold starts, etc.). Categorize the failure modes.

  2. Granular retry configuration — Instead of a global retry: 1, consider:

    • Per-test or per-file retry annotations (vitest supports retry in test() options) for tests known to hit transient issues.
    • Different retry counts based on failure type (e.g., retry on throttling/timeout but not on assertion failures).
    • Custom retry logic using onTestFailed or vitest hooks to inspect the error before deciding to retry.
  3. Reduce the need for retries — Where possible, make tests more resilient:

    • Add request-level retries with backoff in the test setup/fixtures for API calls (separate from test-level retries).
    • Use more deterministic prompts or lower temperature settings in tests to reduce model non-determinism.
    • Add appropriate waits or rate limiting between concurrent tests to avoid throttling.
  4. Observability — Track retry frequency and reasons in CI artifacts so we can measure improvement over time.

Use Case

This impacts every CI run on every PR. Fewer unnecessary retries means:

  • Lower model API costs in CI.
  • Faster CI feedback loops for contributors.
  • Better signal-to-noise ratio — when a test fails after retries, it's more likely a real issue.
  • Clearer understanding of our test suite's reliability.

Alternatives Solutions

  • Keep the status quo — Global retry: 1 is simple and catches most transient issues, but we pay the cost/speed penalty on every flaky run without learning from it.
  • Remove retries entirely — Would surface all transient failures but would make CI unreliable and block PRs on flaky tests.
  • Move to a separate flaky-test quarantine — Mark known-flaky tests and run them in a separate non-blocking job. This isolates the problem but doesn't fix it.

Additional Context

Current config in vitest.config.ts — both integ projects use retry: 1:

// integ-node
{
  test: {
    include: ['test/integ/**/*.test.ts', 'test/integ/**/*.test.node.ts'],
    name: { label: 'integ-node', color: 'magenta' },
    testTimeout: 60 * 1000,
    retry: 1,
    sequence: { concurrent: true },
  },
}

// integ-browser
{
  test: {
    include: ['test/integ/**/*.test.ts', 'test/integ/**/*.test.browser.ts'],
    name: { label: 'integ-browser', color: 'yellow' },
    testTimeout: 60 * 1000,
    retry: 1,
    sequence: { concurrent: true },
  },
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions