[FEATURE] Improve integration test retry strategies for cost and speed efficiency

## Problem Statement

Our integration tests currently use a blanket `retry: 1` at the vitest project level for both the `integ-node` and `integ-browser` configs (see `vitest.config.ts`). This was added to handle transient failures from model providers, but it's a coarse approach:

- Every failed test gets retried regardless of the failure reason (transient API error vs. genuine bug).
- Retries against model providers cost real money and add wall-clock time.
- We have no visibility into *which* tests are flaky and *why* they fail transiently, making it hard to fix root causes.

**Example transient failure:** [CI run with timeout failures](https://github.com/strands-agents/sdk-typescript/actions/runs/22961445357/job/66652956758)

## Proposed Solution

Investigate and improve our retry strategy across a few dimensions:

1. **Root cause analysis** — Audit recent CI runs to identify which integration tests fail transiently and why (throttling, timeouts, model non-determinism, cold starts, etc.). Categorize the failure modes.

2. **Granular retry configuration** — Instead of a global `retry: 1`, consider:
   - Per-test or per-file retry annotations (vitest supports `retry` in `test()` options) for tests known to hit transient issues.
   - Different retry counts based on failure type (e.g., retry on throttling/timeout but not on assertion failures).
   - Custom retry logic using `onTestFailed` or vitest hooks to inspect the error before deciding to retry.

3. **Reduce the need for retries** — Where possible, make tests more resilient:
   - Add request-level retries with backoff in the test setup/fixtures for API calls (separate from test-level retries).
   - Use more deterministic prompts or lower temperature settings in tests to reduce model non-determinism.
   - Add appropriate waits or rate limiting between concurrent tests to avoid throttling.

4. **Observability** — Track retry frequency and reasons in CI artifacts so we can measure improvement over time.

## Use Case

This impacts every CI run on every PR. Fewer unnecessary retries means:
- Lower model API costs in CI.
- Faster CI feedback loops for contributors.
- Better signal-to-noise ratio — when a test fails after retries, it's more likely a real issue.
- Clearer understanding of our test suite's reliability.

## Alternatives Solutions

- **Keep the status quo** — Global `retry: 1` is simple and catches most transient issues, but we pay the cost/speed penalty on every flaky run without learning from it.
- **Remove retries entirely** — Would surface all transient failures but would make CI unreliable and block PRs on flaky tests.
- **Move to a separate flaky-test quarantine** — Mark known-flaky tests and run them in a separate non-blocking job. This isolates the problem but doesn't fix it.

## Additional Context

Current config in `vitest.config.ts` — both integ projects use `retry: 1`:

```ts
// integ-node
{
  test: {
    include: ['test/integ/**/*.test.ts', 'test/integ/**/*.test.node.ts'],
    name: { label: 'integ-node', color: 'magenta' },
    testTimeout: 60 * 1000,
    retry: 1,
    sequence: { concurrent: true },
  },
}

// integ-browser
{
  test: {
    include: ['test/integ/**/*.test.ts', 'test/integ/**/*.test.browser.ts'],
    name: { label: 'integ-browser', color: 'yellow' },
    testTimeout: 60 * 1000,
    retry: 1,
    sequence: { concurrent: true },
  },
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Improve integration test retry strategies for cost and speed efficiency #642

Problem Statement

Proposed Solution

Use Case

Alternatives Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Improve integration test retry strategies for cost and speed efficiency #642

Description

Problem Statement

Proposed Solution

Use Case

Alternatives Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions