Skip to content

Hotfix Soniox fallback keys for staging#3220

Open
isaiahb wants to merge 4 commits into
stagingfrom
codex/soniox-fallback-hotfix-staging
Open

Hotfix Soniox fallback keys for staging#3220
isaiahb wants to merge 4 commits into
stagingfrom
codex/soniox-fallback-hotfix-staging

Conversation

@isaiahb

@isaiahb isaiahb commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Backports the Soniox fallback key hotfix to staging for legacy Cloud v1.

Why

The primary Soniox org/key can hit account-level limits, causing every new transcription or translation stream to fail through the same exhausted credential. This branch mirrors the main hotfix so staging has the same operational behavior.

What changed

  • Adds SONIOX_FALLBACK_API_KEYS support for transcription and translation.
  • Uses a shared Soniox key-pool helper with failure classification and cooldowns.
  • Keeps SONIOX_API_KEY as primary while healthy.
  • Logs only safe key fingerprints.
  • Includes issue docs under cloud/issues/108-soniox-fallback-api-keys/.

Validation

  • bun test cloud/packages/cloud/src/services/session/soniox/__tests__/SonioxKeyPool.test.ts
  • git diff --check HEAD~1..HEAD
  • cd cloud && bun run build
  • cd cloud/packages/cloud && bun run build

Note

Medium Risk
Touches live transcription/translation stream creation and retry paths; misconfigured fallbacks or aggressive cooldowns could delay or fail streams, but behavior is additive behind env config and mirrors an existing main hotfix.

Overview
Adds SONIOX_FALLBACK_API_KEYS (comma-separated) alongside SONIOX_API_KEY so Cloud v1 can open Soniox streams on alternate credentials when the primary key hits limits or errors.

Introduces SonioxKeyPool: prefers primary, round-robins healthy fallbacks, classifies failures (concurrency, rate limit, quota, auth, transient) with per-key cooldowns, and logs SHA-256 fingerprints only. Cooldown state is shared in-process across transcription and translation via getSharedSonioxKeyPool.

Transcription and translation providers loop credentials on stream creation, record failures back to the pool from SDK/WebSocket streams (credentialId), and managers retry when all keys are cooling down or on quota/concurrency/rate-limit errors so a new stream can pick another key.

Default Soniox model bumps from stt-rt-v4 to stt-rt-v5 in config and stream defaults. Issue docs and unit tests for the key pool are included.

Reviewed by Cursor Bugbot for commit ac44ca0. Bugbot is set up for automated code reviews on this repo. Configure here.


Summary by cubic

Backports Soniox fallback API keys to staging for Cloud v1, adding automatic key failover with shared cooldowns so transcription and translation keep working when the primary key hits limits. Also defaults the Soniox model to stt-rt-v5 and tightens quota cooldown handling to prevent premature reuse.

  • New Features

    • Adds SONIOX_FALLBACK_API_KEYS (comma‑separated) alongside SONIOX_API_KEY.
    • Introduces SonioxKeyPool: prefers primary, round‑robins healthy fallbacks, classifies failures (concurrency/rate/quota/auth/transient) with per‑key cooldowns; quota cooldowns are long and a later success does not clear an active cooldown; cooldowns shared across providers in‑process; logs SHA‑256 key fingerprints only.
    • Providers select credentials per stream; SDK/WebSocket streams report credentialId failures back to the pool; one @soniox/node client per credential; managers retry on capacity errors and when no credentials are temporarily available; includes unit tests.
  • Migration

    • Set SONIOX_FALLBACK_API_KEYS in the staging env and redeploy.
    • No breaking changes; SONIOX_API_KEY remains primary.

Written for commit ac44ca0. Summary will update on new commits.

Review in cubic

@isaiahb isaiahb requested a review from a team as a code owner June 21, 2026 01:53
@github-actions

Copy link
Copy Markdown

📋 PR Review Helper

📱 Mobile App Build

Waiting for build...

🕶️ ASG Client Build

Waiting for build...


🔀 Test Locally

gh pr checkout 3220

);
return true;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auth errors block fallback retry

Medium Severity

After this change, live Soniox transcription streams call recordCredentialFailure, which can disable or cooldown the key that was in use. isRetryableError still returns false for Soniox 401/403, so the manager does not schedule a new stream that would run the provider’s credential loop and select a fallback key. Capacity-related failures were updated to retry for that reason; auth-style failures on an already-open stream leave fallback keys unused.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ec26e83. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this open intentionally for follow-up. The ASAP prod hotfix keeps auth-style runtime errors non-retryable and focuses on the 402/budget quota fallback path that took prod down.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec26e83126

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

this.logger,
this.config,
throw (
lastError ??

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve retryable credential failures after fallback auth errors

When the preferred credential fails with a retryable capacity condition (for example a concurrent-stream limit) and a later fallback credential is misconfigured or invalid, this throws only the last fallback error. Because stream creation failures reach handleStreamError with no current provider, isRetryableError sees the final 401/403-style Soniox error and gives up, so the stream is never retried when the primary key’s cooldown expires. Consider surfacing a pool-level retryable error whenever any attempted credential is cooling down instead of letting a later permanent fallback error mask it.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this open intentionally for follow-up. This is a real edge case, but I kept the outage hotfix narrow after product guidance: 402/budget cooldown plus shared fallback key rotation.

lower.includes("unauthorized")
) {
return { kind: "auth", cooldownMs: Number.POSITIVE_INFINITY, disabled: true };
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

403 not treated as auth

Medium Severity

classifySonioxCredentialFailure disables credentials only for HTTP 401, while stream retry logic in TranscriptionManager also treats Soniox 403 as non-retryable. A Soniox error 403 response gets a short transient cooldown instead of being disabled, so the key pool can select that credential again after cooldown despite authorization-style failure.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5a3d149. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this open intentionally for follow-up. We are not broadening 403 into process-disable behavior in the ASAP hotfix.

@isaiahb

isaiahb commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Related Soniox fallback PRs:

Current prod/main hotfix head: e06781b.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ac44ca0. Configure here.

stream = this.createStreamForCredential(credential, language, options);
await stream.initialize();
this.keyPool.recordSuccess(credential.id);
return stream;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pool success before stream ready

Medium Severity

createTranscriptionStream and createTranslationStream call keyPool.recordSuccess as soon as initialize() resolves, but TranscriptionManager and TranslationManager still run waitForStreamReady afterward. When initialization hangs (e.g. SDK connect() returns while state stays INITIALIZING), the manager throws a timeout without recordFailure, so the pool keeps treating that credential as healthy and retries the primary instead of fallbacks.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ac44ca0. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this open intentionally for follow-up. The latest hotfix prevents a success from clearing an active cooldown, but moving success recording later in the manager is a larger lifecycle change than I want in the prod outage patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant