Make provider availability snapshot reliable for MCP-heavy providers#1314
Make provider availability snapshot reliable for MCP-heavy providers#1314jms830 wants to merge 1 commit into
Conversation
Two changes to stop MCP-heavy providers (omp/pi with many configured MCP servers, opencode) from spuriously showing 'error'/'unavailable': 1. Raise DEFAULT_REFRESH_TIMEOUT_MS 30s -> 90s. RPC session + model/command enumeration for MCP-heavy providers can exceed 30s. 2. Probe providers sequentially instead of concurrently. Running several MCP-heavy probes at once starves CPU/IO on smaller hosts; the resulting timeouts/crashes were cached and then gated on-demand model fetches. Separate from the OMP provider feature; kept as its own commit.
|
| Filename | Overview |
|---|---|
| packages/server/src/server/agent/provider-snapshot-manager.ts | Sequential probing loop replaces Promise.allSettled and DEFAULT_REFRESH_TIMEOUT_MS is raised from 30s to 90s; logic is correct but worst-case full-refresh time now scales linearly with provider count and no tests cover the new ordering behavior |
Sequence Diagram
sequenceDiagram
participant LPs as loadProviders
participant LP as loadProvider(A)
participant LP2 as loadProvider(B)
participant RP as refreshProvider
Note over LPs: Before: Promise.allSettled (concurrent)
LPs->>LP: start A
LPs->>LP2: start B
LP-->>RP: refreshProvider(A) [up to 30s]
LP2-->>RP: refreshProvider(B) [up to 30s]
RP-->>LP: A resolves/errors
RP-->>LP2: B resolves/errors
LP-->>LPs: settled
LP2-->>LPs: settled
Note over LPs: After: sequential for-of (this PR)
LPs->>LP: await loadProvider(A)
LP-->>RP: refreshProvider(A) [up to 90s]
RP-->>LP: A resolves/errors
LP-->>LPs: .catch() → continue
LPs->>LP2: await loadProvider(B)
LP2-->>RP: refreshProvider(B) [up to 90s]
RP-->>LP2: B resolves/errors
LP2-->>LPs: .catch() → continue
Reviews (1): Last reviewed commit: "Make provider snapshot reliable for MCP-..." | Re-trigger Greptile
| for (const provider of options.providers) { | ||
| await this.loadProvider({ ...options, provider }).catch(() => undefined); | ||
| } |
There was a problem hiding this comment.
Worst-case refresh time scales linearly with provider count
refreshProvider applies this.refreshTimeoutMs twice per provider — once to isAvailable() and again to fetchModels + fetchModes. If both phases hit the ceiling, a single provider consumes up to 2 × 90 s = 3 min. For an N-provider setup in a daemon restart, the sequential loop can now block for up to N × 3 min before any snapshot is considered fresh. Lightweight providers are fine in practice (binary-presence checks return in milliseconds), but if any slow provider is positioned early in options.providers it delays all providers behind it. A small concurrency limit (e.g. 2–3 parallel probes) would bound the latency regression while still resolving the contention problem the PR targets — the PR description mentions this as a ready alternative.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| for (const provider of options.providers) { | ||
| await this.loadProvider({ ...options, provider }).catch(() => undefined); | ||
| } |
There was a problem hiding this comment.
No test coverage for sequential probe ordering
The behavioral change from Promise.allSettled to sequential probing is not exercised by the existing test suite (loadProviders, sequential ordering, and the timeout constant are all absent from the test file). A test with two fake providers — one fast, one slow — could verify that the slow provider's error is isolated and does not prevent the fast provider from resolving to ready, and that the status: "error" snapshot entry is correctly emitted for the timed-out provider. Without this, a regression back to concurrent probing (or a bug in the loop's .catch path) would go undetected.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
What
Makes the provider availability snapshot reliable for MCP-heavy providers. Two changes in
provider-snapshot-manager.ts:Probe providers sequentially instead of concurrently.
loadProviderspreviously didPromise.allSettled(providers.map(...)), so every provider's availability probe ran at once. Providers that start an RPC session and connect their configured MCP servers during the probe (Pi/OMP with many MCP servers, OpenCode) then contend for CPU/IO on smaller hosts, and the resulting timeouts/crashes get cached and gate on-demand model fetches (paseo provider models <id>returns "not available"). Sequential probing removes the contention.Raise
DEFAULT_REFRESH_TIMEOUT_MS30s → 90s. An RPC session + model/command enumeration for an MCP-heavy provider can legitimately exceed 30s on a loaded host.Why
On a host running several MCP-heavy providers, the concurrent refresh made availability flap between
availableanderror/unavailableacross restarts, even though each provider was individually healthy and fast in isolation. Because the cached snapshot status also gates on-demand operations, a provider that lost the probe race became unusable until the next successful refresh. Sequential probing + a more generous budget makes the snapshot deterministic.Observed on a real multi-provider daemon: with concurrent probing, OMP/Pi/OpenCode intermittently showed
error; with these two changes they consistently resolve toavailable.Tradeoff
Sequential probing makes a full refresh take longer (sum of per-provider probe times instead of the max). For the lightweight providers this is negligible (binary-presence checks return in well under a second); the cost is paid only by genuinely slow MCP-heavy providers, which is exactly where reliability matters. A bounded-concurrency pool (e.g. 2) would be a reasonable middle ground if preferred — happy to switch to that.
Scope
Single file,
+15 / −4. No protocol changes, no provider-specific code. Independent of #1177 (OMP provider) — that PR surfaced the issue but does not depend on this one.Testing
@getpaseo/servertypecheckclean; existing provider-snapshot tests unaffected (behavior change is probe scheduling + a timeout constant). Verified live: after this change the daemon'spaseo provider lsconsistently reports MCP-heavy providers asavailableinstead of flapping toerror.