Skip to content

Make provider availability snapshot reliable for MCP-heavy providers#1314

Open
jms830 wants to merge 1 commit into
getpaseo:mainfrom
jms830:feat/provider-snapshot-reliability
Open

Make provider availability snapshot reliable for MCP-heavy providers#1314
jms830 wants to merge 1 commit into
getpaseo:mainfrom
jms830:feat/provider-snapshot-reliability

Conversation

@jms830
Copy link
Copy Markdown

@jms830 jms830 commented Jun 3, 2026

What

Makes the provider availability snapshot reliable for MCP-heavy providers. Two changes in provider-snapshot-manager.ts:

  1. Probe providers sequentially instead of concurrently. loadProviders previously did Promise.allSettled(providers.map(...)), so every provider's availability probe ran at once. Providers that start an RPC session and connect their configured MCP servers during the probe (Pi/OMP with many MCP servers, OpenCode) then contend for CPU/IO on smaller hosts, and the resulting timeouts/crashes get cached and gate on-demand model fetches (paseo provider models <id> returns "not available"). Sequential probing removes the contention.

  2. Raise DEFAULT_REFRESH_TIMEOUT_MS 30s → 90s. An RPC session + model/command enumeration for an MCP-heavy provider can legitimately exceed 30s on a loaded host.

Why

On a host running several MCP-heavy providers, the concurrent refresh made availability flap between available and error/unavailable across restarts, even though each provider was individually healthy and fast in isolation. Because the cached snapshot status also gates on-demand operations, a provider that lost the probe race became unusable until the next successful refresh. Sequential probing + a more generous budget makes the snapshot deterministic.

Observed on a real multi-provider daemon: with concurrent probing, OMP/Pi/OpenCode intermittently showed error; with these two changes they consistently resolve to available.

Tradeoff

Sequential probing makes a full refresh take longer (sum of per-provider probe times instead of the max). For the lightweight providers this is negligible (binary-presence checks return in well under a second); the cost is paid only by genuinely slow MCP-heavy providers, which is exactly where reliability matters. A bounded-concurrency pool (e.g. 2) would be a reasonable middle ground if preferred — happy to switch to that.

Scope

Single file, +15 / −4. No protocol changes, no provider-specific code. Independent of #1177 (OMP provider) — that PR surfaced the issue but does not depend on this one.

Testing

@getpaseo/server typecheck clean; existing provider-snapshot tests unaffected (behavior change is probe scheduling + a timeout constant). Verified live: after this change the daemon's paseo provider ls consistently reports MCP-heavy providers as available instead of flapping to error.

Two changes to stop MCP-heavy providers (omp/pi with many configured MCP
servers, opencode) from spuriously showing 'error'/'unavailable':

1. Raise DEFAULT_REFRESH_TIMEOUT_MS 30s -> 90s. RPC session + model/command
   enumeration for MCP-heavy providers can exceed 30s.
2. Probe providers sequentially instead of concurrently. Running several
   MCP-heavy probes at once starves CPU/IO on smaller hosts; the resulting
   timeouts/crashes were cached and then gated on-demand model fetches.

Separate from the OMP provider feature; kept as its own commit.
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR makes provider availability probing sequential (replacing Promise.allSettled) and raises the per-probe timeout from 30 s to 90 s to prevent CPU/IO contention on hosts running several MCP-heavy providers from causing spurious error snapshot entries that block on-demand model fetches.

  • loadProviders now iterates providers with await ... .catch(() => undefined) instead of Promise.allSettled, ensuring each probe completes before the next starts and one failure cannot abort the rest.
  • DEFAULT_REFRESH_TIMEOUT_MS is raised from 30 000 to 90 000 ms; the generous budget gives RPC-session-starting providers room to enumerate models and commands even on a loaded host.

Confidence Score: 4/5

Safe to merge; the logic change is correct and the reliability improvement is real, though the full-refresh latency trade-off is worth keeping in mind as provider count grows.

The sequential loop correctly mirrors the error-isolation behavior of the original Promise.allSettled — refreshProvider already catches all errors internally, so the .catch(() => undefined) just protects against edge-case rejections from event listeners. The timeout increase is straightforward. The two non-blocking observations are that worst-case full-refresh time now scales as N × 2 × 90 s (a slow early provider delays all later ones), and no test exercises the new ordering behavior, making it hard to catch a regression back to concurrent probing.

packages/server/src/server/agent/provider-snapshot-manager.ts — specifically the sequential loop and the interaction between provider ordering and total refresh latency.

Important Files Changed

Filename Overview
packages/server/src/server/agent/provider-snapshot-manager.ts Sequential probing loop replaces Promise.allSettled and DEFAULT_REFRESH_TIMEOUT_MS is raised from 30s to 90s; logic is correct but worst-case full-refresh time now scales linearly with provider count and no tests cover the new ordering behavior

Sequence Diagram

sequenceDiagram
    participant LPs as loadProviders
    participant LP as loadProvider(A)
    participant LP2 as loadProvider(B)
    participant RP as refreshProvider

    Note over LPs: Before: Promise.allSettled (concurrent)
    LPs->>LP: start A
    LPs->>LP2: start B
    LP-->>RP: refreshProvider(A) [up to 30s]
    LP2-->>RP: refreshProvider(B) [up to 30s]
    RP-->>LP: A resolves/errors
    RP-->>LP2: B resolves/errors
    LP-->>LPs: settled
    LP2-->>LPs: settled

    Note over LPs: After: sequential for-of (this PR)
    LPs->>LP: await loadProvider(A)
    LP-->>RP: refreshProvider(A) [up to 90s]
    RP-->>LP: A resolves/errors
    LP-->>LPs: .catch() → continue
    LPs->>LP2: await loadProvider(B)
    LP2-->>RP: refreshProvider(B) [up to 90s]
    RP-->>LP2: B resolves/errors
    LP2-->>LPs: .catch() → continue
Loading

Reviews (1): Last reviewed commit: "Make provider snapshot reliable for MCP-..." | Re-trigger Greptile

Comment on lines +534 to +536
for (const provider of options.providers) {
await this.loadProvider({ ...options, provider }).catch(() => undefined);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Worst-case refresh time scales linearly with provider count

refreshProvider applies this.refreshTimeoutMs twice per provider — once to isAvailable() and again to fetchModels + fetchModes. If both phases hit the ceiling, a single provider consumes up to 2 × 90 s = 3 min. For an N-provider setup in a daemon restart, the sequential loop can now block for up to N × 3 min before any snapshot is considered fresh. Lightweight providers are fine in practice (binary-presence checks return in milliseconds), but if any slow provider is positioned early in options.providers it delays all providers behind it. A small concurrency limit (e.g. 2–3 parallel probes) would bound the latency regression while still resolving the contention problem the PR targets — the PR description mentions this as a ready alternative.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +534 to +536
for (const provider of options.providers) {
await this.loadProvider({ ...options, provider }).catch(() => undefined);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No test coverage for sequential probe ordering

The behavioral change from Promise.allSettled to sequential probing is not exercised by the existing test suite (loadProviders, sequential ordering, and the timeout constant are all absent from the test file). A test with two fake providers — one fast, one slow — could verify that the slow provider's error is isolated and does not prevent the fast provider from resolving to ready, and that the status: "error" snapshot entry is correctly emitted for the timed-out provider. Without this, a regression back to concurrent probing (or a bug in the loop's .catch path) would go undetected.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant