Skip to content

SSE Resumption Between execd and SDKs #507

@Pangjiping

Description

@Pangjiping

SSE Resumption Between execd and SDKs

Overview

This document describes the capabilities and core requirements for resumable Server-Sent Events (SSE) streams between the execd daemon and language SDKs. The goal is to support disconnect/reconnect scenarios without losing or reordering execution output when the server can still replay recent events.


Capabilities to Deliver

1. Resumable execution streams

After transient failures (network blips, client sleep/wake, load balancer rotation, SDK process restart), the SDK can establish a new SSE connection and continue the same logical execution stream when the run is still active and replay is available server-side—without forcing a new run unless required.

2. Deterministic replay from a client cursor

On reconnect, the client sends a monotonic cursor (e.g., last seen sequence number, byte offset, or opaque server-issued token). The server replays missed events from that point, or returns a clear, machine-readable outcome if:

  • the run has already completed, or
  • the cursor is unknown/invalid, or
  • the event is outside the replay window.

3. Clean consumer semantics in the SDK

The SDK presents application code with a single ordered sequence of execution events. The implementation must avoid duplicate delivery at the public API surface (via deduplication, idempotent handling, or strict server replay guarantees).

4. Explicit terminal semantics

Terminal conditions—success, failure, cancellation, and not resumable—must be distinguishable through documented SSE event types and/or HTTP status and error payloads, so the SDK can stop retrying and release resources.

5. Operability

Reconnect behavior should be testable and observable: cursor advancement, replay hits/misses, retries, and fallbacks should map to stable error categories suitable for SDK backoff policies.


Core Requirements

Cursor and ordering

  • Every application-relevant SSE event must be addressable by a stable cursor scoped to a single logical stream (e.g., one command/code execution).
  • Cursors must define a total order for that stream (strictly increasing sequence or equivalent).
  • Replay must preserve the same delivery order as the original stream.

Wire contract and versioning

  • Resumption inputs (cursor, stream/run identifier, optional flags) must be specified in OpenAPI (or an adjacent protocol note for non-generated SSE paths) and versioned with the API.
  • Field names should align with existing models and generated clients where possible; handwritten SSE transport must stay contract-compatible.

Replay window and retention

  • execd must define bounded replay: limits by time, event count, and/or buffered payload size.
  • If replay is impossible, the response must not imply success with silent loss; use documented HTTP status and structured errors (e.g., expired buffer, unknown run).

Idempotency and side effects

  • Reconnect with the same cursor is read-only with respect to execution side effects: it must not start a new run or mutate execution state unless that is an explicitly separate API.
  • Starting a new run remains a distinct operation from resuming an existing stream.

Terminal runs

  • For completed runs, reconnect behavior must be deterministic and documented: either immediate completion metadata, tail replay then completion, or non-resumable with a clear reason—pick one consistent model per endpoint family.

Concurrency

  • Either only one active consumer per logical stream is supported, or multi-consumer semantics are explicitly defined (e.g., shared cursor, fan-out rules). Undefined concurrent attach is disallowed.

Security and tenancy

  • Stream identifiers and cursors must be cryptographically or logically bound to the authenticated principal and sandbox/session context so clients cannot resume another tenant’s stream by identifier guessing.

Performance and safety

  • Replay must be chunked or limited per response to protect memory and CPU on execd and on clients.
  • Heartbeats, comments, and keep-alive frames must not break cursor or ordering semantics.

SDK responsibilities

  • Implement automatic reconnect with exponential backoff and jitter where appropriate.
  • Persist and advance the cursor for the lifetime of the run.
  • Map server errors to retryable vs terminal outcomes.
  • Preserve backwards compatibility: clients that do not send resumption parameters continue to work unchanged.

Non-Goals

  • Unbounded history: resumption is not a full audit log; retention is finite by design.
  • General stream editing: resumption is for replay of server-emitted events, not arbitrary mutation of in-flight execution unless specified elsewhere.

Success Criteria (summary)

  • Reconnect after a drop yields no silent loss within the documented replay window.
  • Event order at the SDK boundary is stable and duplicate-free from the app’s perspective.
  • Terminal and non-resumable cases are explicit and test-covered.
  • Legacy clients remain compatible without resumption parameters.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions