Skip to content

Perf: drop O(n²) string accumulation in agent SSE response loop#8

Draft
mimeding wants to merge 3 commits into
mainfrom
cursor/agent-sse-string-concat-2812
Draft

Perf: drop O(n²) string accumulation in agent SSE response loop#8
mimeding wants to merge 3 commits into
mainfrom
cursor/agent-sse-string-concat-2812

Conversation

@mimeding

Copy link
Copy Markdown
Owner

Summary

Why this matters (business)

The agent SSE handler (POST /agents/{id}/run) is what powers long-running agent turns over HTTP — Work mode, server-side tool invocations, the new background-task pipeline. On long generations the handler was doing more work per token than the inference engine itself, just to assemble the assistant message for the next iteration.

For short turns this is invisible. For multi-thousand-character agent answers (file analyses, multi-file code reviews, summarizations) the accumulator's CPU cost grows quadratically and starts to dominate. Streaming feels increasingly choppy near the end of long turns, and the server-side latency profile is misleading because the slowness is in our own accumulation loop, not in the model.

What's wrong (technical)

                var responseContent = ""
                var toolInvoked: ServiceToolInvocation?

                do {
                    let stream = try await chatEngine.streamChat(request: iterationReq)
                    for try await delta in stream {
                        if StreamingToolHint.isSentinel(delta) { continue }
                        responseContent += delta
                        hop {
                            writerBound.value.writeContent(
                                delta,
                                model: model,
                                responseId: responseId,
                                created: created,
                                context: ctx.value
                            )
                        }
                    }
                } catch let inv as ServiceToolInvocation {

Swift String is a copy-on-write value type. Even though += amortizes capacity growth, each append still has to walk the existing content and reallocate when capacity runs out. Across a full streamed turn this is O(n²) in characters — for a 20 KB assistant answer that's hundreds of millions of pointer-copies just to build the next message context, on top of the actual SSE flush.

responseContent is only used after the stream completes (messages.append(ChatMessage(role: "assistant", content: responseContent))). Nothing reads it mid-stream. The tool-invocation branch doesn't use it at all.

Fix

Accumulate streamed deltas in a [String] and a running utf8.count. After the stream ends and only on the "final text response" branch (no tool invocation), reserve a single String of the right capacity and append the chunks once. That's O(n) total work and a single backing allocation, regardless of stream length.

var deltaBuffer: [String] = []
var accumulatedLength = 0
...
for try await delta in stream {
    if StreamingToolHint.isSentinel(delta) { continue }
    deltaBuffer.append(delta)
    accumulatedLength += delta.utf8.count
    hop { writerBound.value.writeContent(delta, ...) }
}
...
var responseContent = String()
responseContent.reserveCapacity(accumulatedLength)
for chunk in deltaBuffer { responseContent.append(chunk) }
messages.append(ChatMessage(role: "assistant", content: responseContent))

No SSE-wire change — the per-delta writeContent is unchanged. Only the in-memory accumulation strategy is different.

Scope decisions

  • This is the only loop in HTTPHandler.swift that does responseContent += delta on the hot path. /v1/chat/completions and /messages streaming loops use a different shape (they don't materialize the full message in HTTP-handler memory).
  • The ResponseWriters per-token encode pattern (also flagged by the audit) is a separate, larger PR — it touches three writers and needs careful flush-coalescing benchmarks before landing.

Changes

  • Behavior change (in-memory only — wire response is byte-identical)
  • UI change
  • Refactor / chore (perf)
  • Tests (no behavior change on the wire; a perf benchmark would belong in scripts/benchmark/ and is a follow-up)
  • Docs

Test Plan

  1. Run a long agent turn against /agents/{id}/run (e.g. ask it to summarize a multi-thousand-line file). Streamed deltas should arrive at the same rate; the final SSE event closes the stream as before.
  2. Server-side timing: profile a 20 KB+ answer with Instruments → CPU. The handler's own CPU time on the streaming loop should drop dramatically (quadratic → linear).
  3. Tool-invocation path: ask for an action that triggers a tool call — messages.append(.assistant(...)) does not run on that branch, so behaviour is unchanged.

Checklist

  • I have read CONTRIBUTING.md
  • I added/updated tests where reasonable (wire behavior is unchanged; a benchmark would be the right tool)
  • I updated docs/README as needed (n/a — internal optimization)
  • I verified build on macOS with Xcode 16.4+ (authored in a Linux sandbox; verified the touched file via swiftc -frontend -parse)
Open in Web Open in Cursor 

cursoragent and others added 3 commits May 27, 2026 04:13
The agent /run SSE handler accumulated the assistant turn by appending
each streamed delta to a String with '+=' on the hot streaming path.
String concatenation in Swift is O(n) per append because String is a
copy-on-write value type and concatenation must rebuild the internal
storage when capacity is exhausted. Across a full streamed turn this
makes assembling the assistant message O(n^2) in characters, on top
of the inference latency itself. Long agent runs (file analysis,
summarization, code review) were the visible victims.

Change the accumulation to an array of delta strings plus a running
utf8.count. After the stream completes (and only if it ended without
a tool invocation — the tool-invocation branch doesn't need the
joined content), allocate a single String of the right capacity and
append the chunks once. That's O(n) total work and a single backing
allocation instead of O(log n) reallocations.

No SSE-wire change — the per-delta writeContent call is unchanged.
Only the in-memory accumulation strategy is different. The
'/run' endpoint is the only one using this pattern; the
/chat/completions and /messages streaming loops already use a
different (delta-only) shape.

Co-authored-by: Michael Meding <mimeding@users.noreply.github.com>
ModelManager.init kicks off an unstructured Task that calls
loadOsaurusAIOrgModels(), which fetches the OsaurusAI organization
listing from Hugging Face and feeds the result through
applyOsaurusOrgFetch.

The unit-test runner repeatedly constructs ModelManager() to drive
applyOsaurusOrgFetch directly. The background launch-time fetch
races with those test calls — whichever finishes last wins, and
the merge result is non-deterministic. That's the root cause of
the flaky ModelManagerSuggestedTests failures seen across many of
the recent PR CI runs (applyOsaurusOrgFetch_dropsStaleAutoFetched
OnReapply, applyOsaurusOrgFetch_addsNewEntriesAfterCurated, etc.).

Gate the launch-time fetch on a small isRunningInTestEnvironment
helper that checks for any of XCTestConfigurationFilePath,
XCTestBundlePath, or XCTestSessionIdentifier in the process
environment. Those variables are only present inside an xctest host
process; production app launches still get the HF fetch exactly as
before.

This is a network call, so removing it under tests also has the
side benefit of making the test suite work offline / on hermetic
CI runners.

Co-authored-by: Michael Meding <mimeding@users.noreply.github.com>
PR #8 moved the assistant-turn assembly into the no-tool-invocation
branch only, but the tool-invocation branch a few lines below ALSO
reads 'responseContent' to record the assistant's pre-tool-call text
on the ChatMessage. That left 'responseContent' undefined in the
tool-invocation scope and broke the build with two 'cannot find
responseContent in scope' errors.

Materialize 'responseContent' once, before the 'guard let invocation
= toolInvoked' branch, so both successful paths see the same
already-joined String. Asymptotic shape is unchanged from the
previous commit (single allocation, linear-time join); we just hoist
it one level up so its lifetime covers both consumers.

Co-authored-by: Michael Meding <mimeding@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants