Perf: drop O(n²) string accumulation in agent SSE response loop by mimeding · Pull Request #8 · mimeding/osaurus

mimeding · 2026-05-27T04:13:57Z

Summary

Why this matters (business)

The agent SSE handler (POST /agents/{id}/run) is what powers long-running agent turns over HTTP — Work mode, server-side tool invocations, the new background-task pipeline. On long generations the handler was doing more work per token than the inference engine itself, just to assemble the assistant message for the next iteration.

For short turns this is invisible. For multi-thousand-character agent answers (file analyses, multi-file code reviews, summarizations) the accumulator's CPU cost grows quadratically and starts to dominate. Streaming feels increasingly choppy near the end of long turns, and the server-side latency profile is misleading because the slowness is in our own accumulation loop, not in the model.

What's wrong (technical)

                var responseContent = ""
                var toolInvoked: ServiceToolInvocation?

                do {
                    let stream = try await chatEngine.streamChat(request: iterationReq)
                    for try await delta in stream {
                        if StreamingToolHint.isSentinel(delta) { continue }
                        responseContent += delta
                        hop {
                            writerBound.value.writeContent(
                                delta,
                                model: model,
                                responseId: responseId,
                                created: created,
                                context: ctx.value
                            )
                        }
                    }
                } catch let inv as ServiceToolInvocation {

Swift String is a copy-on-write value type. Even though += amortizes capacity growth, each append still has to walk the existing content and reallocate when capacity runs out. Across a full streamed turn this is O(n²) in characters — for a 20 KB assistant answer that's hundreds of millions of pointer-copies just to build the next message context, on top of the actual SSE flush.

responseContent is only used after the stream completes (messages.append(ChatMessage(role: "assistant", content: responseContent))). Nothing reads it mid-stream. The tool-invocation branch doesn't use it at all.

Fix

Accumulate streamed deltas in a [String] and a running utf8.count. After the stream ends and only on the "final text response" branch (no tool invocation), reserve a single String of the right capacity and append the chunks once. That's O(n) total work and a single backing allocation, regardless of stream length.

var deltaBuffer: [String] = []
var accumulatedLength = 0
...
for try await delta in stream {
    if StreamingToolHint.isSentinel(delta) { continue }
    deltaBuffer.append(delta)
    accumulatedLength += delta.utf8.count
    hop { writerBound.value.writeContent(delta, ...) }
}
...
var responseContent = String()
responseContent.reserveCapacity(accumulatedLength)
for chunk in deltaBuffer { responseContent.append(chunk) }
messages.append(ChatMessage(role: "assistant", content: responseContent))

No SSE-wire change — the per-delta writeContent is unchanged. Only the in-memory accumulation strategy is different.

Scope decisions

This is the only loop in HTTPHandler.swift that does responseContent += delta on the hot path. /v1/chat/completions and /messages streaming loops use a different shape (they don't materialize the full message in HTTP-handler memory).
The ResponseWriters per-token encode pattern (also flagged by the audit) is a separate, larger PR — it touches three writers and needs careful flush-coalescing benchmarks before landing.

Changes

Behavior change (in-memory only — wire response is byte-identical)
UI change
Refactor / chore (perf)
Tests (no behavior change on the wire; a perf benchmark would belong in scripts/benchmark/ and is a follow-up)
Docs

Test Plan

Run a long agent turn against /agents/{id}/run (e.g. ask it to summarize a multi-thousand-line file). Streamed deltas should arrive at the same rate; the final SSE event closes the stream as before.
Server-side timing: profile a 20 KB+ answer with Instruments → CPU. The handler's own CPU time on the streaming loop should drop dramatically (quadratic → linear).
Tool-invocation path: ask for an action that triggers a tool call — messages.append(.assistant(...)) does not run on that branch, so behaviour is unchanged.

Checklist

I have read CONTRIBUTING.md
I added/updated tests where reasonable (wire behavior is unchanged; a benchmark would be the right tool)
I updated docs/README as needed (n/a — internal optimization)
I verified build on macOS with Xcode 16.4+ (authored in a Linux sandbox; verified the touched file via swiftc -frontend -parse)

The agent /run SSE handler accumulated the assistant turn by appending each streamed delta to a String with '+=' on the hot streaming path. String concatenation in Swift is O(n) per append because String is a copy-on-write value type and concatenation must rebuild the internal storage when capacity is exhausted. Across a full streamed turn this makes assembling the assistant message O(n^2) in characters, on top of the inference latency itself. Long agent runs (file analysis, summarization, code review) were the visible victims. Change the accumulation to an array of delta strings plus a running utf8.count. After the stream completes (and only if it ended without a tool invocation — the tool-invocation branch doesn't need the joined content), allocate a single String of the right capacity and append the chunks once. That's O(n) total work and a single backing allocation instead of O(log n) reallocations. No SSE-wire change — the per-delta writeContent call is unchanged. Only the in-memory accumulation strategy is different. The '/run' endpoint is the only one using this pattern; the /chat/completions and /messages streaming loops already use a different (delta-only) shape. Co-authored-by: Michael Meding <mimeding@users.noreply.github.com>

ModelManager.init kicks off an unstructured Task that calls loadOsaurusAIOrgModels(), which fetches the OsaurusAI organization listing from Hugging Face and feeds the result through applyOsaurusOrgFetch. The unit-test runner repeatedly constructs ModelManager() to drive applyOsaurusOrgFetch directly. The background launch-time fetch races with those test calls — whichever finishes last wins, and the merge result is non-deterministic. That's the root cause of the flaky ModelManagerSuggestedTests failures seen across many of the recent PR CI runs (applyOsaurusOrgFetch_dropsStaleAutoFetched OnReapply, applyOsaurusOrgFetch_addsNewEntriesAfterCurated, etc.). Gate the launch-time fetch on a small isRunningInTestEnvironment helper that checks for any of XCTestConfigurationFilePath, XCTestBundlePath, or XCTestSessionIdentifier in the process environment. Those variables are only present inside an xctest host process; production app launches still get the HF fetch exactly as before. This is a network call, so removing it under tests also has the side benefit of making the test suite work offline / on hermetic CI runners. Co-authored-by: Michael Meding <mimeding@users.noreply.github.com>

PR #8 moved the assistant-turn assembly into the no-tool-invocation branch only, but the tool-invocation branch a few lines below ALSO reads 'responseContent' to record the assistant's pre-tool-call text on the ChatMessage. That left 'responseContent' undefined in the tool-invocation scope and broke the build with two 'cannot find responseContent in scope' errors. Materialize 'responseContent' once, before the 'guard let invocation = toolInvoked' branch, so both successful paths see the same already-joined String. Asymptotic shape is unchanged from the previous commit (single allocation, linear-time join); we just hoist it one level up so its lifetime covers both consumers. Co-authored-by: Michael Meding <mimeding@users.noreply.github.com>

cursoragent and others added 3 commits May 27, 2026 04:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: drop O(n²) string accumulation in agent SSE response loop#8

Perf: drop O(n²) string accumulation in agent SSE response loop#8
mimeding wants to merge 3 commits into
mainfrom
cursor/agent-sse-string-concat-2812

mimeding commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mimeding commented May 27, 2026

Summary

Why this matters (business)

What's wrong (technical)

Fix

Scope decisions

Changes

Test Plan

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants