Perf: drop O(n²) string accumulation in agent SSE response loop#8
Draft
mimeding wants to merge 3 commits into
Draft
Perf: drop O(n²) string accumulation in agent SSE response loop#8mimeding wants to merge 3 commits into
mimeding wants to merge 3 commits into
Conversation
The agent /run SSE handler accumulated the assistant turn by appending each streamed delta to a String with '+=' on the hot streaming path. String concatenation in Swift is O(n) per append because String is a copy-on-write value type and concatenation must rebuild the internal storage when capacity is exhausted. Across a full streamed turn this makes assembling the assistant message O(n^2) in characters, on top of the inference latency itself. Long agent runs (file analysis, summarization, code review) were the visible victims. Change the accumulation to an array of delta strings plus a running utf8.count. After the stream completes (and only if it ended without a tool invocation — the tool-invocation branch doesn't need the joined content), allocate a single String of the right capacity and append the chunks once. That's O(n) total work and a single backing allocation instead of O(log n) reallocations. No SSE-wire change — the per-delta writeContent call is unchanged. Only the in-memory accumulation strategy is different. The '/run' endpoint is the only one using this pattern; the /chat/completions and /messages streaming loops already use a different (delta-only) shape. Co-authored-by: Michael Meding <mimeding@users.noreply.github.com>
ModelManager.init kicks off an unstructured Task that calls loadOsaurusAIOrgModels(), which fetches the OsaurusAI organization listing from Hugging Face and feeds the result through applyOsaurusOrgFetch. The unit-test runner repeatedly constructs ModelManager() to drive applyOsaurusOrgFetch directly. The background launch-time fetch races with those test calls — whichever finishes last wins, and the merge result is non-deterministic. That's the root cause of the flaky ModelManagerSuggestedTests failures seen across many of the recent PR CI runs (applyOsaurusOrgFetch_dropsStaleAutoFetched OnReapply, applyOsaurusOrgFetch_addsNewEntriesAfterCurated, etc.). Gate the launch-time fetch on a small isRunningInTestEnvironment helper that checks for any of XCTestConfigurationFilePath, XCTestBundlePath, or XCTestSessionIdentifier in the process environment. Those variables are only present inside an xctest host process; production app launches still get the HF fetch exactly as before. This is a network call, so removing it under tests also has the side benefit of making the test suite work offline / on hermetic CI runners. Co-authored-by: Michael Meding <mimeding@users.noreply.github.com>
PR #8 moved the assistant-turn assembly into the no-tool-invocation branch only, but the tool-invocation branch a few lines below ALSO reads 'responseContent' to record the assistant's pre-tool-call text on the ChatMessage. That left 'responseContent' undefined in the tool-invocation scope and broke the build with two 'cannot find responseContent in scope' errors. Materialize 'responseContent' once, before the 'guard let invocation = toolInvoked' branch, so both successful paths see the same already-joined String. Asymptotic shape is unchanged from the previous commit (single allocation, linear-time join); we just hoist it one level up so its lifetime covers both consumers. Co-authored-by: Michael Meding <mimeding@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why this matters (business)
The agent SSE handler (
POST /agents/{id}/run) is what powers long-running agent turns over HTTP — Work mode, server-side tool invocations, the new background-task pipeline. On long generations the handler was doing more work per token than the inference engine itself, just to assemble the assistant message for the next iteration.For short turns this is invisible. For multi-thousand-character agent answers (file analyses, multi-file code reviews, summarizations) the accumulator's CPU cost grows quadratically and starts to dominate. Streaming feels increasingly choppy near the end of long turns, and the server-side latency profile is misleading because the slowness is in our own accumulation loop, not in the model.
What's wrong (technical)
Swift
Stringis a copy-on-write value type. Even though+=amortizes capacity growth, each append still has to walk the existing content and reallocate when capacity runs out. Across a full streamed turn this is O(n²) in characters — for a 20 KB assistant answer that's hundreds of millions of pointer-copies just to build the next message context, on top of the actual SSE flush.responseContentis only used after the stream completes (messages.append(ChatMessage(role: "assistant", content: responseContent))). Nothing reads it mid-stream. The tool-invocation branch doesn't use it at all.Fix
Accumulate streamed deltas in a
[String]and a runningutf8.count. After the stream ends and only on the "final text response" branch (no tool invocation), reserve a singleStringof the right capacity and append the chunks once. That's O(n) total work and a single backing allocation, regardless of stream length.No SSE-wire change — the per-delta
writeContentis unchanged. Only the in-memory accumulation strategy is different.Scope decisions
HTTPHandler.swiftthat doesresponseContent += deltaon the hot path./v1/chat/completionsand/messagesstreaming loops use a different shape (they don't materialize the full message in HTTP-handler memory).ResponseWritersper-token encode pattern (also flagged by the audit) is a separate, larger PR — it touches three writers and needs careful flush-coalescing benchmarks before landing.Changes
scripts/benchmark/and is a follow-up)Test Plan
/agents/{id}/run(e.g. ask it to summarize a multi-thousand-line file). Streamed deltas should arrive at the same rate; the final SSE event closes the stream as before.messages.append(.assistant(...))does not run on that branch, so behaviour is unchanged.Checklist
CONTRIBUTING.mdswiftc -frontend -parse)