-
Notifications
You must be signed in to change notification settings - Fork 0
Optimize performance and improve time-to-first-token #50
Description
Optimize performance and improve time-to-first-token
Summary
Agents currently exhibit slow time to first token (TTFT), causing a sub-par user experience. This issue covers instrumenting the full latency path to identify bottlenecks, implementing targeted optimizations to reduce TTFT to a few seconds, and providing administrators with a latency analysis dashboard to prioritize and validate future improvements.
Context
The current invocation path in backend/app/routers/invocations.py performs the following operations sequentially before the first SSE token reaches the browser:
- Auth token resolution (
invocations.py:828-894) — For agents with Cognito authorizers, the backend calls AWS Secrets Manager (get_secret) synchronously in the request path, then makes a Cognito token endpoint HTTP call (get_cognito_token). These two network round trips add latency to every cold-start invocation. - Session and invocation DB writes (
invocations.py:780-826) — Twodb.commit()calls create session and invocation records before streaming begins. - First token — Only after all of the above does
invoke_agent_streamyield the firstchunkSSE event.
Additionally, TTFT itself (time from client_invoke_time to receipt of the first chunk event) is not tracked as a separate metric — only cold_start_latency_ms (container init time) and client_duration_ms (total end-to-end time) are stored. This makes it impossible to distinguish network/auth/DB overhead from container cold-start overhead.
Key existing latency data available in Invocation:
client_invoke_time— when the backend received the invoke requestagent_start_time— when the agent container started (from CloudWatch)cold_start_latency_ms—agent_start_time - client_invoke_time(ms)client_done_time— when the last token was receivedclient_duration_ms—client_done_time - client_invoke_time(ms)
Missing:
ttft_ms— time fromclient_invoke_timeto firstchunkSSE event (TTFT)- Warm vs. cold start classification
Requirements
R1: TTFT Instrumentation
R1.1: Track TTFT Per Invocation
Add a ttft_ms field to the Invocation model to record the time from the backend receiving the invoke request to when the first text chunk is yielded to the SSE stream.
Current behavior:
backend/app/models/invocation.py— nottft_mscolumn.backend/app/routers/invocations.py:552-554—client_done_timeis set after the stream drains; no first-chunk timestamp exists.
Desired behavior:
- Add
ttft_ms = Column(Float, nullable=True)to theInvocationORM model. - In
invoke_agent_stream(invocations.py:440), recordfirst_chunk_time = time.time()immediately before the firstchunkSSE event is yielded. - Compute
ttft_ms = (first_chunk_time - client_invoke_time) * 1000and persist it alongside the existing latency fields in the samedb.commit()that finalizes the invocation. - Include
ttft_msinInvocationResponseandinvocation.to_dict().
R1.2: Classify Warm vs. Cold Invocations
Add a boolean was_cold_start field to the Invocation model so that warm and cold invocations can be filtered and compared separately.
Desired behavior:
- Add
was_cold_start = Column(Boolean, nullable=True)toInvocation. - Set
was_cold_start = Truewhencold_start_latency_msis populated and exceeds a configurable threshold (default: 500 ms, configurable via environment variableLOOM_COLD_START_THRESHOLD_MS). - Set
was_cold_start = Falsewhencold_start_latency_msis below the threshold. - Leave
was_cold_start = Nonewhencold_start_latency_msis not available (CloudWatch logs not yet retrieved). - Include
was_cold_startinInvocationResponseandinvocation.to_dict().
R2: Token Caching to Reduce Auth Overhead
R2.1: Cache M2M Cognito Tokens
The Cognito client credentials grant flow (backend/app/services/cognito.py, called from invocations.py:851-862 and invocations.py:884-894) fetches a fresh token from Secrets Manager and Cognito on every invocation. Cognito access tokens are valid for up to one hour. Caching them avoids two unnecessary network round trips per invocation.
Desired behavior:
- Add a simple in-process token cache in
backend/app/services/cognito.py(or a newbackend/app/services/token_cache.py) keyed by(client_id, scopes). - Store
(access_token, expires_at)tuples. Expire cache entries 60 seconds before the token's actual expiry to allow for clock skew. - On a cache hit, return the cached token without calling Secrets Manager or Cognito.
- On a cache miss or expired entry, fetch a fresh token and populate the cache.
- The cache is process-local (no external dependency required). Document that it resets on process restart.
- Log cache hits and misses at
DEBUGlevel for observability.
R2.2: Parallelize Auth Token Resolution and Session Creation
Auth token resolution and the initial session/invocation DB writes currently happen sequentially. Both can be initiated at the same time because they are independent.
Current flow (invocations.py:780-897):
- Resolve session or create new session (DB write + commit)
- Create invocation record (DB write + commit)
- Resolve access token (sequential: Secrets Manager → Cognito token endpoint)
- Call
invoke_agent_stream
Desired flow:
- Where the agent is known to require an access token (Priority 1:
credential_idprovided; Priority 3: agent config M2M), start token resolution concurrently with the session and invocation DB writes usingasyncio.gatherorasyncio.create_task. - Token resolution must complete before the boto3
invoke_agent_runtimecall (agentcore.py:137), but can overlap with DB writes. - Preserve the existing priority order for token resolution (manual → credential_id → user login token → agent config M2M).
R3: Latency Analysis Dashboard for Administrators
R3.1: Latency Summary API Endpoint
Add a backend endpoint that returns aggregated latency statistics across all agents and invocations, using the data already stored in the Invocation table.
Endpoint: GET /api/admin/latency (require agent:read scope)
Query parameters:
agent_id(optional) — filter to a specific agentrange— time range:1h,24h,7d,30d(default:24h)min_invocations(optional, default: 1) — exclude agents with fewer than this many invocations
Response shape:
{
"range": "24h",
"total_invocations": 120,
"cold_start_pct": 0.35,
"agents": [
{
"agent_id": 1,
"agent_name": "My Agent",
"invocation_count": 40,
"cold_start_count": 14,
"cold_start_pct": 0.35,
"ttft_p50_ms": 1200,
"ttft_p95_ms": 8400,
"ttft_p99_ms": 12000,
"cold_start_ttft_p50_ms": 6500,
"warm_ttft_p50_ms": 900,
"duration_p50_ms": 3200,
"duration_p95_ms": 14000
}
]
}Use SQLAlchemy to compute percentiles. For SQLite (local dev), use Python-side computation after fetching the raw values; for PostgreSQL, use percentile_cont. Add the endpoint to backend/app/routers/admin.py or a new backend/app/routers/latency.py.
R3.2: Latency Analysis Tab in the Admin Dashboard
Add a Latency tab to frontend/src/pages/AdminDashboardPage.tsx (or a dedicated LatencyPage.tsx reachable from the admin navigation) that displays the data returned by GET /api/admin/latency.
Required visualizations:
- Summary cards — Total invocations, cold start percentage, and overall p50/p95 TTFT for the selected time range.
- TTFT distribution bar chart — Histogram of TTFT values bucketed into ranges (e.g., 0–1 s, 1–3 s, 3–5 s, 5–10 s, >10 s). Use the Recharts library already present in the project (see
AdminDashboardPage.tsx:18-26). - Per-agent latency table — Sortable table showing each agent's invocation count, cold start %, p50 TTFT, p95 TTFT, warm TTFT p50, and cold TTFT p50. Highlight agents with p95 TTFT > 10 s in a warning color.
- Cold vs. warm comparison — Side-by-side bar chart comparing average TTFT for cold-start vs. warm invocations per agent.
- Time range selector — Filter by 1h / 24h / 7d / 30d (consistent with the existing time range selector pattern in
AdminDashboardPage.tsx:46-60).
R3.3: Per-Invocation TTFT in Invocation Detail
Surface ttft_ms and was_cold_start in the existing invocation detail view so individual invocations can be inspected.
Current behavior:
frontend/src/pages/InvocationDetailPage.tsxdisplayscold_start_latency_msandclient_duration_ms.
Desired behavior:
- Add TTFT and Cold Start rows to the latency section of
InvocationDetailPage.tsx. - Format TTFT as
Xs(e.g., "1.2s") for readability. - Display a Cold start badge when
was_cold_start = true, and a Warm badge whenwas_cold_start = false.
R4: Backend Unit Tests
Add unit tests in backend/tests/test_latency.py covering:
ttft_msis populated correctly when the first chunk is received.was_cold_startisTruewhencold_start_latency_msexceedsLOOM_COLD_START_THRESHOLD_MS,Falsewhen below, andNonewhencold_start_latency_msis absent.- Token cache returns cached token on hit and fetches fresh token on miss or expiry.
- Latency summary endpoint returns correct p50/p95 values for a known set of invocations.
Files to Modify
| File | Changes |
|---|---|
backend/app/models/invocation.py |
Add ttft_ms, was_cold_start columns |
backend/app/routers/invocations.py |
Record ttft_ms, set was_cold_start, parallelize auth+DB writes |
backend/app/services/cognito.py |
Add M2M token cache |
backend/app/routers/admin.py (or new latency.py) |
Add GET /api/admin/latency endpoint |
frontend/src/pages/AdminDashboardPage.tsx |
Add Latency tab |
frontend/src/pages/InvocationDetailPage.tsx |
Surface ttft_ms and was_cold_start |
etc/environment.sh |
Document LOOM_COLD_START_THRESHOLD_MS |
Files to Create
| File | Description |
|---|---|
backend/app/services/token_cache.py |
In-process M2M token cache (if extracted from cognito.py) |
backend/tests/test_latency.py |
Unit tests for all new latency-related behavior |
Acceptance Criteria
- R1.1:
ttft_msis recorded on every invocation where at least one text chunk is received - R1.1:
ttft_msis included inInvocationResponseand returned by the API - R1.2:
was_cold_startisTrue/Falsebased oncold_start_latency_msvs threshold;Nonewhen unknown - R2.1: Repeated invocations with the same credential reuse a cached Cognito token without calling Secrets Manager or Cognito on cache hits
- R2.1: Cache entries expire before the actual token expiry (60 s buffer)
- R2.2: Auth token resolution overlaps with session/invocation DB writes where possible
- R3.1:
GET /api/admin/latencyreturns TTFT percentiles (p50, p95, p99) broken down by agent and cold/warm classification - R3.2: Latency tab displays TTFT histogram, per-agent table, and cold vs. warm comparison chart
- R3.2: Agents with p95 TTFT > 10 s are highlighted in the table
- R3.3:
InvocationDetailPagedisplays TTFT and cold/warm badge - R4: All unit tests in
test_latency.pypass - No regressions to existing invocation streaming, cost tracking, or CloudWatch log retrieval