Skip to content

Optimize performance and improve time-to-first-token #50

@heeki

Description

@heeki

Optimize performance and improve time-to-first-token

Summary

Agents currently exhibit slow time to first token (TTFT), causing a sub-par user experience. This issue covers instrumenting the full latency path to identify bottlenecks, implementing targeted optimizations to reduce TTFT to a few seconds, and providing administrators with a latency analysis dashboard to prioritize and validate future improvements.

Context

The current invocation path in backend/app/routers/invocations.py performs the following operations sequentially before the first SSE token reaches the browser:

  1. Auth token resolution (invocations.py:828-894) — For agents with Cognito authorizers, the backend calls AWS Secrets Manager (get_secret) synchronously in the request path, then makes a Cognito token endpoint HTTP call (get_cognito_token). These two network round trips add latency to every cold-start invocation.
  2. Session and invocation DB writes (invocations.py:780-826) — Two db.commit() calls create session and invocation records before streaming begins.
  3. First token — Only after all of the above does invoke_agent_stream yield the first chunk SSE event.

Additionally, TTFT itself (time from client_invoke_time to receipt of the first chunk event) is not tracked as a separate metric — only cold_start_latency_ms (container init time) and client_duration_ms (total end-to-end time) are stored. This makes it impossible to distinguish network/auth/DB overhead from container cold-start overhead.

Key existing latency data available in Invocation:

  • client_invoke_time — when the backend received the invoke request
  • agent_start_time — when the agent container started (from CloudWatch)
  • cold_start_latency_msagent_start_time - client_invoke_time (ms)
  • client_done_time — when the last token was received
  • client_duration_msclient_done_time - client_invoke_time (ms)

Missing:

  • ttft_ms — time from client_invoke_time to first chunk SSE event (TTFT)
  • Warm vs. cold start classification

Requirements

R1: TTFT Instrumentation

R1.1: Track TTFT Per Invocation

Add a ttft_ms field to the Invocation model to record the time from the backend receiving the invoke request to when the first text chunk is yielded to the SSE stream.

Current behavior:

  • backend/app/models/invocation.py — no ttft_ms column.
  • backend/app/routers/invocations.py:552-554client_done_time is set after the stream drains; no first-chunk timestamp exists.

Desired behavior:

  • Add ttft_ms = Column(Float, nullable=True) to the Invocation ORM model.
  • In invoke_agent_stream (invocations.py:440), record first_chunk_time = time.time() immediately before the first chunk SSE event is yielded.
  • Compute ttft_ms = (first_chunk_time - client_invoke_time) * 1000 and persist it alongside the existing latency fields in the same db.commit() that finalizes the invocation.
  • Include ttft_ms in InvocationResponse and invocation.to_dict().

R1.2: Classify Warm vs. Cold Invocations

Add a boolean was_cold_start field to the Invocation model so that warm and cold invocations can be filtered and compared separately.

Desired behavior:

  • Add was_cold_start = Column(Boolean, nullable=True) to Invocation.
  • Set was_cold_start = True when cold_start_latency_ms is populated and exceeds a configurable threshold (default: 500 ms, configurable via environment variable LOOM_COLD_START_THRESHOLD_MS).
  • Set was_cold_start = False when cold_start_latency_ms is below the threshold.
  • Leave was_cold_start = None when cold_start_latency_ms is not available (CloudWatch logs not yet retrieved).
  • Include was_cold_start in InvocationResponse and invocation.to_dict().

R2: Token Caching to Reduce Auth Overhead

R2.1: Cache M2M Cognito Tokens

The Cognito client credentials grant flow (backend/app/services/cognito.py, called from invocations.py:851-862 and invocations.py:884-894) fetches a fresh token from Secrets Manager and Cognito on every invocation. Cognito access tokens are valid for up to one hour. Caching them avoids two unnecessary network round trips per invocation.

Desired behavior:

  • Add a simple in-process token cache in backend/app/services/cognito.py (or a new backend/app/services/token_cache.py) keyed by (client_id, scopes).
  • Store (access_token, expires_at) tuples. Expire cache entries 60 seconds before the token's actual expiry to allow for clock skew.
  • On a cache hit, return the cached token without calling Secrets Manager or Cognito.
  • On a cache miss or expired entry, fetch a fresh token and populate the cache.
  • The cache is process-local (no external dependency required). Document that it resets on process restart.
  • Log cache hits and misses at DEBUG level for observability.

R2.2: Parallelize Auth Token Resolution and Session Creation

Auth token resolution and the initial session/invocation DB writes currently happen sequentially. Both can be initiated at the same time because they are independent.

Current flow (invocations.py:780-897):

  1. Resolve session or create new session (DB write + commit)
  2. Create invocation record (DB write + commit)
  3. Resolve access token (sequential: Secrets Manager → Cognito token endpoint)
  4. Call invoke_agent_stream

Desired flow:

  • Where the agent is known to require an access token (Priority 1: credential_id provided; Priority 3: agent config M2M), start token resolution concurrently with the session and invocation DB writes using asyncio.gather or asyncio.create_task.
  • Token resolution must complete before the boto3 invoke_agent_runtime call (agentcore.py:137), but can overlap with DB writes.
  • Preserve the existing priority order for token resolution (manual → credential_id → user login token → agent config M2M).

R3: Latency Analysis Dashboard for Administrators

R3.1: Latency Summary API Endpoint

Add a backend endpoint that returns aggregated latency statistics across all agents and invocations, using the data already stored in the Invocation table.

Endpoint: GET /api/admin/latency (require agent:read scope)

Query parameters:

  • agent_id (optional) — filter to a specific agent
  • range — time range: 1h, 24h, 7d, 30d (default: 24h)
  • min_invocations (optional, default: 1) — exclude agents with fewer than this many invocations

Response shape:

{
  "range": "24h",
  "total_invocations": 120,
  "cold_start_pct": 0.35,
  "agents": [
    {
      "agent_id": 1,
      "agent_name": "My Agent",
      "invocation_count": 40,
      "cold_start_count": 14,
      "cold_start_pct": 0.35,
      "ttft_p50_ms": 1200,
      "ttft_p95_ms": 8400,
      "ttft_p99_ms": 12000,
      "cold_start_ttft_p50_ms": 6500,
      "warm_ttft_p50_ms": 900,
      "duration_p50_ms": 3200,
      "duration_p95_ms": 14000
    }
  ]
}

Use SQLAlchemy to compute percentiles. For SQLite (local dev), use Python-side computation after fetching the raw values; for PostgreSQL, use percentile_cont. Add the endpoint to backend/app/routers/admin.py or a new backend/app/routers/latency.py.

R3.2: Latency Analysis Tab in the Admin Dashboard

Add a Latency tab to frontend/src/pages/AdminDashboardPage.tsx (or a dedicated LatencyPage.tsx reachable from the admin navigation) that displays the data returned by GET /api/admin/latency.

Required visualizations:

  1. Summary cards — Total invocations, cold start percentage, and overall p50/p95 TTFT for the selected time range.
  2. TTFT distribution bar chart — Histogram of TTFT values bucketed into ranges (e.g., 0–1 s, 1–3 s, 3–5 s, 5–10 s, >10 s). Use the Recharts library already present in the project (see AdminDashboardPage.tsx:18-26).
  3. Per-agent latency table — Sortable table showing each agent's invocation count, cold start %, p50 TTFT, p95 TTFT, warm TTFT p50, and cold TTFT p50. Highlight agents with p95 TTFT > 10 s in a warning color.
  4. Cold vs. warm comparison — Side-by-side bar chart comparing average TTFT for cold-start vs. warm invocations per agent.
  5. Time range selector — Filter by 1h / 24h / 7d / 30d (consistent with the existing time range selector pattern in AdminDashboardPage.tsx:46-60).

R3.3: Per-Invocation TTFT in Invocation Detail

Surface ttft_ms and was_cold_start in the existing invocation detail view so individual invocations can be inspected.

Current behavior:

  • frontend/src/pages/InvocationDetailPage.tsx displays cold_start_latency_ms and client_duration_ms.

Desired behavior:

  • Add TTFT and Cold Start rows to the latency section of InvocationDetailPage.tsx.
  • Format TTFT as Xs (e.g., "1.2s") for readability.
  • Display a Cold start badge when was_cold_start = true, and a Warm badge when was_cold_start = false.

R4: Backend Unit Tests

Add unit tests in backend/tests/test_latency.py covering:

  • ttft_ms is populated correctly when the first chunk is received.
  • was_cold_start is True when cold_start_latency_ms exceeds LOOM_COLD_START_THRESHOLD_MS, False when below, and None when cold_start_latency_ms is absent.
  • Token cache returns cached token on hit and fetches fresh token on miss or expiry.
  • Latency summary endpoint returns correct p50/p95 values for a known set of invocations.

Files to Modify

File Changes
backend/app/models/invocation.py Add ttft_ms, was_cold_start columns
backend/app/routers/invocations.py Record ttft_ms, set was_cold_start, parallelize auth+DB writes
backend/app/services/cognito.py Add M2M token cache
backend/app/routers/admin.py (or new latency.py) Add GET /api/admin/latency endpoint
frontend/src/pages/AdminDashboardPage.tsx Add Latency tab
frontend/src/pages/InvocationDetailPage.tsx Surface ttft_ms and was_cold_start
etc/environment.sh Document LOOM_COLD_START_THRESHOLD_MS

Files to Create

File Description
backend/app/services/token_cache.py In-process M2M token cache (if extracted from cognito.py)
backend/tests/test_latency.py Unit tests for all new latency-related behavior

Acceptance Criteria

  • R1.1: ttft_ms is recorded on every invocation where at least one text chunk is received
  • R1.1: ttft_ms is included in InvocationResponse and returned by the API
  • R1.2: was_cold_start is True/False based on cold_start_latency_ms vs threshold; None when unknown
  • R2.1: Repeated invocations with the same credential reuse a cached Cognito token without calling Secrets Manager or Cognito on cache hits
  • R2.1: Cache entries expire before the actual token expiry (60 s buffer)
  • R2.2: Auth token resolution overlaps with session/invocation DB writes where possible
  • R3.1: GET /api/admin/latency returns TTFT percentiles (p50, p95, p99) broken down by agent and cold/warm classification
  • R3.2: Latency tab displays TTFT histogram, per-agent table, and cold vs. warm comparison chart
  • R3.2: Agents with p95 TTFT > 10 s are highlighted in the table
  • R3.3: InvocationDetailPage displays TTFT and cold/warm badge
  • R4: All unit tests in test_latency.py pass
  • No regressions to existing invocation streaming, cost tracking, or CloudWatch log retrieval

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions