Optimize performance and improve time-to-first-token

# Optimize performance and improve time-to-first-token

## Summary

Agents currently exhibit slow time to first token (TTFT), causing a sub-par user experience. This issue covers instrumenting the full latency path to identify bottlenecks, implementing targeted optimizations to reduce TTFT to a few seconds, and providing administrators with a latency analysis dashboard to prioritize and validate future improvements.

## Context

The current invocation path in `backend/app/routers/invocations.py` performs the following operations sequentially before the first SSE token reaches the browser:

1. **Auth token resolution** (`invocations.py:828-894`) — For agents with Cognito authorizers, the backend calls AWS Secrets Manager (`get_secret`) synchronously in the request path, then makes a Cognito token endpoint HTTP call (`get_cognito_token`). These two network round trips add latency to every cold-start invocation.
2. **Session and invocation DB writes** (`invocations.py:780-826`) — Two `db.commit()` calls create session and invocation records before streaming begins.
3. **First token** — Only after all of the above does `invoke_agent_stream` yield the first `chunk` SSE event.

Additionally, TTFT itself (time from `client_invoke_time` to receipt of the first `chunk` event) is not tracked as a separate metric — only `cold_start_latency_ms` (container init time) and `client_duration_ms` (total end-to-end time) are stored. This makes it impossible to distinguish network/auth/DB overhead from container cold-start overhead.

Key existing latency data available in `Invocation`:
- `client_invoke_time` — when the backend received the invoke request
- `agent_start_time` — when the agent container started (from CloudWatch)
- `cold_start_latency_ms` — `agent_start_time - client_invoke_time` (ms)
- `client_done_time` — when the last token was received
- `client_duration_ms` — `client_done_time - client_invoke_time` (ms)

Missing:
- `ttft_ms` — time from `client_invoke_time` to first `chunk` SSE event (TTFT)
- Warm vs. cold start classification

## Requirements

### R1: TTFT Instrumentation

#### R1.1: Track TTFT Per Invocation

Add a `ttft_ms` field to the `Invocation` model to record the time from the backend receiving the invoke request to when the first text chunk is yielded to the SSE stream.

**Current behavior:**
- `backend/app/models/invocation.py` — no `ttft_ms` column.
- `backend/app/routers/invocations.py:552-554` — `client_done_time` is set after the stream drains; no first-chunk timestamp exists.

**Desired behavior:**
- Add `ttft_ms = Column(Float, nullable=True)` to the `Invocation` ORM model.
- In `invoke_agent_stream` (`invocations.py:440`), record `first_chunk_time = time.time()` immediately before the first `chunk` SSE event is yielded.
- Compute `ttft_ms = (first_chunk_time - client_invoke_time) * 1000` and persist it alongside the existing latency fields in the same `db.commit()` that finalizes the invocation.
- Include `ttft_ms` in `InvocationResponse` and `invocation.to_dict()`.

#### R1.2: Classify Warm vs. Cold Invocations

Add a boolean `was_cold_start` field to the `Invocation` model so that warm and cold invocations can be filtered and compared separately.

**Desired behavior:**
- Add `was_cold_start = Column(Boolean, nullable=True)` to `Invocation`.
- Set `was_cold_start = True` when `cold_start_latency_ms` is populated and exceeds a configurable threshold (default: 500 ms, configurable via environment variable `LOOM_COLD_START_THRESHOLD_MS`).
- Set `was_cold_start = False` when `cold_start_latency_ms` is below the threshold.
- Leave `was_cold_start = None` when `cold_start_latency_ms` is not available (CloudWatch logs not yet retrieved).
- Include `was_cold_start` in `InvocationResponse` and `invocation.to_dict()`.

---

### R2: Token Caching to Reduce Auth Overhead

#### R2.1: Cache M2M Cognito Tokens

The Cognito client credentials grant flow (`backend/app/services/cognito.py`, called from `invocations.py:851-862` and `invocations.py:884-894`) fetches a fresh token from Secrets Manager and Cognito on every invocation. Cognito access tokens are valid for up to one hour. Caching them avoids two unnecessary network round trips per invocation.

**Desired behavior:**
- Add a simple in-process token cache in `backend/app/services/cognito.py` (or a new `backend/app/services/token_cache.py`) keyed by `(client_id, scopes)`.
- Store `(access_token, expires_at)` tuples. Expire cache entries 60 seconds before the token's actual expiry to allow for clock skew.
- On a cache hit, return the cached token without calling Secrets Manager or Cognito.
- On a cache miss or expired entry, fetch a fresh token and populate the cache.
- The cache is process-local (no external dependency required). Document that it resets on process restart.
- Log cache hits and misses at `DEBUG` level for observability.

#### R2.2: Parallelize Auth Token Resolution and Session Creation

Auth token resolution and the initial session/invocation DB writes currently happen sequentially. Both can be initiated at the same time because they are independent.

**Current flow (`invocations.py:780-897`):**
1. Resolve session or create new session (DB write + commit)
2. Create invocation record (DB write + commit)
3. Resolve access token (sequential: Secrets Manager → Cognito token endpoint)
4. Call `invoke_agent_stream`

**Desired flow:**
- Where the agent is known to require an access token (Priority 1: `credential_id` provided; Priority 3: agent config M2M), start token resolution concurrently with the session and invocation DB writes using `asyncio.gather` or `asyncio.create_task`.
- Token resolution must complete before the boto3 `invoke_agent_runtime` call (`agentcore.py:137`), but can overlap with DB writes.
- Preserve the existing priority order for token resolution (manual → credential_id → user login token → agent config M2M).

---

### R3: Latency Analysis Dashboard for Administrators

#### R3.1: Latency Summary API Endpoint

Add a backend endpoint that returns aggregated latency statistics across all agents and invocations, using the data already stored in the `Invocation` table.

**Endpoint:** `GET /api/admin/latency` (require `agent:read` scope)

**Query parameters:**
- `agent_id` (optional) — filter to a specific agent
- `range` — time range: `1h`, `24h`, `7d`, `30d` (default: `24h`)
- `min_invocations` (optional, default: 1) — exclude agents with fewer than this many invocations

**Response shape:**
```json
{
  "range": "24h",
  "total_invocations": 120,
  "cold_start_pct": 0.35,
  "agents": [
    {
      "agent_id": 1,
      "agent_name": "My Agent",
      "invocation_count": 40,
      "cold_start_count": 14,
      "cold_start_pct": 0.35,
      "ttft_p50_ms": 1200,
      "ttft_p95_ms": 8400,
      "ttft_p99_ms": 12000,
      "cold_start_ttft_p50_ms": 6500,
      "warm_ttft_p50_ms": 900,
      "duration_p50_ms": 3200,
      "duration_p95_ms": 14000
    }
  ]
}
```

Use SQLAlchemy to compute percentiles. For SQLite (local dev), use Python-side computation after fetching the raw values; for PostgreSQL, use `percentile_cont`. Add the endpoint to `backend/app/routers/admin.py` or a new `backend/app/routers/latency.py`.

#### R3.2: Latency Analysis Tab in the Admin Dashboard

Add a **Latency** tab to `frontend/src/pages/AdminDashboardPage.tsx` (or a dedicated `LatencyPage.tsx` reachable from the admin navigation) that displays the data returned by `GET /api/admin/latency`.

**Required visualizations:**
1. **Summary cards** — Total invocations, cold start percentage, and overall p50/p95 TTFT for the selected time range.
2. **TTFT distribution bar chart** — Histogram of TTFT values bucketed into ranges (e.g., 0–1 s, 1–3 s, 3–5 s, 5–10 s, >10 s). Use the Recharts library already present in the project (see `AdminDashboardPage.tsx:18-26`).
3. **Per-agent latency table** — Sortable table showing each agent's invocation count, cold start %, p50 TTFT, p95 TTFT, warm TTFT p50, and cold TTFT p50. Highlight agents with p95 TTFT > 10 s in a warning color.
4. **Cold vs. warm comparison** — Side-by-side bar chart comparing average TTFT for cold-start vs. warm invocations per agent.
5. **Time range selector** — Filter by 1h / 24h / 7d / 30d (consistent with the existing time range selector pattern in `AdminDashboardPage.tsx:46-60`).

#### R3.3: Per-Invocation TTFT in Invocation Detail

Surface `ttft_ms` and `was_cold_start` in the existing invocation detail view so individual invocations can be inspected.

**Current behavior:**
- `frontend/src/pages/InvocationDetailPage.tsx` displays `cold_start_latency_ms` and `client_duration_ms`.

**Desired behavior:**
- Add **TTFT** and **Cold Start** rows to the latency section of `InvocationDetailPage.tsx`.
- Format TTFT as `Xs` (e.g., "1.2s") for readability.
- Display a **Cold start** badge when `was_cold_start = true`, and a **Warm** badge when `was_cold_start = false`.

---

### R4: Backend Unit Tests

Add unit tests in `backend/tests/test_latency.py` covering:
- `ttft_ms` is populated correctly when the first chunk is received.
- `was_cold_start` is `True` when `cold_start_latency_ms` exceeds `LOOM_COLD_START_THRESHOLD_MS`, `False` when below, and `None` when `cold_start_latency_ms` is absent.
- Token cache returns cached token on hit and fetches fresh token on miss or expiry.
- Latency summary endpoint returns correct p50/p95 values for a known set of invocations.

## Files to Modify

| File | Changes |
|------|---------|
| `backend/app/models/invocation.py` | Add `ttft_ms`, `was_cold_start` columns |
| `backend/app/routers/invocations.py` | Record `ttft_ms`, set `was_cold_start`, parallelize auth+DB writes |
| `backend/app/services/cognito.py` | Add M2M token cache |
| `backend/app/routers/admin.py` (or new `latency.py`) | Add `GET /api/admin/latency` endpoint |
| `frontend/src/pages/AdminDashboardPage.tsx` | Add Latency tab |
| `frontend/src/pages/InvocationDetailPage.tsx` | Surface `ttft_ms` and `was_cold_start` |
| `etc/environment.sh` | Document `LOOM_COLD_START_THRESHOLD_MS` |

## Files to Create

| File | Description |
|------|-------------|
| `backend/app/services/token_cache.py` | In-process M2M token cache (if extracted from cognito.py) |
| `backend/tests/test_latency.py` | Unit tests for all new latency-related behavior |

## Acceptance Criteria

- [ ] R1.1: `ttft_ms` is recorded on every invocation where at least one text chunk is received
- [ ] R1.1: `ttft_ms` is included in `InvocationResponse` and returned by the API
- [ ] R1.2: `was_cold_start` is `True`/`False` based on `cold_start_latency_ms` vs threshold; `None` when unknown
- [ ] R2.1: Repeated invocations with the same credential reuse a cached Cognito token without calling Secrets Manager or Cognito on cache hits
- [ ] R2.1: Cache entries expire before the actual token expiry (60 s buffer)
- [ ] R2.2: Auth token resolution overlaps with session/invocation DB writes where possible
- [ ] R3.1: `GET /api/admin/latency` returns TTFT percentiles (p50, p95, p99) broken down by agent and cold/warm classification
- [ ] R3.2: Latency tab displays TTFT histogram, per-agent table, and cold vs. warm comparison chart
- [ ] R3.2: Agents with p95 TTFT > 10 s are highlighted in the table
- [ ] R3.3: `InvocationDetailPage` displays TTFT and cold/warm badge
- [ ] R4: All unit tests in `test_latency.py` pass
- [ ] No regressions to existing invocation streaming, cost tracking, or CloudWatch log retrieval


File	Changes
`backend/app/models/invocation.py`	Add `ttft_ms`, `was_cold_start` columns
`backend/app/routers/invocations.py`	Record `ttft_ms`, set `was_cold_start`, parallelize auth+DB writes
`backend/app/services/cognito.py`	Add M2M token cache
`backend/app/routers/admin.py` (or new `latency.py`)	Add `GET /api/admin/latency` endpoint
`frontend/src/pages/AdminDashboardPage.tsx`	Add Latency tab
`frontend/src/pages/InvocationDetailPage.tsx`	Surface `ttft_ms` and `was_cold_start`
`etc/environment.sh`	Document `LOOM_COLD_START_THRESHOLD_MS`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize performance and improve time-to-first-token #50

Optimize performance and improve time-to-first-token

Summary

Context

Requirements

R1: TTFT Instrumentation

R1.1: Track TTFT Per Invocation

R1.2: Classify Warm vs. Cold Invocations

R2: Token Caching to Reduce Auth Overhead

R2.1: Cache M2M Cognito Tokens

R2.2: Parallelize Auth Token Resolution and Session Creation

R3: Latency Analysis Dashboard for Administrators

R3.1: Latency Summary API Endpoint

R3.2: Latency Analysis Tab in the Admin Dashboard

R3.3: Per-Invocation TTFT in Invocation Detail

R4: Backend Unit Tests

Files to Modify

Files to Create

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Description
`backend/app/services/token_cache.py`	In-process M2M token cache (if extracted from cognito.py)
`backend/tests/test_latency.py`	Unit tests for all new latency-related behavior

Optimize performance and improve time-to-first-token #50

Description

Optimize performance and improve time-to-first-token

Summary

Context

Requirements

R1: TTFT Instrumentation

R1.1: Track TTFT Per Invocation

R1.2: Classify Warm vs. Cold Invocations

R2: Token Caching to Reduce Auth Overhead

R2.1: Cache M2M Cognito Tokens

R2.2: Parallelize Auth Token Resolution and Session Creation

R3: Latency Analysis Dashboard for Administrators

R3.1: Latency Summary API Endpoint

R3.2: Latency Analysis Tab in the Admin Dashboard

R3.3: Per-Invocation TTFT in Invocation Detail

R4: Backend Unit Tests

Files to Modify

Files to Create

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions