Add prompt caching for model inference cost reduction

# Add prompt caching for model inference cost reduction

## Overview

Add prompt caching support to reduce costs associated with model inference. Prompt caching allows the system prompt and repeated context to be cached across invocations, significantly reducing input token costs for multi-turn conversations. The implementation must account for differences across model families — Anthropic Claude models support prompt caching natively via cache control breakpoints, while Amazon Nova models do not currently support this feature. Users should be able to enable caching per agent or at the model family level, and should see reporting on cache hit rates, efficiency gains, and cost savings.

This issue depends on [issue #30](https://github.com/heeki/loom/issues/30) for token usage tracking and cost estimation infrastructure.

## Context

### Current State
- The Strands `BedrockModel` is initialized with only `model_id`, `max_tokens`, and `streaming=True` in `agents/strands_agent/src/agent.py` (lines 32-37) — no caching parameters are passed
- `SUPPORTED_MODELS` in `agents.py` (lines 52-63) lists 5 Anthropic Claude models and 5 Amazon Nova models with `model_id`, `display_name`, `group`, and `max_tokens` — no caching capability flags
- The `AGENT_CONFIG_JSON` structure contains `system_prompt`, `model_id`, `max_tokens`, and `integrations` — no caching configuration
- The `AgentConfig` dataclass in `agents/strands_agent/src/config.py` has no caching-related fields
- The invocation flow passes only the prompt text through the API chain (`invoke_agent_runtime()` in `agentcore.py`) — no cache control directives
- The system prompt is static per agent deployment, making it an ideal candidate for caching since it does not change between invocations
- No references to prompt caching, `cache_control`, or cache breakpoints exist in the codebase (the only "cache" references are HTTP `Cache-Control: no-cache` for SSE streaming and `__pycache__` cleanup during artifact builds)
- The hook system (`BeforeInvocationEvent`, `AfterInvocationEvent`) in the Strands agent provides injection points at invocation boundaries

### Key Files
- `agents/strands_agent/src/agent.py` — Builds Strands `Agent` with `BedrockModel` (no cache params)
- `agents/strands_agent/src/config.py` — `AgentConfig` dataclass (no caching fields)
- `agents/strands_agent/src/handler.py` — Entry point, invokes `agent.stream_async(prompt)`
- `agents/strands_agent/requirements.txt` — Dependencies: `strands-agents[a2a]>=0.1.0`, `boto3>=1.35.0`
- `backend/app/routers/agents.py` — `SUPPORTED_MODELS`, `_deploy_agent()`, `AGENT_CONFIG_JSON` construction (lines 520-529)
- `backend/app/routers/invocations.py` — SSE invocation endpoint
- `backend/app/services/agentcore.py` — `invoke_agent_runtime()` boto3 call (lines 74-162)
- `backend/app/models/invocation.py` — Invocation ORM model (timing + content, no cache metrics)
- `frontend/src/components/LatencySummary.tsx` — Timing metrics display
- `frontend/src/pages/AgentDetailPage.tsx` — Agent detail with invocation panel

### Technology Stack
- **Agent Framework**: Strands Agents SDK (`strands-agents`), `BedrockModel`
- **Runtime**: AWS Bedrock AgentCore (Python 3.13, ARM64)
- **Backend**: Python, FastAPI, SQLAlchemy, SQLite
- **Frontend**: TypeScript, React, Vite, shadcn/ui, Tailwind CSS
- **AWS SDK**: boto3 (`bedrock-agentcore` client for invocation)

### Prompt Caching by Model Family
- **Anthropic Claude**: Supports prompt caching via `cache_control` breakpoints in the messages API. Cached input tokens are billed at a reduced rate (typically 90% discount). A cache write occurs on the first request; subsequent requests with the same prefix get cache reads.
- **Amazon Nova**: Does not currently support prompt caching. Caching configuration should be silently skipped or disabled for Nova models.

## Requirements

### R1: Enable prompt caching per agent or model family

Users should be able to enable prompt caching for individual agents or at the model family level. The implementation must correctly handle differences between model families.

- Add a `supports_prompt_caching` boolean flag to each entry in `SUPPORTED_MODELS`:
  - Anthropic Claude models: `true`
  - Amazon Nova models: `false`
- Expose this capability in the models endpoint (or `GET /api/pricing/models` from issue #30) so the frontend knows which models support caching
- Add a `prompt_caching_enabled` field to `AgentConfig` (dataclass in `config.py`) with a default of `false`
- Add a `prompt_caching_enabled` field to `AgentDeployRequest` and include it in the `AGENT_CONFIG_JSON` environment variable during deployment
- Add a toggle in `AgentRegistrationForm.tsx` to enable/disable prompt caching:
  - The toggle should only be enabled when a model that supports caching is selected
  - If the user selects a model that does not support caching, the toggle should be disabled and display a tooltip explaining why
  - Include the `prompt_caching_enabled` field in JSON import/export (issue #27)
- When `prompt_caching_enabled` is true and the model supports it, configure the Strands `BedrockModel` or the underlying Bedrock API call with cache control parameters:
  - Apply a cache breakpoint after the system prompt so it is cached across invocations within the same session
  - If the Strands SDK exposes cache control parameters on `BedrockModel`, use them directly
  - If the Strands SDK does not expose caching natively, explore passing additional model kwargs or extending the model wrapper to inject `cache_control` in the messages payload
- When `prompt_caching_enabled` is true but the model does not support caching (e.g. user switches models after enabling), log a warning and proceed without caching — do not fail the invocation

### R2: Cache efficiency reporting

Users should see backend reporting on cache hit rates, efficiency gains, and cost savings from prompt caching.

- Extend the `Invocation` model with caching metrics (builds on the token fields from issue #30):
  - `cache_read_tokens` (integer, nullable) — number of tokens served from cache
  - `cache_write_tokens` (integer, nullable) — number of tokens written to cache on first request
  - `cache_hit` (boolean, nullable) — whether the invocation resulted in a cache hit
- Extract cache metrics from the model response:
  - Anthropic Claude responses include `usage.cache_creation_input_tokens` and `usage.cache_read_input_tokens` in the response metadata
  - Parse these values from the AgentCore invocation response or the Strands SDK callback
  - Store them on the Invocation record at completion
- Create a backend endpoint (e.g. `GET /api/agents/{agent_id}/cache-stats`) that returns aggregate caching statistics:
  - Total invocations with caching enabled
  - Cache hit rate (percentage of invocations with cache hits)
  - Total cache read tokens vs. total input tokens (efficiency ratio)
  - Estimated cost savings: `(cache_read_tokens * (regular_input_price - cached_input_price)) / 1000`
  - Breakdown by time period (e.g. last 7 days, 30 days)
- Extend the cost dashboard (issue #30 R2) to include a caching section:
  - Show aggregate cache savings across all agents in the group
  - Highlight which agents benefit most from caching

### R3: Cache status visibility and usage impact

Users should see that prompt caching is enabled on their agents and understand the impact on token usage and cost.

- Display a visual indicator on agent cards (`AgentCard.tsx`) when prompt caching is enabled:
  - A small badge or icon (e.g. a cache/lightning icon) next to the model name
  - Tooltip showing "Prompt caching enabled"
- On the agent detail page (`AgentDetailPage.tsx`), add a caching section:
  - Show whether caching is enabled/disabled with a toggle to change it (triggers redeployment of `AGENT_CONFIG_JSON`)
  - Display cache statistics: hit rate, total cache reads, estimated savings
  - Show a per-session breakdown — cache writes typically occur on the first invocation of a session, with subsequent invocations in the same session getting cache reads
- In the invocation detail (`InvocationDetailPage.tsx`) and `LatencySummary`:
  - Show cache read/write token counts alongside input/output token counts
  - If a cache hit occurred, highlight the cost savings for that invocation (e.g. "Saved $X.XX from cache")
  - Differentiate between cached and non-cached input tokens in the token breakdown
- In the sessions table on the agent detail page:
  - Add a column or indicator showing cache utilization per session
  - Sessions with caching should show the ratio of cached vs. uncached tokens

## Testing

- Run backend tests: `cd backend && make test`
- Run frontend typecheck: `cd frontend && npx tsc --noEmit`
- Verify caching toggle:
  - Select a Claude model → caching toggle is enabled and functional
  - Select a Nova model → caching toggle is disabled with tooltip
  - Enable caching → deploy agent → `AGENT_CONFIG_JSON` includes `prompt_caching_enabled: true`
- Verify cache behavior during invocation:
  - Deploy an agent with caching enabled (Claude model)
  - First invocation in a session: expect cache write tokens (system prompt cached)
  - Subsequent invocations in the same session: expect cache read tokens (cache hit)
  - Verify `cache_read_tokens`, `cache_write_tokens`, and `cache_hit` are stored on the Invocation record
- Verify cache reporting:
  - `GET /api/agents/{agent_id}/cache-stats` returns accurate hit rate and savings
  - Agent detail page displays cache statistics
  - Cost dashboard includes caching savings section
- Verify model family handling:
  - Agent with Nova model and caching enabled → caching silently skipped, no errors
  - Agent switches from Claude to Nova → caching toggle auto-disables
- Verify UI indicators:
  - Agent card shows cache badge when enabled
  - Invocation detail shows cache token breakdown
  - LatencySummary shows cache savings

## Out of Scope

- Cross-session cache sharing (caching is per-session within AgentCore)
- Tool result caching or response caching (only input/system prompt caching)
- Custom cache TTL configuration (use provider defaults)
- Cache warming or pre-loading strategies
- Caching for non-Bedrock model providers
- Automatic caching recommendations based on usage patterns


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prompt caching for model inference cost reduction #35

Add prompt caching for model inference cost reduction

Overview

Context

Current State

Key Files

Technology Stack

Prompt Caching by Model Family

Requirements

R1: Enable prompt caching per agent or model family

R2: Cache efficiency reporting

R3: Cache status visibility and usage impact

Testing

Out of Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add prompt caching for model inference cost reduction #35

Description

Add prompt caching for model inference cost reduction

Overview

Context

Current State

Key Files

Technology Stack

Prompt Caching by Model Family

Requirements

R1: Enable prompt caching per agent or model family

R2: Cache efficiency reporting

R3: Cache status visibility and usage impact

Testing

Out of Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions