Skip to content

Add prompt caching for model inference cost reduction #35

@heeki

Description

@heeki

Add prompt caching for model inference cost reduction

Overview

Add prompt caching support to reduce costs associated with model inference. Prompt caching allows the system prompt and repeated context to be cached across invocations, significantly reducing input token costs for multi-turn conversations. The implementation must account for differences across model families — Anthropic Claude models support prompt caching natively via cache control breakpoints, while Amazon Nova models do not currently support this feature. Users should be able to enable caching per agent or at the model family level, and should see reporting on cache hit rates, efficiency gains, and cost savings.

This issue depends on issue #30 for token usage tracking and cost estimation infrastructure.

Context

Current State

  • The Strands BedrockModel is initialized with only model_id, max_tokens, and streaming=True in agents/strands_agent/src/agent.py (lines 32-37) — no caching parameters are passed
  • SUPPORTED_MODELS in agents.py (lines 52-63) lists 5 Anthropic Claude models and 5 Amazon Nova models with model_id, display_name, group, and max_tokens — no caching capability flags
  • The AGENT_CONFIG_JSON structure contains system_prompt, model_id, max_tokens, and integrations — no caching configuration
  • The AgentConfig dataclass in agents/strands_agent/src/config.py has no caching-related fields
  • The invocation flow passes only the prompt text through the API chain (invoke_agent_runtime() in agentcore.py) — no cache control directives
  • The system prompt is static per agent deployment, making it an ideal candidate for caching since it does not change between invocations
  • No references to prompt caching, cache_control, or cache breakpoints exist in the codebase (the only "cache" references are HTTP Cache-Control: no-cache for SSE streaming and __pycache__ cleanup during artifact builds)
  • The hook system (BeforeInvocationEvent, AfterInvocationEvent) in the Strands agent provides injection points at invocation boundaries

Key Files

  • agents/strands_agent/src/agent.py — Builds Strands Agent with BedrockModel (no cache params)
  • agents/strands_agent/src/config.pyAgentConfig dataclass (no caching fields)
  • agents/strands_agent/src/handler.py — Entry point, invokes agent.stream_async(prompt)
  • agents/strands_agent/requirements.txt — Dependencies: strands-agents[a2a]>=0.1.0, boto3>=1.35.0
  • backend/app/routers/agents.pySUPPORTED_MODELS, _deploy_agent(), AGENT_CONFIG_JSON construction (lines 520-529)
  • backend/app/routers/invocations.py — SSE invocation endpoint
  • backend/app/services/agentcore.pyinvoke_agent_runtime() boto3 call (lines 74-162)
  • backend/app/models/invocation.py — Invocation ORM model (timing + content, no cache metrics)
  • frontend/src/components/LatencySummary.tsx — Timing metrics display
  • frontend/src/pages/AgentDetailPage.tsx — Agent detail with invocation panel

Technology Stack

  • Agent Framework: Strands Agents SDK (strands-agents), BedrockModel
  • Runtime: AWS Bedrock AgentCore (Python 3.13, ARM64)
  • Backend: Python, FastAPI, SQLAlchemy, SQLite
  • Frontend: TypeScript, React, Vite, shadcn/ui, Tailwind CSS
  • AWS SDK: boto3 (bedrock-agentcore client for invocation)

Prompt Caching by Model Family

  • Anthropic Claude: Supports prompt caching via cache_control breakpoints in the messages API. Cached input tokens are billed at a reduced rate (typically 90% discount). A cache write occurs on the first request; subsequent requests with the same prefix get cache reads.
  • Amazon Nova: Does not currently support prompt caching. Caching configuration should be silently skipped or disabled for Nova models.

Requirements

R1: Enable prompt caching per agent or model family

Users should be able to enable prompt caching for individual agents or at the model family level. The implementation must correctly handle differences between model families.

  • Add a supports_prompt_caching boolean flag to each entry in SUPPORTED_MODELS:
    • Anthropic Claude models: true
    • Amazon Nova models: false
  • Expose this capability in the models endpoint (or GET /api/pricing/models from issue feat: add JSON import/export and agent deletion polling (#28) #30) so the frontend knows which models support caching
  • Add a prompt_caching_enabled field to AgentConfig (dataclass in config.py) with a default of false
  • Add a prompt_caching_enabled field to AgentDeployRequest and include it in the AGENT_CONFIG_JSON environment variable during deployment
  • Add a toggle in AgentRegistrationForm.tsx to enable/disable prompt caching:
    • The toggle should only be enabled when a model that supports caching is selected
    • If the user selects a model that does not support caching, the toggle should be disabled and display a tooltip explaining why
    • Include the prompt_caching_enabled field in JSON import/export (issue feat: tagging page and custom tags (closes #24) #27)
  • When prompt_caching_enabled is true and the model supports it, configure the Strands BedrockModel or the underlying Bedrock API call with cache control parameters:
    • Apply a cache breakpoint after the system prompt so it is cached across invocations within the same session
    • If the Strands SDK exposes cache control parameters on BedrockModel, use them directly
    • If the Strands SDK does not expose caching natively, explore passing additional model kwargs or extending the model wrapper to inject cache_control in the messages payload
  • When prompt_caching_enabled is true but the model does not support caching (e.g. user switches models after enabling), log a warning and proceed without caching — do not fail the invocation

R2: Cache efficiency reporting

Users should see backend reporting on cache hit rates, efficiency gains, and cost savings from prompt caching.

  • Extend the Invocation model with caching metrics (builds on the token fields from issue feat: add JSON import/export and agent deletion polling (#28) #30):
    • cache_read_tokens (integer, nullable) — number of tokens served from cache
    • cache_write_tokens (integer, nullable) — number of tokens written to cache on first request
    • cache_hit (boolean, nullable) — whether the invocation resulted in a cache hit
  • Extract cache metrics from the model response:
    • Anthropic Claude responses include usage.cache_creation_input_tokens and usage.cache_read_input_tokens in the response metadata
    • Parse these values from the AgentCore invocation response or the Strands SDK callback
    • Store them on the Invocation record at completion
  • Create a backend endpoint (e.g. GET /api/agents/{agent_id}/cache-stats) that returns aggregate caching statistics:
    • Total invocations with caching enabled
    • Cache hit rate (percentage of invocations with cache hits)
    • Total cache read tokens vs. total input tokens (efficiency ratio)
    • Estimated cost savings: (cache_read_tokens * (regular_input_price - cached_input_price)) / 1000
    • Breakdown by time period (e.g. last 7 days, 30 days)
  • Extend the cost dashboard (issue feat: add JSON import/export and agent deletion polling (#28) #30 R2) to include a caching section:
    • Show aggregate cache savings across all agents in the group
    • Highlight which agents benefit most from caching

R3: Cache status visibility and usage impact

Users should see that prompt caching is enabled on their agents and understand the impact on token usage and cost.

  • Display a visual indicator on agent cards (AgentCard.tsx) when prompt caching is enabled:
    • A small badge or icon (e.g. a cache/lightning icon) next to the model name
    • Tooltip showing "Prompt caching enabled"
  • On the agent detail page (AgentDetailPage.tsx), add a caching section:
    • Show whether caching is enabled/disabled with a toggle to change it (triggers redeployment of AGENT_CONFIG_JSON)
    • Display cache statistics: hit rate, total cache reads, estimated savings
    • Show a per-session breakdown — cache writes typically occur on the first invocation of a session, with subsequent invocations in the same session getting cache reads
  • In the invocation detail (InvocationDetailPage.tsx) and LatencySummary:
    • Show cache read/write token counts alongside input/output token counts
    • If a cache hit occurred, highlight the cost savings for that invocation (e.g. "Saved $X.XX from cache")
    • Differentiate between cached and non-cached input tokens in the token breakdown
  • In the sessions table on the agent detail page:
    • Add a column or indicator showing cache utilization per session
    • Sessions with caching should show the ratio of cached vs. uncached tokens

Testing

  • Run backend tests: cd backend && make test
  • Run frontend typecheck: cd frontend && npx tsc --noEmit
  • Verify caching toggle:
    • Select a Claude model → caching toggle is enabled and functional
    • Select a Nova model → caching toggle is disabled with tooltip
    • Enable caching → deploy agent → AGENT_CONFIG_JSON includes prompt_caching_enabled: true
  • Verify cache behavior during invocation:
    • Deploy an agent with caching enabled (Claude model)
    • First invocation in a session: expect cache write tokens (system prompt cached)
    • Subsequent invocations in the same session: expect cache read tokens (cache hit)
    • Verify cache_read_tokens, cache_write_tokens, and cache_hit are stored on the Invocation record
  • Verify cache reporting:
    • GET /api/agents/{agent_id}/cache-stats returns accurate hit rate and savings
    • Agent detail page displays cache statistics
    • Cost dashboard includes caching savings section
  • Verify model family handling:
    • Agent with Nova model and caching enabled → caching silently skipped, no errors
    • Agent switches from Claude to Nova → caching toggle auto-disables
  • Verify UI indicators:
    • Agent card shows cache badge when enabled
    • Invocation detail shows cache token breakdown
    • LatencySummary shows cache savings

Out of Scope

  • Cross-session cache sharing (caching is per-session within AgentCore)
  • Tool result caching or response caching (only input/system prompt caching)
  • Custom cache TTL configuration (use provider defaults)
  • Cache warming or pre-loading strategies
  • Caching for non-Bedrock model providers
  • Automatic caching recommendations based on usage patterns

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions