-
Notifications
You must be signed in to change notification settings - Fork 0
Add prompt caching for model inference cost reduction #35
Description
Add prompt caching for model inference cost reduction
Overview
Add prompt caching support to reduce costs associated with model inference. Prompt caching allows the system prompt and repeated context to be cached across invocations, significantly reducing input token costs for multi-turn conversations. The implementation must account for differences across model families — Anthropic Claude models support prompt caching natively via cache control breakpoints, while Amazon Nova models do not currently support this feature. Users should be able to enable caching per agent or at the model family level, and should see reporting on cache hit rates, efficiency gains, and cost savings.
This issue depends on issue #30 for token usage tracking and cost estimation infrastructure.
Context
Current State
- The Strands
BedrockModelis initialized with onlymodel_id,max_tokens, andstreaming=Trueinagents/strands_agent/src/agent.py(lines 32-37) — no caching parameters are passed SUPPORTED_MODELSinagents.py(lines 52-63) lists 5 Anthropic Claude models and 5 Amazon Nova models withmodel_id,display_name,group, andmax_tokens— no caching capability flags- The
AGENT_CONFIG_JSONstructure containssystem_prompt,model_id,max_tokens, andintegrations— no caching configuration - The
AgentConfigdataclass inagents/strands_agent/src/config.pyhas no caching-related fields - The invocation flow passes only the prompt text through the API chain (
invoke_agent_runtime()inagentcore.py) — no cache control directives - The system prompt is static per agent deployment, making it an ideal candidate for caching since it does not change between invocations
- No references to prompt caching,
cache_control, or cache breakpoints exist in the codebase (the only "cache" references are HTTPCache-Control: no-cachefor SSE streaming and__pycache__cleanup during artifact builds) - The hook system (
BeforeInvocationEvent,AfterInvocationEvent) in the Strands agent provides injection points at invocation boundaries
Key Files
agents/strands_agent/src/agent.py— Builds StrandsAgentwithBedrockModel(no cache params)agents/strands_agent/src/config.py—AgentConfigdataclass (no caching fields)agents/strands_agent/src/handler.py— Entry point, invokesagent.stream_async(prompt)agents/strands_agent/requirements.txt— Dependencies:strands-agents[a2a]>=0.1.0,boto3>=1.35.0backend/app/routers/agents.py—SUPPORTED_MODELS,_deploy_agent(),AGENT_CONFIG_JSONconstruction (lines 520-529)backend/app/routers/invocations.py— SSE invocation endpointbackend/app/services/agentcore.py—invoke_agent_runtime()boto3 call (lines 74-162)backend/app/models/invocation.py— Invocation ORM model (timing + content, no cache metrics)frontend/src/components/LatencySummary.tsx— Timing metrics displayfrontend/src/pages/AgentDetailPage.tsx— Agent detail with invocation panel
Technology Stack
- Agent Framework: Strands Agents SDK (
strands-agents),BedrockModel - Runtime: AWS Bedrock AgentCore (Python 3.13, ARM64)
- Backend: Python, FastAPI, SQLAlchemy, SQLite
- Frontend: TypeScript, React, Vite, shadcn/ui, Tailwind CSS
- AWS SDK: boto3 (
bedrock-agentcoreclient for invocation)
Prompt Caching by Model Family
- Anthropic Claude: Supports prompt caching via
cache_controlbreakpoints in the messages API. Cached input tokens are billed at a reduced rate (typically 90% discount). A cache write occurs on the first request; subsequent requests with the same prefix get cache reads. - Amazon Nova: Does not currently support prompt caching. Caching configuration should be silently skipped or disabled for Nova models.
Requirements
R1: Enable prompt caching per agent or model family
Users should be able to enable prompt caching for individual agents or at the model family level. The implementation must correctly handle differences between model families.
- Add a
supports_prompt_cachingboolean flag to each entry inSUPPORTED_MODELS:- Anthropic Claude models:
true - Amazon Nova models:
false
- Anthropic Claude models:
- Expose this capability in the models endpoint (or
GET /api/pricing/modelsfrom issue feat: add JSON import/export and agent deletion polling (#28) #30) so the frontend knows which models support caching - Add a
prompt_caching_enabledfield toAgentConfig(dataclass inconfig.py) with a default offalse - Add a
prompt_caching_enabledfield toAgentDeployRequestand include it in theAGENT_CONFIG_JSONenvironment variable during deployment - Add a toggle in
AgentRegistrationForm.tsxto enable/disable prompt caching:- The toggle should only be enabled when a model that supports caching is selected
- If the user selects a model that does not support caching, the toggle should be disabled and display a tooltip explaining why
- Include the
prompt_caching_enabledfield in JSON import/export (issue feat: tagging page and custom tags (closes #24) #27)
- When
prompt_caching_enabledis true and the model supports it, configure the StrandsBedrockModelor the underlying Bedrock API call with cache control parameters:- Apply a cache breakpoint after the system prompt so it is cached across invocations within the same session
- If the Strands SDK exposes cache control parameters on
BedrockModel, use them directly - If the Strands SDK does not expose caching natively, explore passing additional model kwargs or extending the model wrapper to inject
cache_controlin the messages payload
- When
prompt_caching_enabledis true but the model does not support caching (e.g. user switches models after enabling), log a warning and proceed without caching — do not fail the invocation
R2: Cache efficiency reporting
Users should see backend reporting on cache hit rates, efficiency gains, and cost savings from prompt caching.
- Extend the
Invocationmodel with caching metrics (builds on the token fields from issue feat: add JSON import/export and agent deletion polling (#28) #30):cache_read_tokens(integer, nullable) — number of tokens served from cachecache_write_tokens(integer, nullable) — number of tokens written to cache on first requestcache_hit(boolean, nullable) — whether the invocation resulted in a cache hit
- Extract cache metrics from the model response:
- Anthropic Claude responses include
usage.cache_creation_input_tokensandusage.cache_read_input_tokensin the response metadata - Parse these values from the AgentCore invocation response or the Strands SDK callback
- Store them on the Invocation record at completion
- Anthropic Claude responses include
- Create a backend endpoint (e.g.
GET /api/agents/{agent_id}/cache-stats) that returns aggregate caching statistics:- Total invocations with caching enabled
- Cache hit rate (percentage of invocations with cache hits)
- Total cache read tokens vs. total input tokens (efficiency ratio)
- Estimated cost savings:
(cache_read_tokens * (regular_input_price - cached_input_price)) / 1000 - Breakdown by time period (e.g. last 7 days, 30 days)
- Extend the cost dashboard (issue feat: add JSON import/export and agent deletion polling (#28) #30 R2) to include a caching section:
- Show aggregate cache savings across all agents in the group
- Highlight which agents benefit most from caching
R3: Cache status visibility and usage impact
Users should see that prompt caching is enabled on their agents and understand the impact on token usage and cost.
- Display a visual indicator on agent cards (
AgentCard.tsx) when prompt caching is enabled:- A small badge or icon (e.g. a cache/lightning icon) next to the model name
- Tooltip showing "Prompt caching enabled"
- On the agent detail page (
AgentDetailPage.tsx), add a caching section:- Show whether caching is enabled/disabled with a toggle to change it (triggers redeployment of
AGENT_CONFIG_JSON) - Display cache statistics: hit rate, total cache reads, estimated savings
- Show a per-session breakdown — cache writes typically occur on the first invocation of a session, with subsequent invocations in the same session getting cache reads
- Show whether caching is enabled/disabled with a toggle to change it (triggers redeployment of
- In the invocation detail (
InvocationDetailPage.tsx) andLatencySummary:- Show cache read/write token counts alongside input/output token counts
- If a cache hit occurred, highlight the cost savings for that invocation (e.g. "Saved $X.XX from cache")
- Differentiate between cached and non-cached input tokens in the token breakdown
- In the sessions table on the agent detail page:
- Add a column or indicator showing cache utilization per session
- Sessions with caching should show the ratio of cached vs. uncached tokens
Testing
- Run backend tests:
cd backend && make test - Run frontend typecheck:
cd frontend && npx tsc --noEmit - Verify caching toggle:
- Select a Claude model → caching toggle is enabled and functional
- Select a Nova model → caching toggle is disabled with tooltip
- Enable caching → deploy agent →
AGENT_CONFIG_JSONincludesprompt_caching_enabled: true
- Verify cache behavior during invocation:
- Deploy an agent with caching enabled (Claude model)
- First invocation in a session: expect cache write tokens (system prompt cached)
- Subsequent invocations in the same session: expect cache read tokens (cache hit)
- Verify
cache_read_tokens,cache_write_tokens, andcache_hitare stored on the Invocation record
- Verify cache reporting:
GET /api/agents/{agent_id}/cache-statsreturns accurate hit rate and savings- Agent detail page displays cache statistics
- Cost dashboard includes caching savings section
- Verify model family handling:
- Agent with Nova model and caching enabled → caching silently skipped, no errors
- Agent switches from Claude to Nova → caching toggle auto-disables
- Verify UI indicators:
- Agent card shows cache badge when enabled
- Invocation detail shows cache token breakdown
- LatencySummary shows cache savings
Out of Scope
- Cross-session cache sharing (caching is per-session within AgentCore)
- Tool result caching or response caching (only input/system prompt caching)
- Custom cache TTL configuration (use provider defaults)
- Cache warming or pre-loading strategies
- Caching for non-Bedrock model providers
- Automatic caching recommendations based on usage patterns