Skip to content

Conversation

adase11
Copy link
Contributor

@adase11 adase11 commented Sep 9, 2025

Summary:

Introduces a structured prompt caching API for Anthropic via AnthropicCacheOptions, applies cache_control deterministically, adds per-message TTLs (5m/1h), content-length eligibility, and clarifies tool caching behavior. Updates docs and adds tests to validate wire format, headers, and limits.

API

  • AnthropicChatOptions: adds cacheOptions(AnthropicCacheOptions); default remains no caching.
  • AnthropicCacheOptions: strategy (NONE, SYSTEM_ONLY, SYSTEM_AND_TOOLS, CONVERSATION_HISTORY); per-message TTLs via AnthropicCacheTtl (FIVE_MINUTES, ONE_HOUR);
    messageTypeMinContentLength; contentLengthFunction for custom token estimates.
  • AnthropicChatModel: applies cache_control only when eligible; caches the last tool definition; never caches the latest user question; uses array system format when caching;
    auto-sets Anthropic beta header for 1h TTL.
  • AnthropicApi: models cache_control on content blocks and tools; exposes Usage cacheCreationInputTokens and cacheReadInputTokens.
  • Utilities: CacheEligibilityResolver (with CacheBreakpointTracker) enforces Anthropic’s 4-breakpoint limit.

Tests

  • Unit: AnthropicCacheOptionsTests; CacheEligibilityResolverTests; AnthropicPromptCachingMockTest (wire format, TTL beta header, 4-breakpoint limit).
  • IT: AnthropicPromptCachingIT (guarded by ANTHROPIC_API_KEY; validates real usage fields when available).

Documentation

  • Updates spring-ai-docs Anthropic page to use cacheOptions with strategy examples.
  • Adds per-message TTL examples (1h) and notes automatic beta header.
  • Adds eligibility guidance (min lengths, custom contentLengthFunction).
  • Clarifies tool caching (last tool definition) and that latest user message is not cached in conversation history.

Compatibility:

  • Default behavior unchanged (caching disabled unless cacheOptions provided).
  • Prior doc examples using cacheStrategy/cacheTtl are replaced with cacheOptions; docs updated accordingly.

Closes: #4325

@adase11
Copy link
Contributor Author

adase11 commented Sep 9, 2025

I'm aware of https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#mixing-different-ttls where

You can use both 1-hour and 5-minute cache controls in the same request, but with an important constraint: Cache entries with longer TTL must appear before shorter TTLs (i.e., a 1-hour cache entry must appear before any 5-minute cache entries).

I din't make any attempt to enforce that as, in my opinion, it's up to the user to configure that properly. (And also not worth the complexity). I probably could have mentioned that in the documentation though.

List<ContentBlock> mediaContent = userMessage.getMedia().stream().map(media -> {
Type contentBlockType = getContentBlockTypeByMedia(media);
var source = getSourceByMedia(media);
return new ContentBlock(contentBlockType, source);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically these can be cached too - maybe as a next step including these content blocks. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#what-can-be-cached

@sobychacko
Copy link
Contributor

Thanks for the PR! We will start reviewing it soon.

@sobychacko
Copy link
Contributor

@adase11 Can you add your name as an author to the classes you changed?

I have a couple of questions regarding the messageTypeMinContentLengths in AnthropicCacheOptions. Why are we defaulting to the min content length of 1? Aren't the model requirements much higher for minimum tokens? 1024 and 2048? Also, content length function as it stands, does not map 1-to-1 with the number of tokens. I know the function is flexible so that more sophisticated token count mechanisms can be injected, but still I wonder if using the basic content length is the way to go there? But more importantly, my concern is with the global min requirements in the map. Since we default to 1 and since its unlikely users may override that, we may end up with a situation where we try to cache contents with shorter length than stipulated by the models. You have a check for length < this.messageTypeMinContentLengths.get(messageType) in CacheEligibilityResolver in which you return null if true. Imagine a content with 200 length. That check will fail because the content is longer than the minimum default of 1, which then ends up being cached and thus wasting the break point, right? Please correct me if I am wrong.

Another general question, how does token count requirements work in the case of tool segments?

Thanks!

@adase11
Copy link
Contributor Author

adase11 commented Sep 12, 2025

@sobychacko Thanks! I'll add my name as the author to the classes I changed. With regard to your questions:

Why are we defaulting to the min content length of 1? Aren't the model requirements much higher for minimum tokens? 1024 and 2048? Also, content length function as it stands, does not map 1-to-1 with the number of tokens.

I didn't want to be out-of-the-gate prescriptive to users about a minimum, so rather than guessing at a default that will continue to be valid for Anthropic's API spec I wanted to leave that up to the user to decide. And plus, like you said String text length doesn't map one-to-one with token length, so rather than guess at the right minimum string content I thought it was easier to just allow everything by default and then let users tweak their configurations to fit their specific use case.

Also, content length function as it stands, does not map 1-to-1 with the number of tokens. I know the function is flexible so that more sophisticated token count mechanisms can be injected, but still I wonder if using the basic content length is the way to go there?

Definitely understand what you're saying, I wanted to take the least intrusive / overhead approach as possible by default. My anticipation is that for a vast majority of the time a rough approximation will be sufficient. And then for the times when its not - if that means introducing a more sophisticated token approximation function that probably means much more overhead (the most robust way would be to actually introduce a tokenizer) than I would want to add to users without their explicit opt-in.

Since we default to 1 and since its unlikely users may override that, we may end up with a situation where we try to cache contents with shorter length than stipulated by the models. You have a check for length < this.messageTypeMinContentLengths.get(messageType) in CacheEligibilityResolver in which you return null if true. Imagine a content with 200 length. That check will fail because the content is longer than the minimum default of 1, which then ends up being cached and thus wasting the break point, right? Please correct me if I am wrong.

That's correct, and I believe it reflects the current behavior of the prompt caching (where users can optimize a bit using the different strategies). The goal of my PR is partly to address this behavior in order to give users more control over what is attempted to be cached (optimizing their allotted cache blocks). However, my intent is to let users opt into this optimization rather than enforcing it by default, since user contexts vary widely. Any default I set risks being a poor fit—or worse, masking inefficiencies until usage scales (e.g., small initial workloads seem fine under defaults, but performance degrades as volume grows).

I chose 1 instead of 0 because the Anthropic API doesn’t allow caching of empty text blocks (docs).

Another general question, how does token count requirements work in the case of tool segments?

For Tool definitions I took the easy way out, I think I should have documented better but I left the comment in there that addresses it Tool definition messages are always considered for caching if the strategy includes system messages..

Let me know what you think and I'm happy to make any changes. I don't feel exceptionally strongly about the default min to 1 vs something like 1000 so i'm happy to change that and adjust documentation if you like.

@adase11
Copy link
Contributor Author

adase11 commented Sep 12, 2025

@sobychacko - I went ahead and added the author tag as well as updated the documentation to talk about how the tool definitions are handled.

@sobychacko
Copy link
Contributor

@adase11 Sounds good to me. I will let @markpollack take a look before we can proceed with the PR.

@adase11
Copy link
Contributor Author

adase11 commented Sep 12, 2025

Thanks!

.cacheTtl("1h") // 1-hour cache lifetime
.cacheOptions(AnthropicCacheOptions.builder()
.strategy(AnthropicCacheStrategy.SYSTEM_ONLY)
.messageTypeTtls(MessageType.SYSTEM, AnthropicCacheTtl.ONE_HOUR)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think just 'ttl' as a builder name is cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
else {
contents.add(new ContentBlock(message.getText()));
contentBlocks.add(cacheAwareContentBlock(contentBlock, messageType, cacheEligibilityResolver));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adase11 Doesn't this make all the user messages in the request get cached? The first if conditional checks if it's the last user message and then skips adding the cache, and in the else, you just add it to the cache as long as it's a user message (based on the outer if condition). Isn't that wrong though? I think we only need to add the cache control to the last user message, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean it is not correct right now as it stands in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's currently incorrect, I should have only attempted to cache the last user message. And I can check that content size based on the last 20 user messages https://docs.claude.com/en/docs/build-with-claude/prompt-caching#continuing-a-multi-turn-conversation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Are you going to update the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see

Cache the entire conversation history up to (but not including) the current user question. This is ideal for multi-turn conversations where you want to reuse the conversation context while asking new questions.

on the documentation on AnthropicCacheStrategy.CONVERSATION_HISTORY - I think from the Anthropic docs that we could be caching including the last user message. I'll make an update to keep with the old logic - only apply to the next to last user message - and then if we're in agreement that actually the last one is ok to make eligible for caching that's an easy change to make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sobychacko - updated, and added some ITs in AnthropicPromptCachingIT for better coverage. If we do decide that all user messages are eligible including the last one that would make things a little simpler. But for now I left that logic the same and just considered the eligible (according to Anthropic) content when lookin at the content size (i.e. the prior 20 content blocks https://docs.claude.com/en/docs/build-with-claude/prompt-caching#when-to-use-multiple-breakpoints).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have gotten more sophisticated and attempted to add additional breakpoints if there are > 20 messages in the conversation history. I chose not to consider that for the moment in order to keep things simple but if you prefer me to do that I can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's review your changes with Mark first and we can proceed from there. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

…racking

spring-projectsgh-4325: Enhance cache management for Anthropic API by introudicing per-message TTL and configurable content block usage optimization.
Signed-off-by: Austin Dase <[email protected]>
@markpollack
Copy link
Member

thanks! it is merged and will be in 1.1 M2. I do want to experiement a bit more to fine tune.

merged in 1d5ab9b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Anthropic prompt caching message-type aware (TTLs, eligibility, min-size) and optimize cache-block usage
3 participants