Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move TSC took-time policy to guard both heap and disk tier #17190

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

peteralfonsi
Copy link
Contributor

Description

Changes the TieredSpilloverCache's minimum took-time policy so that queries must take >10 ms to enter either tier, instead of just to enter the disk tier. This is desirable because when we allow caching size > 0 queries in the request cache, the number of cacheable queries may go way up, and we should avoid flooding the heap tier with overly cheap queries.

Note the setting key tiered_spillover.disk.store.policies.took_time.threshold is unchanged, for backwards compatibility.

Related Issues

Resolves #16162

Check List

  • Functionality includes testing.
  • [N/A] API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Peter Alfonsi <[email protected]>
@peteralfonsi peteralfonsi added backport PRs or issues specific to backporting features or enhancments and removed RFC Issues requesting major changes labels Jan 29, 2025
@peteralfonsi peteralfonsi added the backport 2.x Backport to 2.x branch label Jan 29, 2025
Copy link
Contributor

❌ Gradle check result for 36c7ba2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Alfonsi <[email protected]>
@github-actions github-actions bot added the RFC Issues requesting major changes label Jan 29, 2025
Copy link
Contributor

❌ Gradle check result for 0359de4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Peter Alfonsi added 2 commits February 6, 2025 12:38
Copy link
Contributor

github-actions bot commented Feb 7, 2025

❌ Gradle check result for 294ebae: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Alfonsi <[email protected]>
Copy link
Contributor

❕ Gradle check result for 2b3b0f0: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Feb 10, 2025

Codecov Report

Attention: Patch coverage is 97.95918% with 1 line in your changes missing coverage. Please review.

Project coverage is 72.48%. Comparing base (77e4112) to head (d138d28).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
...search/cache/common/tier/TieredSpilloverCache.java 97.29% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17190      +/-   ##
============================================
+ Coverage     72.40%   72.48%   +0.08%     
- Complexity    65554    65619      +65     
============================================
  Files          5292     5292              
  Lines        304493   304519      +26     
  Branches      44218    44222       +4     
============================================
+ Hits         220463   220733     +270     
+ Misses        65975    65714     -261     
- Partials      18055    18072      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 394 to 399
// The old indices stats API runs onCached() whenever loader.load() runs. We can't change this as it's PublicApi.
// But now it may not be the case that the key actually enters the cache on loader.load().
// To cancel this out, send the removalListener (the IRC) a removal notification with evicted = false,
// so that it subtracts the stats off again without incrementing evictions.
// These old stats will be removed anyway in 3.0.
removalListener.onRemoval(new RemovalNotification<>(key, value, RemovalReason.EXPLICIT));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be the case with any caller and not just IRC? Mentioning this as TSC in general is designed to be used anywhere and not just IRC. So writing logic/comments specific to IRC doesn't seem right to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition we can refactor this to something below. This looks much simpler to me and won't require any changes in handler.

else {
   if (evaluatePoliciesList(value, policiesList)) {
         future.complete(new Tuple<>(key, value));
   } else {
         future.complete(null); // Passing null would skip the logic to put this into onHeap cache. 
         removalListener.onRemoval(new RemovalNotification<>(key, value, RemovalReason.EXPLICIT));
   }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this shouldn't be coding around the IRC specifically. I'll remove the comments.

Actually I don't like that we have to do this removal notification at all, since it's pretty specific to how the IRC implements its logic for the old cache stats API. For example we don't have to do it on TSC.put() as I described in the other comment. But I'm not sure there's a way around it. By design the TSC shouldn't be aware of this external stats tracking system, but we do want the old stats API to be accurate until we remove it (probably) in 3.0...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't think so. Even outside IRC, if someone calls put() or computeIfAbsent() on TSC, they are expecting that items should be added to the cache but it didn't, so we should notify them. In this case the caller can handle what they need to do with the removal notification whether to calculate stats or not.

With IRC particularly, it was calculating stats after calling computeIfAbsent(), so in this case removal notification is handled by IRC explicitly to deduct stats. For other cases like mentioned above, one can handle it with their own logic or simply ignore it.

@@ -257,19 +263,23 @@ public void put(ICacheKey<K> key, V value) {
Tuple<V, String> cacheValueTuple = getValueFromTieredCache(true).apply(key);
if (cacheValueTuple == null) {
// In case it is not present in any tier, put it inside onHeap cache by default.
try (ReleasableLock ignore = writeLock.acquire()) {
onHeapCache.put(key, value);
if (evaluatePoliciesList(value, policies)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to remove listener logic in computeIfAbsent, should we also trigger a removal listener from here as well?
As otherwise there is no way to communicate to the caller that the put call actually got skipped in case policy returned false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch, I only noticed the removal listener issue later on and forgot to add it back into put().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it isn't necessary, since the old indices stats API onCached listener only gets called on loader.load(), and for put() there's no loader involved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider tiered cache in generic terms, as the framework is agnostic of its users. If someone is using TSC and calls put(), but the item isn't added to the on-heap cache, we should notify them via a removal listener?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in a fully generic setup, we shouldn't call onRemoval() in either put() or computeIfAbsent() as nothing was ever actually removed from the cache. Adding the call in computeIfAbsent() is kind of a gross hack we do because we know the IRC's external stats tracking incorrectly assumes loader.load() always results in a value entering the cache.

I'm wondering if it might be better to keep this generic and not have either method call the removal listener? This would mean the old indices stats API can be incorrect, but we're planning on removing that in 3.0 anyway.

Or, we can put it in both and define "cache removal" to include "rejected by policy". If the IRC ever called put() directly it would make the old stats API incorrect, but it never actually does call put(), so it should be fine?

Signed-off-by: Peter Alfonsi <[email protected]>
Copy link
Contributor

✅ Gradle check result for d138d28: SUCCESS

Signed-off-by: Peter Alfonsi <[email protected]>
Copy link
Contributor

❌ Gradle check result for a317b9f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport PRs or issues specific to backporting features or enhancments backport 2.x Backport to 2.x branch enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Search:Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Optimize caching policy for Request cache
2 participants