Skip to content

Conversation

@ankrgyl
Copy link

@ankrgyl ankrgyl commented Nov 9, 2025

Which issue does this PR close?

Closes #541.

What changes are included in this PR?

Use an arc swap to hold the cached credential, and acquire a mutex only if the credential's ttl is expired.

Are there any user-facing changes?

No

@crepererum
Copy link
Contributor

Instead of a new dependency a RwLock would probably do what you need for #541. This way the queries wouldn't content and you still can override/update/replace the credentials using a writer thread.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ankrgyl

#[derive(Debug)]
pub(crate) struct TokenCache<T> {
cache: Mutex<Option<(TemporaryToken<T>, Instant)>>,
cache: ArcSwapOption<CacheEntry<T>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in general, unless we have benchmark results that show arc-swap is necessary, I am opposed to adding a new dependency

Did you try a RWLock before reaching for a new crate? I always worry about adding new crates like arcswap as I don't want to have to deal with a RUSTSEC report if/when it becomes abandoned.

I do see there are many other users

RWLocks would allow multiple concurrent readers but if you had a lot of writers you might still have contention. If the you find update contention is too much, you could change to use RWLock<Arc<..>> so that the lock only needs to be held to clone an Arc

I understand the docs for arc swap claims https://docs.rs/arc-swap/latest/arc_swap/

Better option would be to have RwLock<Arc>. Then one would lock, clone the Arc and unlock. This suffers from CPU-level contention (on the lock and on the reference count of the Arc) which makes it relatively slow. Depending on the implementation, an update may be blocked for arbitrary long time by a steady inflow of readers.

I would imagine the overhead of actually using the token (making an HTTP request) is pretty huge compared to getting a lock.

Copy link
Author

@ankrgyl ankrgyl Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I just saw this after writing my other comment.

I would imagine the overhead of actually using the token (making an HTTP request) is pretty huge compared to getting a lock.

The problem with the previous design, which may not apply to an RwLock (and sure, I will benchmark it and report back), is that "waiting in line" for the mutex would become so expensive with a high number of concurrent requests (eg. HEAD requests with 8ms p50 latencies), that it actually overwhelmed tokio's worker threads and dominated the execution time (we saw p50 "HEAD" operation latency spike to 700ms, and realized the mutex was the root cause).

Let me run a benchmark with arc swap vs. RwLock and report back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me run a benchmark with arc swap vs. RwLock and report back

👍

@ankrgyl
Copy link
Author

ankrgyl commented Nov 10, 2025

Instead of a new dependency a RwLock would probably do what you need for #541. This way the queries wouldn't content and you still can override/update/replace the credentials using a writer thread.

It's true, but I figured the read:write ratio is SO high (basically, 100% reads within the TTL window), that I'd like to avoid the (relatively) higher cost of an RwLock for the majority case. Would it be useful to illustrate this with a benchmark, or are you categorically against adding new dependencies?

@alamb
Copy link
Contributor

alamb commented Nov 10, 2025

It's true, but I figured the read:write ratio is SO high (basically, 100% reads within the TTL window), that I'd like to avoid the (relatively) higher cost of an RwLock for the majority case. Would it be useful to illustrate this with a benchmark, or are you categorically against adding new dependencies?

I think a super read heavy workload will work well with RWLock

Would it be useful to illustrate this with a benchmark, or are you categorically against adding new dependencies?

In my mind, the burden of evidence is much higher to add a new dependency. I know it sounds somewhat like a curmudgeon, but each new dependency adds some small (but real) additional maintenance overhead (and downstream work). Unless there is compelling demonstrated value to add a new dependency, we try and avoid it.

Also, there are very few contributions removing dependencies for some reason 😆 so once we add one we are typically stuck with them.

@ankrgyl
Copy link
Author

ankrgyl commented Nov 10, 2025

It's true, but I figured the read:write ratio is SO high (basically, 100% reads within the TTL window), that I'd like to avoid the (relatively) higher cost of an RwLock for the majority case. Would it be useful to illustrate this with a benchmark, or are you categorically against adding new dependencies?

I think a super read heavy workload will work well with RWLock

Would it be useful to illustrate this with a benchmark, or are you categorically against adding new dependencies?

In my mind, the burden of evidence is much higher to add a new dependency. I know it sounds somewhat like a curmudgeon, but each new dependency adds some small (but real) additional maintenance overhead (and downstream work). Unless there is compelling demonstrated value to add a new dependency, we try and avoid it.

Also, there are very few contributions removing dependencies for some reason 😆 so once we add one we are typically stuck with them.

That's totally fine with me! I am only writing in to contribute in a helpful manner. If I staunchly feel that arc_swap is the right solution, I can easily run a fork. So I'm aligned to doing whatever you feel like is the right thing for the repo.

Here are the benchmark numbers run on my macbook pro. What I saw in production was dramatically more pronounced (mutex vs. arc swap, not arc swap vs. rwlock), but I don't have much more bandwidth to investigate that in depth.

=== Results Summary ===

Median Latency (p50):
Concurrency  │      Mutex │   Arc-swap │     RwLock │ Arc vs Mutex │ RwLock vs Mutex
────────────────────────────────────────────────────────────────────────────────────────────
100          │     2.31ms │     2.34ms │     2.35ms │        -1.2% │          -1.8%
500          │     2.57ms │     2.62ms │     2.65ms │        -1.8% │          -2.9%
5k           │     5.01ms │     3.46ms │     4.07ms │        44.9% │          23.2%
5k           │     5.15ms │     3.50ms │     4.15ms │        46.9% │          24.1%
10k          │     8.18ms │     7.11ms │     8.69ms │        15.0% │          -5.9%
25k          │    21.94ms │    19.04ms │    23.18ms │        15.3% │          -5.3%


Tail Latency (p99):
Concurrency  │      Mutex │   Arc-swap │     RwLock │ Arc vs Mutex │ RwLock vs Mutex
────────────────────────────────────────────────────────────────────────────────────────────
100          │     5.98ms │     3.54ms │     3.57ms │        68.9% │          67.8%
500          │     4.01ms │     3.93ms │     3.93ms │         2.2% │           2.1%
5k           │     6.60ms │     5.36ms │     5.94ms │        23.2% │          11.1%
5k           │     7.83ms │     5.81ms │     6.16ms │        34.8% │          27.0%
10k          │    11.59ms │    12.67ms │    15.97ms │        -8.5% │         -27.4%
25k          │    31.44ms │    37.85ms │    47.92ms │       -16.9% │         -34.4%

I'm happy to update the code however you'd like based on these findings.

@alamb
Copy link
Contributor

alamb commented Nov 10, 2025

Here are the benchmark numbers run on my macbook pro. What I saw in production was dramatically more pronounced (mutex vs. arc swap, not arc swap vs. rwlock), but I don't have much more bandwidth to investigate that in depth.

What did you benchmark? Is this a micro benchmark for locking, or is it actually a workload running object_store requests?

given the total request time reported is in ms I am guessing the actual workload you have

@ankrgyl
Copy link
Author

ankrgyl commented Nov 11, 2025

Here are the benchmark numbers run on my macbook pro. What I saw in production was dramatically more pronounced (mutex vs. arc swap, not arc swap vs. rwlock), but I don't have much more bandwidth to investigate that in depth.

What did you benchmark? Is this a micro benchmark for locking, or is it actually a workload running object_store requests?

given the total request time reported is in ms I am guessing the actual workload you have

Benchmark is here: https://github.com/apache/arrow-rs-object-store/pull/542/files#diff-4e18aa7d15e47cfe7440ad519403de83caed450f907bc533b9aa414ca7f9c7de. I simulated object store requests via a 1-20ms sleep. Which obviously is not perfect...

You can run it with

cargo bench --bench cache_benchmark

@crepererum
Copy link
Contributor

crepererum commented Nov 11, 2025

These are the results on my machine:

=== Results Summary ===

Median Latency (p50):
Concurrency  │      Mutex │   Arc-swap │     RwLock │ Arc vs Mutex │ RwLock vs Mutex
────────────────────────────────────────────────────────────────────────────────────────────
100          │     2.14ms │     2.13ms │     2.13ms │         0.3% │           0.2%
500          │     2.33ms │     2.25ms │     2.25ms │         3.8% │           3.6%
5k           │     4.45ms │     3.48ms │     3.26ms │        28.0% │          36.3%
5k           │     4.36ms │     3.47ms │     3.28ms │        25.5% │          32.8%
10k          │     9.53ms │     7.12ms │     6.65ms │        33.8% │          43.3%
25k          │    23.42ms │    18.78ms │    17.56ms │        24.7% │          33.4%


Tail Latency (p99):
Concurrency  │      Mutex │   Arc-swap │     RwLock │ Arc vs Mutex │ RwLock vs Mutex
────────────────────────────────────────────────────────────────────────────────────────────
100          │     3.25ms │     3.20ms │     3.21ms │         1.7% │           1.2%
500          │     3.51ms │     3.39ms │     3.39ms │         3.7% │           3.7%
5k           │     6.02ms │     7.07ms │     6.59ms │       -14.8% │          -8.6%
5k           │     5.92ms │     7.01ms │     6.52ms │       -15.5% │          -9.2%
10k          │    10.93ms │    12.78ms │    11.68ms │       -14.4% │          -6.3%
25k          │    24.91ms │    36.22ms │    33.63ms │       -31.2% │         -25.9%

So here RwLock is consistently better than Arc-swap, but one also has to note that the overall improvement of this entire endeavor is fairly limited (like 35% latency improvement at best). In fact the tail latency is the best using a plain Mutex.

@crepererum
Copy link
Contributor

crepererum commented Nov 11, 2025

Also looking at the code, you roughly have something like this:

if token_fresh() {
  return token;
}

let _guard = refresh.lock().await;

...

write_token();

I don't see how an ArcSwap is really gonna help here. In fact with that construct the RwLock doesn't even need to be a tokio async lock. You can just use a stdlib or parking_lot RwLock since getting the fresh token doesn't involve I/O and replacing it while being under the refresh guard at the very end is also just a sync write operation. My suspicion is that that would be even faster, since async locks also have some overhead that we technically don't need here.

@ankrgyl
Copy link
Author

ankrgyl commented Nov 11, 2025

Also looking at the code, you roughly have something like this:

if token_fresh() {
  return token;
}

let _guard = refresh.lock().await;

...

write_token();

I don't see how an ArcSwap is really gonna help here. In fact with that construct the RwLock doesn't even need to be a tokio async lock. You can just use a stdlib or parking_lot RwLock since getting the fresh token doesn't involve I/O and replacing it while being under the refresh guard at the very end is also just a sync write operation. My suspicion is that that would be even faster, since async locks also have some overhead that we technically don't need here.

Correct. I explicitly did not test any credential refreshes and just wanted to illustrate the difference with read contention. It's not very hard to fill that in.

And yes I stated above that this benchmark does not illustrate the difference as extremely as what I saw in production. Perhaps it's because in a real workload, we're doing a lot more on the runtime than just GET operations, and that accentuates the impact of additional polls.

@alamb
Copy link
Contributor

alamb commented Nov 11, 2025

And yes I stated above that this benchmark does not illustrate the difference as extremely as what I saw in production. Perhaps it's because in a real workload, we're doing a lot more on the runtime than just GET operations, and that accentuates the impact of additional polls.

It might make sense to look into using a separate threadpool for CPU and IO work.

For example, you can move all your object store work to a different threadpool (tokio runtime) using the
SpawnedReqwestConnector. There is an end to end example in datafusion: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/thread_pools.rs

Something we spent quite a long time at InfluxData was that io/network latencies increased substantially with highly concurrent workloads. We eventually tracked this down to using the same threadpool (tokio pool) for CPU and IO work -- doing so basically starves the IO of the CPU it needs to make progress in the TCP state machine , and it seems that then the tcp stack treats the system as being congested and slows down traffic.

@ankrgyl
Copy link
Author

ankrgyl commented Nov 12, 2025

And yes I stated above that this benchmark does not illustrate the difference as extremely as what I saw in production. Perhaps it's because in a real workload, we're doing a lot more on the runtime than just GET operations, and that accentuates the impact of additional polls.

It might make sense to look into using a separate threadpool for CPU and IO work.

For example, you can move all your object store work to a different threadpool (tokio runtime) using the SpawnedReqwestConnector. There is an end to end example in datafusion: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/thread_pools.rs

Something we spent quite a long time at InfluxData was that io/network latencies increased substantially with highly concurrent workloads. We eventually tracked this down to using the same threadpool (tokio pool) for CPU and IO work -- doing so basically starves the IO of the CPU it needs to make progress in the TCP state machine , and it seems that then the tcp stack treats the system as being congested and slows down traffic.

I appreciate it! We do a lot of this and are constantly optimizing it. I deployed the change on our end using arc_swap, and with no other changes, saw a pretty substantial impact on reported latency (and for us, indexing throughput).

I don't mind at all if you are uninterested in this contribution. I was mostly submitting it to pay-it-back and say thank you for this library. I'm very happy to just run our fork. Feel free to let me know what you'd like from this point onwards.

@crepererum
Copy link
Contributor

crepererum commented Nov 12, 2025

FWIW: I don't question that ArcSwap is better than the plain Mutex, but I don't think it's actually necessary. They way the code is written, the same effect can likely be gained using RwLock.

It might make sense to look into using a separate threadpool for CPU and IO work.

I don't think there's a CPU contention here and IMHO using a threadpool would be totally overkill. The code in main has a reader contention for many concurrent requests though.

@alamb
Copy link
Contributor

alamb commented Nov 12, 2025

I don't mind at all if you are uninterested in this contribution. I was mostly submitting it to pay-it-back and say thank you for this library. I'm very happy to just run our fork. Feel free to let me know what you'd like from this point onwards.

Thank you -- we very much appreciate it

I think we are trying to get to the bottom of what is going on / figure out the best solution. In my opinion switching from Mutex to RWLock is a clear win and I would be happy to accept such a PR (or maybe will find time to write one myself)

I feel like we are still trying to figure out how much benefit, if any, the new Arc swap library really brings so we can make a final judgement call about if the new dependency is worth it

@ankrgyl
Copy link
Author

ankrgyl commented Nov 12, 2025

I don't mind at all if you are uninterested in this contribution. I was mostly submitting it to pay-it-back and say thank you for this library. I'm very happy to just run our fork. Feel free to let me know what you'd like from this point onwards.

Thank you -- we very much appreciate it

I think we are trying to get to the bottom of what is going on / figure out the best solution. In my opinion switching from Mutex to RWLock is a clear win and I would be happy to accept such a PR (or maybe will find time to write one myself)

I feel like we are still trying to figure out how much benefit, if any, the new Arc swap library really brings so we can make a final judgement call about if the new dependency is worth it

I personally have no issues with using an RWLock instead of arc swap (I agree, the evidence is not glaringly obvious from these micro benchmarks).

@crepererum
Copy link
Contributor

the evidence is not glaringly obvious from these micro benchmarks

Do you see a difference between RwLock and AtomicArc in your production environment? Because if you do, we could work on better benchmarks to replicate that.

@ankrgyl
Copy link
Author

ankrgyl commented Nov 13, 2025

the evidence is not glaringly obvious from these micro benchmarks

Do you see a difference between RwLock and AtomicArc in your production environment? Because if you do, we could work on better benchmarks to replicate that.

I have not tested it yet. I can give it a go and report back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Heavy contention on credentials cache

3 participants