Use arc swap for credentials cache #542

ankrgyl · 2025-11-09T00:48:32Z

Which issue does this PR close?

Closes #541.

What changes are included in this PR?

Use an arc swap to hold the cached credential, and acquire a mutex only if the credential's ttl is expired.

Are there any user-facing changes?

No

crepererum · 2025-11-10T14:00:09Z

Instead of a new dependency a RwLock would probably do what you need for #541. This way the queries wouldn't content and you still can override/update/replace the credentials using a writer thread.

alamb

Thank you @ankrgyl

alamb · 2025-11-10T16:13:50Z

src/client/token.rs

 #[derive(Debug)]
 pub(crate) struct TokenCache<T> {
-    cache: Mutex<Option<(TemporaryToken<T>, Instant)>>,
+    cache: ArcSwapOption<CacheEntry<T>>,


So in general, unless we have benchmark results that show arc-swap is necessary, I am opposed to adding a new dependency

Did you try a RWLock before reaching for a new crate? I always worry about adding new crates like arcswap as I don't want to have to deal with a RUSTSEC report if/when it becomes abandoned.

I do see there are many other users

RWLocks would allow multiple concurrent readers but if you had a lot of writers you might still have contention. If the you find update contention is too much, you could change to use RWLock<Arc<..>> so that the lock only needs to be held to clone an Arc

I understand the docs for arc swap claims https://docs.rs/arc-swap/latest/arc_swap/

Better option would be to have RwLock<Arc>. Then one would lock, clone the Arc and unlock. This suffers from CPU-level contention (on the lock and on the reference count of the Arc) which makes it relatively slow. Depending on the implementation, an update may be blocked for arbitrary long time by a steady inflow of readers.

I would imagine the overhead of actually using the token (making an HTTP request) is pretty huge compared to getting a lock.

Sorry I just saw this after writing my other comment.

I would imagine the overhead of actually using the token (making an HTTP request) is pretty huge compared to getting a lock.

The problem with the previous design, which may not apply to an RwLock (and sure, I will benchmark it and report back), is that "waiting in line" for the mutex would become so expensive with a high number of concurrent requests (eg. HEAD requests with 8ms p50 latencies), that it actually overwhelmed tokio's worker threads and dominated the execution time (we saw p50 "HEAD" operation latency spike to 700ms, and realized the mutex was the root cause).

Let me run a benchmark with arc swap vs. RwLock and report back

Let me run a benchmark with arc swap vs. RwLock and report back

👍

ankrgyl · 2025-11-10T16:16:15Z

Instead of a new dependency a RwLock would probably do what you need for #541. This way the queries wouldn't content and you still can override/update/replace the credentials using a writer thread.

It's true, but I figured the read:write ratio is SO high (basically, 100% reads within the TTL window), that I'd like to avoid the (relatively) higher cost of an RwLock for the majority case. Would it be useful to illustrate this with a benchmark, or are you categorically against adding new dependencies?

alamb · 2025-11-10T16:31:38Z

It's true, but I figured the read:write ratio is SO high (basically, 100% reads within the TTL window), that I'd like to avoid the (relatively) higher cost of an RwLock for the majority case. Would it be useful to illustrate this with a benchmark, or are you categorically against adding new dependencies?

I think a super read heavy workload will work well with RWLock

Would it be useful to illustrate this with a benchmark, or are you categorically against adding new dependencies?

In my mind, the burden of evidence is much higher to add a new dependency. I know it sounds somewhat like a curmudgeon, but each new dependency adds some small (but real) additional maintenance overhead (and downstream work). Unless there is compelling demonstrated value to add a new dependency, we try and avoid it.

Also, there are very few contributions removing dependencies for some reason 😆 so once we add one we are typically stuck with them.

ankrgyl · 2025-11-10T20:29:12Z

It's true, but I figured the read:write ratio is SO high (basically, 100% reads within the TTL window), that I'd like to avoid the (relatively) higher cost of an RwLock for the majority case. Would it be useful to illustrate this with a benchmark, or are you categorically against adding new dependencies?

I think a super read heavy workload will work well with RWLock

Would it be useful to illustrate this with a benchmark, or are you categorically against adding new dependencies?

In my mind, the burden of evidence is much higher to add a new dependency. I know it sounds somewhat like a curmudgeon, but each new dependency adds some small (but real) additional maintenance overhead (and downstream work). Unless there is compelling demonstrated value to add a new dependency, we try and avoid it.

Also, there are very few contributions removing dependencies for some reason 😆 so once we add one we are typically stuck with them.

That's totally fine with me! I am only writing in to contribute in a helpful manner. If I staunchly feel that arc_swap is the right solution, I can easily run a fork. So I'm aligned to doing whatever you feel like is the right thing for the repo.

Here are the benchmark numbers run on my macbook pro. What I saw in production was dramatically more pronounced (mutex vs. arc swap, not arc swap vs. rwlock), but I don't have much more bandwidth to investigate that in depth.

=== Results Summary ===

Median Latency (p50):
Concurrency  │      Mutex │   Arc-swap │     RwLock │ Arc vs Mutex │ RwLock vs Mutex
────────────────────────────────────────────────────────────────────────────────────────────
100          │     2.31ms │     2.34ms │     2.35ms │        -1.2% │          -1.8%
500          │     2.57ms │     2.62ms │     2.65ms │        -1.8% │          -2.9%
5k           │     5.01ms │     3.46ms │     4.07ms │        44.9% │          23.2%
5k           │     5.15ms │     3.50ms │     4.15ms │        46.9% │          24.1%
10k          │     8.18ms │     7.11ms │     8.69ms │        15.0% │          -5.9%
25k          │    21.94ms │    19.04ms │    23.18ms │        15.3% │          -5.3%


Tail Latency (p99):
Concurrency  │      Mutex │   Arc-swap │     RwLock │ Arc vs Mutex │ RwLock vs Mutex
────────────────────────────────────────────────────────────────────────────────────────────
100          │     5.98ms │     3.54ms │     3.57ms │        68.9% │          67.8%
500          │     4.01ms │     3.93ms │     3.93ms │         2.2% │           2.1%
5k           │     6.60ms │     5.36ms │     5.94ms │        23.2% │          11.1%
5k           │     7.83ms │     5.81ms │     6.16ms │        34.8% │          27.0%
10k          │    11.59ms │    12.67ms │    15.97ms │        -8.5% │         -27.4%
25k          │    31.44ms │    37.85ms │    47.92ms │       -16.9% │         -34.4%

I'm happy to update the code however you'd like based on these findings.

alamb · 2025-11-10T20:54:37Z

Here are the benchmark numbers run on my macbook pro. What I saw in production was dramatically more pronounced (mutex vs. arc swap, not arc swap vs. rwlock), but I don't have much more bandwidth to investigate that in depth.

What did you benchmark? Is this a micro benchmark for locking, or is it actually a workload running object_store requests?

given the total request time reported is in ms I am guessing the actual workload you have

ankrgyl · 2025-11-11T00:35:06Z

Here are the benchmark numbers run on my macbook pro. What I saw in production was dramatically more pronounced (mutex vs. arc swap, not arc swap vs. rwlock), but I don't have much more bandwidth to investigate that in depth.

What did you benchmark? Is this a micro benchmark for locking, or is it actually a workload running object_store requests?

given the total request time reported is in ms I am guessing the actual workload you have

Benchmark is here: https://github.com/apache/arrow-rs-object-store/pull/542/files#diff-4e18aa7d15e47cfe7440ad519403de83caed450f907bc533b9aa414ca7f9c7de. I simulated object store requests via a 1-20ms sleep. Which obviously is not perfect...

You can run it with

cargo bench --bench cache_benchmark

crepererum · 2025-11-11T08:57:05Z

These are the results on my machine:

=== Results Summary ===

Median Latency (p50):
Concurrency  │      Mutex │   Arc-swap │     RwLock │ Arc vs Mutex │ RwLock vs Mutex
────────────────────────────────────────────────────────────────────────────────────────────
100          │     2.14ms │     2.13ms │     2.13ms │         0.3% │           0.2%
500          │     2.33ms │     2.25ms │     2.25ms │         3.8% │           3.6%
5k           │     4.45ms │     3.48ms │     3.26ms │        28.0% │          36.3%
5k           │     4.36ms │     3.47ms │     3.28ms │        25.5% │          32.8%
10k          │     9.53ms │     7.12ms │     6.65ms │        33.8% │          43.3%
25k          │    23.42ms │    18.78ms │    17.56ms │        24.7% │          33.4%


Tail Latency (p99):
Concurrency  │      Mutex │   Arc-swap │     RwLock │ Arc vs Mutex │ RwLock vs Mutex
────────────────────────────────────────────────────────────────────────────────────────────
100          │     3.25ms │     3.20ms │     3.21ms │         1.7% │           1.2%
500          │     3.51ms │     3.39ms │     3.39ms │         3.7% │           3.7%
5k           │     6.02ms │     7.07ms │     6.59ms │       -14.8% │          -8.6%
5k           │     5.92ms │     7.01ms │     6.52ms │       -15.5% │          -9.2%
10k          │    10.93ms │    12.78ms │    11.68ms │       -14.4% │          -6.3%
25k          │    24.91ms │    36.22ms │    33.63ms │       -31.2% │         -25.9%

So here RwLock is consistently better than Arc-swap, but one also has to note that the overall improvement of this entire endeavor is fairly limited (like 35% latency improvement at best). In fact the tail latency is the best using a plain Mutex.

crepererum · 2025-11-11T09:03:11Z

Also looking at the code, you roughly have something like this:

if token_fresh() {
  return token;
}

let _guard = refresh.lock().await;

...

write_token();

I don't see how an ArcSwap is really gonna help here. In fact with that construct the RwLock doesn't even need to be a tokio async lock. You can just use a stdlib or parking_lot RwLock since getting the fresh token doesn't involve I/O and replacing it while being under the refresh guard at the very end is also just a sync write operation. My suspicion is that that would be even faster, since async locks also have some overhead that we technically don't need here.

ankrgyl · 2025-11-11T16:13:05Z

Also looking at the code, you roughly have something like this:
if token_fresh() {
  return token;
}

let _guard = refresh.lock().await;

...

write_token();
I don't see how an ArcSwap is really gonna help here. In fact with that construct the RwLock doesn't even need to be a tokio async lock. You can just use a stdlib or parking_lot RwLock since getting the fresh token doesn't involve I/O and replacing it while being under the refresh guard at the very end is also just a sync write operation. My suspicion is that that would be even faster, since async locks also have some overhead that we technically don't need here.

Correct. I explicitly did not test any credential refreshes and just wanted to illustrate the difference with read contention. It's not very hard to fill that in.

And yes I stated above that this benchmark does not illustrate the difference as extremely as what I saw in production. Perhaps it's because in a real workload, we're doing a lot more on the runtime than just GET operations, and that accentuates the impact of additional polls.

alamb · 2025-11-11T17:05:30Z

And yes I stated above that this benchmark does not illustrate the difference as extremely as what I saw in production. Perhaps it's because in a real workload, we're doing a lot more on the runtime than just GET operations, and that accentuates the impact of additional polls.

It might make sense to look into using a separate threadpool for CPU and IO work.

For example, you can move all your object store work to a different threadpool (tokio runtime) using the
SpawnedReqwestConnector. There is an end to end example in datafusion: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/thread_pools.rs

Something we spent quite a long time at InfluxData was that io/network latencies increased substantially with highly concurrent workloads. We eventually tracked this down to using the same threadpool (tokio pool) for CPU and IO work -- doing so basically starves the IO of the CPU it needs to make progress in the TCP state machine , and it seems that then the tcp stack treats the system as being congested and slows down traffic.

ankrgyl · 2025-11-12T08:23:43Z

And yes I stated above that this benchmark does not illustrate the difference as extremely as what I saw in production. Perhaps it's because in a real workload, we're doing a lot more on the runtime than just GET operations, and that accentuates the impact of additional polls.

It might make sense to look into using a separate threadpool for CPU and IO work.

For example, you can move all your object store work to a different threadpool (tokio runtime) using the SpawnedReqwestConnector. There is an end to end example in datafusion: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/thread_pools.rs

Something we spent quite a long time at InfluxData was that io/network latencies increased substantially with highly concurrent workloads. We eventually tracked this down to using the same threadpool (tokio pool) for CPU and IO work -- doing so basically starves the IO of the CPU it needs to make progress in the TCP state machine , and it seems that then the tcp stack treats the system as being congested and slows down traffic.

I appreciate it! We do a lot of this and are constantly optimizing it. I deployed the change on our end using arc_swap, and with no other changes, saw a pretty substantial impact on reported latency (and for us, indexing throughput).

I don't mind at all if you are uninterested in this contribution. I was mostly submitting it to pay-it-back and say thank you for this library. I'm very happy to just run our fork. Feel free to let me know what you'd like from this point onwards.

crepererum · 2025-11-12T12:53:56Z

FWIW: I don't question that ArcSwap is better than the plain Mutex, but I don't think it's actually necessary. They way the code is written, the same effect can likely be gained using RwLock.

It might make sense to look into using a separate threadpool for CPU and IO work.

I don't think there's a CPU contention here and IMHO using a threadpool would be totally overkill. The code in main has a reader contention for many concurrent requests though.

alamb · 2025-11-12T14:28:19Z

I don't mind at all if you are uninterested in this contribution. I was mostly submitting it to pay-it-back and say thank you for this library. I'm very happy to just run our fork. Feel free to let me know what you'd like from this point onwards.

Thank you -- we very much appreciate it

I think we are trying to get to the bottom of what is going on / figure out the best solution. In my opinion switching from Mutex to RWLock is a clear win and I would be happy to accept such a PR (or maybe will find time to write one myself)

I feel like we are still trying to figure out how much benefit, if any, the new Arc swap library really brings so we can make a final judgement call about if the new dependency is worth it

ankrgyl · 2025-11-12T15:52:41Z

I don't mind at all if you are uninterested in this contribution. I was mostly submitting it to pay-it-back and say thank you for this library. I'm very happy to just run our fork. Feel free to let me know what you'd like from this point onwards.

Thank you -- we very much appreciate it

I think we are trying to get to the bottom of what is going on / figure out the best solution. In my opinion switching from Mutex to RWLock is a clear win and I would be happy to accept such a PR (or maybe will find time to write one myself)

I feel like we are still trying to figure out how much benefit, if any, the new Arc swap library really brings so we can make a final judgement call about if the new dependency is worth it

I personally have no issues with using an RWLock instead of arc swap (I agree, the evidence is not glaringly obvious from these micro benchmarks).

crepererum · 2025-11-12T16:18:22Z

the evidence is not glaringly obvious from these micro benchmarks

Do you see a difference between RwLock and AtomicArc in your production environment? Because if you do, we could work on better benchmarks to replicate that.

ankrgyl · 2025-11-13T00:46:08Z

the evidence is not glaringly obvious from these micro benchmarks

Do you see a difference between RwLock and AtomicArc in your production environment? Because if you do, we could work on better benchmarks to replicate that.

I have not tested it yet. I can give it a go and report back

ankrgyl added 9 commits November 8, 2025 14:26

add some tracing

2c65051

more tracing

68f8e11

more tracing

cb5c384

rwlock

ac1842a

remove some tracing

b7ae9f2

use an arc swap

5d5e011

Merge branch 'tracing-tweaks' into arc-swap-credentials

103fa3e

bump version to latest

a79410d

Merge branch 'tracing-tweaks' into arc-swap-credentials

1ea4211

alamb reviewed Nov 10, 2025

View reviewed changes

add benchmark

6c9f2b0

Use arc swap for credentials cache #542

Are you sure you want to change the base?

Use arc swap for credentials cache #542

Uh oh!

Conversation

ankrgyl commented Nov 9, 2025

Which issue does this PR close?

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

crepererum commented Nov 10, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

ankrgyl Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

ankrgyl commented Nov 10, 2025

Uh oh!

alamb commented Nov 10, 2025

Uh oh!

ankrgyl commented Nov 10, 2025

Uh oh!

alamb commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankrgyl commented Nov 11, 2025

Uh oh!

crepererum commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crepererum commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankrgyl commented Nov 11, 2025

Uh oh!

alamb commented Nov 11, 2025

Uh oh!

ankrgyl commented Nov 12, 2025

Uh oh!

crepererum commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Nov 12, 2025

Uh oh!

ankrgyl commented Nov 12, 2025

Uh oh!

crepererum commented Nov 12, 2025

Uh oh!

ankrgyl commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ankrgyl Nov 10, 2025 •

edited

Loading

alamb commented Nov 10, 2025 •

edited

Loading

crepererum commented Nov 11, 2025 •

edited

Loading

crepererum commented Nov 11, 2025 •

edited

Loading

crepererum commented Nov 12, 2025 •

edited

Loading