Skip to content

fix: claim clone approach to prevent duplicate clones#1101

Open
thlorenz wants to merge 21 commits intomasterfrom
thlorenz/fetch-sub-race
Open

fix: claim clone approach to prevent duplicate clones#1101
thlorenz wants to merge 21 commits intomasterfrom
thlorenz/fetch-sub-race

Conversation

@thlorenz
Copy link
Copy Markdown
Collaborator

@thlorenz thlorenz commented Mar 27, 2026

Summary

Fixes a race condition where both the on-demand fetch path and the subscription update path independently submit clone requests for the same (pubkey, slot), resulting in duplicate clone transactions. The fix adds a pending-clone ownership mechanism inside FetchCloner so the first caller performs the clone while subsequent callers wait for the result.

Details

Problem

When an account is being fetched on-demand while a subscription update arrives for the same account at the same slot, both paths pass their independent checks and call clone_account(), producing duplicate work.

Initial Attempt to Fix

The original approach (shared SubMux dedup cache part of first commits in this PR, practically now reverted) did not properly work:

A cache hit only means "the subscription path touched this entry," not "the account was successfully cloned into bank."
This caused delegated accounts to silently never materialize.

Solution

FetchCloner now maintains a pending_clones: HashMap<(Pubkey, u64), Vec<Sender<CloneCompletion>>>. The first caller to claim_pending_clone for a given (pubkey, slot) becomes the owner and performs the actual clone. Any subsequent caller for the same key becomes a waiter and receives the outcome (success/failure) via a oneshot channel. After the clone completes, finish_pending_clone removes the entry and notifies all waiters.

Initial PR description
Details

Summary

Fixes a race condition where both the on-demand fetch path and the subscription update path independently submit clone requests for the same (pubkey, slot), resulting in duplicate clone transactions. The fix shares SubMuxClient's existing dedup cache with FetchCloner so both paths coordinate on (pubkey, slot) deduplication.

Details

Problem

When an account is being fetched on-demand while a subscription update arrives for the same account at the same slot, both paths pass their independent checks and call clone_account(), producing duplicate work.

Solution

SubMuxClient already maintains a (Pubkey, u64) → Instant dedup cache and inserts entries via should_forward_dedup() before forwarding subscription updates. This PR exposes that cache and passes it into FetchCloner, which checks it in clone_accounts_and_programs() before cloning. If the subscription path already inserted an entry, the fetch path skips its clone (and vice versa).

Key Changes

  • SubMuxClient: Added dedup_cache() and dedup_window() accessors to expose the shared cache
  • RemoteAccountProvider: try_from_urls_and_config and try_new_from_endpoints now return the dedup cache and window alongside the provider
  • FetchCloner: Takes required DedupCache and dedup_window, with a check_dedup_cache() method that does atomic check-and-insert. The dedup filter runs in the pipeline's clone_accounts_and_programs() step
  • ClonerStub: Added clone request tracking and configurable delay for testing
  • AccountCloneRequest: Derived Clone to support test assertions
  • Test: Added test_fetch_subscription_race_duplicate_clone that reproduces the race and verifies only one clone request is submitted

Summary by CodeRabbit

  • New Features
    • Added a shared deduplication mechanism that skips redundant account/program clone requests within a configurable time window, reducing duplicate cloning work.
  • Tests
    • Added tests and test utilities validating dedup behavior under concurrent fetch/subscription races to ensure only one clone occurs for duplicates.
  • Chores
    • Made a request type cloneable and widened a test size-deviation tolerance to reduce flakiness.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Manual Deploy Available

You can trigger a manual deploy of this PR branch to testnet:

Deploy to Testnet 🚀

Alternative: Comment /deploy on this PR to trigger deployment directly.

⚠️ Note: Manual deploy requires authorization. Only authorized users can trigger deployments.

Comment updated automatically when the PR is synchronized.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request implements an in-flight clone coordination mechanism in FetchCloner to deduplicate concurrent clone requests targeting the same account and slot. When multiple clone requests arrive for the same (pubkey, slot) pair, only the first caller executes the actual clone operation while subsequent callers wait for the result. Changes include: adding pending clone tracking with ownership-based flow control, updating pipeline code to use the new coordination method, adding comprehensive test coverage for race conditions, enhancing ClonerStub with request tracking and failure injection capabilities, and refactoring RemoteAccountProvider's unsubscribe logic with atomic claim semantics.

Suggested reviewers

  • GabrielePicco
  • snawaz
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thlorenz/fetch-sub-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs`:
- Around line 183-206: The mutex lock in check_dedup_cache uses .unwrap() on
self.dedup_cache.lock(), which must be either justified or handled; replace the
.unwrap() with explicit handling of PoisonError (e.g., match
self.dedup_cache.lock() { Ok(guard) => guard, Err(poison) => { /* recover: log
warning about poisoned mutex and continue with poison.into_inner() */
poison.into_inner() } }) or, if you decide to keep the assumption, add a short
justification comment immediately above the lock call referencing the invariant
that the mutex can only be poisoned by a panic and that recovery via
into_inner() is acceptable in this context; reference the function name
check_dedup_cache and the symbol self.dedup_cache.lock() when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 83702121-3c76-458b-a67e-4dd053afa2f9

📥 Commits

Reviewing files that changed from the base of the PR and between f8a5871 and e6366e7.

📒 Files selected for processing (11)
  • magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs
  • magicblock-chainlink/src/chainlink/fetch_cloner/pipeline.rs
  • magicblock-chainlink/src/chainlink/fetch_cloner/tests.rs
  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/src/cloner/mod.rs
  • magicblock-chainlink/src/remote_account_provider/mod.rs
  • magicblock-chainlink/src/submux/mod.rs
  • magicblock-chainlink/src/testing/cloner_stub.rs
  • magicblock-chainlink/tests/utils/test_context.rs
  • test-integration/test-chainlink/src/ixtest_context.rs
  • test-integration/test-chainlink/src/test_context.rs
👮 Files not reviewed due to content moderation or server errors (8)
  • magicblock-chainlink/src/cloner/mod.rs
  • magicblock-chainlink/tests/utils/test_context.rs
  • test-integration/test-chainlink/src/ixtest_context.rs
  • test-integration/test-chainlink/src/test_context.rs
  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/src/chainlink/fetch_cloner/pipeline.rs
  • magicblock-chainlink/src/submux/mod.rs
  • magicblock-chainlink/src/chainlink/fetch_cloner/tests.rs

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs (1)

195-197: ⚠️ Potential issue | 🟠 Major

Handle dedup_cache lock poisoning explicitly.

Line 197 still panics on a poisoned mutex. Please recover or document the invariant instead of unwrapping.

🛠️ Minimal fix
         let now = Instant::now();
-        let mut cache = self.dedup_cache.lock().unwrap();
+        let mut cache = match self.dedup_cache.lock() {
+            Ok(cache) => cache,
+            Err(poisoned) => {
+                warn!(
+                    pubkey = %pubkey,
+                    slot,
+                    "dedup cache lock poisoned; recovering"
+                );
+                poisoned.into_inner()
+            }
+        };

As per coding guidelines: "Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs` around lines 195 -
197, The check_dedup_cache function currently calls
self.dedup_cache.lock().unwrap(), which will panic on a poisoned mutex; change
this to explicitly handle Mutex::lock() returning Err(PoisonError) by matching
the result (e.g., match self.dedup_cache.lock() { Ok(mut cache) => ...,
Err(poisoned) => { let mut cache = poisoned.into_inner(); /* log a warning about
poison */ ... } }) so the function recovers and continues using the inner cache
instead of panicking; ensure the code in check_dedup_cache still uses the
recovered cache to perform the dedup logic and consider logging the mutex
poisoning event for observability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs`:
- Around line 195-197: The check_dedup_cache function currently calls
self.dedup_cache.lock().unwrap(), which will panic on a poisoned mutex; change
this to explicitly handle Mutex::lock() returning Err(PoisonError) by matching
the result (e.g., match self.dedup_cache.lock() { Ok(mut cache) => ...,
Err(poisoned) => { let mut cache = poisoned.into_inner(); /* log a warning about
poison */ ... } }) so the function recovers and continues using the inner cache
instead of panicking; ensure the code in check_dedup_cache still uses the
recovered cache to perform the dedup logic and consider logging the mutex
poisoning event for observability.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8d473a45-068c-4d09-8302-3394029a7dc2

📥 Commits

Reviewing files that changed from the base of the PR and between e6366e7 and 0629e7e.

📒 Files selected for processing (4)
  • magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs
  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/src/remote_account_provider/mod.rs
  • magicblock-chainlink/src/submux/mod.rs

thlorenz and others added 6 commits March 27, 2026 20:36
* master:
  feat: action callbacks (#1061)
* master:
  fix: allow not existing accounts in actions (#1120)
  fix: published npm wrapper (#1115)
  release: 0.8.5 (#1113)
  fix: actions dropping on eata cold cloning (#1112)
  fix (CloneAccount): enforce incoming_slot > current_slot when cloning an account (#1110)
  fix: flaky restore_ledger integration test (#1107)
  Fix: remove conditions that allow to overrides undelegating ata (trough mapped eata) (#1105)
  feat: add logs for startup time on chainlink and committer service (#1106)
@thlorenz thlorenz requested a review from GabrielePicco March 31, 2026 08:35
@thlorenz thlorenz changed the title fix: share submux dedup cache with fetch path to prevent duplicate clones fix: claim clone approach to prevent duplicate clones Apr 2, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs (1)

217-250: ⚠️ Potential issue | 🟠 Major

Handle poisoned pending_clones locks without expect().

Lines 219 and 246 currently panic the clone-coordination path on mutex poison. This is production code, so please recover/report explicitly instead of using expect().

Suggested handling
-        let mut map = self
-            .pending_clones
-            .lock()
-            .expect("pending_clones mutex poisoned");
+        let mut map = match self.pending_clones.lock() {
+            Ok(guard) => guard,
+            Err(poison) => {
+                warn!("pending_clones mutex poisoned in claim_pending_clone; recovering");
+                poison.into_inner()
+            }
+        };
@@
-            let mut map = self
-                .pending_clones
-                .lock()
-                .expect("pending_clones mutex poisoned");
+            let mut map = match self.pending_clones.lock() {
+                Ok(guard) => guard,
+                Err(poison) => {
+                    warn!("pending_clones mutex poisoned in finish_pending_clone; recovering");
+                    poison.into_inner()
+                }
+            };

As per coding guidelines: "Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue. These should not be categorized as trivial or nit-level concerns. Request proper error handling or explicit justification with invariants."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs` around lines 217 -
250, Replace the panic-on-poisoned-mutex usage in claim_pending_clone and
finish_pending_clone: instead of .lock().expect("pending_clones mutex poisoned")
match on the Result from .lock(), log/report the PoisonError and recover by
calling PoisonError::into_inner() to obtain the guard so the method can continue
(for claim_pending_clone continue with creating/inserting the Vec and returning
CloneClaim::Owner or CloneClaim::Waiter as before; for finish_pending_clone
obtain the guard, remove the key and notify waiters). Reference the
pending_clones mutex and the functions claim_pending_clone and
finish_pending_clone when making the change so you handle both lock sites
consistently and avoid any .expect()/unwrap() usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs`:
- Around line 261-296: clone_account_with_ownership currently dedupes based only
on (pubkey, slot) via claim_pending_clone, which discards differing
AccountCloneRequest fields (e.g., commit_frequency_ms) and can lose commit
scheduling metadata; fix by changing the dedupe key or call to
claim_pending_clone to include the request's materially significant fields
(e.g., pass AccountCloneRequest or a computed dedupe_key that includes pubkey,
slot and commit_frequency_ms/other relevant fields) and/or merge differing
requests before claiming ownership (compute a merged AccountCloneRequest and use
that for cloning and for finish_pending_clone) so that CloneClaim::Waiter
receives equivalent semantics to CloneClaim::Owner; update references to
claim_pending_clone, finish_pending_clone, clone_account_with_ownership, and the
AccountCloneRequest handling to use the richer key/merged request.
- Around line 268-276: Wrap the clone task’s critical section with a drop guard
type (e.g., PendingCloneGuard) that on Drop calls finish_pending_clone(pubkey,
slot, CloneCompletion::Failed) unless it has been explicitly marked completed;
construct the guard immediately after claim_pending_clone(pubkey, slot) and
dismiss it after you call finish_pending_clone(...) on success/failure so that
if the async await of self.cloner.clone_account(request).await is cancelled or
panics the guard will notify waiters. Also replace the pending_clones mutex uses
that currently call .expect() (the calls around claim_pending_clone and
finish_pending_clone) with proper poisoning handling (e.g., propagate the lock
error as a Result or recover via into_inner() with explicit documentation) or
add an explicit comment documenting the invariant if you intentionally allow
poisoning—do not use .expect() so mutex poison is handled explicitly.

---

Duplicate comments:
In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs`:
- Around line 217-250: Replace the panic-on-poisoned-mutex usage in
claim_pending_clone and finish_pending_clone: instead of
.lock().expect("pending_clones mutex poisoned") match on the Result from
.lock(), log/report the PoisonError and recover by calling
PoisonError::into_inner() to obtain the guard so the method can continue (for
claim_pending_clone continue with creating/inserting the Vec and returning
CloneClaim::Owner or CloneClaim::Waiter as before; for finish_pending_clone
obtain the guard, remove the key and notify waiters). Reference the
pending_clones mutex and the functions claim_pending_clone and
finish_pending_clone when making the change so you handle both lock sites
consistently and avoid any .expect()/unwrap() usage.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 63c8e43c-1724-4925-9c83-71834563e080

📥 Commits

Reviewing files that changed from the base of the PR and between 5f53a46 and bbeb418.

📒 Files selected for processing (8)
  • magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs
  • magicblock-chainlink/src/chainlink/fetch_cloner/pipeline.rs
  • magicblock-chainlink/src/chainlink/fetch_cloner/tests.rs
  • magicblock-chainlink/src/remote_account_provider/mod.rs
  • magicblock-chainlink/src/testing/cloner_stub.rs
  • magicblock-chainlink/tests/utils/test_context.rs
  • test-integration/test-chainlink/src/test_context.rs
  • test-integration/test-chainlink/tests/ix_remote_account_provider.rs

@thlorenz thlorenz marked this pull request as ready for review April 3, 2026 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants