feat: move blockhash calculation to the scheduler by bmuddha · Pull Request #1072 · magicblock-labs/magicblock-validator

bmuddha · 2026-03-19T08:59:09Z

Summary

Add a streaming blockhash computation to the transaction scheduler, moving it from slot_ticker. In addition introduced the notion of superblocks, where between each N slots the scheduler stops the world and takes a snapshot of accountsdb along with its checksum.

Compatibility

Blockhash computation has been moved completely to the scheduler and the way it is computed is different now.
Config change: new config parameters: ledger.superblock-size, accountsdb no longer has snapshot frequency parameter.

Testing

tests are passing

Checklist

Summary by CodeRabbit

New Features
- New configurable superblock snapshot size (superblock-size, default 72,000).
Bug Fixes
- Improved slot/ledger coordination and scheduler readiness to reduce race conditions.
Refactor
- Redesigned block/slot notification and primary/replica slot transition flow; added superblock publication and blockhash streaming/verification.
- Introduced execution concurrency gating to limit parallel transaction execution.
Chores
- Attempt leader lock release on shutdown.
Tests
- Tests made timing-robust (polling/yields and scheduler-ready waits).

coderabbitai · 2026-03-19T08:59:26Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Refactors block-update signaling to use ledger-broadcasted LatestBlockInner via tokio::broadcast and removes the flume BlockUpdate channel/types. Moves slot advancement and superblock/snapshot control into the scheduler: scheduler maintains a streaming blake3 hasher, finalizes blockhashes at slot boundaries, emits Block/Transaction/SuperBlock replication messages, and triggers superblock snapshot publication. Introduces execution-permit semaphore coordination (for executors and coordinator wait-for-idle), adds configurable ledger.superblock_size, updates tests to poll/wait for scheduler-driven slot progress, and adjusts various consumers/producers and replication plumbing accordingly.

Assessment against linked issues

Objective (issue#)	Addressed	Explanation
Move snapshot/superblock control into scheduler (1027)	✅
Compute streaming blockhashes tied to scheduler (1003)	✅

Assessment — Out-of-scope changes

Code Change (file_path)	Explanation
Removed `mod slot;` declaration (magicblock-api/src/lib.rs)	Changes crate module surface unrelated to scheduler/superblock or streaming-hash objectives.
Metrics API changed counter→gauge and `inc_slot()`→`set_slot()` (magicblock-metrics/src/metrics/mod.rs)	Telemetry API semantics changed; not required by the linked issues.
NATS MessageStream re-export and consumer `pending` helper added (magicblock-replicator/src/nats/consumer.rs, magicblock-replicator/src/nats/mod.rs)	Public re-export and consumer polling helper expand replication API surface unrelated to snapshot/superblock or blockhash computation.
Added `Producer::release` and primary shutdown lock release (magicblock-replicator/src/nats/producer.rs, magicblock-replicator/src/service/primary.rs)	Leader lock release API and shutdown behavior are operational changes outside the two issue objectives.

Suggested reviewers

thlorenz
GabrielePicco

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch bmuddha/feat/streaming-blockhash

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bmuddha · 2026-03-19T08:59:38Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix: replication state mismatch #1118
feat: add nats authentication #1117
feat: move blockhash calculation to the scheduler #1072 👈 (View in Graphite)
feat: initial attempt at replicator integration #1049
feat: add sliding block history to executor #1047
epic: replication service #1076 : 2 other dependent PRs (#1007 , #1031 )
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

magicblock-processor/src/scheduler/state.rs (1)

109-113: ⚠️ Potential issue | 🟠 Major

Replace .expect() with proper error handling.

The .expect() on line 111 violates the coding guidelines for production Rust code in the magicblock-* paths. If register_function fails, this will panic instead of gracefully handling the error.

🛠️ Proposed fix

         let runtime_v2 = {
             let mut runtime = create_program_runtime_environment_v2(
                 &Default::default(),
                 false,
             );
-            runtime
+            if let Err(e) = runtime
                 .register_function("sol_matmul_i8", SyscallMatmulI8::vm)
-                .expect(
-                    "failed to register sol_matmul_i8 in runtime environment v2",
-                );
+            {
+                tracing::error!("failed to register sol_matmul_i8 in runtime environment v2: {e}");
+            }
             runtime
         };

As per coding guidelines: {magicblock-*,programs,storage-proto}/**: Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/scheduler/state.rs` around lines 109 - 113, The use
of .expect() when registering the function "sol_matmul_i8" via
runtime.register_function (SyscallMatmulI8::vm) should be replaced with proper
error handling: propagate the error from register_function (change the
surrounding function to return Result if needed) or handle it by logging the
failure and returning an Err instead of panicking, ensuring the runtime
registration failure is handled gracefully and reported rather than calling
.expect().

magicblock-aperture/tests/websocket.rs (1)

28-50: ⚠️ Potential issue | 🟡 Minor

Restore a lower-bound check for the notification slot.

This only checks that the slot is not ahead of current_slot. A stale notification from far in the past still passes. Capture the slot before Line 29 and assert the notification is >= that baseline.

Proposed fix

     // Trigger an update by sending lamports to the account.
+    let slot_before_transfer = env.latest_slot();
     env.transfer_lamports(account, amount).await;
@@
     let current_slot = env.latest_slot();
     assert!(
-        notification.context.slot <= current_slot + 1,
+        notification.context.slot >= slot_before_transfer
+            && notification.context.slot <= current_slot + 1,
         "notification slot {} should be reasonable compared to current slot {}",
         notification.context.slot,
         current_slot

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-aperture/tests/websocket.rs` around lines 28 - 50, Before
triggering the transfer, capture the current slot into a baseline (call
env.latest_slot() and store it, e.g., baseline_slot) so you can assert the
notification slot is not stale; after receiving the notification
(notification.context.slot) assert it is >= baseline_slot and also keep the
existing upper-bound check against current_slot + 1. Update the test around
env.transfer_lamports and the assertions (referencing env.transfer_lamports,
env.latest_slot, and notification.context.slot) to use the baseline_slot
lower-bound check and the existing upper-bound check.

magicblock-processor/src/executor/mod.rs (1)

176-184: ⚠️ Potential issue | 🔴 Critical

Acquire the semaphore before transition_to_slot().

Line 181 is too late. wait_for_idle() uses this semaphore to make the superblock snapshot path exclusive, but transition_to_slot() on Line 179 mutates per-slot executor state before the permit is held. That leaves a window where accountsdb.take_snapshot() can run concurrently with a slot/sysvar transition.

Proposed fix

                 txn = self.rx.recv() => {
                     let Some(transaction) = txn else { break };
+                    let Ok(_permit) = self.execution_permits.acquire().await else {
+                        warn!("execution permit semaphore closed");
+                        break;
+                    };
                     if transaction.slot != self.processor.slot {
                         self.transition_to_slot(transaction.slot);
                     }
-                    let _permit = self.execution_permits.acquire().await;
                     match transaction.txn.mode {
                         TransactionProcessingMode::Execution(_) => {
                             self.execute(transaction, None);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/executor/mod.rs` around lines 176 - 184, The code
currently calls transition_to_slot(transaction.slot) before acquiring the
execution_permits, which permits wait_for_idle()/accountsdb.take_snapshot() to
run concurrently with the slot/sysvar transition; move the semaphore acquisition
so the permit is awaited and held (call self.execution_permits.acquire().await
and bind _permit) before checking and calling
transition_to_slot(transaction.slot), ensuring the permit is held across
transition_to_slot and subsequent execute() calls (symbols: transition_to_slot,
execution_permits, wait_for_idle, take_snapshot, execute, rx.recv,
processor.slot).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config.example.toml`:
- Around line 192-197: The comment above the superblock-size setting contains a
grammar error; update the sentence "accountsdb checksum/snapshot takes at
superblock boundary" to "accountsdb checksum/snapshot is taken at superblock
boundary" in the comment block immediately preceding the superblock-size key so
the message reads clearly and correctly.

In `@magicblock-aperture/tests/blocks.rs`:
- Around line 187-201: The current assertions around the get_blocks result
(blocks from env.rpc.get_blocks(start, Some(end))) only check boundaries; update
the test to iterate the returned blocks and assert for each slot that start <=
slot <= end, that each slot is strictly greater than the previous slot (strictly
increasing), and — if the invariant is one block per slot — that each slot
equals previous_slot + 1 to enforce contiguity; apply the same checks in the
other occurrence of these assertions (the block range at lines 239-252).

In `@magicblock-aperture/tests/transactions.rs`:
- Around line 468-469: Replace the fixed 100ms sleep with a polling loop that
repeatedly checks env.execution.get_transaction(fail_sig) until it returns
Some(...) or a short timeout/iteration limit is reached; use small sleeps (e.g.
20ms) between polls via tokio::time::sleep and after the loop assert that the
transaction is present to fail fast if it never appears. Locate the wait near
the test that references fail_sig and env.execution.get_transaction and
implement the loop there instead of the single
tokio::time::sleep(Duration::from_millis(100)).await call.

In `@magicblock-api/src/tickers.rs`:
- Around line 22-29: init_slot_ticker() used to increment SLOT_COUNT via
metrics::inc_slot() but that call was removed during the refactor, so the
scheduler-owned slot transition path must now perform the metric increment;
update the scheduler transition logic (the code that advances slots inside
TransactionSchedulerHandle / the scheduler's slot transition function) to call
metrics::inc_slot() whenever a new slot is accepted/advanced, ensuring
ScheduledCommitsProcessor-related paths still advance slots and trigger the same
metrics increment so SLOT_COUNT reflects real slot progress.

In `@magicblock-config/src/config/ledger.rs`:
- Around line 16-17: The superblock_size field in the LedgerConfig (pub
superblock_size: u64) lacks constraints and can be zero, which leads to a
division-by-zero in handle_superblock; update the code by either (a) adding
documentation to the superblock_size field stating it must be > 0 and include an
example/rationale, or (b) adding serde/runtime validation (e.g., in
LedgerConfig::validate or a custom Deserialize) to reject values == 0 and return
a clear error; reference the superblock_size field in ledger.rs and the
handle_superblock function when adding the check or doc so readers and callers
know the valid range.
- Around line 39-40: The constant name is misspelled: update the definition and
all uses from DEFAULT_SUPEBLOCK_SIZE to DEFAULT_SUPERBLOCK_SIZE; specifically
rename the constant in the consts definition (where DEFAULT_SUPEBLOCK_SIZE is
declared) and update its usage in the ledger config where the struct field
superblock_size is assigned, ensuring all references (e.g.,
DEFAULT_SUPEBLOCK_SIZE → DEFAULT_SUPERBLOCK_SIZE) are renamed identically to
avoid compilation errors.

In `@magicblock-config/src/consts.rs`:
- Around line 57-58: Rename the mis-typed constant DEFAULT_SUPEBLOCK_SIZE to
DEFAULT_SUPERBLOCK_SIZE in magicblock-config/src/consts.rs and update all usages
(e.g., the reference in magicblock-config/src/config/ledger.rs where
DEFAULT_SUPEBLOCK_SIZE is read) to the new identifier; ensure you update the
public constant name and any places that import or reference it so compilation
and tests succeed.

In `@magicblock-core/src/link/blocks.rs`:
- Around line 7-10: Update the doc comment for BlockUpdateRx to match its
generic payload: replace the statement that the notification payload is empty
with a description that BlockUpdateRx<T> is a generic broadcast::Receiver<T> and
that callers typically instantiate it as BlockUpdateRx<LatestBlockInner> to
receive the full block data (slot, blockhash, timestamp); ensure the comment
mentions the generic T and the common usage pattern (e.g., LatestBlockInner) so
readers know the payload carries block details rather than being empty.

In `@magicblock-processor/src/scheduler/coordinator.rs`:
- Around line 345-350: Change wait_for_idle to return
Result<OwnedSemaphorePermit, tokio::sync::AcquireError> instead of
unconditionally unwrapping: replace the .expect(...) on the acquire_many_owned
await with a ?-style propagation of the AcquireError (i.e., return the Err when
acquire_many_owned fails). Update the signature of pub(super) async fn
wait_for_idle(&self) to reflect the Result and adjust all callers of
wait_for_idle to handle the Err case (handle the tokio::sync::AcquireError)
rather than assuming success; reference the execution_permits,
blocked_transactions, and OwnedSemaphorePermit symbols when locating the change.

In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 476-491: The finalize_block path must abort advancing local slot
when ledger.write_block fails: in async fn finalize_block(&self, block:
&LatestBlockInner) call self.ledger.write_block(...).await and match its Result
instead of just inspect_err; if self.coordinator.is_primary() and write_block
returns Err(error) then log the error (using error!(%error, ...)) and return
early from finalize_block so you do not call
self.accountsdb.set_slot(block.slot); only call set_slot when either not primary
or when write_block succeeded.
- Around line 162-165: The Hasher is created empty (Hasher::new()) but must be
seeded with the last persisted block hash so the first slot uses
H(previous_blockhash || txs...); before starting the slot loop (after computing
latest_block and slot_ticker) load the previous block hash from
latest_block.load().block_hash and feed it into the hasher (via the appropriate
method on Hasher, e.g., seed/update/absorb) so that hasher reflects the previous
block state before processing the first slot.
- Around line 437-442: The current call to
SystemTime::now().duration_since(UNIX_EPOCH).unwrap() (the computation that
assigns unix_timestamp) can panic if the host clock is behind UNIX_EPOCH;
replace the unwrap with proper error handling: match or map_err the Result, log
the SystemTimeError (using the module's logger), and recover instead of
panicking (e.g., use a safe fallback value such as the previous slot timestamp,
0, or SystemTime::UNIX_EPOCH.elapsed() equivalent) so the scheduler continues
running; locate the unix_timestamp assignment in mod.rs and update that
expression to handle Err by logging and returning/using the chosen fallback
rather than unwrap().

In `@magicblock-processor/src/scheduler/state.rs`:
- Around line 69-71: Validate that TransactionSchedulerState::superblock_size is
> 0 to avoid divide-by-zero in handle_superblock where
slot.is_multiple_of(self.superblock_size) is called; add a check in the
TransactionSchedulerState constructor (or in config deserialization) that
returns an error or panics with a clear message if superblock_size == 0, or use
an assert ensuring the invariant, and document the requirement so callers cannot
pass 0.

In `@magicblock-replicator/src/nats/consumer.rs`:
- Around line 88-107: The pending() method currently busy-loops on
self.inner.get_info() failures causing unbounded retries; change pending(&self,
cancel: &CancellationToken) -> Option<u64> to return a Result<Option<u64>, E>
(propagate the underlying error type) and stop retrying on get_info() errors —
instead return Err(error) so callers like Standby::has_pending() can decide how
to handle/backoff; keep the tokio::select! that respects cancel.cancelled() and
only loop if you intend to implement an explicit backoff strategy (sleep with
jitter) before retrying.

In `@magicblock-replicator/src/service/standby.rs`:
- Around line 63-66: The expiry signal from watcher.wait_for_expiry() is being
dropped by the immediate continue after has_pending() returns true; instead set
and use a lock_expired flag to remember the expiry and retry takeover/promotion
as soon as has_pending() becomes false. Modify the loop around
watcher.wait_for_expiry() and has_pending() to: on expiry set lock_expired =
true (do not discard the event), when has_pending() is false and lock_expired is
true call the promotion/takeover routine (e.g., the existing promotion
function), reset lock_expired on successful takeover, and only fall back to the
idle-timeout branch if no expiry is recorded—this preserves the expiry signal
until takeover succeeds.
- Around line 95-101: The standby path currently warns when self.can_promote is
false but still calls self.ctx.try_acquire_producer() and may promote via
self.ctx.into_primary(...), bypassing ReplicaOnly; update the logic in the
function that contains can_promote/try_acquire_producer (in standby.rs) so
try_acquire_producer() and the subsequent self.ctx.into_primary(producer,
self.messages).await.map(Some) are only executed when self.can_promote is true
(e.g., change the sequence to check if self.can_promote before attempting
try_acquire_producer or early-continue/return when can_promote is false).

---

Outside diff comments:
In `@magicblock-aperture/tests/websocket.rs`:
- Around line 28-50: Before triggering the transfer, capture the current slot
into a baseline (call env.latest_slot() and store it, e.g., baseline_slot) so
you can assert the notification slot is not stale; after receiving the
notification (notification.context.slot) assert it is >= baseline_slot and also
keep the existing upper-bound check against current_slot + 1. Update the test
around env.transfer_lamports and the assertions (referencing
env.transfer_lamports, env.latest_slot, and notification.context.slot) to use
the baseline_slot lower-bound check and the existing upper-bound check.

In `@magicblock-processor/src/executor/mod.rs`:
- Around line 176-184: The code currently calls
transition_to_slot(transaction.slot) before acquiring the execution_permits,
which permits wait_for_idle()/accountsdb.take_snapshot() to run concurrently
with the slot/sysvar transition; move the semaphore acquisition so the permit is
awaited and held (call self.execution_permits.acquire().await and bind _permit)
before checking and calling transition_to_slot(transaction.slot), ensuring the
permit is held across transition_to_slot and subsequent execute() calls
(symbols: transition_to_slot, execution_permits, wait_for_idle, take_snapshot,
execute, rx.recv, processor.slot).

In `@magicblock-processor/src/scheduler/state.rs`:
- Around line 109-113: The use of .expect() when registering the function
"sol_matmul_i8" via runtime.register_function (SyscallMatmulI8::vm) should be
replaced with proper error handling: propagate the error from register_function
(change the surrounding function to return Result if needed) or handle it by
logging the failure and returning an Err instead of panicking, ensuring the
runtime registration failure is handled gracefully and reported rather than
calling .expect().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4f261bb4-4f29-4a40-bef4-8f169de6dc91

📥 Commits

Reviewing files that changed from the base of the PR and between 873881e and 8d1b35d.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (34)

Cargo.toml
config.example.toml
magicblock-aperture/src/geyser.rs
magicblock-aperture/src/processor.rs
magicblock-aperture/src/state/blocks.rs
magicblock-aperture/tests/blocks.rs
magicblock-aperture/tests/mocked.rs
magicblock-aperture/tests/setup.rs
magicblock-aperture/tests/transactions.rs
magicblock-aperture/tests/websocket.rs
magicblock-api/src/lib.rs
magicblock-api/src/magic_validator.rs
magicblock-api/src/slot.rs
magicblock-api/src/tickers.rs
magicblock-config/src/config/ledger.rs
magicblock-config/src/consts.rs
magicblock-core/src/link.rs
magicblock-core/src/link/blocks.rs
magicblock-core/src/link/replication.rs
magicblock-ledger/src/blockstore_processor/mod.rs
magicblock-ledger/src/lib.rs
magicblock-processor/Cargo.toml
magicblock-processor/src/executor/mod.rs
magicblock-processor/src/scheduler/coordinator.rs
magicblock-processor/src/scheduler/mod.rs
magicblock-processor/src/scheduler/state.rs
magicblock-processor/src/scheduler/tests.rs
magicblock-processor/tests/fees.rs
magicblock-processor/tests/replay.rs
magicblock-replicator/src/nats/consumer.rs
magicblock-replicator/src/nats/mod.rs
magicblock-replicator/src/service/context.rs
magicblock-replicator/src/service/standby.rs
test-kit/src/lib.rs

💤 Files with no reviewable changes (3)

magicblock-api/src/lib.rs
magicblock-api/src/slot.rs
magicblock-core/src/link.rs

config.example.toml

magicblock-aperture/tests/blocks.rs

magicblock-aperture/tests/transactions.rs

magicblock-api/src/tickers.rs

magicblock-config/src/config/ledger.rs

magicblock-processor/src/scheduler/mod.rs

magicblock-processor/src/scheduler/state.rs

magicblock-replicator/src/nats/consumer.rs

magicblock-replicator/src/service/standby.rs

magicblock-config/src/config/ledger.rs

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (6)

magicblock-processor/src/scheduler/coordinator.rs (1)

345-350: ⚠️ Potential issue | 🟠 Major

Replace panicing .expect() in idle barrier with fallible error propagation.

Line 350 can panic coordinator/repl flows on semaphore closure. This should return Result and let callers handle shutdown/error paths.

Proposed fix

-    pub(super) async fn wait_for_idle(&self) -> OwnedSemaphorePermit {
-        self.execution_permits
-            .clone()
-            .acquire_many_owned(self.blocked_transactions.len() as u32)
-            .await
-            .expect("semaphore can never be closed")
-    }
+    pub(super) async fn wait_for_idle(
+        &self,
+    ) -> Result<OwnedSemaphorePermit, tokio::sync::AcquireError> {
+        self.execution_permits
+            .clone()
+            .acquire_many_owned(self.blocked_transactions.len() as u32)
+            .await
+    }

Please verify all .wait_for_idle().await call sites are updated to handle Err(AcquireError) explicitly:

#!/bin/bash
set -euo pipefail

rg -n -C2 --type rust 'pub\(super\)\s+async\s+fn\s+wait_for_idle|acquire_many_owned|expect\("semaphore can never be closed"\)'
rg -n -C2 --type rust '\bwait_for_idle\s*\(\)\.await'

As per coding guidelines Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/scheduler/coordinator.rs` around lines 345 - 350,
Change wait_for_idle to return Result<OwnedSemaphorePermit,
tokio::sync::AcquireError> instead of panicking: replace the expect call on
acquire_many_owned with ? (i.e., propagate the AcquireError) and update the
signature of pub(super) async fn wait_for_idle to reflect Result. Update all
call sites of wait_for_idle().await to handle the Err(AcquireError) path
explicitly (return Err, map to a coordinator shutdown error, or continue as
appropriate) so callers of execution_permits.acquire_many_owned and users of
OwnedSemaphorePermit handle semaphore closure gracefully; references to
execution_permits, blocked_transactions, wait_for_idle, OwnedSemaphorePermit and
AcquireError will help locate affected code.

magicblock-config/src/config/ledger.rs (1)

16-17: 🧹 Nitpick | 🔵 Trivial

Fix typo in documentation.

The doc comment has a grammatical error: "before they accountsdb" should be "before the accountsdb".

📝 Proposed fix

-    /// The number of slots that must elapse before they accountsdb snapshot/checksum is taken
+    /// The number of slots that must elapse before the accountsdb snapshot/checksum is taken
     pub superblock_size: u64,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-config/src/config/ledger.rs` around lines 16 - 17, Update the doc
comment for the struct field superblock_size to fix the typo: change "before
they accountsdb snapshot/checksum is taken" to "before the accountsdb
snapshot/checksum is taken." Locate the triple-slash doc comment immediately
above the pub superblock_size: u64 field in ledger.rs and edit the text to use
"the accountsdb" so the comment reads correctly.

magicblock-processor/src/scheduler/mod.rs (3)

477-492: ⚠️ Potential issue | 🟠 Major

Abort the slot transition when write_block() fails.

This still advances accountsdb after a ledger write error. Once that happens, local slot/Clock/SlotHashes move ahead of persisted ledger state and RPC/replication can observe a block that never landed.

Suggested fix

     async fn finalize_block(&self, block: &LatestBlockInner) {
         if self.coordinator.is_primary() {
-            let _ = self
-                .ledger
-                .write_block(
-                    block.slot,
-                    block.clock.unix_timestamp,
-                    block.blockhash,
-                )
-                .inspect_err(|error| {
-                    error!(%error, "failed to write block to the ledger")
-                });
+            if let Err(error) = self.ledger.write_block(
+                block.slot,
+                block.clock.unix_timestamp,
+                block.blockhash,
+            ) {
+                error!(%error, "failed to write block to the ledger");
+                return;
+            }
         }
         self.accountsdb.set_slot(block.slot);
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/scheduler/mod.rs` around lines 477 - 492,
finalize_block currently advances accountsdb.set_slot even when
ledger.write_block fails; change the logic so that when
self.coordinator.is_primary() and ledger.write_block(...) returns an Err the
function returns early (aborting the slot transition) instead of calling
self.accountsdb.set_slot. Concretely, replace the current inspect_err usage with
explicit error handling around self.ledger.write_block (in finalize_block) —
call ledger.write_block, log the error via error!(%error, ...) on failure and
return immediately; only call self.accountsdb.set_slot(block.slot) when
write_block succeeded or when not primary.

438-443: ⚠️ Potential issue | 🟠 Major

Replace this unwrap() on the clock read.

A host clock before UNIX_EPOCH will panic the scheduler at Line 440 during slot finalization. Handle SystemTimeError and choose a fallback timestamp instead.

Suggested fix

-        let unix_timestamp = SystemTime::now()
-            .duration_since(UNIX_EPOCH)
-            .unwrap()
-            // NOTE: since we can tick very frequently, a lot
-            // of blocks might have identical timestamps
-            .as_secs() as i64;
+        let unix_timestamp = match SystemTime::now().duration_since(UNIX_EPOCH) {
+            Ok(duration) => duration.as_secs() as i64,
+            Err(error) => {
+                warn!(%error, "system clock is before UNIX_EPOCH; using 0");
+                0
+            }
+        };

As per coding guidelines, "Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue. These should not be categorized as trivial or nit-level concerns. Request proper error handling or explicit justification with invariants."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/scheduler/mod.rs` around lines 438 - 443, The
current call to SystemTime::now().duration_since(UNIX_EPOCH).unwrap() can panic
if the host clock is before UNIX_EPOCH; replace the unwrap by explicitly
handling the Result from duration_since(UNIX_EPOCH) (around the unix_timestamp
calculation), e.g., match the Ok(duration) branch to use duration.as_secs() as
i64 and the Err(_) branch to choose a safe fallback timestamp (for example 0 or
another agreed sentinel) and emit a warning/log; update the code that assigns
unix_timestamp accordingly so slot finalization no longer panics on
SystemTimeError.

163-167: ⚠️ Potential issue | 🔴 Critical

Make the hasher mutable and seed it with bytes, or this code will not compile.

blake3::Hasher::update requires &mut self and &[u8] input. The current code declares hasher as immutable and attempts to pass a Hash type directly, causing compilation failure. Other uses of hasher in this file confirm the pattern: self.hasher.update(block.blockhash.as_ref()).

Suggested fix

-        let hasher = Hasher::new();
+        let mut hasher = Hasher::new();
         let slot_ticker = interval(state.block_time);
         let latest_block = state.ledger.latest_block().clone();
-        hasher.update(latest_block.load().blockhash);
-        let slot = latest_block.load().slot;
+        let latest = latest_block.load();
+        hasher.update(latest.blockhash.as_ref());
+        let slot = latest.slot;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/scheduler/mod.rs` around lines 163 - 167, The Hasher
is created immutably and update is called with a non-byte type; change to create
a mutable hasher (e.g., use let mut hasher = Hasher::new()) and call
hasher.update(...) with a byte slice from the block hash (use
latest_block.load().blockhash.as_ref() or equivalent) so the update receives
&[u8] and compiles like other usages in this file.

magicblock-replicator/src/service/standby.rs (1)

63-66: ⚠️ Potential issue | 🟠 Major

Keep the lock-expiry signal latched until takeover actually runs.

Lines 64-65 drop the expiry event when backlog still exists. Once those pending messages drain, this loop no longer retries takeover until another watcher event or the idle-timeout path fires.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/src/service/standby.rs` around lines 63 - 66, When
watcher.wait_for_expiry() fires, don't drop that signal if has_pending().await
returns true; instead set a latched flag (e.g., expiry_latched) when
wait_for_expiry() returns, continue draining pending items while keeping
expiry_latched true, and only allow the takeover path to run once
has_pending().await is false and expiry_latched is set; clear expiry_latched
after takeover completes. Update the loop in standby.rs around
watcher.wait_for_expiry(), has_pending(), and the takeover invocation to use
this latched boolean so the expiry event is remembered across the backlog drain.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-processor/src/executor/mod.rs`:
- Around line 195-209: The recv() loop in the executor treats any recv error as
terminal (the pattern using Ok(latest) => ... else => break), which causes the
executor to exit on tokio::broadcast::error::RecvError::Lagged; change the loop
in the run/processing task that calls block_updated.recv() to match on the
Result and explicitly handle Err(RecvError::Lagged(_)) by resynchronizing (e.g.,
call or implement a resync method to repair/refresh self.block_history or fetch
the latest missing blocks) and continue the loop, handle Err(RecvError::Closed)
by breaking/terminating, and keep the existing path of calling
register_new_block(latest) for Ok(latest); reference the recv call and
register_new_block(…) so you update only the error handling branch rather than
exiting on any recv error.

In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 362-376: The streaming blockhash must include replayed
transactions, but current code only updates self.hasher when is_execution is
true, causing replicas (TransactionProcessingMode::Replay) to diverge in
verify_block_as_replica; change the logic so that replay and execution both
contribute to the hasher: keep the self.index += 1 behavior only inside the
TransactionProcessingMode::Execution arm, but ensure you call
self.hasher.update(txn.transaction.signature().as_ref()) for both
TransactionProcessingMode::Replay and TransactionProcessingMode::Execution (or
set a should_hash flag in those two arms), leaving
TransactionProcessingMode::Simulation un-hashed.
- Around line 368-375: The code advances self.index and updates self.hasher
before sending work to the executor, so if try_send fails the slot index and
hash move forward incorrectly; change the flow in the scheduling path that uses
is_execution and try_send so you compute the candidate index/slot and hash input
without mutating self, then attempt try_send, and only on successful send update
self.index and call self.hasher.update(...) (and similarly ensure the
replica-path mutations around try_send at the 392-396 region are delayed until
after a successful enqueue). Reference self.index, self.hasher, is_execution,
and try_send when making these changes.

---

Duplicate comments:
In `@magicblock-config/src/config/ledger.rs`:
- Around line 16-17: Update the doc comment for the struct field superblock_size
to fix the typo: change "before they accountsdb snapshot/checksum is taken" to
"before the accountsdb snapshot/checksum is taken." Locate the triple-slash doc
comment immediately above the pub superblock_size: u64 field in ledger.rs and
edit the text to use "the accountsdb" so the comment reads correctly.

In `@magicblock-processor/src/scheduler/coordinator.rs`:
- Around line 345-350: Change wait_for_idle to return
Result<OwnedSemaphorePermit, tokio::sync::AcquireError> instead of panicking:
replace the expect call on acquire_many_owned with ? (i.e., propagate the
AcquireError) and update the signature of pub(super) async fn wait_for_idle to
reflect Result. Update all call sites of wait_for_idle().await to handle the
Err(AcquireError) path explicitly (return Err, map to a coordinator shutdown
error, or continue as appropriate) so callers of
execution_permits.acquire_many_owned and users of OwnedSemaphorePermit handle
semaphore closure gracefully; references to execution_permits,
blocked_transactions, wait_for_idle, OwnedSemaphorePermit and AcquireError will
help locate affected code.

In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 477-492: finalize_block currently advances accountsdb.set_slot
even when ledger.write_block fails; change the logic so that when
self.coordinator.is_primary() and ledger.write_block(...) returns an Err the
function returns early (aborting the slot transition) instead of calling
self.accountsdb.set_slot. Concretely, replace the current inspect_err usage with
explicit error handling around self.ledger.write_block (in finalize_block) —
call ledger.write_block, log the error via error!(%error, ...) on failure and
return immediately; only call self.accountsdb.set_slot(block.slot) when
write_block succeeded or when not primary.
- Around line 438-443: The current call to
SystemTime::now().duration_since(UNIX_EPOCH).unwrap() can panic if the host
clock is before UNIX_EPOCH; replace the unwrap by explicitly handling the Result
from duration_since(UNIX_EPOCH) (around the unix_timestamp calculation), e.g.,
match the Ok(duration) branch to use duration.as_secs() as i64 and the Err(_)
branch to choose a safe fallback timestamp (for example 0 or another agreed
sentinel) and emit a warning/log; update the code that assigns unix_timestamp
accordingly so slot finalization no longer panics on SystemTimeError.
- Around line 163-167: The Hasher is created immutably and update is called with
a non-byte type; change to create a mutable hasher (e.g., use let mut hasher =
Hasher::new()) and call hasher.update(...) with a byte slice from the block hash
(use latest_block.load().blockhash.as_ref() or equivalent) so the update
receives &[u8] and compiles like other usages in this file.

In `@magicblock-replicator/src/service/standby.rs`:
- Around line 63-66: When watcher.wait_for_expiry() fires, don't drop that
signal if has_pending().await returns true; instead set a latched flag (e.g.,
expiry_latched) when wait_for_expiry() returns, continue draining pending items
while keeping expiry_latched true, and only allow the takeover path to run once
has_pending().await is false and expiry_latched is set; clear expiry_latched
after takeover completes. Update the loop in standby.rs around
watcher.wait_for_expiry(), has_pending(), and the takeover invocation to use
this latched boolean so the expiry event is remembered across the backlog drain.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b8813f34-7cb6-4d10-9c4e-82d0108a714f

📥 Commits

Reviewing files that changed from the base of the PR and between 8d1b35d and 45462c8.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (35)

Cargo.toml
config.example.toml
magicblock-aperture/src/geyser.rs
magicblock-aperture/src/processor.rs
magicblock-aperture/src/state/blocks.rs
magicblock-aperture/tests/blocks.rs
magicblock-aperture/tests/mocked.rs
magicblock-aperture/tests/setup.rs
magicblock-aperture/tests/transactions.rs
magicblock-aperture/tests/websocket.rs
magicblock-api/src/lib.rs
magicblock-api/src/magic_validator.rs
magicblock-api/src/slot.rs
magicblock-api/src/tickers.rs
magicblock-config/src/config/ledger.rs
magicblock-config/src/consts.rs
magicblock-core/src/link.rs
magicblock-core/src/link/blocks.rs
magicblock-core/src/link/replication.rs
magicblock-ledger/src/blockstore_processor/mod.rs
magicblock-ledger/src/lib.rs
magicblock-metrics/src/metrics/mod.rs
magicblock-processor/Cargo.toml
magicblock-processor/src/executor/mod.rs
magicblock-processor/src/scheduler/coordinator.rs
magicblock-processor/src/scheduler/mod.rs
magicblock-processor/src/scheduler/state.rs
magicblock-processor/src/scheduler/tests.rs
magicblock-processor/tests/fees.rs
magicblock-processor/tests/replay.rs
magicblock-replicator/src/nats/consumer.rs
magicblock-replicator/src/nats/mod.rs
magicblock-replicator/src/service/context.rs
magicblock-replicator/src/service/standby.rs
test-kit/src/lib.rs

💤 Files with no reviewable changes (3)

magicblock-api/src/lib.rs
magicblock-api/src/slot.rs
magicblock-core/src/link.rs

magicblock-processor/src/executor/mod.rs

magicblock-processor/src/scheduler/mod.rs

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-aperture/tests/blocks.rs`:
- Around line 125-163: The single-pass scan from slot_before..=current_slot can
miss a block that appears after current_slot; modify the logic around
env.wait_for_slot_progress / env.rpc.get_slot so you implement a bounded polling
loop that repeatedly refreshes current_slot and rescans slots (using
env.rpc.get_block_with_config and RpcBlockConfig with
UiTransactionEncoding::Base64) until you find the signature or a timeout/attempt
limit is reached; use a short delay between iterations, keep the scan window
bounded (e.g., last N slots or from slot_before to refreshed current_slot), and
return a clear error if the transaction isn't found within the timeout.
- Around line 46-77: The test incorrectly compares the second value returned by
get_latest_blockhash_with_commitment (captured as last_valid_slot) to the slot
from get_slot(); that value is actually lastValidBlockHeight. Update the test to
rename last_valid_slot to last_valid_block_height (or similar) and fetch the
current block height via env.rpc.get_block_height().await.expect(...) and assert
that last_valid_block_height > current_block_height, updating the assert message
to reference block heights rather than slots; keep the BlockHash non-default
assertions unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 58e3857e-c980-4479-a411-f32ffea77aa5

📥 Commits

Reviewing files that changed from the base of the PR and between 45462c8 and 28daab0.

📒 Files selected for processing (1)

magicblock-aperture/tests/blocks.rs

magicblock-aperture/tests/blocks.rs

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

magicblock-api/src/tickers.rs (1)

43-44: ⚠️ Potential issue | 🟠 Major

.expect() usage flagged per coding guidelines.

The .expect("Validator found to be running without MagicContext account!") on line 44 is flagged per coding guidelines which treat any .unwrap() or .expect() in magicblock-*/** as a MAJOR issue.

While this invariant (MagicContext always exists) is fundamental to validator operation and the error message is descriptive, consider returning an error or logging and continuing rather than panicking in the async loop.

🛡️ Suggested defensive handling

-            let magic_context_acc = accountsdb.get_account(&magic_program::MAGIC_CONTEXT_PUBKEY)
-                .expect("Validator found to be running without MagicContext account!");
+            let Some(magic_context_acc) = accountsdb.get_account(&magic_program::MAGIC_CONTEXT_PUBKEY) else {
+                error!("Validator found to be running without MagicContext account!");
+                continue;
+            };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-api/src/tickers.rs` around lines 43 - 44, Replace the
panic-causing expect call when fetching the MagicContext account: instead of
calling
accountsdb.get_account(&magic_program::MAGIC_CONTEXT_PUBKEY).expect(...), handle
the missing account defensively in the surrounding async loop by checking the
Option returned, logging a clear error via the existing logger, and either
returning an Err from the containing function or skipping/continuing the loop
iteration while preserving state; update code paths that use magic_context_acc
to accept the Option or the Result so the function signature (e.g., the ticker
handler) returns Result<..., Error> or handles the None case gracefully rather
than panicking.

magicblock-processor/src/executor/mod.rs (1)

224-234: ⚠️ Potential issue | 🟠 Major

.unwrap() on RwLock write guard flagged per coding guidelines.

Line 226 uses .unwrap() on writable_sysvar_cache().write(). While lock poisoning is rare and this pattern is common in Rust, the coding guidelines treat any .unwrap() in magicblock-*/** as a MAJOR issue.

Consider whether this should propagate an error or use .expect() with an explicit invariant justification.

🛡️ Suggested fix with explicit invariant

     fn set_sysvars(&self, block: &LatestBlockInner) {
-        let mut cache = self.processor.writable_sysvar_cache().write().unwrap();
+        let mut cache = self.processor
+            .writable_sysvar_cache()
+            .write()
+            .expect("sysvar cache lock should never be poisoned in single-threaded executor");
         cache.set_sysvar_for_tests(&block.clock);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/executor/mod.rs` around lines 224 - 234, The write()
call on writable_sysvar_cache used in set_sysvars currently calls .unwrap()
which is disallowed; replace the .unwrap() with a non-panicking handling: either
change set_sysvars to return a Result and propagate the PoisonError from
writable_sysvar_cache().write(), or (if you can assert the lock poisoning is an
unrecoverable invariant) replace .unwrap() with .expect("invariant:
writable_sysvar_cache RwLock must not be poisoned in set_sysvars; if poisoned,
recovery is undefined") to provide an explicit justification; update references
to set_sysvars/LatestBlockInner accordingly when changing the signature.

♻️ Duplicate comments (6)

magicblock-replicator/src/nats/consumer.rs (1)

88-107: ⚠️ Potential issue | 🟠 Major

Propagate cancellation distinctly from “no pending messages.”

magicblock-replicator/src/service/standby.rs:151-158 turns None into false with unwrap_or_default(). If cancel fires while get_info() is in flight, the expiry branch in magicblock-replicator/src/service/standby.rs:63-76 can keep going and attempt promotion during shutdown. This method needs a distinct cancelled/error state instead of overloading Option<u64>.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/src/nats/consumer.rs` around lines 88 - 107, The
pending method currently returns Option<u64> and conflates "cancelled" with "no
pending messages"; change its signature and behaviour to return a distinct
result for cancellation/errors (e.g. Result<Option<u64>, PendingCancelled> or a
small enum like PendingOutcome::{Cancelled, Pending(u64), Error}), update the
loop in pending (referencing pending, self.inner.get_info(), CancellationToken)
to return the appropriate variant when cancel.cancelled() fires versus when
get_info() errors or returns a count, and then update the callers (e.g. the code
in standby.rs that currently calls pending(...).unwrap_or_default()) to match on
the new return type and handle Cancelled distinctly from zero pending so
promotion is not attempted during shutdown.

magicblock-replicator/src/service/standby.rs (1)

63-76: ⚠️ Potential issue | 🟠 Major

Lock-expiry signal may be lost if pending messages exist.

When has_pending() returns true at line 64, the code continues without preserving the fact that the lock expired. Once catch-up finishes and has_pending() becomes false, the code won't automatically retry takeover—it will only respond to a new expiry event or fall back to the idle timeout (adding at least LEADER_TIMEOUT delay).

Consider tracking the expiry state with a flag and retrying promotion as soon as has_pending() becomes false.

♻️ Suggested approach

+    lock_expired: bool,
 }

 impl Standby {
     pub fn new(...) -> Self {
         ...
         Self {
             ...
+            lock_expired: false,
         }
     }

     pub async fn run(mut self) -> Result<Option<Primary>> {
         ...
         loop {
             tokio::select! {
                 biased;
                 _ = self.watcher.wait_for_expiry() => {
+                    self.lock_expired = true;
                     if self.has_pending().await {
                         continue;
                     }
-                    if !self.can_promote {
-                        warn!("leader lock expired, but takeover disabled (ReplicaOnly mode)");
-                        continue
-                    }
-                    info!("leader lock expired, attempting takeover");
-                    if let Ok(Some(producer)) = self.ctx.try_acquire_producer().await {
-                        info!("acquired leadership, promoting");
-                        return self.ctx.into_primary(producer, self.messages).await.map(Some);
-                    }
+                    self.try_promote_if_ready().await?;
                 }
                 result = stream.next() => {
                     // ... handle message ...
+                    // After processing, check if we can promote now
+                    if self.lock_expired && !self.has_pending().await {
+                        self.try_promote_if_ready().await?;
+                    }
                 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/src/service/standby.rs` around lines 63 - 76, The
expiry event handled in the watcher.wait_for_expiry branch is currently
discarded when has_pending().await is true, so promotion isn't retried once
pending clears; add a flag (e.g., pending_expired_lock or expiry_pending) in the
Standby state that is set when watcher.wait_for_expiry fires and cleared only
after a successful promotion attempt or when can_promote is false, then modify
the loop so that after has_pending().await becomes false it checks this flag and
immediately retries the same promotion logic (calling
ctx.try_acquire_producer().await and ctx.into_primary(...).await with messages)
instead of waiting for the next expiry; ensure the flag is also checked/cleared
on other early exits to avoid stuck state and keep existing logging (info/warn)
paths.

magicblock-processor/src/scheduler/mod.rs (4)

362-376: ⚠️ Potential issue | 🔴 Critical

Hash replayed transactions too.

Replicas schedule TransactionProcessingMode::Replay, but that arm never updates self.hasher. verify_block_as_replica() will diverge on any non-empty replicated slot.

⚙️ Minimal fix

-        let mut is_execution = false;
+        let mut is_execution = false;
+        let mut should_hash = false;
         let (slot, index) = match txn.mode {
-            TransactionProcessingMode::Replay(ctx) => (ctx.slot, ctx.index),
+            TransactionProcessingMode::Replay(ctx) => {
+                should_hash = true;
+                (ctx.slot, ctx.index)
+            }
             TransactionProcessingMode::Simulation(_) => (self.slot, 0),
             TransactionProcessingMode::Execution(_) => {
                 is_execution = true;
+                should_hash = true;
                 let index = self.index;
                 // we only advance the index if we are executing (primary mode)
                 self.index += 1;
                 (self.slot, index)
             }
         };
-        if is_execution {
+        if should_hash {
             self.hasher.update(txn.transaction.signature().as_ref());
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/scheduler/mod.rs` around lines 362 - 376, The replay
branch currently skips updating the block hasher, causing divergence in
verify_block_as_replica; update self.hasher with the transaction signature when
handling TransactionProcessingMode::Replay (same way as when is_execution is
true) so both Replay and Execution contribute to the hash—i.e., in the match arm
for TransactionProcessingMode::Replay (and/or by removing the is_execution
gating) call self.hasher.update(txn.transaction.signature().as_ref()) so
replicated transactions are hashed consistently with verify_block_as_replica.

405-413: ⚠️ Potential issue | 🟠 Major

Abort the slot transition when write_block() fails.

finalize_block() only logs ledger errors, and transition_to_new_slot() immediately advances sysvars anyway. That leaves local slot, Clock, and SlotHashes ahead of persisted history. Return a Result here and stop the transition on error, or make the failure fatal.

Also applies to: 477-491

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/scheduler/mod.rs` around lines 405 - 413,
transition_to_new_slot must abort on ledger write failures: change its signature
to return a Result<u64, E> (or make it fatal) and propagate errors from
prepare_block/finalize_block (which calls write_block) instead of always
proceeding; stop calling update_sysvars and metrics::set_slot when
finalize_block/write_block fails. Locate transition_to_new_slot, prepare_block,
finalize_block and any calls to write_block, update_sysvars and
metrics::set_slot (also apply the same change to the other transition block
around lines 477-491) and ensure errors from finalize_block/write_block are
returned up so the slot transition is not applied on persistent write failure.

366-396: ⚠️ Potential issue | 🟠 Major

Don't advance the block stream before enqueue succeeds.

Lines 368-375 mutate self.index and self.hasher before Line 392 attempts try_send(). If that send fails, the transaction is dropped but the slot index and blockhash stream have already moved. Defer those mutations until after a successful enqueue, and surface the send failure instead of returning None.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/scheduler/mod.rs` around lines 366 - 396, The code
currently increments self.index and updates self.hasher before calling
self.executors[...].try_send(...) so on send failure the stream and hash advance
incorrectly; fix by computing local slot and index variables without mutating
self, build the IndexedTransaction and call try_send first, and only when
try_send returns Ok do you mutate self.index (self.index += 1) and call
self.hasher.update(...); similarly only call self.send_replication(...) after
confirming sent and coordinator.is_primary(); on send failure return or
propagate an error instead of silently returning None so the caller can observe
the enqueue failure (references: self.index, self.hasher.update,
self.executors[..].try_send, IndexedTransaction, send_replication,
coordinator.is_primary, Message::Transaction).

438-443: ⚠️ Potential issue | 🟠 Major

Replace this unwrap() in the slot-boundary path.

A backward host clock panics the scheduler here. Handle Err(SystemTimeError) explicitly or make the failure fatal with context instead of panicking via unwrap().

🛡️ Minimal fix

-        let unix_timestamp = SystemTime::now()
-            .duration_since(UNIX_EPOCH)
-            .unwrap()
-            // NOTE: since we can tick very frequently, a lot
-            // of blocks might have identical timestamps
-            .as_secs() as i64;
+        let unix_timestamp = match SystemTime::now().duration_since(UNIX_EPOCH) {
+            Ok(duration) => duration.as_secs() as i64,
+            Err(error) => {
+                error!(%error, "system clock before UNIX_EPOCH");
+                0
+            }
+        };

As per coding guidelines {magicblock-*,programs,storage-proto}/**: Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/scheduler/mod.rs` around lines 438 - 443, The
current call to SystemTime::now().duration_since(UNIX_EPOCH).unwrap() can panic
if the host clock is before the epoch; replace the unwrap with explicit error
handling: call duration_since(UNIX_EPOCH) and match or map_err the Result, set
unix_timestamp from Ok(d.as_secs() as i64) and on Err(SystemTimeError) either
return a Result::Err from the enclosing function (propagate the error) or log
and convert it into a fatal error with clear context (include the
SystemTimeError in the message). Update the function signature to return a
Result if you choose propagation, and reference SystemTime::now, duration_since,
UNIX_EPOCH, SystemTimeError and the unix_timestamp variable when making the
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-config/src/config/ledger.rs`:
- Around line 16-17: The doc comment for the ledger field superblock_size
contains a typo ("before they accountsdb"); update the documentation string
above the pub superblock_size: u64 field in ledger.rs to read "before the
accountsdb snapshot/checksum is taken" (replace "they" with "the") so the
comment is grammatically correct and clear.

In `@magicblock-config/src/consts.rs`:
- Around line 57-58: Replace the hardcoded "20" with a calculation using
DEFAULT_LEDGER_BLOCK_TIME_MS so DEFAULT_SUPERBLOCK_SIZE stays in sync with
ledger block time; compute slots per second as 1000 /
DEFAULT_LEDGER_BLOCK_TIME_MS (using the same integer type as
DEFAULT_LEDGER_BLOCK_TIME_MS) and set DEFAULT_SUPERBLOCK_SIZE = 3600 *
slots_per_second, keeping the doc comment updated to reflect the derivation;
reference DEFAULT_SUPERBLOCK_SIZE and DEFAULT_LEDGER_BLOCK_TIME_MS when making
the change.

In `@magicblock-ledger/src/lib.rs`:
- Around line 32-35: Update the field documentation for notifier to reflect that
the broadcast channel now sends the full LatestBlockInner payload rather than
merely a signal; mention that subscribers receive LatestBlockInner instances
(not just a notification) and clarify whether callers should still use the load
method or prefer the payload from the channel for the current state, referencing
the notifier field and LatestBlockInner type.

In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 511-518: In update_program_cache, avoid using write().unwrap() on
self.program_cache which will panic if the RwLock is poisoned; instead handle
the PoisonError explicitly by matching on self.program_cache.write() (or using
.into_inner() on the Err branch) so you can log or record the poisoning and
still obtain a mutable cache guard to run cache.prune(slot, 0) and set
cache.latest_root_slot = slot, then drop the guard before the syscall lookup;
update references to program_cache, update_program_cache, cache.prune, and
cache.latest_root_slot accordingly and remove the unwrap()/expect() usage.

---

Outside diff comments:
In `@magicblock-api/src/tickers.rs`:
- Around line 43-44: Replace the panic-causing expect call when fetching the
MagicContext account: instead of calling
accountsdb.get_account(&magic_program::MAGIC_CONTEXT_PUBKEY).expect(...), handle
the missing account defensively in the surrounding async loop by checking the
Option returned, logging a clear error via the existing logger, and either
returning an Err from the containing function or skipping/continuing the loop
iteration while preserving state; update code paths that use magic_context_acc
to accept the Option or the Result so the function signature (e.g., the ticker
handler) returns Result<..., Error> or handles the None case gracefully rather
than panicking.

In `@magicblock-processor/src/executor/mod.rs`:
- Around line 224-234: The write() call on writable_sysvar_cache used in
set_sysvars currently calls .unwrap() which is disallowed; replace the .unwrap()
with a non-panicking handling: either change set_sysvars to return a Result and
propagate the PoisonError from writable_sysvar_cache().write(), or (if you can
assert the lock poisoning is an unrecoverable invariant) replace .unwrap() with
.expect("invariant: writable_sysvar_cache RwLock must not be poisoned in
set_sysvars; if poisoned, recovery is undefined") to provide an explicit
justification; update references to set_sysvars/LatestBlockInner accordingly
when changing the signature.

---

Duplicate comments:
In `@magicblock-processor/src/scheduler/mod.rs`:
- Around line 362-376: The replay branch currently skips updating the block
hasher, causing divergence in verify_block_as_replica; update self.hasher with
the transaction signature when handling TransactionProcessingMode::Replay (same
way as when is_execution is true) so both Replay and Execution contribute to the
hash—i.e., in the match arm for TransactionProcessingMode::Replay (and/or by
removing the is_execution gating) call
self.hasher.update(txn.transaction.signature().as_ref()) so replicated
transactions are hashed consistently with verify_block_as_replica.
- Around line 405-413: transition_to_new_slot must abort on ledger write
failures: change its signature to return a Result<u64, E> (or make it fatal) and
propagate errors from prepare_block/finalize_block (which calls write_block)
instead of always proceeding; stop calling update_sysvars and metrics::set_slot
when finalize_block/write_block fails. Locate transition_to_new_slot,
prepare_block, finalize_block and any calls to write_block, update_sysvars and
metrics::set_slot (also apply the same change to the other transition block
around lines 477-491) and ensure errors from finalize_block/write_block are
returned up so the slot transition is not applied on persistent write failure.
- Around line 366-396: The code currently increments self.index and updates
self.hasher before calling self.executors[...].try_send(...) so on send failure
the stream and hash advance incorrectly; fix by computing local slot and index
variables without mutating self, build the IndexedTransaction and call try_send
first, and only when try_send returns Ok do you mutate self.index (self.index +=
1) and call self.hasher.update(...); similarly only call
self.send_replication(...) after confirming sent and coordinator.is_primary();
on send failure return or propagate an error instead of silently returning None
so the caller can observe the enqueue failure (references: self.index,
self.hasher.update, self.executors[..].try_send, IndexedTransaction,
send_replication, coordinator.is_primary, Message::Transaction).
- Around line 438-443: The current call to
SystemTime::now().duration_since(UNIX_EPOCH).unwrap() can panic if the host
clock is before the epoch; replace the unwrap with explicit error handling: call
duration_since(UNIX_EPOCH) and match or map_err the Result, set unix_timestamp
from Ok(d.as_secs() as i64) and on Err(SystemTimeError) either return a
Result::Err from the enclosing function (propagate the error) or log and convert
it into a fatal error with clear context (include the SystemTimeError in the
message). Update the function signature to return a Result if you choose
propagation, and reference SystemTime::now, duration_since, UNIX_EPOCH,
SystemTimeError and the unix_timestamp variable when making the change.

In `@magicblock-replicator/src/nats/consumer.rs`:
- Around line 88-107: The pending method currently returns Option<u64> and
conflates "cancelled" with "no pending messages"; change its signature and
behaviour to return a distinct result for cancellation/errors (e.g.
Result<Option<u64>, PendingCancelled> or a small enum like
PendingOutcome::{Cancelled, Pending(u64), Error}), update the loop in pending
(referencing pending, self.inner.get_info(), CancellationToken) to return the
appropriate variant when cancel.cancelled() fires versus when get_info() errors
or returns a count, and then update the callers (e.g. the code in standby.rs
that currently calls pending(...).unwrap_or_default()) to match on the new
return type and handle Cancelled distinctly from zero pending so promotion is
not attempted during shutdown.

In `@magicblock-replicator/src/service/standby.rs`:
- Around line 63-76: The expiry event handled in the watcher.wait_for_expiry
branch is currently discarded when has_pending().await is true, so promotion
isn't retried once pending clears; add a flag (e.g., pending_expired_lock or
expiry_pending) in the Standby state that is set when watcher.wait_for_expiry
fires and cleared only after a successful promotion attempt or when can_promote
is false, then modify the loop so that after has_pending().await becomes false
it checks this flag and immediately retries the same promotion logic (calling
ctx.try_acquire_producer().await and ctx.into_primary(...).await with messages)
instead of waiting for the next expiry; ensure the flag is also checked/cleared
on other early exits to avoid stuck state and keep existing logging (info/warn)
paths.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bdb9c6c7-2d9c-49be-9799-40893b584f04

📥 Commits

Reviewing files that changed from the base of the PR and between 28daab0 and e0fd273.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (37)

Cargo.toml
config.example.toml
magicblock-aperture/src/geyser.rs
magicblock-aperture/src/processor.rs
magicblock-aperture/src/state/blocks.rs
magicblock-aperture/tests/blocks.rs
magicblock-aperture/tests/mocked.rs
magicblock-aperture/tests/setup.rs
magicblock-aperture/tests/transactions.rs
magicblock-aperture/tests/websocket.rs
magicblock-api/src/lib.rs
magicblock-api/src/magic_validator.rs
magicblock-api/src/slot.rs
magicblock-api/src/tickers.rs
magicblock-config/src/config/ledger.rs
magicblock-config/src/consts.rs
magicblock-core/src/link.rs
magicblock-core/src/link/blocks.rs
magicblock-core/src/link/replication.rs
magicblock-ledger/src/blockstore_processor/mod.rs
magicblock-ledger/src/lib.rs
magicblock-metrics/src/metrics/mod.rs
magicblock-processor/Cargo.toml
magicblock-processor/src/executor/mod.rs
magicblock-processor/src/scheduler/coordinator.rs
magicblock-processor/src/scheduler/mod.rs
magicblock-processor/src/scheduler/state.rs
magicblock-processor/src/scheduler/tests.rs
magicblock-processor/tests/fees.rs
magicblock-processor/tests/replay.rs
magicblock-replicator/src/nats/consumer.rs
magicblock-replicator/src/nats/mod.rs
magicblock-replicator/src/nats/producer.rs
magicblock-replicator/src/service/context.rs
magicblock-replicator/src/service/primary.rs
magicblock-replicator/src/service/standby.rs
test-kit/src/lib.rs

💤 Files with no reviewable changes (3)

magicblock-api/src/lib.rs
magicblock-core/src/link.rs
magicblock-api/src/slot.rs

magicblock-config/src/config/ledger.rs

magicblock-config/src/consts.rs

magicblock-ledger/src/lib.rs

magicblock-processor/src/scheduler/mod.rs

bmuddha mentioned this pull request Mar 19, 2026

feat: initial attempt at replicator integration #1049

Open

2 tasks

bmuddha self-assigned this Mar 19, 2026

bmuddha requested review from GabrielePicco and thlorenz March 19, 2026 09:10

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

bmuddha commented Mar 19, 2026

View reviewed changes

magicblock-config/src/config/ledger.rs Outdated Show resolved Hide resolved

bmuddha force-pushed the bmuddha/replicator/integration branch from 873881e to fd44715 Compare March 19, 2026 15:53

bmuddha force-pushed the bmuddha/feat/streaming-blockhash branch from 8d1b35d to 45462c8 Compare March 19, 2026 15:53

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

magicblock-processor/src/executor/mod.rs Show resolved Hide resolved

magicblock-processor/src/scheduler/mod.rs Show resolved Hide resolved

magicblock-processor/src/scheduler/mod.rs Show resolved Hide resolved

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

magicblock-aperture/tests/blocks.rs Show resolved Hide resolved

magicblock-aperture/tests/blocks.rs Show resolved Hide resolved

bmuddha force-pushed the bmuddha/feat/streaming-blockhash branch from 28daab0 to e0fd273 Compare March 20, 2026 07:51

bmuddha force-pushed the bmuddha/replicator/integration branch from fd44715 to 2773373 Compare March 20, 2026 07:51

bmuddha mentioned this pull request Mar 20, 2026

epic: replication service #1076

Draft

1 task

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

magicblock-config/src/config/ledger.rs Outdated Show resolved Hide resolved

magicblock-config/src/consts.rs Show resolved Hide resolved

magicblock-ledger/src/lib.rs Show resolved Hide resolved

magicblock-processor/src/scheduler/mod.rs Show resolved Hide resolved

bmuddha force-pushed the bmuddha/feat/streaming-blockhash branch from 3ca65ca to 9ce31db Compare March 30, 2026 11:49

bmuddha force-pushed the bmuddha/replicator/integration branch from 2773373 to 06289eb Compare March 30, 2026 11:49

This was referenced Mar 30, 2026

feat: add nats authentication #1117

Open

fix: replication state mismatch #1118

Open

bmuddha force-pushed the bmuddha/feat/streaming-blockhash branch from 9ce31db to 9b8ab0b Compare March 30, 2026 12:14

bmuddha force-pushed the bmuddha/replicator/integration branch from 06289eb to 0a64c31 Compare March 30, 2026 12:14

feat: move blockhash calculation to the scheduler

22e6294

bmuddha added 13 commits April 1, 2026 11:37

feat: move blockhash compute into scheduler

6f06522

feat: add superblock support to the scheduler

e6447d1

fix: add counting semaphore to the executors

c69869a

fix: use unsafe when taking snapshot

993a1fb

fix: add residual message check in consumption streams

9576c7c

fix: address review comments

4651114

fix: fmt

e911db8

fix: release leader lock when shutdown

84d6a9a

fix: clippy

797e05c

fix: remove irrelevant test assertion

9ebcc3e

chore: update gitignore

e481be7

chore: commit Cargo.lock

abeb691

chore: add a note regarding the error warning

0f177a8

bmuddha changed the base branch from bmuddha/replicator/integration to graphite-base/1072 April 1, 2026 08:37

chore: fix comment typo

342a8cd

bmuddha force-pushed the bmuddha/feat/streaming-blockhash branch from 9b8ab0b to 342a8cd Compare April 1, 2026 10:11

bmuddha force-pushed the graphite-base/1072 branch from 0a64c31 to 9b27820 Compare April 1, 2026 10:11

bmuddha changed the base branch from graphite-base/1072 to bmuddha/replicator/integration April 1, 2026 10:11

bmuddha marked this pull request as ready for review April 1, 2026 16:24

bmuddha added 5 commits April 2, 2026 12:53

fix: racy test harness

fd438dd

fix: fmt

30c43f0

fix: don't replicate if no replication service is running

4eed9ae

fix: use different accountsdb slot check

a5b45ac

fix: use accountsdb slot

a131217

Conversation

bmuddha commented Mar 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Compatibility

Testing

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Assessment against linked issues

Assessment — Out-of-scope changes

Suggested reviewers

Uh oh!

bmuddha commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bmuddha commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 19, 2026 •

edited

Loading

bmuddha commented Mar 19, 2026 •

edited

Loading