Skip to content

epic: replication service#1076

Draft
bmuddha wants to merge 5 commits intomasterfrom
bmuddha/epic/replication-service
Draft

epic: replication service#1076
bmuddha wants to merge 5 commits intomasterfrom
bmuddha/epic/replication-service

Conversation

@bmuddha
Copy link
Copy Markdown
Collaborator

@bmuddha bmuddha commented Mar 20, 2026

Summary

An accumulator branch for replication service PRs

Checklist

Summary by CodeRabbit

  • Documentation
    • Added comprehensive docs for a leader–follower replication service: startup flow, roles, event formats, snapshot recovery, failover rules, and threading model.
  • New Features
    • New NATS-backed replication service (primary/standby) for streaming validator events.
    • Snapshots now stored as compressed archives and support external ingestion with fast‑forward recovery.
  • Chores
    • Updated ignore rules to exclude AI-related config files.
  • Breaking Changes
    • Removed configurable automatic snapshot cadence and its config/default entry.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 20, 2026

Manual Deploy Available

You can trigger a manual deploy of this PR branch to testnet:

Deploy to Testnet 🚀

Alternative: Comment /deploy on this PR to trigger deployment directly.

⚠️ Note: Manual deploy requires authorization. Only authorized users can trigger deployments.

Comment updated automatically when the PR is synchronized.

Copy link
Copy Markdown
Collaborator Author

bmuddha commented Mar 20, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR removes the accountsdb snapshot-frequency config, migrates snapshot persistence to per-slot compressed archives (snapshot-<slot>.tar.gz) with archive creation/registration/validation/extraction and atomic-swap restore, adds explicit AccountsDb snapshot APIs and lock/checksum primitives, updates tests to the archive flow, adds tar/flate2 deps, introduces a new magicblock-replicator crate implementing a NATS JetStream-based replication protocol (broker/producer/consumer/lock-watcher/producer/consumer/service/primary/standby/proto/watcher), updates workspace/Cargo, and updates docs/.gitignore and issue/PR templates.

Assessment against linked issues

Objective (issue) Addressed Explanation
Document replication design and protocol (#935)
Provide implementation of replication service (#935)
Snapshot upload/recovery and JetStream event handling implemented (#935)

Assessment against linked issues: Out-of-scope changes

Code Change (file_path) Explanation
Snapshot storage refactor to tar.gz archives and archive-based restore/ingest (magicblock-accounts-db/src/snapshot.rs) Major local persistence redesign not required to meet the replication-epic documentation; changes the on-disk snapshot format and restore semantics.
AccountsDb API and locking/unsafe checksum changes (magicblock-accounts-db/src/lib.rs) Public API/signature changes (take_snapshot, insert_external_snapshot, lock_database, unsafe checksum, changed set_slot) alter callers' responsibilities and safety model beyond replication integration.
Removal of snapshot_frequency config & default constant (magicblock-config/src/config/accounts.rs, magicblock-config/src/consts.rs, config.example.toml, various test configs) Global config and test changes remove user-visible snapshot-frequency setting; this is a policy/behavior change distinct from the replication protocol objective.

Suggested reviewers

  • GabrielePicco
  • thlorenz
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bmuddha/epic/replication-service

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Around line 34-35: The .gitignore currently uses global basename patterns
"config.json" and "config.toml" which will ignore those files anywhere in the
repo; change these to more specific, local-scoped patterns (for example a
specific directory like "secrets/config.json" or a dev-only filename like
"config.local.json" or prefix with a directory such as "dev/config.toml") so
that tracked templates (e.g., config.json in example/ or templates/) are not
masked—update the entries that reference "config.json" and "config.toml" in the
.gitignore to narrowed names or paths that reflect the intended secret/dev
override files.

In `@magicblock-replicator/README.md`:
- Line 49: Replace the misspelled word "replys" in the standby behavior sentence
("Consumes events from the stream and replys them locally") with the correct
term "replays" so the sentence reads "Consumes events from the stream and
replays them locally" in README.md.
- Around line 7-22: The README's fenced ASCII-art code block is missing a
language tag (triggering MD040); update the opening triple-backtick fence in
README.md to include a plain-language tag such as ```text (or ```plaintext) so
the block is explicitly marked as text—locate the ASCII diagram fenced by ```
and add the language identifier to the opening fence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 922ddb0c-6e71-4b11-81bb-5ccb43acf562

📥 Commits

Reviewing files that changed from the base of the PR and between e7ed957 and 3238fb7.

📒 Files selected for processing (2)
  • .gitignore
  • magicblock-replicator/README.md

Comment on lines +34 to +35
config.json
config.toml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope these ignore rules to avoid masking valid tracked config files.

config.json and config.toml here are global basename patterns, so they ignore matching files anywhere in the repo. That can silently block committing legitimate config files (examples, fixtures, package configs). Prefer local-only patterns (for secrets/dev overrides) and keep templates tracked.

Suggested narrowing
 # AI related
 **/CLAUDE.md
-config.json
-config.toml
+/config.local.json
+/config.local.toml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 34 - 35, The .gitignore currently uses global
basename patterns "config.json" and "config.toml" which will ignore those files
anywhere in the repo; change these to more specific, local-scoped patterns (for
example a specific directory like "secrets/config.json" or a dev-only filename
like "config.local.json" or prefix with a directory such as "dev/config.toml")
so that tracked templates (e.g., config.json in example/ or templates/) are not
masked—update the entries that reference "config.json" and "config.toml" in the
.gitignore to narrowed names or paths that reflect the intended secret/dev
override files.

Comment on lines +7 to +22
```
┌─────────────┐
│ Service │
└──────┬──────┘
┌────────────┴────────────┐
▼ ▼
┌─────────┐ ┌─────────┐
│ Primary │ ← ─ ─ ─ ─ ─ → │ Standby │
└────┬────┘ └────┬────┘
│ │
┌───┴───┐ ┌───┴───┐
│Publish│ │Consume│
│Upload │ │Apply │
│Refresh│ │Verify │
└───────┘ └───────┘
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language to the fenced block to satisfy Markdown linting.

Line 7 opens a fenced code block without a language tag (```), which triggers MD040. Use text (or plaintext) for the ASCII diagram.

Suggested patch
-```
+```text
               ┌─────────────┐
               │   Service   │
               └──────┬──────┘
         ┌────────────┴────────────┐
         ▼                         ▼
    ┌─────────┐               ┌─────────┐
    │ Primary │ ← ─ ─ ─ ─ ─ → │ Standby │
    └────┬────┘               └────┬────┘
         │                         │
     ┌───┴───┐                 ┌───┴───┐
     │Publish│                 │Consume│
     │Upload │                 │Apply  │
     │Refresh│                 │Verify │
     └───────┘                 └───────┘
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/README.md` around lines 7 - 22, The README's fenced
ASCII-art code block is missing a language tag (triggering MD040); update the
opening triple-backtick fence in README.md to include a plain-language tag such
as ```text (or ```plaintext) so the block is explicitly marked as text—locate
the ASCII diagram fenced by ``` and add the language identifier to the opening
fence.


### Standby Node

- Consumes events from the stream and replys them locally
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in standby behavior description.

Line 49 says “replys”; this should be “replays” (or “replies,” depending on intent). Based on surrounding text (“consume and replay events”), “replays” appears correct.

Suggested patch
-- Consumes events from the stream and replys them locally
+- Consumes events from the stream and replays them locally
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Consumes events from the stream and replys them locally
- Consumes events from the stream and replays them locally
🧰 Tools
🪛 LanguageTool

[grammar] ~49-~49: Ensure spelling is correct
Context: ... - Consumes events from the stream and replys them locally - Watches the leader lock ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/README.md` at line 49, Replace the misspelled word
"replys" in the standby behavior sentence ("Consumes events from the stream and
replys them locally") with the correct term "replays" so the sentence reads
"Consumes events from the stream and replays them locally" in README.md.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
.gitignore (1)

34-35: ⚠️ Potential issue | 🟠 Major

Scope ignore patterns to avoid hiding legitimate config files repo-wide.

Line 34 and Line 35 use global basename rules, so any config.json/config.toml anywhere in the repo will be ignored. Narrow these to intended local/dev-secret filenames or paths.

Suggested narrowing
 # AI related
 **/CLAUDE.md
-config.json
-config.toml
+/config.local.json
+/config.local.toml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 34 - 35, The .gitignore currently uses broad
basename patterns "config.json" and "config.toml" which ignore those filenames
repository-wide; replace them with scoped paths or more specific names (for
example a local/dev-secret filename like "config.local.json", a folder-specific
path like "secrets/config.toml" or "./.env/config.json") so only the intended
local/dev secret files are ignored—update the entries that currently list
"config.json" and "config.toml" to the narrowed names/paths used by your
tooling.
magicblock-replicator/README.md (2)

7-7: ⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced code block.

The fenced code block is missing a language specifier, which triggers the MD040 linting rule. Add text (or plaintext) to the opening fence.

📝 Proposed fix
-```
+```text
               ┌─────────────┐
               │   Service   │
               └──────┬──────┘
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/README.md` at line 7, The fenced code block in
README.md is missing a language tag and triggers MD040; update the opening
triple-backtick fence for the ASCII diagram to include a language specifier
(e.g., change ``` to ```text or ```plaintext) so the markdown linter accepts the
block and preserves plain text formatting.

49-49: ⚠️ Potential issue | 🟡 Minor

Fix the typo in the standby behavior description.

Line 49 contains "replys" which should be "replays" (consuming and replaying events locally).

📝 Proposed fix
-- Consumes events from the stream and replys them locally
+- Consumes events from the stream and replays them locally
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/README.md` at line 49, Replace the misspelled word
"replys" with "replays" in the README sentence that currently reads "Consumes
events from the stream and replys them locally" so it becomes "Consumes events
from the stream and replays them locally"; search for that exact phrase in the
README to locate and update the standby behavior description.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.gitignore:
- Around line 34-35: The .gitignore currently uses broad basename patterns
"config.json" and "config.toml" which ignore those filenames repository-wide;
replace them with scoped paths or more specific names (for example a
local/dev-secret filename like "config.local.json", a folder-specific path like
"secrets/config.toml" or "./.env/config.json") so only the intended local/dev
secret files are ignored—update the entries that currently list "config.json"
and "config.toml" to the narrowed names/paths used by your tooling.

In `@magicblock-replicator/README.md`:
- Line 7: The fenced code block in README.md is missing a language tag and
triggers MD040; update the opening triple-backtick fence for the ASCII diagram
to include a language specifier (e.g., change ``` to ```text or ```plaintext) so
the markdown linter accepts the block and preserves plain text formatting.
- Line 49: Replace the misspelled word "replys" with "replays" in the README
sentence that currently reads "Consumes events from the stream and replys them
locally" so it becomes "Consumes events from the stream and replays them
locally"; search for that exact phrase in the README to locate and update the
standby behavior description.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1627e5bf-a6f2-4907-83dc-e59e87a79e12

📥 Commits

Reviewing files that changed from the base of the PR and between 3238fb7 and cb7cff5.

📒 Files selected for processing (2)
  • .gitignore
  • magicblock-replicator/README.md

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-accounts-db/src/snapshot.rs`:
- Around line 235-246: The validate_archive function only verifies gzip/tar
structure and not AccountsDb content integrity; update the function
(validate_archive) to include a clear TODO comment explaining that content
validation / checksums should be added in future when the snapshot format is
extended (e.g., include per-file or archive-wide checksums and verify entries
via Archive::entries()/read) and reference relevant types used (GzDecoder,
Archive, AccountsDbResult) so future implementers know where to add checksum
verification logic.
- Around line 248-268: The code uses unwrap() on registry.remove(index) in
find_and_remove_snapshot which must be replaced with explicit error handling;
before removing, verify the index is within registry.len() and return
AccountsDbError::SnapshotMissing(target_slot) if out of bounds, then safely
remove (or clone via registry.get(index).cloned()) into chosen_archive and
proceed with the existing parse_slot/error flow; reference the function
find_and_remove_snapshot and the registry.remove(index) call to locate the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e33ed757-35cf-4d57-b14e-0e1e808870a1

📥 Commits

Reviewing files that changed from the base of the PR and between cb7cff5 and 2ebffec.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • test-integration/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (22)
  • config.example.toml
  • magicblock-accounts-db/Cargo.toml
  • magicblock-accounts-db/src/lib.rs
  • magicblock-accounts-db/src/snapshot.rs
  • magicblock-accounts-db/src/tests.rs
  • magicblock-config/src/config/accounts.rs
  • magicblock-config/src/consts.rs
  • magicblock-config/src/tests.rs
  • test-integration/configs/api-conf.ephem.toml
  • test-integration/configs/chainlink-conf.devnet.toml
  • test-integration/configs/claim-fees-test.toml
  • test-integration/configs/cloning-conf.devnet.toml
  • test-integration/configs/cloning-conf.ephem.toml
  • test-integration/configs/committor-conf.devnet.toml
  • test-integration/configs/config-conf.devnet.toml
  • test-integration/configs/restore-ledger-conf.devnet.toml
  • test-integration/configs/schedulecommit-conf-fees.ephem.toml
  • test-integration/configs/schedulecommit-conf.devnet.toml
  • test-integration/configs/schedulecommit-conf.ephem.frequent-commits.toml
  • test-integration/configs/schedulecommit-conf.ephem.toml
  • test-integration/configs/validator-offline.devnet.toml
  • test-integration/test-ledger-restore/src/lib.rs
💤 Files with no reviewable changes (17)
  • test-integration/configs/chainlink-conf.devnet.toml
  • test-integration/configs/config-conf.devnet.toml
  • test-integration/configs/committor-conf.devnet.toml
  • test-integration/configs/schedulecommit-conf.ephem.frequent-commits.toml
  • test-integration/configs/schedulecommit-conf.devnet.toml
  • test-integration/configs/cloning-conf.ephem.toml
  • test-integration/configs/claim-fees-test.toml
  • test-integration/configs/api-conf.ephem.toml
  • test-integration/configs/schedulecommit-conf.ephem.toml
  • test-integration/configs/restore-ledger-conf.devnet.toml
  • config.example.toml
  • test-integration/configs/cloning-conf.devnet.toml
  • magicblock-config/src/consts.rs
  • magicblock-config/src/tests.rs
  • test-integration/configs/validator-offline.devnet.toml
  • test-integration/configs/schedulecommit-conf-fees.ephem.toml
  • magicblock-config/src/config/accounts.rs

Comment on lines +248 to +268
/// Finds the best snapshot for target_slot, removes from registry.
/// Returns (archive_path, slot, index).
fn find_and_remove_snapshot(
&self,
target_slot: u64,
) -> AccountsDbResult<(PathBuf, u64, usize)> {
let mut registry = self.registry.lock();
let search_key = self.slot_to_archive_path(target_slot);
let index = match registry.binary_search(&search_key) {
Ok(i) => i,
Err(i) if i > 0 => i - 1,
_ => return Err(AccountsDbError::SnapshotMissing(target_slot)),
};

let chosen_path = registry.remove(index).unwrap();
let chosen_slot = Self::parse_slot(&chosen_path)
let chosen_archive = registry.remove(index).unwrap();
let chosen_slot = Self::parse_slot(&chosen_archive)
.ok_or(AccountsDbError::SnapshotMissing(target_slot))?;

info!(
chosen_slot = chosen_slot,
target_slot = target_slot,
"Restoring snapshot"
);

// 2. Prune Invalidated Futures
// Any snapshot strictly newer than the chosen one is now on a diverging timeline.
for invalidated in registry.drain(index..) {
warn!(
invalidated_path = %invalidated.display(),
"Pruning invalidated snapshot"
);
let _ = fs::remove_dir_all(&invalidated);
}
info!(chosen_slot, target_slot, "Restoring snapshot");
Ok((chosen_archive, chosen_slot, index))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use proper error handling instead of .unwrap() on line 262.

The .unwrap() on registry.remove(index) violates the coding guidelines for production Rust code in magicblock-*. While the index is validated by the preceding binary search logic, the guidelines require explicit error handling.

🔧 Proposed fix
-        let chosen_archive = registry.remove(index).unwrap();
+        let chosen_archive = registry.remove(index).ok_or_else(|| {
+            AccountsDbError::Internal("Snapshot registry index out of bounds".into())
+        })?;

As per coding guidelines: {magicblock-*,programs,storage-proto}/**: Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 248 - 268, The code uses
unwrap() on registry.remove(index) in find_and_remove_snapshot which must be
replaced with explicit error handling; before removing, verify the index is
within registry.len() and return AccountsDbError::SnapshotMissing(target_slot)
if out of bounds, then safely remove (or clone via registry.get(index).cloned())
into chosen_archive and proceed with the existing parse_slot/error flow;
reference the function find_and_remove_snapshot and the registry.remove(index)
call to locate the change.

@bmuddha bmuddha force-pushed the bmuddha/epic/replication-service branch from 2ebffec to 6af735c Compare March 31, 2026 15:35
@bmuddha bmuddha force-pushed the bmuddha/epic/replication-service branch from 6af735c to 0adfe05 Compare March 31, 2026 15:38
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
magicblock-accounts-db/src/lib.rs (1)

424-433: ⚠️ Potential issue | 🟠 Major

Don't hide LMDB scan failures behind a checksum.

This walks iter_all(), and iter_all() turns get_all_accounts() errors into an empty iterator. A failed index scan therefore produces a valid-looking hash of zero accounts instead of surfacing the error. For divergence detection, this path should be fallible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/lib.rs` around lines 424 - 433, The checksum
function currently hides LMDB scan failures because iter_all() converts
get_all_accounts() errors into an empty iterator; change checksum (unsafe fn
checksum(&self) -> u64) to return a fallible result (e.g. -> Result<u64, E>) and
propagate any error from the underlying scan instead of treating it as empty.
Locate checksum and the iterator producer iter_all()/get_all_accounts(), make
iter_all propagate or return an iterator over Result items (or have checksum
call get_all_accounts() directly and return Err on failure), and only
compute/finish the xxhash3_64::Hasher when the scan succeeds so real LMDB errors
surface to callers.
♻️ Duplicate comments (3)
magicblock-accounts-db/src/snapshot.rs (1)

250-263: ⚠️ Potential issue | 🟠 Major

Replace the unwrap() in snapshot selection.

The search narrows the candidate index, but this still needs an explicit error path in production code.

Suggested patch
-        let chosen_archive = registry.remove(index).unwrap();
+        let chosen_archive = registry
+            .remove(index)
+            .ok_or(AccountsDbError::SnapshotMissing(target_slot))?;

As per coding guidelines: {magicblock-*,programs,storage-proto}/**: Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue. These should not be categorized as trivial or nit-level concerns. Request proper error handling or explicit justification with invariants.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 250 - 263, The code in
find_and_remove_snapshot uses registry.remove(index).unwrap(), which can panic
in production; replace this unwrap with explicit error handling: perform
registry.remove(index) into an Option, return
Err(AccountsDbError::SnapshotMissing(target_slot)) (or a more specific
AccountsDbError if appropriate) when None, otherwise bind the removed value to
chosen_archive and continue to call Self::parse_slot(&chosen_archive) as before;
ensure the error path covers both removal failure and any parse_slot failure by
propagating/parsing errors consistently.
magicblock-replicator/README.md (2)

49-49: ⚠️ Potential issue | 🟡 Minor

Fix the standby behavior typo.

replys should be replays here.

Suggested patch
-- Consumes events from the stream and replys them locally
+- Consumes events from the stream and replays them locally
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/README.md` at line 49, Update the README sentence
containing the phrase "Consumes events from the stream and replys them locally"
to correct the typo by replacing "replys" with "replays" so the line reads
"Consumes events from the stream and replays them locally"; ensure any other
occurrences of "replys" in the repository are similarly corrected.

7-22: ⚠️ Potential issue | 🟡 Minor

Add a language tag to the ASCII-art fence.

The opening fence is still bare, so Markdown lint will keep flagging it.

Suggested patch
-```
+```text
               ┌─────────────┐
               │   Service   │
               └──────┬──────┘
         ┌────────────┴────────────┐
         ▼                         ▼
    ┌─────────┐               ┌─────────┐
    │ Primary │ ← ─ ─ ─ ─ ─ → │ Standby │
    └────┬────┘               └────┬────┘
         │                         │
     ┌───┴───┐                 ┌───┴───┐
     │Publish│                 │Consume│
     │Upload │                 │Apply  │
     │Refresh│                 │Verify │
     └───────┘                 └───────┘
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @magicblock-replicator/README.md around lines 7 - 22, The Markdown code fence
for the ASCII-art diagram currently has no language tag which triggers lint
warnings; update the opening fence (the triple backticks that begin the ASCII
block) to include a language tag (e.g., "text") so it reads ```text, leaving the
ASCII art content and the closing triple backticks unchanged to satisfy the
linter.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @magicblock-accounts-db/src/lib.rs:

  • Around line 295-318: take_snapshot currently computes and returns a checksum
    even when snapshot_manager.create_snapshot_dir(slot, used_storage) fails; make
    take_snapshot fallible (e.g., return Result<u64, E> or Option) and return
    early on phase‑1 errors instead of a successful checksum. Specifically, call
    snapshot_manager.create_snapshot_dir(slot, used_storage) while you still hold
    the write lock (or otherwise check its Result) and if it returns Err propagate
    that error immediately; only compute the checksum (unsafe { this.checksum() })
    and spawn the background thread that calls archive_and_register(&dir) when
    create_snapshot_dir succeeded, and update callers of take_snapshot to handle the
    new Result/Option return type.

In @magicblock-accounts-db/src/snapshot.rs:

  • Around line 220-230: The code currently calls find_and_remove_snapshot() which
    mutates the in-memory registry before extract_archive() and atomic_swap(),
    causing the snapshot to be evicted even if restore fails; change the flow so
    restore_from_snapshot finds the snapshot without removing it (or modify
    find_and_remove_snapshot to only mark/return the candidate but not mutate) and
    only remove/prune the registry entry and fs::remove_file(&chosen_archive) after
    atomic_swap() completes successfully; keep using the same identifiers
    (chosen_archive, chosen_slot, index) but defer
    prune_invalidated_snapshots(index) and the file removal until after
    extract_archive() and atomic_swap() succeed so failed restores leave the
    snapshot selectable for retry.
  • Around line 103-119: create_snapshot_dir is currently calling prune_registry
    before reliably inserting the new snapshot into the registry, and the registry
    is appended unsorted which breaks find_and_remove_snapshot's binary_search when
    archives complete out of order; change the registration step to insert the new
    snapshot into the registry at the correct sorted position (use binary_search on
    slot to compute insertion index rather than push_back) and only call
    prune_registry after the insertion/reporting succeeds (i.e., after execute
    returns Ok and after adding the snapshot to the registry), enforcing
    max_snapshots post-registration so the registry remains sorted and pruning never
    removes the last valid snapshot prematurely.
  • Around line 239-245: The validate_archive function currently only constructs
    the tar::Archive iterator which is lazy; update validate_archive(bytes: &[u8])
    to fully iterate and drain the archive so gzip/tar decoding errors surface:
    create the Cursor and GzDecoder as before (Cursor::new(bytes),
    GzDecoder::new(cursor)), call tar.entries()? to get the iterator, then loop over
    entries (e.g., for entry in tar.entries()? { let mut e = entry.map_err(...)?;
    read or drain each entry's contents to EOF such as by copying to /dev/null or
    read_to_end into a buffer) so any corrupt/truncated data causes an error which
    you propagate (using the same AccountsDbResult error path and log_err wrapper
    used today). Ensure you preserve the current error message context ("Invalid
    snapshot archive: not a valid gzip tar") when mapping/propagating errors from
    entries or reading entry bodies.

Outside diff comments:
In @magicblock-accounts-db/src/lib.rs:

  • Around line 424-433: The checksum function currently hides LMDB scan failures
    because iter_all() converts get_all_accounts() errors into an empty iterator;
    change checksum (unsafe fn checksum(&self) -> u64) to return a fallible result
    (e.g. -> Result<u64, E>) and propagate any error from the underlying scan
    instead of treating it as empty. Locate checksum and the iterator producer
    iter_all()/get_all_accounts(), make iter_all propagate or return an iterator
    over Result items (or have checksum call get_all_accounts() directly and return
    Err on failure), and only compute/finish the xxhash3_64::Hasher when the scan
    succeeds so real LMDB errors surface to callers.

Duplicate comments:
In @magicblock-accounts-db/src/snapshot.rs:

  • Around line 250-263: The code in find_and_remove_snapshot uses
    registry.remove(index).unwrap(), which can panic in production; replace this
    unwrap with explicit error handling: perform registry.remove(index) into an
    Option, return Err(AccountsDbError::SnapshotMissing(target_slot)) (or a more
    specific AccountsDbError if appropriate) when None, otherwise bind the removed
    value to chosen_archive and continue to call Self::parse_slot(&chosen_archive)
    as before; ensure the error path covers both removal failure and any parse_slot
    failure by propagating/parsing errors consistently.

In @magicblock-replicator/README.md:

  • Line 49: Update the README sentence containing the phrase "Consumes events
    from the stream and replys them locally" to correct the typo by replacing
    "replys" with "replays" so the line reads "Consumes events from the stream and
    replays them locally"; ensure any other occurrences of "replys" in the
    repository are similarly corrected.
  • Around line 7-22: The Markdown code fence for the ASCII-art diagram currently
    has no language tag which triggers lint warnings; update the opening fence (the
    triple backticks that begin the ASCII block) to include a language tag (e.g.,
    "text") so it reads ```text, leaving the ASCII art content and the closing
    triple backticks unchanged to satisfy the linter.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: ASSERTIVE

**Plan**: Pro

**Run ID**: `41a69455-5c17-4cf8-8d93-c56fec1092eb`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 2ebffec80ea133496ade8e4e550c24ebad2ffea4 and 0adfe05eed385ba44458ce1fd92efc7c35e3633c.

</details>

<details>
<summary>⛔ Files ignored due to path filters (2)</summary>

* `Cargo.lock` is excluded by `!**/*.lock`
* `test-integration/Cargo.lock` is excluded by `!**/*.lock`

</details>

<details>
<summary>📒 Files selected for processing (24)</summary>

* `.gitignore`
* `config.example.toml`
* `magicblock-accounts-db/Cargo.toml`
* `magicblock-accounts-db/src/lib.rs`
* `magicblock-accounts-db/src/snapshot.rs`
* `magicblock-accounts-db/src/tests.rs`
* `magicblock-config/src/config/accounts.rs`
* `magicblock-config/src/consts.rs`
* `magicblock-config/src/tests.rs`
* `magicblock-replicator/README.md`
* `test-integration/configs/api-conf.ephem.toml`
* `test-integration/configs/chainlink-conf.devnet.toml`
* `test-integration/configs/claim-fees-test.toml`
* `test-integration/configs/cloning-conf.devnet.toml`
* `test-integration/configs/cloning-conf.ephem.toml`
* `test-integration/configs/committor-conf.devnet.toml`
* `test-integration/configs/config-conf.devnet.toml`
* `test-integration/configs/restore-ledger-conf.devnet.toml`
* `test-integration/configs/schedulecommit-conf-fees.ephem.toml`
* `test-integration/configs/schedulecommit-conf.devnet.toml`
* `test-integration/configs/schedulecommit-conf.ephem.frequent-commits.toml`
* `test-integration/configs/schedulecommit-conf.ephem.toml`
* `test-integration/configs/validator-offline.devnet.toml`
* `test-integration/test-ledger-restore/src/lib.rs`

</details>

<details>
<summary>💤 Files with no reviewable changes (17)</summary>

* test-integration/configs/committor-conf.devnet.toml
* test-integration/configs/schedulecommit-conf.devnet.toml
* test-integration/configs/schedulecommit-conf.ephem.frequent-commits.toml
* test-integration/configs/config-conf.devnet.toml
* test-integration/configs/restore-ledger-conf.devnet.toml
* test-integration/configs/api-conf.ephem.toml
* test-integration/configs/chainlink-conf.devnet.toml
* test-integration/configs/cloning-conf.devnet.toml
* magicblock-config/src/consts.rs
* config.example.toml
* test-integration/configs/schedulecommit-conf-fees.ephem.toml
* test-integration/configs/validator-offline.devnet.toml
* test-integration/configs/claim-fees-test.toml
* test-integration/configs/schedulecommit-conf.ephem.toml
* magicblock-config/src/config/accounts.rs
* test-integration/configs/cloning-conf.ephem.toml
* magicblock-config/src/tests.rs

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +295 to +318
pub fn take_snapshot(self: &Arc<Self>, slot: u64) -> u64 {
let this = self.clone();

// Phase 1: Create snapshot directory (with write lock)
let locked = this.write_lock.write();
this.flush();
// SAFETY:
// we have acquired the write lock above
let checksum = unsafe { this.checksum() };
let used_storage = this.storage.active_segment();

let snapshot_dir = this
.snapshot_manager
.create_snapshot_dir(slot, used_storage);
drop(locked);
thread::spawn(move || {
// Acquire write lock to ensure consistent state capture
let write_guard = this.write_lock.write();
this.flush();

// Capture the active memory map region for the snapshot
let used_storage = this.storage.active_segment();

let _ = this.snapshot_manager.create_snapshot(
slot,
used_storage,
write_guard,
);
// Phase 2: Archive directory (no lock needed)
let _ = snapshot_dir
.and_then(|dir| {
this.snapshot_manager.archive_and_register(&dir)
})
.log_err(|| "failed to create accountsdb snapshot");
});
checksum
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't return a checksum when phase 1 already failed.

create_snapshot_dir() can fail before the background thread is spawned, but take_snapshot() still returns a checksum as if the snapshot request succeeded. That hides local snapshot-I/O failures from the replication flow. Please make this API fallible and return early on phase-1 errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/lib.rs` around lines 295 - 318, take_snapshot
currently computes and returns a checksum even when
snapshot_manager.create_snapshot_dir(slot, used_storage) fails; make
take_snapshot fallible (e.g., return Result<u64, E> or Option<u64>) and return
early on phase‑1 errors instead of a successful checksum. Specifically, call
snapshot_manager.create_snapshot_dir(slot, used_storage) while you still hold
the write lock (or otherwise check its Result) and if it returns Err propagate
that error immediately; only compute the checksum (unsafe { this.checksum() })
and spawn the background thread that calls archive_and_register(&dir) when
create_snapshot_dir succeeded, and update callers of take_snapshot to handle the
new Result/Option return type.

Comment on lines +103 to +119
pub fn create_snapshot_dir(
&self,
slot: u64,
active_mem: &[u8],
lock: RwLockWriteGuard<()>,
) -> AccountsDbResult<()> {
let snap_path = self.generate_path(slot);

// 1. Maintain retention policy
self.prune_registry();

// 2. Prepare Data Capture
// If legacy copy, we must capture state while lock is held.
) -> AccountsDbResult<PathBuf> {
let snap_path = self.slot_to_dir_path(slot);
let memory_capture =
matches!(self.strategy, SnapshotStrategy::LegacyCopy)
.then(|| active_mem.to_vec())
.unwrap_or_default();

// 3. Execute Snapshot
self.strategy
.execute(&self.db_path, &snap_path, memory_capture, lock)
.execute(&self.db_path, &snap_path, memory_capture)
.log_err(|| "Snapshot failed")?;
self.prune_registry();

// 4. Register success
self.registry.lock().push_back(snap_path);
Ok(())
Ok(snap_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Keep the registry sorted and prune only after successful registration.

find_and_remove_snapshot() relies on binary_search(), but push_back() only stays ordered if archives register in slot order. Background archiving can complete out of order, and the current pre-archive prune can delete the last good snapshot before the new archive is durable. Insert by sorted position and enforce max_snapshots after registration succeeds.

Suggested direction
     pub fn create_snapshot_dir(
         &self,
         slot: u64,
         active_mem: &[u8],
     ) -> AccountsDbResult<PathBuf> {
@@
         self.strategy
             .execute(&self.db_path, &snap_path, memory_capture)
             .log_err(|| "Snapshot failed")?;
-        self.prune_registry();
 
         Ok(snap_path)
     }
@@
     fn register_archive(&self, archive_path: PathBuf) {
         info!(archive_path = %archive_path.display(), "Snapshot registered");
-        self.registry.lock().push_back(archive_path);
+        let mut registry = self.registry.lock();
+        let index = registry.binary_search(&archive_path).unwrap_or_else(|i| i);
+        registry.insert(index, archive_path);
+        while registry.len() > self.max_snapshots {
+            let Some(path) = registry.pop_front() else {
+                break;
+            };
+            if let Err(e) = fs::remove_file(&path) {
+                warn!(path = %path.display(), error = ?e, "Failed to prune snapshot archive");
+            }
+        }
     }

Also applies to: 326-329

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 103 - 119,
create_snapshot_dir is currently calling prune_registry before reliably
inserting the new snapshot into the registry, and the registry is appended
unsorted which breaks find_and_remove_snapshot's binary_search when archives
complete out of order; change the registration step to insert the new snapshot
into the registry at the correct sorted position (use binary_search on slot to
compute insertion index rather than push_back) and only call prune_registry
after the insertion/reporting succeeds (i.e., after execute returns Ok and after
adding the snapshot to the registry), enforcing max_snapshots post-registration
so the registry remains sorted and pruning never removes the last valid snapshot
prematurely.

Comment on lines 220 to +230
pub fn restore_from_snapshot(
&self,
target_slot: u64,
) -> AccountsDbResult<u64> {
let mut registry = self.registry.lock();
let (chosen_archive, chosen_slot, index) =
self.find_and_remove_snapshot(target_slot)?;

let extracted_dir = self.extract_archive(&chosen_archive)?;
self.atomic_swap(&extracted_dir)?;
self.prune_invalidated_snapshots(index);
let _ = fs::remove_file(&chosen_archive);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't evict the chosen snapshot before restore commits.

find_and_remove_snapshot() mutates the in-memory registry before extract_archive() and atomic_swap() run. If either step fails, the archive is still on disk but the current process can no longer pick it for another restore attempt. Remove it from the registry only after the swap succeeds.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 220 - 230, The code
currently calls find_and_remove_snapshot() which mutates the in-memory registry
before extract_archive() and atomic_swap(), causing the snapshot to be evicted
even if restore fails; change the flow so restore_from_snapshot finds the
snapshot without removing it (or modify find_and_remove_snapshot to only
mark/return the candidate but not mutate) and only remove/prune the registry
entry and fs::remove_file(&chosen_archive) after atomic_swap() completes
successfully; keep using the same identifiers (chosen_archive, chosen_slot,
index) but defer prune_invalidated_snapshots(index) and the file removal until
after extract_archive() and atomic_swap() succeed so failed restores leave the
snapshot selectable for retry.

Comment on lines +239 to +245
fn validate_archive(bytes: &[u8]) -> AccountsDbResult<()> {
let cursor = Cursor::new(bytes);
let dec = GzDecoder::new(cursor);
let mut tar = Archive::new(dec);
tar.entries()
.log_err(|| "Invalid snapshot archive: not a valid gzip tar")?;
Ok(())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

tmp="$(mktemp)"
if ! curl -fsSL 'https://docs.rs/crate/tar/latest/source/src/archive.rs?plain=1' > "$tmp"; then
  curl -fsSL https://raw.githubusercontent.com/alexcrichton/tar-rs/master/src/archive.rs > "$tmp"
fi

rg -n -A6 -B2 'pub fn entries' "$tmp"

Repository: magicblock-labs/magicblock-validator

Length of output: 62


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Try alternative URLs for tar-rs source code
echo "Attempting to fetch tar-rs source from alternative locations..."

tmp1="$(mktemp)"
tmp2="$(mktemp)"
tmp3="$(mktemp)"

# Try GitHub releases/main branch
curl -fsSL "https://raw.githubusercontent.com/alexcrichton/tar-rs/main/src/archive.rs" > "$tmp1" 2>/dev/null || echo "main branch failed"

# Try crates.io source link
curl -fsSL "https://docs.rs/tar/*/source/src/archive.rs" > "$tmp2" 2>/dev/null || echo "docs.rs failed"

# Check if any succeeded and search for entries
for tmp in "$tmp1" "$tmp2"; do
  if [ -s "$tmp" ]; then
    echo "=== Found tar source, searching for entries() ===" 
    rg -n -A10 'pub fn entries' "$tmp" || echo "No match found"
    break
  fi
done

Repository: magicblock-labs/magicblock-validator

Length of output: 1021


🌐 Web query:

tar-rs Archive::entries() implementation lazy evaluation

💡 Result:

Yes, the tar-rs Archive::entries implementation uses lazy evaluation. It returns a struct Entries<'a, R> that implements Iterator<Item = io::Result<Entry<'a, R>>>. The iterator's next method is implemented by delegating to self.fields.next, where EntriesFields holds references to the Archive's reader (via RefCell) and position tracking state. Entries are yielded sequentially as the iterator advances, reading headers and preparing lazy Entry readers (EntryFields with EntryIo variants like Data(EntryData) that read from the shared underlying reader on demand via Read impl). This design ensures streaming/lazy processing without loading the entire archive into memory upfront, but requires sequential consumption to avoid corrupting entry contents due to shared mutable reader state. For seekable readers, entries_with_seek uses seeking to skip data more efficiently while maintaining laziness.

Citations:


validate_archive() doesn't actually walk the archive.

tar::Archive::entries() is lazy; constructing the iterator does not force a full tar/gzip read. Truncated or corrupt archives can still pass this check and only fail later during extract/restore. Iterate the entries and drain them before accepting the bytes.

Possible fix
     fn validate_archive(bytes: &[u8]) -> AccountsDbResult<()> {
         let cursor = Cursor::new(bytes);
         let dec = GzDecoder::new(cursor);
         let mut tar = Archive::new(dec);
-        tar.entries()
-            .log_err(|| "Invalid snapshot archive: not a valid gzip tar")?;
+        let entries = tar
+            .entries()
+            .log_err(|| "Invalid snapshot archive: not a valid gzip tar")?;
+        for entry in entries {
+            let mut entry = entry
+                .log_err(|| "Invalid snapshot archive: failed to read tar entry")?;
+            io::copy(&mut entry, &mut io::sink())
+                .log_err(|| "Invalid snapshot archive: truncated tar entry")?;
+        }
         Ok(())
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn validate_archive(bytes: &[u8]) -> AccountsDbResult<()> {
let cursor = Cursor::new(bytes);
let dec = GzDecoder::new(cursor);
let mut tar = Archive::new(dec);
tar.entries()
.log_err(|| "Invalid snapshot archive: not a valid gzip tar")?;
Ok(())
fn validate_archive(bytes: &[u8]) -> AccountsDbResult<()> {
let cursor = Cursor::new(bytes);
let dec = GzDecoder::new(cursor);
let mut tar = Archive::new(dec);
let entries = tar
.entries()
.log_err(|| "Invalid snapshot archive: not a valid gzip tar")?;
for entry in entries {
let mut entry = entry
.log_err(|| "Invalid snapshot archive: failed to read tar entry")?;
io::copy(&mut entry, &mut io::sink())
.log_err(|| "Invalid snapshot archive: truncated tar entry")?;
}
Ok(())
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 239 - 245, The
validate_archive function currently only constructs the tar::Archive iterator
which is lazy; update validate_archive(bytes: &[u8]) to fully iterate and drain
the archive so gzip/tar decoding errors surface: create the Cursor and GzDecoder
as before (Cursor::new(bytes), GzDecoder::new(cursor)), call tar.entries()? to
get the iterator, then loop over entries (e.g., for entry in tar.entries()? {
let mut e = entry.map_err(...)?; read or drain each entry's contents to EOF such
as by copying to /dev/null or read_to_end into a buffer) so any
corrupt/truncated data causes an error which you propagate (using the same
AccountsDbResult error path and log_err wrapper used today). Ensure you preserve
the current error message context ("Invalid snapshot archive: not a valid gzip
tar") when mapping/propagating errors from entries or reading entry bodies.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
magicblock-config/src/config/validator.rs (1)

27-37: ⚠️ Potential issue | 🟠 Major

Preserve backward compatibility for ReplicationMode config values.

This serde rename policy changes accepted enum strings (Standalone/StandBy/ReplicateOnly → kebab-case forms). Existing configs can fail deserialization and prevent startup. Please add aliases (or migration handling) for legacy values.

Proposed compatibility patch
 #[derive(Deserialize, Serialize, Debug, Clone)]
 #[serde(rename_all = "kebab-case")]
 pub enum ReplicationMode {
+    #[serde(alias = "Standalone")]
     Standalone,
+    #[serde(alias = "StandBy")]
     StandBy(Url, SerdePubkey),
+    #[serde(alias = "ReplicateOnly")]
     ReplicateOnly(Url, SerdePubkey),
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-config/src/config/validator.rs` around lines 27 - 37, The enum
ReplicationMode now uses serde(rename_all = "kebab-case") which breaks configs
that use the legacy variant names; update ReplicationMode to accept legacy names
by adding serde alias attributes to each variant (e.g., add #[serde(alias =
"Standalone")] to the Standalone variant, #[serde(alias = "StandBy")] to
StandBy(Url, SerdePubkey), and #[serde(alias = "ReplicateOnly")] to
ReplicateOnly(Url, SerdePubkey)) so deserialization accepts both the new
kebab-case names and the old legacy names; this keeps backward compatibility
without a custom deserializer.
♻️ Duplicate comments (1)
magicblock-accounts-db/src/lib.rs (1)

294-317: ⚠️ Potential issue | 🟠 Major

Propagate phase-1 snapshot failures instead of returning a checksum.

create_snapshot_dir() can fail before the archiver starts, but take_snapshot() still returns a successful checksum and only logs the error later from the spawned thread. That hides local snapshot-I/O failures from the replication path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/lib.rs` around lines 294 - 317, The current
take_snapshot(self: &Arc<Self>, slot: u64) -> u64 hides phase-1 errors because
create_snapshot_dir can fail but the function still returns checksum; change
take_snapshot to return a Result<u64, E> (or appropriate SnapshotError) so
phase-1 I/O failures are propagated. Concretely: call
this.snapshot_manager.create_snapshot_dir(slot, used_storage) and if it returns
Err, return Err immediately instead of spawning the thread; if Ok(dir) then
spawn the thread that calls this.snapshot_manager.archive_and_register(&dir) and
finally return Ok(checksum). Update all callers of take_snapshot to handle the
Result. Use the existing symbols take_snapshot, create_snapshot_dir,
snapshot_manager.archive_and_register to locate and apply the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/ISSUE_TEMPLATE/bug.yml:
- Around line 26-37: The deployment dropdown (type: dropdown, id: deployment)
should capture extra details when "Other" is selected; add an optional free-text
field (e.g., type: input, id: deployment_other, attributes.label: "If Other,
describe deployment") and wire it so it's shown/available when deployment ==
"Other" with no required validation, ensuring the new field's id and label are
descriptive for triage and easy to locate in the template.

In @.github/ISSUE_TEMPLATE/good-first-issue.yml:
- Around line 22-30: The acceptance criteria placeholder under the form field
with id "acceptance" is too minimal; replace the generic "- [ ]" placeholder
with a brief, explicit example checklist to help contributors (e.g., "- [ ]
Reproduced the bug; - [ ] Added tests; - [ ] Updated README"). Update the
"placeholder" attribute for id: acceptance to include that concrete example text
so new issues show a useful, copyable acceptance-criteria template.

In @.github/PULL_REQUEST_TEMPLATE.md:
- Around line 1-10: The file begins with an HTML comment before the first
markdown heading which triggers markdownlint MD041; fix this by ensuring the
document starts with a top-level H1 before any comments—either add a leading "#
Pull Request" (or similar) as the first line, or move the existing title-rule
HTML comment to follow the "## Summary" line; update the file around the
existing "## Summary" heading and the HTML comment so the first non-empty line
is the H1.
- Around line 13-15: The markdown heading "## Breaking Changes" is immediately
followed by list items; insert a single blank line after the "## Breaking
Changes" heading so the heading and the subsequent checklist are separated
(i.e., add one empty line between the "## Breaking Changes" line and the "- [ ]
None" list item).

In `@magicblock-accounts-db/src/tests.rs`:
- Around line 231-242: The test contains a duplicated archive
existence/assertion block that reconstructs archive_path using
env.snapshot_manager.database_path() and SNAPSHOT_SLOT and repeats the same
assert!(archive_path.exists()) and assert!(archive_path.is_file()); remove the
redundant second block so only the original assertions remain (locate the
duplicated code that defines archive_path and the two asserts and delete that
duplicate).

In `@magicblock-replicator/src/nats/broker.rs`:
- Around line 137-144: The current code in broker.rs materializes the entire
snapshot into memory by creating a Vec with capacity info.size and calling
object.read_to_end(&mut data) in the Snapshot construction; instead, change
Snapshot handling to stream the object to disk (or provide a streaming reader)
rather than allocating info.size bytes: open a temporary file (or return a file
path/AsyncRead impl) and use async streaming copy (e.g., tokio::fs::File +
tokio::io::copy or tokio_util::io helpers) to write the object body
incrementally, set self.sequence = meta.sequence as before, and construct
Snapshot to reference the temp file/reader rather than an in-memory Vec (update
the Snapshot type accordingly where used).
- Around line 85-101: The init_resources() function currently always calls
create_object_store() and create_key_value(), which fail if the buckets already
exist; change it to use a get-first-or-create pattern: call
ctx.get_object_store(cfg::SNAPSHOTS) and only call ctx.create_object_store(...)
if the get returns a NotFound error (propagate other errors), and likewise call
ctx.get_key_value(cfg::PRODUCER_LOCK) and only call ctx.create_key_value(...)
when NotFound is returned; also update the init_resources() docstring (around
the current line 34) to state that resources are created idempotently by first
attempting to get the existing object store and KV and creating them only on
NotFound.

In `@magicblock-replicator/src/nats/consumer.rs`:
- Around line 61-75: The messages() loop currently retries
self.inner.stream().max_messages_per_batch(cfg::BATCH_SIZE).messages().await in
a tight loop and will spam warnings on JetStream outages; change it to either
(A) return a Result<MessageStream, Error> so the caller can decide
backoff/shutdown by updating the signature of messages() and propagating the
Err(error) instead of swallowing it, or (B) implement an exponential backoff
with jitter inside messages() (use tokio::time::sleep, start with a small delay
and cap it) and log backoff attempts before retrying; locate the retry logic in
messages(), reference self.inner.stream() and cfg::BATCH_SIZE, and apply one of
these fixes consistently (do not both swallow errors and busy-loop).

In `@magicblock-replicator/src/nats/lock_watcher.rs`:
- Around line 20-37: The new(...) constructor in lock_watcher.rs currently
hot-loops on errors from broker.ctx.get_key_value and store.watch; add a backoff
delay between retries to avoid log flooding and CPU spin. Modify the loop in
pub(crate) async fn new(broker: &Broker) to track a retry delay (e.g., start
250–500ms, multiply on each failure up to a max like 5–10s, optionally add small
jitter), and await tokio::time::sleep(delay) before continuing after any Err
branch (both the get_key_value and watch error handling), resetting the delay to
the base value on success (e.g., after Ok(s) or Ok(w)). Ensure you reference the
same function name new and the variables broker, store, and watch so the change
is localized to this retry loop.
- Around line 41-59: The method wait_for_expiry currently treats the watch
stream ending as a successful expiry; instead change its contract to surface
stream termination as an error so callers can distinguish transport failure from
a real Delete/Purge. Update pub async fn wait_for_expiry(&mut self) to return a
Result<(), E> (or a concrete error type you use) and, inside the loop, return
Ok(()) only on matches!(operation, Operation::Delete | Operation::Purge) while
returning Err(...) when self.watch.next().await yields None (and include context
in the error); keep the existing warn! log but do not return success on stream
end. Use the symbols wait_for_expiry, self.watch.next(), and Operation::Delete |
Operation::Purge to locate and modify the code.

In `@magicblock-replicator/src/service/context.rs`:
- Around line 98-101: The formatted error message in context.rs constructs msg
with expected and actual checksum reversed; update the format string in the let
msg = format!(...) call (referencing msg, sb.slot and sb.checksum) so it prints
"expected {expected}, got {actual}" with the expected value being the known good
checksum and the actual value coming from sb.checksum (i.e., swap the
placeholders/arguments or reorder the interpolation) to correctly reflect
expected vs. observed values.
- Around line 121-128: The methods enter_replica_mode and enter_primary_mode
currently ignore send errors (let _ = ...) which can silently fail; change both
to return Result<(), crate::ErrorType> (or anyhow::Result) and propagate the
error from self.mode_tx.send(...).await so callers can handle failures; update
callers (into_primary and into_standby) to await and propagate or handle the
Result (i.e., bubble up errors from enter_replica_mode/enter_primary_mode and
adjust into_primary/into_standby signatures to return Result as needed) so
mode-change send failures are not dropped.

In `@magicblock-replicator/src/service/mod.rs`:
- Around line 110-115: The current-thread Tokio runtime is built without drivers
and the build error is unconditionally panicked; update the runtime construction
(where Builder::new_current_thread() is used) to enable required drivers (at
minimum enable_time() and enable_io() or simply enable_all()) so Primary::run()
and Standby::run() can safely use tokio::time::interval and broker I/O, and
change the .expect("Failed to build replication service runtime") to propagate
the Result error (return Err(...) or map the error into the function's Result)
rather than panicking so callers can handle build failures instead of aborting;
locate and modify the block that builds the runtime and the call to
runtime.block_on(self.run()) accordingly.

In `@magicblock-replicator/src/service/primary.rs`:
- Around line 82-90: The code advances the replay position unconditionally
(calling ctx.update_position(slot, index)) even for messages that require a
durable JetStream ack (ack == true when Message::SuperBlock), but
Broker::publish only waits for PubAck when ack is true; move the call to
ctx.update_position so it only happens after a successful durable publish
acknowledgement for messages with ack==true (i.e., await publish and, if ack was
requested and publish succeeded, then call ctx.update_position(slot, index));
use the existing ack variable, msg.slot_and_index(), Broker::publish() result,
and Message::SuperBlock to locate and implement this conditional update.
- Around line 42-47: The code currently drops the message when publish() fails
because msg has already been popped from self.messages; before calling
self.ctx.into_standby(...), requeue or persist the failed msg so it isn't lost:
inside the Some(msg) = self.messages.recv() branch, on Err(error) from
self.publish(msg).await attempt to push the same msg back onto self.messages
(e.g., using its async send or try_send equivalent) or write it to a durable
retry queue, logging any requeue error, and only then call
self.ctx.into_standby(self.messages, true). Ensure you reference the
recv()/publish() failure path and use the same msg instance (or clone it if
necessary) so the event stream can be replayed after demotion.

In `@magicblock-replicator/src/service/standby.rs`:
- Around line 70-75: The watcher currently conflates a real lock expiry with an
unexpected watch-stream end; update LockWatcher::wait_for_expiry() to return a
discriminated result (e.g., Result<ExpiryKind, WatcherError> or an enum like
Expiry::Deleted | Expiry::StreamEnded) so callers can distinguish a genuine
Delete/Purge from stream termination, then modify the standby loop that calls
self.watcher.wait_for_expiry() to match on that result: on a genuine expiry,
proceed to call self.ctx.try_acquire_producer().await and on success call
self.ctx.into_primary(...).await as before, but on StreamEnded/Err log
appropriately and stop/recreate the watcher (do not spin on
try_acquire_producer()); reference symbols: LockWatcher::wait_for_expiry,
self.watcher, self.ctx.try_acquire_producer(), and self.ctx.into_primary.

---

Outside diff comments:
In `@magicblock-config/src/config/validator.rs`:
- Around line 27-37: The enum ReplicationMode now uses serde(rename_all =
"kebab-case") which breaks configs that use the legacy variant names; update
ReplicationMode to accept legacy names by adding serde alias attributes to each
variant (e.g., add #[serde(alias = "Standalone")] to the Standalone variant,
#[serde(alias = "StandBy")] to StandBy(Url, SerdePubkey), and #[serde(alias =
"ReplicateOnly")] to ReplicateOnly(Url, SerdePubkey)) so deserialization accepts
both the new kebab-case names and the old legacy names; this keeps backward
compatibility without a custom deserializer.

---

Duplicate comments:
In `@magicblock-accounts-db/src/lib.rs`:
- Around line 294-317: The current take_snapshot(self: &Arc<Self>, slot: u64) ->
u64 hides phase-1 errors because create_snapshot_dir can fail but the function
still returns checksum; change take_snapshot to return a Result<u64, E> (or
appropriate SnapshotError) so phase-1 I/O failures are propagated. Concretely:
call this.snapshot_manager.create_snapshot_dir(slot, used_storage) and if it
returns Err, return Err immediately instead of spawning the thread; if Ok(dir)
then spawn the thread that calls
this.snapshot_manager.archive_and_register(&dir) and finally return
Ok(checksum). Update all callers of take_snapshot to handle the Result. Use the
existing symbols take_snapshot, create_snapshot_dir,
snapshot_manager.archive_and_register to locate and apply the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b98b3800-f56d-4a35-94ad-664b6bb69c80

📥 Commits

Reviewing files that changed from the base of the PR and between 0adfe05 and c0c77aa.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (29)
  • .github/ISSUE_TEMPLATE/bug.yml
  • .github/ISSUE_TEMPLATE/feature.yml
  • .github/ISSUE_TEMPLATE/good-first-issue.yml
  • .github/PULL_REQUEST_TEMPLATE.md
  • Cargo.toml
  • magicblock-accounts-db/src/lib.rs
  • magicblock-accounts-db/src/tests.rs
  • magicblock-api/src/magic_validator.rs
  • magicblock-config/src/config/validator.rs
  • magicblock-core/src/link/transactions.rs
  • magicblock-processor/src/scheduler/mod.rs
  • magicblock-processor/src/scheduler/state.rs
  • magicblock-replicator/Cargo.toml
  • magicblock-replicator/src/error.rs
  • magicblock-replicator/src/lib.rs
  • magicblock-replicator/src/nats/broker.rs
  • magicblock-replicator/src/nats/consumer.rs
  • magicblock-replicator/src/nats/lock_watcher.rs
  • magicblock-replicator/src/nats/mod.rs
  • magicblock-replicator/src/nats/producer.rs
  • magicblock-replicator/src/nats/snapshot.rs
  • magicblock-replicator/src/proto.rs
  • magicblock-replicator/src/service/context.rs
  • magicblock-replicator/src/service/mod.rs
  • magicblock-replicator/src/service/primary.rs
  • magicblock-replicator/src/service/standby.rs
  • magicblock-replicator/src/tests.rs
  • magicblock-replicator/src/watcher.rs
  • test-kit/src/lib.rs

Comment on lines +26 to +37
- type: dropdown
id: deployment
attributes:
label: Deployment
options:
- Local (native)
- Local (Docker)
- Testnet
- Mainnet
- Other
validations:
required: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Optional: capture details when deployment is “Other”.

Adding a small free-text field can reduce back-and-forth during triage for non-standard environments.

Suggested addition
   - type: dropdown
     id: deployment
     attributes:
       label: Deployment
       options:
         - Local (native)
         - Local (Docker)
         - Testnet
         - Mainnet
         - Other
     validations:
       required: true
+
+  - type: input
+    id: deployment_details
+    attributes:
+      label: Deployment details (if Other)
+      placeholder: "Describe custom environment"
+    validations:
+      required: false
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- type: dropdown
id: deployment
attributes:
label: Deployment
options:
- Local (native)
- Local (Docker)
- Testnet
- Mainnet
- Other
validations:
required: true
- type: dropdown
id: deployment
attributes:
label: Deployment
options:
- Local (native)
- Local (Docker)
- Testnet
- Mainnet
- Other
validations:
required: true
- type: input
id: deployment_details
attributes:
label: Deployment details (if Other)
placeholder: "Describe custom environment"
validations:
required: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/ISSUE_TEMPLATE/bug.yml around lines 26 - 37, The deployment dropdown
(type: dropdown, id: deployment) should capture extra details when "Other" is
selected; add an optional free-text field (e.g., type: input, id:
deployment_other, attributes.label: "If Other, describe deployment") and wire it
so it's shown/available when deployment == "Other" with no required validation,
ensuring the new field's id and label are descriptive for triage and easy to
locate in the template.

Comment on lines +22 to +30
- type: textarea
id: acceptance
attributes:
label: Acceptance Criteria
description: How do we know this is done?
placeholder: |
- [ ]
validations:
required: false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider a more explicit acceptance-criteria example.

The current placeholder (- [ ]) is valid, but adding one concrete example usually improves issue quality for first-time contributors.

Suggested tweak
   - type: textarea
     id: acceptance
     attributes:
       label: Acceptance Criteria
       description: How do we know this is done?
       placeholder: |
-        - [ ]
+        - [ ] Added/updated tests for the changed behavior
+        - [ ] Documentation updated (if applicable)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- type: textarea
id: acceptance
attributes:
label: Acceptance Criteria
description: How do we know this is done?
placeholder: |
- [ ]
validations:
required: false
- type: textarea
id: acceptance
attributes:
label: Acceptance Criteria
description: How do we know this is done?
placeholder: |
- [ ] Added/updated tests for the changed behavior
- [ ] Documentation updated (if applicable)
validations:
required: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/ISSUE_TEMPLATE/good-first-issue.yml around lines 22 - 30, The
acceptance criteria placeholder under the form field with id "acceptance" is too
minimal; replace the generic "- [ ]" placeholder with a brief, explicit example
checklist to help contributors (e.g., "- [ ] Reproduced the bug; - [ ] Added
tests; - [ ] Updated README"). Update the "placeholder" attribute for id:
acceptance to include that concrete example text so new issues show a useful,
copyable acceptance-criteria template.

Comment on lines 1 to +10
<!--
PR title must match:
type(scope): summary
PR title must match: type(scope): summary
Types: feat|fix|docs|chore|refactor|test|perf|ci|build
Examples:
fix: avoid panic on empty slot
feat(rpc): add getFoo endpoint
-->

## Summary
-
<!-- One sentence. Link to issue if applicable. -->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the first line a top-level heading to satisfy markdownlint

markdownlint warns that the file should start with an H1. Right now, an HTML comment appears before ## Summary, which triggers MD041. Move the title-rule comment below the first heading (or add a top-level heading at Line [1]).

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 9-9: First line in a file should be a top-level heading

(MD041, first-line-heading, first-line-h1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/PULL_REQUEST_TEMPLATE.md around lines 1 - 10, The file begins with
an HTML comment before the first markdown heading which triggers markdownlint
MD041; fix this by ensuring the document starts with a top-level H1 before any
comments—either add a leading "# Pull Request" (or similar) as the first line,
or move the existing title-rule HTML comment to follow the "## Summary" line;
update the file around the existing "## Summary" heading and the HTML comment so
the first non-empty line is the H1.

Comment on lines +13 to +15
## Breaking Changes
- [ ] None
- [ ] Yes — migration path described below
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a blank line after ## Breaking Changes

At Line [13], the heading is immediately followed by list items. Add one blank line below the heading to satisfy MD022 and keep formatting consistent.

Suggested patch
 ## Breaking Changes
+
 - [ ] None
 - [ ] Yes — migration path described below
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Breaking Changes
- [ ] None
- [ ] Yes — migration path described below
## Breaking Changes
- [ ] None
- [ ] Yes — migration path described below
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 13-13: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/PULL_REQUEST_TEMPLATE.md around lines 13 - 15, The markdown heading
"## Breaking Changes" is immediately followed by list items; insert a single
blank line after the "## Breaking Changes" heading so the heading and the
subsequent checklist are separated (i.e., add one empty line between the "##
Breaking Changes" line and the "- [ ] None" list item).

Comment on lines +231 to +242
// Verify archive file exists (not directory)
let archive_path = env
.snapshot_manager
.database_path()
.parent()
.unwrap()
.join(format!("snapshot-{:0>12}.tar.gz", SNAPSHOT_SLOT));
assert!(archive_path.exists(), "Archive file should exist");
assert!(
archive_path.is_file(),
"Snapshot should be a file, not directory"
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Remove the duplicated archive assertion block.

Lines 231-242 repeat the exact same path construction and assertions from Lines 218-229, so this adds no coverage and will drift on future filename changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/tests.rs` around lines 231 - 242, The test
contains a duplicated archive existence/assertion block that reconstructs
archive_path using env.snapshot_manager.database_path() and SNAPSHOT_SLOT and
repeats the same assert!(archive_path.exists()) and
assert!(archive_path.is_file()); remove the redundant second block so only the
original assertions remain (locate the duplicated code that defines archive_path
and the two asserts and delete that duplicate).

Comment on lines +110 to +115
let runtime = Builder::new_current_thread()
.thread_name("replication-service")
.build()
.expect("Failed to build replication service runtime");

runtime.block_on(tokio::task::unconstrained(self.run()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check the file and lines in question
cat -n magicblock-replicator/src/service/mod.rs | sed -n '100,125p'

Repository: magicblock-labs/magicblock-validator

Length of output: 822


🏁 Script executed:

# Check the Primary::run() method for tokio::time::interval usage
fd -t f "primary.rs" magicblock-replicator/ | head -5

Repository: magicblock-labs/magicblock-validator

Length of output: 124


🏁 Script executed:

# Check the Standby::run() method
fd -t f "standby.rs" magicblock-replicator/ | head -5

Repository: magicblock-labs/magicblock-validator

Length of output: 124


🏁 Script executed:

# Check the broker code for Tokio I/O
fd -t f "broker.rs" magicblock-replicator/ | head -5

Repository: magicblock-labs/magicblock-validator

Length of output: 120


🏁 Script executed:

# Check Primary::run() for tokio::time::interval usage
cat -n magicblock-replicator/src/service/primary.rs | head -100

Repository: magicblock-labs/magicblock-validator

Length of output: 3764


🏁 Script executed:

# Check Standby::run() for tokio::time::interval usage
cat -n magicblock-replicator/src/service/standby.rs | head -100

Repository: magicblock-labs/magicblock-validator

Length of output: 4138


🏁 Script executed:

# Check broker.rs for async I/O
cat -n magicblock-replicator/src/nats/broker.rs | head -80

Repository: magicblock-labs/magicblock-validator

Length of output: 3352


🏁 Script executed:

# Verify the spawn() return type and context
rg -A 5 "pub fn spawn" magicblock-replicator/src/service/mod.rs

Repository: magicblock-labs/magicblock-validator

Length of output: 372


Build this runtime with Tokio drivers enabled and return the build failure.

This current-thread runtime is built without enable_all(), but Primary::run() (line 38) and Standby::run() (line 51) both use tokio::time::interval, and the broker relies on Tokio I/O. Without the time and I/O drivers, these will panic at runtime. Additionally, line 113's .expect() converts a recoverable error into a panic despite spawn() returning JoinHandle<Result<()>>, violating the MAJOR guideline for this codebase.

Suggested fix
             let runtime = Builder::new_current_thread()
                 .thread_name("replication-service")
-                .build()
-                .expect("Failed to build replication service runtime");
+                .enable_all()
+                .build()
+                .map_err(|error| crate::Error::Internal(format!(
+                    "failed to build replication service runtime: {error}"
+                )))?;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let runtime = Builder::new_current_thread()
.thread_name("replication-service")
.build()
.expect("Failed to build replication service runtime");
runtime.block_on(tokio::task::unconstrained(self.run()))
let runtime = Builder::new_current_thread()
.thread_name("replication-service")
.enable_all()
.build()
.map_err(|error| crate::Error::Internal(format!(
"failed to build replication service runtime: {error}"
)))?;
runtime.block_on(tokio::task::unconstrained(self.run()))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/src/service/mod.rs` around lines 110 - 115, The
current-thread Tokio runtime is built without drivers and the build error is
unconditionally panicked; update the runtime construction (where
Builder::new_current_thread() is used) to enable required drivers (at minimum
enable_time() and enable_io() or simply enable_all()) so Primary::run() and
Standby::run() can safely use tokio::time::interval and broker I/O, and change
the .expect("Failed to build replication service runtime") to propagate the
Result error (return Err(...) or map the error into the function's Result)
rather than panicking so callers can handle build failures instead of aborting;
locate and modify the block that builds the runtime and the call to
runtime.block_on(self.run()) accordingly.

Comment on lines +42 to +47
Some(msg) = self.messages.recv() => {
if let Err(error) = self.publish(msg).await {
// publish should not easily fail, if that happens, it means
// the message broker has become unrecoverably unreacheable
warn!(%error, "failed to publish the message");
return self.ctx.into_standby(self.messages, true).await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't drop the message that triggered demotion.

Line 42 has already removed msg from the channel. If publish() fails here, Line 47 demotes without retrying or requeueing it, so recovery now depends on a later full snapshot instead of the event stream.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/src/service/primary.rs` around lines 42 - 47, The code
currently drops the message when publish() fails because msg has already been
popped from self.messages; before calling self.ctx.into_standby(...), requeue or
persist the failed msg so it isn't lost: inside the Some(msg) =
self.messages.recv() branch, on Err(error) from self.publish(msg).await attempt
to push the same msg back onto self.messages (e.g., using its async send or
try_send equivalent) or write it to a durable retry queue, logging any requeue
error, and only then call self.ctx.into_standby(self.messages, true). Ensure you
reference the recv()/publish() failure path and use the same msg instance (or
clone it if necessary) so the event stream can be replayed after demotion.

Comment on lines +70 to +75
_ = self.watcher.wait_for_expiry() => {
info!("leader lock expired, attempting takeover");
if let Ok(Some(producer)) = self.ctx.try_acquire_producer().await {
info!("acquired leadership, promoting");
return self.ctx.into_primary(producer, self.messages).await;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't treat a dead watch stream as lock expiry.

LockWatcher::wait_for_expiry() also returns when its watch stream ends unexpectedly (magicblock-replicator/src/nats/lock_watcher.rs:41-59). From this branch, that case is indistinguishable from a real Delete/Purge, and once the watcher is exhausted you'll spin on try_acquire_producer() every loop iteration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-replicator/src/service/standby.rs` around lines 70 - 75, The
watcher currently conflates a real lock expiry with an unexpected watch-stream
end; update LockWatcher::wait_for_expiry() to return a discriminated result
(e.g., Result<ExpiryKind, WatcherError> or an enum like Expiry::Deleted |
Expiry::StreamEnded) so callers can distinguish a genuine Delete/Purge from
stream termination, then modify the standby loop that calls
self.watcher.wait_for_expiry() to match on that result: on a genuine expiry,
proceed to call self.ctx.try_acquire_producer().await and on success call
self.ctx.into_primary(...).await as before, but on StreamEnded/Err log
appropriately and stop/recreate the watcher (do not spin on
try_acquire_producer()); reference symbols: LockWatcher::wait_for_expiry,
self.watcher, self.ctx.try_acquire_producer(), and self.ctx.into_primary.

Comment on lines +104 to +120
let result = match message {
Message::Transaction(tx) => {
self.replay_tx(tx.slot, tx.index, tx.payload).await
}
Message::Block(block) => self.ctx.write_block(&block).await,
Message::SuperBlock(sb) => {
self.ctx.verify_checksum(&sb).inspect_err(|error|
error!(slot, %error, "accountsdb state has diverged")
)
}
};

if let Err(error) = result {
warn!(slot, index, %error, "message processing error");
return;
}
self.ctx.update_position(slot, index);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Stop this standby on replay/apply failures.

Lines 116-119 only warn and keep consuming. A failed transaction replay, block write, or superblock verification leaves a known-bad replica running and still eligible for promotion later. Bubble this out of run() and force rebootstrap instead of continuing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replication

1 participant