Skip to content

feat(accountsdb): implement external snapshot insertion#1031

Merged
bmuddha merged 8 commits intobmuddha/epic/replication-servicefrom
bmuddha/accountsdb/archive-snapshots
Mar 31, 2026
Merged

feat(accountsdb): implement external snapshot insertion#1031
bmuddha merged 8 commits intobmuddha/epic/replication-servicefrom
bmuddha/accountsdb/archive-snapshots

Conversation

@bmuddha
Copy link
Copy Markdown
Collaborator

@bmuddha bmuddha commented Mar 9, 2026

Summary

  • Implemented a new accountsdb feature, that allows for insertion of external snapshots, with fast forward of state if the snapshot is newer than current accountsdb. Another feature is snapshot archival, which stores snapshots in archived form on disk, facilitating replication.

Compatibility

  • breaking changes: snapshots are now stored as tar archives

Testing

  • new tests were added

Checklist

Summary by CodeRabbit

  • New Features

    • Snapshots are now produced as compressed tar.gz archives.
    • Two‑phase snapshot workflow: create snapshot directory, then archive in background.
    • Import external snapshot archives with optional fast‑forward; manual restore still supported.
  • Tests

    • Added end‑to‑end tests for archival snapshots, orphan‑directory cleanup, external snapshot import/fast‑forward, and deterministic snapshot control.
  • Chores

    • Removed the configurable snapshot‑frequency option; snapshot timing now follows built‑in defaults.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces directory-only snapshots with tar.gz archives and adds tar/flate2 dependencies. Snapshot creation is two-phase: create snapshot directory while holding a write lock, then archive the directory to tar.gz in background and register the archive (removing the directory). SnapshotManager gains archive creation/validation/extraction/atomic-swap/pruning and external-archive insertion/fast-forward. AccountsDb API adds take_snapshot, insert_external_snapshot, lock_database, and an unsafe checksum variant. Tests updated to use deterministic snapshot slots and validate archive lifecycle and fast-forward behavior.

Assessment against linked issues

Objective (issue) Addressed Explanation
Archive snapshots into compressed archives for replication and space savings (#1030)
Perform archival in background without holding write lock / without blocking block production (#1030)
Remove original snapshot directory after archiving to save space (#1030)
Provide mechanism to import/send snapshots for replication (external snapshot insertion / fast-forward) (#1030)

Out-of-scope changes

Code Change Explanation
Removal of snapshot_frequency field and related constant and config entries (magicblock-accounts-db/src/lib.rs; magicblock-config/src/config/accounts.rs; magicblock-config/src/consts.rs; config.example.toml; multiple test-integration config files) Deleting the configuration field and public constant changes public config surface and tests but is not required to implement archival; it modifies unrelated configuration behavior.

Suggested reviewers

  • GabrielePicco
  • thlorenz
  • Dodecahedr0x
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bmuddha/accountsdb/archive-snapshots

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Collaborator Author

bmuddha commented Mar 9, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-accounts-db/src/snapshot.rs`:
- Around line 226-246: In find_and_remove_snapshot, avoid
registry.remove(index).unwrap(): replace the unwrap with explicit handling by
matching registry.remove(index) (or using
registry.get(index).cloned().ok_or(...)? then remove) and return an
AccountsDbError::SnapshotMissing(target_slot) (or a more specific error) if
removal fails; ensure chosen_archive is a PathBuf and preserve existing parsing
via Self::parse_slot, keeping the info! log and returning Ok((chosen_archive,
chosen_slot, index)).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7207e584-ba2a-40d2-819f-84caa4f9f124

📥 Commits

Reviewing files that changed from the base of the PR and between 190bd7a and 99c2c1f.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • magicblock-accounts-db/Cargo.toml
  • magicblock-accounts-db/src/lib.rs
  • magicblock-accounts-db/src/snapshot.rs
  • magicblock-accounts-db/src/tests.rs

@bmuddha bmuddha self-assigned this Mar 9, 2026
@bmuddha bmuddha changed the base branch from bmuddha/scheduler/dual-mode to graphite-base/1031 March 10, 2026 09:32
@bmuddha bmuddha force-pushed the bmuddha/accountsdb/archive-snapshots branch from 99c2c1f to 43a9a23 Compare March 10, 2026 19:23
@bmuddha bmuddha force-pushed the graphite-base/1031 branch from 190bd7a to b71e317 Compare March 10, 2026 19:23
@bmuddha bmuddha changed the base branch from graphite-base/1031 to master March 10, 2026 19:23
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
magicblock-accounts-db/src/snapshot.rs (1)

226-246: ⚠️ Potential issue | 🟠 Major

Replace .unwrap() with proper error handling.

Line 240 uses .unwrap() on registry.remove(index). While the index is derived from binary_search on the same locked registry (making this practically safe), the coding guidelines require explicit error handling for any .unwrap() in production code.

🛠️ Suggested fix
-        let chosen_archive = registry.remove(index).unwrap();
+        let chosen_archive = registry.remove(index).ok_or_else(|| {
+            AccountsDbError::Internal(format!(
+                "Registry index {} out of bounds during snapshot lookup",
+                index
+            ))
+        })?;

As per coding guidelines: "Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 226 - 246, In
find_and_remove_snapshot, avoid the .unwrap() on registry.remove(index): replace
it with explicit handling (e.g., match or if let) so that if removal yields None
or an unexpected value you return
Err(AccountsDbError::SnapshotMissing(target_slot)) instead of panicking; keep
the rest of the flow (parse_slot -> ok_or(...), logging of
chosen_slot/target_slot, and returning (chosen_archive, chosen_slot, index))
unchanged and reference registry, chosen_archive, chosen_slot, and
Self::parse_slot to locate the exact change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-accounts-db/src/snapshot.rs`:
- Around line 291-302: fast_forward currently pushes the new archive into the
registry via registry.lock().push_back(...) without invoking pruning, which can
let the registry exceed max_snapshots; update fast_forward in snapshot.rs to
call the existing prune_registry() (or the method that enforces max_snapshots)
after pushing the new entry (or call it before/after atomic_swap as appropriate)
so the registry is trimmed to max_snapshots, referencing the fast_forward,
extract_archive, atomic_swap, prune_registry, and registry.lock().push_back
symbols to locate and modify the code.
- Around line 216-224: The current validate_archive(bytes: &[u8]) implementation
only calls tar.entries() which doesn't detect truncated archives; update
validate_archive to iterate the Archive::entries() iterator and for each entry
attempt to fully read or validate the entry (e.g., read its header and drain its
contents) so any I/O or truncated-data errors surface during validation, and
propagate/log those errors via AccountsDbResult (replace the single
tar.entries() check with a loop that calls a read/drain operation on each Entry
and propagates failures).

---

Duplicate comments:
In `@magicblock-accounts-db/src/snapshot.rs`:
- Around line 226-246: In find_and_remove_snapshot, avoid the .unwrap() on
registry.remove(index): replace it with explicit handling (e.g., match or if
let) so that if removal yields None or an unexpected value you return
Err(AccountsDbError::SnapshotMissing(target_slot)) instead of panicking; keep
the rest of the flow (parse_slot -> ok_or(...), logging of
chosen_slot/target_slot, and returning (chosen_archive, chosen_slot, index))
unchanged and reference registry, chosen_archive, chosen_slot, and
Self::parse_slot to locate the exact change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d51a36a2-27d5-4a9f-8242-6aebee487b2a

📥 Commits

Reviewing files that changed from the base of the PR and between 99c2c1f and 43a9a23.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • magicblock-accounts-db/Cargo.toml
  • magicblock-accounts-db/src/lib.rs
  • magicblock-accounts-db/src/snapshot.rs
  • magicblock-accounts-db/src/tests.rs

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 11, 2026

Manual Deploy Available

You can trigger a manual deploy of this PR branch to testnet:

Deploy to Testnet 🚀

Alternative: Comment /deploy on this PR to trigger deployment directly.

⚠️ Note: Manual deploy requires authorization. Only authorized users can trigger deployments.

Comment updated automatically when the PR is synchronized.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
magicblock-accounts-db/src/snapshot.rs (1)

46-58: ⚠️ Potential issue | 🔴 Critical

Do not release the global write lock before copying the live index files.

Line 56 resumes writers while deep_copy_dir is still copying everything except accounts.db from the live directory. On non-CoW filesystems, that can snapshot accounts.db at one point in time and the LMDB index at another, yielding a mismatched or unrecoverable archive.

🛠️ Minimal correctness fix
             Self::LegacyCopy => {
-                drop(lock); // Release lock before slow I/O
-                fs_backend::deep_copy_dir(src, dst, &memory_state)
+                fs_backend::deep_copy_dir(src, dst, &memory_state)?;
+                drop(lock);
+                Ok(())
             }

Based on learnings: "In magicblock-validator, the AccountsDb 'stop-the-world' synchronizer is managed at the processor/executor level... Snapshot operations acquire a write lock, blocking until all executors release their read locks."

Also applies to: 435-457

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 46 - 58, The LegacyCopy
branch in execute currently drops the RwLockWriteGuard (lock) before calling
fs_backend::deep_copy_dir, which allows writers to resume while the live index
files are still being copied; instead keep the write lock held for the entire
duration of the copy to ensure a consistent snapshot: remove the drop(lock) and
call fs_backend::deep_copy_dir while the RwLockWriteGuard<()>, passed as lock,
remains in scope (i.e., held) so the write lock isn't released until after
deep_copy_dir returns; apply the same change to the other execute-like copy path
referenced (the similar LegacyCopy handling around the other occurrence).
♻️ Duplicate comments (1)
magicblock-accounts-db/src/snapshot.rs (1)

228-241: ⚠️ Potential issue | 🟠 Major

Replace the production unwrap() with an explicit error path.

Even if the index is expected to exist, Line 240 should not panic in production code.

🛠️ Proposed fix
-        let chosen_archive = registry.remove(index).unwrap();
+        let Some(chosen_archive) = registry.remove(index) else {
+            return Err(AccountsDbError::Internal(format!(
+                "snapshot registry lost index {index} while restoring slot {target_slot}"
+            )));
+        };

As per coding guidelines: "Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 228 - 241, The code in
find_and_remove_snapshot uses registry.remove(index).unwrap(), which can panic
in production; replace the unwrap with an explicit error branch that returns an
AccountsDbResult::Err (e.g., AccountsDbError::SnapshotMissing(target_slot) or a
more specific error) when registry.remove(index) yields None, then proceed to
parse the chosen_archive via Self::parse_slot(&chosen_archive) as before; ensure
you also propagate any parse errors from Self::parse_slot by returning a proper
AccountsDbError rather than panicking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-accounts-db/src/lib.rs`:
- Around line 397-408: The code currently swaps in the new snapshot as soon as
snapshot_manager.insert_external_snapshot returns true, then calls
self.storage.reload and self.index.reload which can fail and leave the DB
replaced; instead, change the flow so you validate the extracted snapshot before
deleting the backup or flipping db_path: modify insert_external_snapshot
behavior or add a new operation on SnapshotManager (e.g.
extract_external_snapshot or snapshot_manager.extracted_path()) that only
extracts and returns the extracted path without committing; then open/validate
that path (e.g. attempt to open a temporary Storage/Index instance or run a
light integrity check) and only if both validations succeed perform the actual
swap/commit (or call snapshot_manager.commit_fast_forward); alternatively, keep
the backup until both self.storage.reload(path) and self.index.reload(path)
succeed and on any error call snapshot_manager.restore_backup() to revert to the
previous db_path; reference snapshot_manager.insert_external_snapshot,
fast_forwarded, snapshot_manager.database_path, self.storage.reload,
self.index.reload when applying this change.
- Around line 286-289: Change the method signature of set_slot from taking
&Arc<Self> to taking &self and adjust its implementation to call
self.storage.update_slot(slot) as before; specifically, update the receiver on
the pub fn set_slot declaration to &self (not &Arc<Self>) so callers can call
set_slot on any &State without requiring an Arc, and leave the body using
storage.update_slot(slot) since storage already uses interior mutability.

In `@magicblock-accounts-db/src/snapshot.rs`:
- Around line 111-123: Call prune_registry only after the new snapshot/archive
has been durably created and recorded: move the prune_registry() invocation from
before SnapshotStrategy::execute to after strategy.execute(...) has returned Ok
and any registration of the new snapshot (the code path that makes the snapshot
visible/recorded) has completed, so you never delete the last good restore point
on a failed copy/archive; apply the same change to the other occurrence
referenced around lines 151-152 (the other prune_registry call), and ensure any
in-flight snapshot accounting (used to enforce max_snapshots) is updated before
pruning so overlapping background snapshots are counted correctly.
- Around line 342-368: The current startup recovery collects snapshot archive
paths into `paths`, trims to the newest `max` by computing `offset` and
returning `paths.into_iter().skip(offset).collect()`, but it does not remove the
trimmed files from disk causing orphaned archives; update this logic to remove
on-disk files for entries that will be dropped (e.g., iterate the older set:
`paths.into_iter().take(offset)`), call `fs::remove_file` (or
`fs::remove_dir_all` for any orphan dirs if applicable), log success/failure
(use the existing logger and include the path), and then return the remaining
paths (the ones after `offset`) so `parse_slot`, `SNAPSHOT_PREFIX`, `max`, and
`offset` behavior remain intact.
- Around line 304-307: register_archive currently always appends
(registry.lock().push_back(archive_path)) which breaks the binary-search
assumptions in find_and_remove_snapshot and snapshot_exists; change
register_archive to insert the new archive_path into registry in sorted order
instead of always pushing back: compute the slot for archive_path (reuse or call
the same helper used by snapshot_exists/find_and_remove_snapshot to extract slot
from a PathBuf), lock the registry, find the insertion index with
binary_search_by or partition_point comparing slots of existing PathBuf entries,
and insert at that index (registry.lock().insert(idx, archive_path)) so the
deque remains sorted by slot and binary-search lookups continue to work.

---

Outside diff comments:
In `@magicblock-accounts-db/src/snapshot.rs`:
- Around line 46-58: The LegacyCopy branch in execute currently drops the
RwLockWriteGuard (lock) before calling fs_backend::deep_copy_dir, which allows
writers to resume while the live index files are still being copied; instead
keep the write lock held for the entire duration of the copy to ensure a
consistent snapshot: remove the drop(lock) and call fs_backend::deep_copy_dir
while the RwLockWriteGuard<()>, passed as lock, remains in scope (i.e., held) so
the write lock isn't released until after deep_copy_dir returns; apply the same
change to the other execute-like copy path referenced (the similar LegacyCopy
handling around the other occurrence).

---

Duplicate comments:
In `@magicblock-accounts-db/src/snapshot.rs`:
- Around line 228-241: The code in find_and_remove_snapshot uses
registry.remove(index).unwrap(), which can panic in production; replace the
unwrap with an explicit error branch that returns an AccountsDbResult::Err
(e.g., AccountsDbError::SnapshotMissing(target_slot) or a more specific error)
when registry.remove(index) yields None, then proceed to parse the
chosen_archive via Self::parse_slot(&chosen_archive) as before; ensure you also
propagate any parse errors from Self::parse_slot by returning a proper
AccountsDbError rather than panicking.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fdc6023b-6117-49c4-8dcb-ada0a3d68452

📥 Commits

Reviewing files that changed from the base of the PR and between 43a9a23 and 67b913f.

📒 Files selected for processing (7)
  • config.example.toml
  • magicblock-accounts-db/src/lib.rs
  • magicblock-accounts-db/src/snapshot.rs
  • magicblock-accounts-db/src/tests.rs
  • magicblock-config/src/config/accounts.rs
  • magicblock-config/src/consts.rs
  • magicblock-config/src/tests.rs
💤 Files with no reviewable changes (4)
  • magicblock-config/src/config/accounts.rs
  • magicblock-config/src/tests.rs
  • magicblock-config/src/consts.rs
  • config.example.toml

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test-integration/test-ledger-restore/src/lib.rs`:
- Line 61: The helper currently calls AccountsDbConfig::default(), which wipes
any reset flag set by the caller and can leave on-disk AccountsDb/snapshots that
make restores boot from stale state; change the creation of accountsdb_config so
it preserves the caller's reset_ledger flag (e.g., construct AccountsDbConfig
with reset: reset_ledger and the rest from Default::default()) instead of
calling AccountsDbConfig::default(); update the variable used by
setup_offline_validator(...) so the same config (AccountsDbConfig with reset set
from reset_ledger) is passed through.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a1c42ab8-339d-402f-a08b-e2d698332151

📥 Commits

Reviewing files that changed from the base of the PR and between 67b913f and 182b730.

⛔ Files ignored due to path filters (1)
  • test-integration/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (14)
  • test-integration/configs/api-conf.ephem.toml
  • test-integration/configs/chainlink-conf.devnet.toml
  • test-integration/configs/claim-fees-test.toml
  • test-integration/configs/cloning-conf.devnet.toml
  • test-integration/configs/cloning-conf.ephem.toml
  • test-integration/configs/committor-conf.devnet.toml
  • test-integration/configs/config-conf.devnet.toml
  • test-integration/configs/restore-ledger-conf.devnet.toml
  • test-integration/configs/schedulecommit-conf-fees.ephem.toml
  • test-integration/configs/schedulecommit-conf.devnet.toml
  • test-integration/configs/schedulecommit-conf.ephem.frequent-commits.toml
  • test-integration/configs/schedulecommit-conf.ephem.toml
  • test-integration/configs/validator-offline.devnet.toml
  • test-integration/test-ledger-restore/src/lib.rs
💤 Files with no reviewable changes (13)
  • test-integration/configs/validator-offline.devnet.toml
  • test-integration/configs/schedulecommit-conf.ephem.toml
  • test-integration/configs/config-conf.devnet.toml
  • test-integration/configs/schedulecommit-conf.ephem.frequent-commits.toml
  • test-integration/configs/cloning-conf.ephem.toml
  • test-integration/configs/schedulecommit-conf.devnet.toml
  • test-integration/configs/api-conf.ephem.toml
  • test-integration/configs/restore-ledger-conf.devnet.toml
  • test-integration/configs/schedulecommit-conf-fees.ephem.toml
  • test-integration/configs/chainlink-conf.devnet.toml
  • test-integration/configs/claim-fees-test.toml
  • test-integration/configs/cloning-conf.devnet.toml
  • test-integration/configs/committor-conf.devnet.toml

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
magicblock-accounts-db/src/snapshot.rs (4)

114-119: ⚠️ Potential issue | 🟠 Major

Prune after archive is durably registered, not after directory creation.

prune_registry() is called in create_snapshot_dir (line 117) after the snapshot directory is created but before archive_and_register completes. If archiving fails, you've already deleted the oldest archive, potentially leaving fewer restore points than expected.

Consider moving the prune to archive_and_register after the archive is successfully written and registered.

🛠️ Suggested fix
 // In create_snapshot_dir:
         self.strategy
             .execute(&self.db_path, &snap_path, memory_capture)
             .log_err(|| "Snapshot failed")?;
-        self.prune_registry();

         Ok(snap_path)
     }

 // In archive_and_register, after register_archive:
         self.register_archive(archive_path.clone());
+        self.prune_registry();
         Ok(archive_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 114 - 119, The
prune_registry() call is happening too early in create_snapshot_dir (after the
snapshot directory is created) which can delete oldest archives even if
archive_and_register/strategy.execute fails; move the prune_registry()
invocation out of create_snapshot_dir and call it only after
archive_and_register (or after self.strategy.execute completes successfully) so
pruning runs only when the archive is durably written and registered; update
create_snapshot_dir to no longer prune and add the prune_registry() call
immediately after archive_and_register (or after the log_err? check) in
create_snapshot so error paths do not trigger pruning.

259-259: ⚠️ Potential issue | 🟠 Major

Replace .unwrap() with explicit error handling.

Line 259 uses .unwrap() on registry.remove(index). While the index was just validated via binary_search, the coding guidelines require proper error handling for all .unwrap() in production code. The remove method returns Option<T>, and while logically safe here, an explicit .ok_or_else() would satisfy the guidelines and document the invariant.

🛠️ Suggested fix
-        let chosen_archive = registry.remove(index).unwrap();
+        // INVARIANT: index is valid from binary_search on the same locked registry
+        let chosen_archive = registry.remove(index).ok_or_else(|| {
+            AccountsDbError::Internal(format!(
+                "Registry index {index} became invalid during snapshot lookup"
+            ))
+        })?;

As per coding guidelines: "Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` at line 259, Replace the .unwrap() on
registry.remove(index) with explicit error handling: call
registry.remove(index).ok_or_else(|| /* create a descriptive error e.g.,
SnapshotError::InvariantViolation(...) */) and propagate or map that error where
appropriate from the surrounding function (e.g., by returning a Result). Update
any callers if necessary to handle the new error return; reference the variables
chosen_archive, registry.remove, and index and ensure the error message
documents the invariant that binary_search validated the index.

323-327: ⚠️ Potential issue | 🟠 Major

Maintain sorted order when registering archives.

register_archive always appends via push_back, but find_and_remove_snapshot and snapshot_exists use binary_search which requires sorted order. When an external snapshot with slot <= current_slot is inserted (the non-fast-forward path), it will be appended after potentially newer archives, breaking the binary search.

🛠️ Suggested fix to maintain sorted order
     fn register_archive(&self, archive_path: PathBuf) {
         info!(archive_path = %archive_path.display(), "Snapshot registered");
-        self.registry.lock().push_back(archive_path);
+        let mut registry = self.registry.lock();
+        // Maintain sorted order for binary_search compatibility
+        let insert_pos = registry
+            .binary_search(&archive_path)
+            .unwrap_or_else(|i| i);
+        registry.insert(insert_pos, archive_path);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 323 - 327,
register_archive currently appends with registry.lock().push_back(archive_path)
which breaks the sorted invariant required by find_and_remove_snapshot and
snapshot_exists (they use binary_search). Instead of push_back, compute the sort
key used by those functions (the snapshot slot extracted from the PathBuf) and
insert archive_path into registry at the correct sorted position: obtain the
mutex guard on registry, use binary_search_by (or binary_search_by_key) with the
same comparator/key extraction as snapshot_exists/find_and_remove_snapshot to
get an index, then call insert(index_or_insertion_point, archive_path) on the
underlying collection (VecDeque supports insert) so the registry remains sorted;
ensure you handle the Ok (already present) and Err (insertion point) cases
consistently.

386-388: ⚠️ Potential issue | 🟠 Major

Delete over-limit archives during startup recovery.

The code trims the registry to max snapshots but doesn't delete the excluded files from disk. After restart, archives beyond max_snapshots become orphans that are never pruned, defeating the space-saving goal.

🛠️ Suggested fix to delete excess archives
         let offset = paths.len().saturating_sub(max);
+        // Delete excess archives from disk
+        for stale in &paths[..offset] {
+            if let Err(e) = fs::remove_file(stale) {
+                warn!(path = %stale.display(), error = ?e, "Failed to prune excess snapshot archive during recovery");
+            }
+        }
         Ok(paths.into_iter().skip(offset).collect())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/snapshot.rs` around lines 386 - 388, The current
startup recovery logic trims the in-memory list (using paths, max, offset) but
does not remove the excluded archive files from disk; update the function that
builds/returns paths (the block using let offset =
paths.len().saturating_sub(max); Ok(paths.into_iter().skip(offset).collect()))
to delete the files that are being dropped before returning the trimmed list:
iterate over paths[..offset], call std::fs::remove_file (or async equivalent)
for each path, handle and log errors (don’t panic on failure) and only
collect/return the remaining paths via paths.into_iter().skip(offset).collect(),
ensuring deletion occurs during startup recovery to prevent orphaned archives.
magicblock-accounts-db/src/lib.rs (1)

386-406: ⚠️ Potential issue | 🟠 Major

Verify error recovery when reload fails after fast-forward.

If insert_external_snapshot returns true (fast-forward completed with atomic swap), but storage.reload() or index.reload() subsequently fails, the method returns Err while the filesystem has already been modified. The caller has no way to recover the in-memory state.

Consider either:

  1. Validating the extracted snapshot can be loaded before committing the atomic swap, or
  2. Keeping the backup until both reloads succeed
#!/bin/bash
# Check if atomic_swap keeps backup available for rollback after swap completes
rg -n "atomic_swap|backup|\.bak" magicblock-accounts-db/src/snapshot.rs -B 2 -A 10
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-accounts-db/src/lib.rs` around lines 386 - 406, The fast-forward
path in insert_external_snapshot calls snapshot_manager.insert_external_snapshot
which performs an atomic swap before calling storage.reload and index.reload,
leaving no way to recover if either reload fails; change this to a two-step
flow: modify snapshot_manager.insert_external_snapshot (or add new
prepare_external_snapshot / commit_external_snapshot APIs) so you can first
extract and prepare the snapshot without performing the atomic swap, then call
storage.reload(path) and index.reload(path) to validate the new on-disk state,
and only after both reloads succeed call a commit/atomic_swap method (or remove
the backup) to finalize; alternatively, if you cannot change
insert_external_snapshot, add a rollback path that uses
snapshot_manager.database_path/backup support to restore the previous DB on
reload failure and ensure insert_external_snapshot returns a handle or flag you
can use to trigger rollback/commit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-accounts-db/src/tests.rs`:
- Around line 337-342: The test unnecessarily wraps AccountsDb in an Arc then
immediately unwraps it; replace the Arc::new + Arc::try_unwrap sequence by
instantiating AccountsDb directly so you can mutate it: call
AccountsDb::new(&config, temp_dir.path(), 0).unwrap() into a mutable variable,
then call set_slot(SNAPSHOT_SLOT + 1000) on that variable (remove Arc::new and
Arc::try_unwrap around AccountsDb and references to Arc).

---

Duplicate comments:
In `@magicblock-accounts-db/src/lib.rs`:
- Around line 386-406: The fast-forward path in insert_external_snapshot calls
snapshot_manager.insert_external_snapshot which performs an atomic swap before
calling storage.reload and index.reload, leaving no way to recover if either
reload fails; change this to a two-step flow: modify
snapshot_manager.insert_external_snapshot (or add new prepare_external_snapshot
/ commit_external_snapshot APIs) so you can first extract and prepare the
snapshot without performing the atomic swap, then call storage.reload(path) and
index.reload(path) to validate the new on-disk state, and only after both
reloads succeed call a commit/atomic_swap method (or remove the backup) to
finalize; alternatively, if you cannot change insert_external_snapshot, add a
rollback path that uses snapshot_manager.database_path/backup support to restore
the previous DB on reload failure and ensure insert_external_snapshot returns a
handle or flag you can use to trigger rollback/commit.

In `@magicblock-accounts-db/src/snapshot.rs`:
- Around line 114-119: The prune_registry() call is happening too early in
create_snapshot_dir (after the snapshot directory is created) which can delete
oldest archives even if archive_and_register/strategy.execute fails; move the
prune_registry() invocation out of create_snapshot_dir and call it only after
archive_and_register (or after self.strategy.execute completes successfully) so
pruning runs only when the archive is durably written and registered; update
create_snapshot_dir to no longer prune and add the prune_registry() call
immediately after archive_and_register (or after the log_err? check) in
create_snapshot so error paths do not trigger pruning.
- Line 259: Replace the .unwrap() on registry.remove(index) with explicit error
handling: call registry.remove(index).ok_or_else(|| /* create a descriptive
error e.g., SnapshotError::InvariantViolation(...) */) and propagate or map that
error where appropriate from the surrounding function (e.g., by returning a
Result). Update any callers if necessary to handle the new error return;
reference the variables chosen_archive, registry.remove, and index and ensure
the error message documents the invariant that binary_search validated the
index.
- Around line 323-327: register_archive currently appends with
registry.lock().push_back(archive_path) which breaks the sorted invariant
required by find_and_remove_snapshot and snapshot_exists (they use
binary_search). Instead of push_back, compute the sort key used by those
functions (the snapshot slot extracted from the PathBuf) and insert archive_path
into registry at the correct sorted position: obtain the mutex guard on
registry, use binary_search_by (or binary_search_by_key) with the same
comparator/key extraction as snapshot_exists/find_and_remove_snapshot to get an
index, then call insert(index_or_insertion_point, archive_path) on the
underlying collection (VecDeque supports insert) so the registry remains sorted;
ensure you handle the Ok (already present) and Err (insertion point) cases
consistently.
- Around line 386-388: The current startup recovery logic trims the in-memory
list (using paths, max, offset) but does not remove the excluded archive files
from disk; update the function that builds/returns paths (the block using let
offset = paths.len().saturating_sub(max);
Ok(paths.into_iter().skip(offset).collect())) to delete the files that are being
dropped before returning the trimmed list: iterate over paths[..offset], call
std::fs::remove_file (or async equivalent) for each path, handle and log errors
(don’t panic on failure) and only collect/return the remaining paths via
paths.into_iter().skip(offset).collect(), ensuring deletion occurs during
startup recovery to prevent orphaned archives.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 60531684-0351-4a7b-abcb-d6d5d84fbbe7

📥 Commits

Reviewing files that changed from the base of the PR and between 182b730 and 2a3b174.

📒 Files selected for processing (3)
  • magicblock-accounts-db/src/lib.rs
  • magicblock-accounts-db/src/snapshot.rs
  • magicblock-accounts-db/src/tests.rs

@bmuddha bmuddha requested a review from thlorenz March 12, 2026 18:49
@bmuddha bmuddha marked this pull request as ready for review March 12, 2026 18:49
Comment on lines +311 to 321
fn fast_forward(
&self,
slot: u64,
archive_path: &Path,
) -> AccountsDbResult<()> {
let extracted_dir = self.extract_archive(archive_path)?;
self.atomic_swap(&extracted_dir)?;
self.registry.lock().push_back(archive_path.to_path_buf());
info!(slot, "Fast-forward complete");
Ok(())
}
Copy link
Copy Markdown
Collaborator

@thlorenz thlorenz Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After fast_forward extracts the archive and swaps the db directory, it pushes the archive path to the back of the registry — but prune_registry was never called first. If the registry is already at max_snapshots, the next prune_registry call (from a subsequent create_snapshot_dir or insert_external_snapshot no-ff path) will remove the oldest snapshot, which may be more valuable than a pruned one.

Seems like the above is ok (saw that coderabbit found the same issue).

However, register_archive uses push_back which assumes the new entry is always the newest. For externally inserted snapshots this should generally hold, but there's no sorted-insert guarantee if an older external snapshot arrives after a newer one is already registered.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should only be triggered if the snapshot is newer than current database, which in turn is always newer than any snapshot in registry

Copy link
Copy Markdown
Collaborator

@thlorenz thlorenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid refactor, the two-phase snapshot (dir + archive), external snapshot insertion, and snapshot_frequency removal all look clean.
I pointed out a few nits that should be fixed before merging

Also note: magicblock-api/src/slot.rs line 36-42 still references "snapshot frequency" and automatic snapshot triggering. Since set_slot no longer triggers snapshots, that comment is now stale and should be updated or removed.

@bmuddha bmuddha force-pushed the bmuddha/accountsdb/archive-snapshots branch from 2a3b174 to c501cd7 Compare March 13, 2026 13:58
@bmuddha bmuddha force-pushed the bmuddha/accountsdb/archive-snapshots branch from c501cd7 to 7e553f7 Compare March 17, 2026 15:13
@bmuddha bmuddha changed the base branch from master to graphite-base/1031 March 20, 2026 07:51
@bmuddha bmuddha force-pushed the bmuddha/accountsdb/archive-snapshots branch from 352f903 to 11c39e6 Compare March 20, 2026 07:51
@bmuddha bmuddha changed the base branch from graphite-base/1031 to bmuddha/epic/replication-service March 20, 2026 07:51
@bmuddha bmuddha mentioned this pull request Mar 20, 2026
1 task
@bmuddha bmuddha force-pushed the bmuddha/epic/replication-service branch from 3238fb7 to cb7cff5 Compare March 30, 2026 11:49
@bmuddha bmuddha force-pushed the bmuddha/accountsdb/archive-snapshots branch from 11c39e6 to 1e53f27 Compare March 30, 2026 11:49
@bmuddha bmuddha merged commit 2ebffec into bmuddha/epic/replication-service Mar 31, 2026
18 of 19 checks passed
@bmuddha bmuddha deleted the bmuddha/accountsdb/archive-snapshots branch March 31, 2026 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: archive accountsdb snapshots

2 participants