Skip to content

refactor: extract shared wal file substrate#134

Merged
skel84 merged 1 commit intomainfrom
feat/extract-wal-file
Mar 26, 2026
Merged

refactor: extract shared wal file substrate#134
skel84 merged 1 commit intomainfrom
feat/extract-wal-file

Conversation

@skel84
Copy link
Copy Markdown
Owner

@skel84 skel84 commented Mar 26, 2026

Summary

  • extract the shared append/rewrite/truncate WAL file lifecycle into allocdb-wal-file
  • update allocdb-core, quota-core, and reservation-core to wrap the shared substrate while keeping engine-local scan semantics
  • normalize recovery to pass mutable WAL handles where truncation may reopen the file

Testing

  • cargo fmt --all --check
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test
  • scripts/check_repo.sh

Closes #123
Refs #120

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

Summary by CodeRabbit

Release Notes

  • Refactor

    • Consolidated WAL file I/O operations into a dedicated abstraction layer, improving consistency and maintainability across all database recovery and durability mechanisms.
    • Simplified file handling patterns for truncation and atomic rewrites in WAL operations.
  • Chores

    • Updated internal workspace dependencies and dependency checks to support the new WAL file abstraction component.

Walkthrough

A new crate allocdb-wal-file is introduced containing an AppendWalFile type that encapsulates file append/sync/truncate/recovery operations. Three crates—allocdb-core, quota-core, and reservation-core—refactor their WalFile implementations to delegate to this shared abstraction. Recovery API signatures are updated to require mutable WAL references.

Changes

Cohort / File(s) Summary
Workspace and new crate setup
Cargo.toml, crates/allocdb-wal-file/Cargo.toml
Workspace members list and allocdb-wal-file crate manifest with standard package metadata and lint configuration.
AppendWalFile abstraction
crates/allocdb-wal-file/src/lib.rs
New public AppendWalFile struct managing on-disk WAL with methods for append, sync, read, truncate, and atomic rewrite; includes helper functions and unit tests for durability operations.
Recovery API signatures
crates/allocdb-core/src/recovery.rs, crates/quota-core/src/recovery.rs
Updated recover_allocdb/recover_allocdb_with_observer and recover_quota/recover_quota_with_observer to accept &mut WalFile instead of &WalFile.
WalFile refactoring
crates/allocdb-core/src/wal_file.rs, crates/quota-core/src/wal_file.rs, crates/reservation-core/src/wal_file.rs
Each WalFile now delegates file I/O to AppendWalFile; removed local File/OpenOptions/temp-file/parent-sync logic; truncate_to_valid_prefix changed to &mut self.
WalFile dependency additions
crates/allocdb-core/Cargo.toml, crates/quota-core/Cargo.toml, crates/reservation-core/Cargo.toml
Added allocdb-wal-file workspace path dependency to each crate.
Test updates
crates/allocdb-core/src/recovery_*_tests.rs, crates/allocdb-node/src/engine.rs
Updated recovery test calls and engine integration to pass &mut wal instead of &wal to recovery functions.
Dependency allowlist
scripts/check_repo.sh
Extended allocdb-core dependency allowlist to include allocdb-wal-file.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: extracting shared WAL file substrate into a new shared component.
Description check ✅ Passed The description covers the key sections: summary of changes, testing performed, and linked issues. All critical details are present.
Linked Issues check ✅ Passed All objectives from issue #123 are met: a common WAL file helper layer is extracted with append/sync/truncate/recovery operations, engine-local replay semantics are preserved separately, and fail-closed truncation behavior is maintained.
Out of Scope Changes check ✅ Passed All changes are directly aligned with extracting the shared WAL file substrate: new crate creation, dependency additions, API signature updates for mutability, and migration of file operations to the shared layer.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/extract-wal-file

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/reservation-core/src/wal_file.rs (1)

132-138: ⚠️ Potential issue | 🟡 Minor

Fix test path prefix to use reservation-core instead of quota-core.

The test file path format uses "quota-core-{name}-{nanos}.wal" which appears to be a copy-paste artifact. This should be "reservation-core-{name}-{nanos}.wal" for clarity and to avoid potential temp file collisions if tests run concurrently across crates.

Proposed fix
     fn test_path(name: &str) -> PathBuf {
         let nanos = SystemTime::now()
             .duration_since(UNIX_EPOCH)
             .expect("system time should be after epoch")
             .as_nanos();
-        std::env::temp_dir().join(format!("quota-core-{name}-{nanos}.wal"))
+        std::env::temp_dir().join(format!("reservation-core-{name}-{nanos}.wal"))
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/reservation-core/src/wal_file.rs` around lines 132 - 138, In the test
helper function test_path, update the temp file name format string from
"quota-core-{name}-{nanos}.wal" to "reservation-core-{name}-{nanos}.wal" so the
generated temp files use the reservation-core prefix and avoid cross-crate
collisions; locate the format call inside fn test_path and modify the string
passed to std::env::temp_dir().join(...) accordingly.
🧹 Nitpick comments (5)
crates/quota-core/src/recovery.rs (1)

152-236: Consider adding recovery logging for operational clarity.

Unlike allocdb-core/src/recovery.rs which logs recovery completion and failures at info/error levels, quota-core lacks equivalent structured logging. For consistency and debuggability, consider adding similar lifecycle logging around recovery boundaries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/quota-core/src/recovery.rs` around lines 152 - 236, Add structured
info/error logs around recover_quota_with_observer lifecycle to mirror
allocdb-core: log start of recovery (including snapshot
presence/loaded_snapshot_lsn) right after snapshot load, log an info after
wal_file.truncate_to_valid_prefix completes, log progress at
RecoveryBoundary::AfterReplayFrame (include lsn, record_type,
replay_ordinal/replayable_frame_count), and log a final info on successful
completion before returning RecoveryResult; also ensure errors during snapshot
load, WAL truncate, or replay are logged with error level where the ? propagates
(e.g., wrap or map_err to log before returning) so failures are visible in logs.
Use the existing observer/RecoveryBoundary places and
db.set_progress/RecoveryResult return point to attach these logs.
crates/allocdb-wal-file/src/lib.rs (4)

63-71: Unnecessary seek before dropping the truncation handle.

The seek on line 66 has no effect since file is dropped immediately after and the new append handle (line 69) positions at EOF automatically due to append mode.

♻️ Suggested simplification
     pub fn truncate_to(&mut self, valid_prefix: u64) -> Result<(), std::io::Error> {
         let mut file = OpenOptions::new().write(true).open(&self.path)?;
         file.set_len(valid_prefix)?;
-        file.seek(SeekFrom::Start(valid_prefix))?;
         file.sync_data()?;
         drop(file);
         self.file = open_append_file(&self.path)?;
         Ok(())
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-wal-file/src/lib.rs` around lines 63 - 71, In truncate_to
remove the redundant file.seek(SeekFrom::Start(valid_prefix)) call: after
opening a write handle to set_len and sync_data, drop that handle and reopen the
append handle via self.file = open_append_file(&self.path)?; so delete the seek
on the temporary `file` in fn truncate_to to simplify behavior (the new append
handle returned by open_append_file positions at EOF automatically).

115-126: Consider documenting the Windows durability caveat.

The non-Unix sync_parent_dir is a no-op, meaning atomic rename durability guarantees are weaker on Windows. Consider adding a doc comment noting this platform-specific behavior for maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-wal-file/src/lib.rs` around lines 115 - 126, Add a doc comment
to the non-Unix implementation of sync_parent_dir (the function annotated with
#[cfg(not(unix))]) explaining that it is a no-op on Windows and therefore atomic
rename/durability guarantees are weaker on that platform; mention that callers
relying on fsyncing parent directories to persist renames should be aware this
behavior is platform-specific and may require alternate strategies on Windows.

11-94: Consider adding structured logging for lifecycle and state-transition events.

Per coding guidelines, logging should be added "where it materially improves debuggability or operational clarity." For a critical durability component, consider logging:

  • info! on file open (lifecycle event)
  • info! or debug! on truncation and atomic replacement (significant state transitions)

This aids debugging recovery issues and operational observability without cluttering hot-path append operations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-wal-file/src/lib.rs` around lines 11 - 94, Add structured
lifecycle/state-transition logs to AppendWalFile: in open(...) emit an info!
with the file path and whether it was created or opened; in truncate_to(...)
emit info! or debug! with the path and the valid_prefix value (and resulting
file length after truncate if available); in replace_with_bytes(...) emit info!
or debug! with the path and bytes.len() before writing and after successful
rename; use the existing function names (AppendWalFile::open, append_bytes,
sync, read_all, truncate_to, replace_with_bytes) to locate insertion points and
prefer the crate's logging macro (info!/debug!) so logs include path and numeric
details for observability without touching hot-path append_bytes/sync.

144-185: Add edge-case and negative-path tests per coding guidelines.

The current test suite covers happy paths but lacks edge-case, negative-path, and recovery tests. Per coding guidelines for **/*.rs files, favor these over shallow happy-path coverage.

Consider adding tests for:

  • Truncating to zero
  • Truncating to length greater than file size (documents set_len behavior)
  • Replacing with empty bytes
  • Error propagation when underlying operations fail (e.g., after truncate or replace, verify append still works; or simulate permission errors)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-wal-file/src/lib.rs` around lines 144 - 185, Add focused
edge-case and negative-path tests for AppendWalFile: create tests named like
truncate_to_zero, truncate_beyond_length, replace_with_empty_bytes, and
error_propagation_on_failed_ops that exercise AppendWalFile::open,
::append_bytes, ::sync, ::truncate_to and ::replace_with_bytes; specifically
assert behavior after truncating to 0, truncating to a length greater than
current file size (documenting set_len semantics), replacing with an empty slice
produces an empty file and allows future appends, and simulate/verify error
propagation when underlying filesystem ops fail (for example by opening a
read-only handle or changing permissions) to ensure append_bytes returns an
error and state is consistent; follow existing test structure (use test_path,
fs::remove_file) and add assertions using ::read_all and unwrap/expect as
appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/reservation-core/src/wal_file.rs`:
- Around line 132-138: In the test helper function test_path, update the temp
file name format string from "quota-core-{name}-{nanos}.wal" to
"reservation-core-{name}-{nanos}.wal" so the generated temp files use the
reservation-core prefix and avoid cross-crate collisions; locate the format call
inside fn test_path and modify the string passed to
std::env::temp_dir().join(...) accordingly.

---

Nitpick comments:
In `@crates/allocdb-wal-file/src/lib.rs`:
- Around line 63-71: In truncate_to remove the redundant
file.seek(SeekFrom::Start(valid_prefix)) call: after opening a write handle to
set_len and sync_data, drop that handle and reopen the append handle via
self.file = open_append_file(&self.path)?; so delete the seek on the temporary
`file` in fn truncate_to to simplify behavior (the new append handle returned by
open_append_file positions at EOF automatically).
- Around line 115-126: Add a doc comment to the non-Unix implementation of
sync_parent_dir (the function annotated with #[cfg(not(unix))]) explaining that
it is a no-op on Windows and therefore atomic rename/durability guarantees are
weaker on that platform; mention that callers relying on fsyncing parent
directories to persist renames should be aware this behavior is
platform-specific and may require alternate strategies on Windows.
- Around line 11-94: Add structured lifecycle/state-transition logs to
AppendWalFile: in open(...) emit an info! with the file path and whether it was
created or opened; in truncate_to(...) emit info! or debug! with the path and
the valid_prefix value (and resulting file length after truncate if available);
in replace_with_bytes(...) emit info! or debug! with the path and bytes.len()
before writing and after successful rename; use the existing function names
(AppendWalFile::open, append_bytes, sync, read_all, truncate_to,
replace_with_bytes) to locate insertion points and prefer the crate's logging
macro (info!/debug!) so logs include path and numeric details for observability
without touching hot-path append_bytes/sync.
- Around line 144-185: Add focused edge-case and negative-path tests for
AppendWalFile: create tests named like truncate_to_zero, truncate_beyond_length,
replace_with_empty_bytes, and error_propagation_on_failed_ops that exercise
AppendWalFile::open, ::append_bytes, ::sync, ::truncate_to and
::replace_with_bytes; specifically assert behavior after truncating to 0,
truncating to a length greater than current file size (documenting set_len
semantics), replacing with an empty slice produces an empty file and allows
future appends, and simulate/verify error propagation when underlying filesystem
ops fail (for example by opening a read-only handle or changing permissions) to
ensure append_bytes returns an error and state is consistent; follow existing
test structure (use test_path, fs::remove_file) and add assertions using
::read_all and unwrap/expect as appropriate.

In `@crates/quota-core/src/recovery.rs`:
- Around line 152-236: Add structured info/error logs around
recover_quota_with_observer lifecycle to mirror allocdb-core: log start of
recovery (including snapshot presence/loaded_snapshot_lsn) right after snapshot
load, log an info after wal_file.truncate_to_valid_prefix completes, log
progress at RecoveryBoundary::AfterReplayFrame (include lsn, record_type,
replay_ordinal/replayable_frame_count), and log a final info on successful
completion before returning RecoveryResult; also ensure errors during snapshot
load, WAL truncate, or replay are logged with error level where the ? propagates
(e.g., wrap or map_err to log before returning) so failures are visible in logs.
Use the existing observer/RecoveryBoundary places and
db.set_progress/RecoveryResult return point to attach these logs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5c2665df-8dd8-4a35-bcea-1623c63f2b46

📥 Commits

Reviewing files that changed from the base of the PR and between 079f2fa and f3d15d2.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (17)
  • Cargo.toml
  • crates/allocdb-core/Cargo.toml
  • crates/allocdb-core/src/recovery.rs
  • crates/allocdb-core/src/recovery_issue_30_tests.rs
  • crates/allocdb-core/src/recovery_issue_31_tests.rs
  • crates/allocdb-core/src/recovery_revoke_tests.rs
  • crates/allocdb-core/src/recovery_tests.rs
  • crates/allocdb-core/src/wal_file.rs
  • crates/allocdb-node/src/engine.rs
  • crates/allocdb-wal-file/Cargo.toml
  • crates/allocdb-wal-file/src/lib.rs
  • crates/quota-core/Cargo.toml
  • crates/quota-core/src/recovery.rs
  • crates/quota-core/src/wal_file.rs
  • crates/reservation-core/Cargo.toml
  • crates/reservation-core/src/wal_file.rs
  • scripts/check_repo.sh
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: semgrep-cloud-platform/scan
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-core/src/recovery_revoke_tests.rs
  • crates/allocdb-node/src/engine.rs
  • crates/allocdb-core/src/recovery_issue_30_tests.rs
  • crates/allocdb-core/src/recovery_issue_31_tests.rs
  • crates/allocdb-core/src/recovery_tests.rs
  • crates/allocdb-core/src/recovery.rs
  • crates/quota-core/src/recovery.rs
  • crates/allocdb-core/src/wal_file.rs
  • crates/quota-core/src/wal_file.rs
  • crates/allocdb-wal-file/src/lib.rs
  • crates/reservation-core/src/wal_file.rs
🔇 Additional comments (34)
scripts/check_repo.sh (1)

38-38: LGTM!

The dependency whitelist correctly includes allocdb-wal-file to permit the new workspace crate as a dependency of allocdb-core.

Cargo.toml (1)

4-4: LGTM!

The new allocdb-wal-file crate is correctly added to the workspace members, enabling path dependencies from allocdb-core, quota-core, and reservation-core.

crates/reservation-core/Cargo.toml (1)

9-9: LGTM!

The path dependency on allocdb-wal-file is correctly added, consistent with the same pattern in quota-core/Cargo.toml and allocdb-core/Cargo.toml.

crates/allocdb-core/src/recovery_issue_31_tests.rs (1)

70-70: LGTM!

The call site correctly passes &mut wal to match the updated recover_allocdb signature that now requires mutable WAL access for potential truncation during recovery.

crates/allocdb-core/Cargo.toml (1)

9-9: LGTM!

The path dependency on allocdb-wal-file is correctly added and is now permitted by the updated scripts/check_repo.sh whitelist.

crates/allocdb-wal-file/Cargo.toml (1)

1-8: LGTM!

The new crate manifest correctly inherits workspace configuration and has no external dependencies, keeping the shared WAL substrate minimal and self-contained.

crates/allocdb-core/src/recovery_revoke_tests.rs (2)

92-93: LGTM!

The call site correctly passes &mut wal to match the updated recover_allocdb signature.


160-161: LGTM!

The call site correctly passes &mut wal to match the updated recover_allocdb signature, consistent with the change in the first test.

crates/quota-core/Cargo.toml (1)

9-9: LGTM!

The path dependency on allocdb-wal-file is correctly added, enabling quota-core to delegate WAL I/O to the shared AppendWalFile abstraction.

crates/allocdb-node/src/engine.rs (1)

420-441: LGTM!

The recovery call site is correctly updated to pass &mut wal to match the new recover_allocdb_with_observer signature. The mutable binding and error mapping chain are properly preserved.

crates/allocdb-core/src/recovery_issue_30_tests.rs (1)

82-83: LGTM!

Test call sites correctly updated to pass mutable WAL references, matching the new recover_allocdb signature.

Also applies to: 130-131, 148-148, 177-177

crates/allocdb-core/src/recovery_tests.rs (1)

95-95: LGTM!

All recovery test call sites correctly updated to use mutable WAL references. Test coverage for various recovery scenarios is preserved.

Also applies to: 151-151, 196-197, 216-216, 222-222, 266-267, 313-314

crates/allocdb-core/src/recovery.rs (3)

149-164: LGTM!

The signature change to &mut WalFile is correctly justified by the underlying truncate_to_valid_prefix(&mut self) call at line 221. The recovery logic and error handling remain intact.


173-201: LGTM!

The observer-based recovery variant correctly propagates the mutable borrow requirement. Logging at info level for successful recovery and error level for failures via log_recovery_failure follows the coding guidelines.


203-222: LGTM!

Internal implementation correctly receives and uses the mutable WAL reference for truncate_to_valid_prefix(). The observer callback pattern and replay logic remain unchanged.

crates/quota-core/src/recovery.rs (2)

135-150: LGTM!

The signature change to &mut WalFile correctly enables the truncate_to_valid_prefix mutation at line 170.


377-377: LGTM!

Test call sites correctly updated to pass mutable WAL references.

Also applies to: 415-415, 465-465, 520-520

crates/allocdb-core/src/wal_file.rs (6)

1-10: LGTM!

Clean refactor to delegate WAL operations to AppendWalFile. The struct simplification removes duplicate path storage since AppendWalFile owns the path internally.


37-50: LGTM!

Opening and path access correctly delegate to AppendWalFile. The error conversion via From<std::io::Error> handles I/O failures from the underlying abstraction.


63-77: LGTM!

Frame append and sync operations cleanly delegate to AppendWalFile::append_bytes and AppendWalFile::sync. Payload validation occurs before delegation, preserving the existing contract.


84-115: LGTM!

Recovery and truncation correctly use the recover_file helper and delegate truncation to AppendWalFile::truncate_to. The fail-closed behavior on mid-log corruption is preserved (returns WalFileError::Corruption without truncating).


123-134: LGTM!

The replace_with_frames implementation correctly validates all frames first, then encodes into a contiguous buffer before atomic replacement via replace_with_bytes. This preserves crash safety (atomic rename + parent dir sync per context snippet 1).


148-155: LGTM!

The recover_file helper cleanly reads the full WAL via read_all() and scans frames. The u64 conversion panic is acceptable as WAL files exceeding u64::MAX bytes are unrealistic.

crates/quota-core/src/wal_file.rs (4)

1-10: LGTM!

Clean delegation to AppendWalFile, mirroring the allocdb-core pattern for the shared WAL substrate.


37-61: LGTM!

Open, path, append, and sync operations correctly delegate to AppendWalFile.


67-85: LGTM!

Truncation logic preserves fail-closed behavior on corruption and correctly delegates torn-tail truncation to AppendWalFile::truncate_to.


87-119: LGTM!

Frame replacement and recovery helper follow the same pattern as allocdb-core, maintaining consistency across the shared substrate.

crates/reservation-core/src/wal_file.rs (3)

1-10: LGTM!

Clean delegation to AppendWalFile, consistent with the other crate-local WAL wrappers.


37-65: LGTM!

Core WAL operations correctly delegate to AppendWalFile.


67-119: LGTM!

Truncation, replacement, and recovery helper implementations are consistent with the shared substrate pattern.

crates/allocdb-wal-file/src/lib.rs (4)

1-9: LGTM!

Clean struct definition with appropriate fields for the append-only WAL file abstraction.


11-44: LGTM!

The open, path, append_bytes, and sync methods are straightforward delegations with correct semantics. Using sync_data() is appropriate for WAL durability.


46-56: LGTM!

Opening a fresh read handle in read_all avoids position conflicts with the append handle and correctly reads the full on-disk state.


78-94: LGTM on the atomic replacement pattern.

The write-to-temp → sync → rename → parent-dir-sync → reopen sequence is correct for crash-safe atomic replacement.

@skel84 skel84 merged commit 18bef24 into main Mar 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

M12-T03: Extract shared wal_file helper layer after wal extraction

1 participant