fix: --ipv6-prefix for routed mode, stale state cleanup, snapshot restore#559
fix: --ipv6-prefix for routed mode, stale state cleanup, snapshot restore#559
Conversation
…estore timeout - load_state_by_pid: retry after cleanup_stale_state when PID not found - Routed mode: verify forwarding/accept_ra/default route at setup, warn if wrong - Snapshot restore: remove 30s timeout on fc-agent output reconnect (VM may be CPU-starved after restore; proceeding early causes exec failures)
Add --ipv6-prefix flag for routed mode VM addressing. When set, VMs get addresses in the specified /64 prefix via NDP proxy, and MASQUERADE is skipped (the prefix is directly routable and covered by the machine cert's IP SANs, so VM source IPs pass IP binding checks). Without --ipv6-prefix, detect_host_ipv6() auto-detects from interfaces, skipping deprecated addresses. For hosts where all /64s are deprecated, --ipv6-prefix is required. Also: preflight_check is now an instance method (&self) so the ipv6_prefix configuration cannot be mismatched between preflight and setup.
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] Snapshot restore path doesn't propagate In FcNetworkMode::Routed => {
let mut net =
RoutedNetwork::new(vm_id.clone(), tap_device.clone(), port_mappings.clone());
net.preflight_check().context("routed mode preflight check failed")?;
// ...
}The
To fix: either store [MEDIUM] No validation of The user-supplied prefix string is used directly in string formatting ( A minimal check in [LOW] DESIGN.md and CLAUDE.md not updated for The SummaryThe Review by Claude | Run |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ca2ef8c238
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/commands/snapshot.rs
Outdated
| Err(_) => { | ||
| warn!(vm_id = %vm_id, "fc-agent did not connect within 30s, proceeding anyway") | ||
| } | ||
| match output_connected_rx.await { |
There was a problem hiding this comment.
Avoid indefinite wait for output reconnect gate
In non-TTY snapshot restores, output_connected_rx.await can block forever when fc-agent never reconnects (for example, if the restored VM exits/crashes before opening the output vsock). In that case the sender is still held inside run_output_listener's initial accept() wait, so this function never reaches the later vm_manager.wait()/cleanup path and the command hangs indefinitely. Please gate this wait against VM exit/cancellation (or fail after a bounded timeout) so failed restores terminate.
Useful? React with 👍 / 👎.
src/commands/snapshot.rs
Outdated
| let mut net = | ||
| RoutedNetwork::new(vm_id.clone(), tap_device.clone(), port_mappings.clone()); | ||
| net.preflight_check().context("routed mode preflight check failed")?; |
There was a problem hiding this comment.
Propagate routed IPv6 prefix into snapshot restore path
This routed restore path constructs RoutedNetwork without ever setting an explicit prefix, so restores always run the auto-detect/MASQUERADE behavior even after adding --ipv6-prefix support in podman run. That creates a behavior mismatch where cold boots can succeed with a configured routable prefix but snapshot/cache restores cannot reproduce it (and may fail preflight on hosts that require explicit prefixing). The restore path needs to carry the prefix into with_ipv6_prefix(...).
Useful? React with 👍 / 👎.
- cargo fmt: line wrapping for .context() chains - clippy: collapse nested if in detect_host_ipv6() - Add missing ipv6_prefix field to RunArgs in test helpers Tested: make lint (fmt, clippy, audit, deny all pass)
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] Snapshot restore doesn't propagate As noted in previous review, the snapshot restore path ( [MEDIUM] No validation of As noted in previous review, the user-supplied prefix is used directly in [MEDIUM] PR description/docs inconsistency with code changes
SummaryThe two [MEDIUM] issues from the previous review (snapshot restore not propagating Review by Claude | Run |
Three issues found by code review: 1. ipv6_prefix not propagated through snapshot metadata — clones of routed-mode VMs with --ipv6-prefix would fail preflight (no auto-detect) or incorrectly add MASQUERADE. Added ipv6_prefix to VmConfig, SnapshotMetadata, and the snapshot restore path. 2. No validation of --ipv6-prefix input — invalid values like "foobar" produced nonsense addresses. Added validate_ipv6_prefix() that checks for 4 colon-separated hex groups. 3. Bare .await on output_connected_rx could hang forever if Firecracker crashes before fc-agent connects. Replaced with tokio::select! loop that polls vm_manager.try_wait() every 5s as a liveness check. Tested: make lint (fmt, clippy, audit, deny all pass)
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] Stale documentation: The two code-level [MEDIUM] issues from previous review (snapshot propagation, input validation) are now fixed in commit 06b9b1a. The documentation staleness from previous review remains:
[LOW] Doc comment on Line 70 of SummaryThe snapshot propagation and validation bugs are fixed. The remaining issue is stale documentation across three files that doesn't reflect the new Review by Claude | Run |
Keep ipv6_prefix as a builder method (with_ipv6_prefix), consistent with loopback_ip (with_loopback_ip). Both are optional config set after construction — loopback_ip because it's allocated async after preflight_check, ipv6_prefix because it's genuinely optional. The real compile-time safety comes from storing ipv6_prefix in SnapshotMetadata, so both call sites (podman run + snapshot run) have it available and use symmetric code. Tested: cargo fmt --check + clippy pass
- README: ip6tables not required with --ipv6-prefix, add to CLI Reference - DESIGN: add ipv6_prefix field to RoutedNetwork, note conditional checks - CLAUDE.md: note MASQUERADE skipped with --ipv6-prefix
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] Stale documentation: As noted in previous review, the code-level issues (snapshot propagation, input validation) are now fixed. The documentation staleness remains across three files:
SummaryThe two original code bugs (snapshot restore not propagating Review by Claude | Run |
🔍 Claude ReviewSEVERITY: none FindingsAll previously reported issues are now fixed:
No new issues found. The PR is clean and ready to merge. SummaryAll [MEDIUM] and [LOW] issues from previous reviews have been addressed across commits 06b9b1a and bac740f. The implementation is consistent: Review by Claude | Run |
- 10 tests in routed.rs: prefix validation (valid/invalid/hex/full-addr), VM IPv6 generation (deterministic/format), parse_host_ipv6 (deprecated address filtering, link-local/ULA skipping, prefix extraction) - 2 tests in snapshot.rs: ipv6_prefix roundtrip through SnapshotMetadata serde, backward compatibility with old snapshots missing the field - 3 tests in test_state_manager.rs: load_state_by_pid found/not-found, stale state cleanup on retry (verifies dead PID files are removed) Extracted parse_host_ipv6(output, check_onlink) from detect_host_ipv6() for testability without shelling out to `ip addr`.
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] In vm_state.config.health_check_url = snapshot_config.metadata.health_check_url.clone();
vm_state.config.health_check_timeout = snapshot_config.metadata.health_check_timeout;
vm_state.config.username = snapshot_config.metadata.username.clone();
vm_state.config.user = snapshot_config.metadata.user.clone();
vm_state.config.port_mappings = port_mappings;
vm_state.config.network_mode = network_mode;
vm_state.config.tty = tty_mode;
vm_state.config.interactive = interactive;
// Missing: vm_state.config.ipv6_prefix = snapshot_config.metadata.ipv6_prefix.clone();The
Fix: add [LOW] README says README.md line 339: SummaryThe Review by Claude | Run |
🔧 Auto-Fix CreatedI found issues and created a fix PR: #560 Please review and merge the fix PR first, then this PR. |
🔧 CI Auto-FixCreated fix PR: #561 Root cause: |
ca2ef8c to
8feec2e
Compare
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] Snapshot restore unconditionally skips MASQUERADE for all routed VMs In if let Some(ref guest_ipv6) = snapshot_config.metadata.network_config.guest_ipv6 {
let addr_part = guest_ipv6.split('/').next().unwrap_or(guest_ipv6);
if let Ok(ip) = addr_part.parse::<std::net::Ipv6Addr>() {
let s = ip.segments();
let prefix = format!("{:x}:{:x}:{:x}:{:x}", s[0], s[1], s[2], s[3]);
net = net.with_ipv6_prefix(prefix);
}
}This unconditionally sets For VMs originally started with This is already fixed in PR #560, which stores The previously reported issues (previous review) — input validation and SummaryThe Review by Claude | Run |
🔧 CI Auto-FixCreated fix PR: #562 Problem: The Fix: Added |
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] As noted in previous review, [LOW] README says As noted in previous review, README.md line 339 says SummaryThe [MEDIUM] config propagation bug and [LOW] README mismatch from PR #560 (closed without merge) remain unfixed. Creating a fix PR. Review by Claude | Run |
🔧 Auto-Fix CreatedI found issues and created a fix PR: #563 Please review and merge the fix PR first, then this PR. |
Add the missing `vm_state.config.ipv6_prefix` restoration alongside other config fields in the snapshot run path. Without this, cascaded snapshots (creating a snapshot from a clone) lose the prefix because `build_snapshot_config()` reads from `vm_state.config`. Also fix README CLI reference: `<CIDR>` → `<PREFIX>` to match the actual input format (4 colon-separated hex groups, not CIDR notation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🔍 Claude ReviewSEVERITY: none FindingsAll previously reported issues are now fixed:
No new issues found. The SummaryThe PR is clean. All 8 commits are consistent with the description, all previously flagged issues are resolved, and code/docs are in sync. Review by Claude | Run |
On first failure (or when FCVM_FORCE_DIAG=1), dump 16 diagnostic checks: - Verbose curl via pasta (exact failure point) - ss -tlnp and ss -tanp (listening + all TCP states) - nsenter curl 10.0.2.100:80 (bypass pasta, test guest directly) - nsenter ip neigh, bridge link, ss -tanp (namespace state) - nsenter ping 10.0.2.100 (L3 reachability) - exec in VM: nginx pids, localhost curl, connection count - exec in VM: ss -tan, ARP, somaxconn, tcp_max_syn_backlog - exec in VM: dmesg for TCP/conntrack errors - Raw TCP via nc from namespace (bypass HTTP) - /proc/net/sockstat in namespace (socket counts) - Pasta fd count, VmRSS, thread count - tcpdump on br0 port 80 while doing curl (captures actual RST) - conntrack entries for port 80 in namespace - Per-request timing logged every 5th request The decisive test is nsenter curl: if it works but pasta curl doesn't, the bug is in pasta's splice relay. If both fail, the bug is in the guest's TCP stack after snapshot restore.
🔍 Claude ReviewSEVERITY: none FindingsAll previously reported issues are now fixed:
Latest commit [LOW] Duplicate comment numbering in test diagnostics — SummaryThe PR is clean and ready to merge. All 9 commits are consistent with the description, all previously flagged issues are resolved, and code/docs are in sync. The Review by Claude | Run |
Summary
--ipv6-prefixflag for routed mode — specifies a routable /64 prefix for VM addressing. When set, MASQUERADE is skipped (the prefix is directly routable and covered by the machine cert's IP SANs). Without it,detect_host_ipv6()auto-detects from interfaces, skipping deprecated addresses.load_state_by_pid: retry aftercleanup_stale_statewhen target PID not found (stale files from killed VMs shadow live ones)setup(): verify host forwarding/accept_ra/default route and warn if misconfiguredpreflight_checkis now an instance method (&self) so ipv6_prefix config cannot be mismatched between preflight and setupTest Results