fix: skip pasta port probe during snapshot restore#555
Conversation
During snapshot restore, post_start() runs BEFORE the VM snapshot is loaded into Firecracker. wait_for_port_forwarding() probes ports via TcpStream::connect(), which pasta accepts (it's listening), but pasta can't forward to the non-existent guest. This forces pasta to attempt L2 forwarding that fails, potentially poisoning its internal connection tracking state and causing subsequent data connections to return 0 bytes. Add with_restore_mode() to PastaNetwork that skips the premature port probe in post_start(). Port forwarding is verified later via verify_port_forwarding() which runs after the VM is resumed and fc-agent has sent its gratuitous ARP.
Stress test that spawns 3 clones from a snapshot with port forwarding and makes 20 rapid HTTP requests to each clone. Catches the "connect succeeded but 0 bytes" pattern seen in CI bench-vm failures. Specifically detects pasta connection tracking poisoning where TCP connect succeeds but no data flows through the forwarded port.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ffa6facb6b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !self.restore_mode && !self.port_mappings.is_empty() { | ||
| self.wait_for_port_forwarding().await?; |
There was a problem hiding this comment.
Preserve hard failure when restore-mode probe is skipped
When restore_mode is enabled, this branch skips the only startup-time port-forward readiness check that previously failed snapshot run on broken forwarding. The later check in cmd_snapshot_run (network.verify_port_forwarding()) only logs a warning on error, so a clone can now come up “healthy” while all published ports are unusable; this is a behavior regression specific to rootless snapshot restores with --publish where pasta never reaches a working forwarding state.
Useful? React with 👍 / 👎.
🔍 Claude ReviewSEVERITY: none AnalysisThe core fix is correct and well-targeted. The ordering in
Probing ports in step 1 means pasta tries L2 forwarding to a non-existent guest. The Scope check: Only Test: Well-structured stress test with proper feature gating ( FindingsNo issues found. The fix is minimal, well-documented in code comments, and addresses a real race condition in the snapshot restore ordering. SummaryClean PR — the root cause analysis is thorough, the fix is surgical (one boolean flag + one condition check), and the new stress test validates the specific failure pattern. LGTM. Review by Claude | Run |
…one spawning - Extract get_loopback_ip() and curl_check()/CurlResult to tests/common (was duplicated 4x across test files) - Replace inline loopback IP parsing in test_snapshot_clone.rs (2 instances) - Spawn clones concurrently via JoinSet instead of sequential loop - Send HTTP requests concurrently across all clones (true stress test) - Remove 50ms sleep between requests and local CurlResult/curl_check
🔍 Claude ReviewSEVERITY: none AnalysisCore fix is correct. The
Refactoring is clean. The third commit extracts Stress test is well-targeted. Feature-gated ( FindingsNo issues found. As noted in previous review, the fix is minimal and well-documented. The additional refactoring commit (extracting common test helpers, concurrent spawning) is a good cleanup. SummaryClean PR. The fix is surgical (one boolean flag, one condition check), the test targets the specific failure mode, and the refactoring reduces code duplication. LGTM. Review by Claude | Run |
The caller in snapshot.rs wrapped verify_port_forwarding() in if-let-Err that only logged a warning, allowing clones to come up with broken port forwarding. Now that post_start() skips the port probe in restore mode, verify_port_forwarding() is the only gate. Propagate its error so clones fail fast on broken forwarding.
🔍 Claude ReviewSEVERITY: none AnalysisTwo previous reviews found no issues — the core fix and refactoring remain correct. This review focused on additional verification:
SummaryClean PR. The fix is correct, the error promotion is safe across all network backends, and the description matches the commits. LGTM. Review by Claude | Run |
…arding With arp_accept=0 (Linux default), gratuitous ARP only updates existing neighbor entries — never creates new ones. The old code checked `ip neigh show` which would never find an entry because fc-agent's gratuitous arping can't create one. Active ping forces a normal ARP exchange (request+reply) that always creates a REACHABLE entry. Changes: - Replace `ip neigh show` with `ping -c 1 -W 0.2` to the guest - Successful ping proves both ARP resolution and guest reachability - 200ms ping timeout allows ~16 retries within 5s deadline - Remove 10ms sleep (ping provides natural 200ms backoff) Tested: make test-root FILTER=clone_port_forward_stress — 60/60 OK, 0 errors
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] PR description missing newest commit: The description doesn't cover commit [MEDIUM] Swallowed spawn errors in
Fix: propagate the [LOW] Stale comment in SummaryThe core fix (skip port probe during restore, use ping for ARP) is correct. The main issue is that command spawn errors in the new ping loop are silently swallowed, which would cause a busy-loop and misleading error message if Review by Claude | Run |
🔧 Auto-Fix CreatedI found issues and created a fix PR: #556 Please review and merge the fix PR first, then this PR. |
The new ping-based ARP check used `if let Ok(...)` which silently swallowed command spawn failures (nsenter/ping not found, permission denied). This could cause a 5-second busy loop with a misleading "ARP not resolved" error. Propagate spawn errors immediately with `.context()?` to match the old code's behavior. Also update stale comment in snapshot.rs to reflect the ping-based ARP resolution approach. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] PR description still missing two commits — As noted in previous review, the description doesn't cover commit Previously reported issues — now fixed:
SummaryThe code changes are correct and complete. The only remaining issue is the PR description being out of date with the last two commits — the description should mention the switch from passive Review by Claude | Run |
- Fix curl_check to use -sS (show errors) instead of -s (silent) - Add pre-storm verification curl for each clone before stress test - On first curl error per clone, dump ss -tlnp and verbose curl to capture the actual error (connection refused, timeout, etc.)
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] PR description still out of date with commits — As noted in previous review, the description doesn't cover commits SummaryThe code changes are all correct — the newest commit ( Review by Claude | Run |
- Add iputils-ping to Containerfile.nested so ping-based ARP probe works in container CI (was missing, causing all container clones to fail with "ARP not resolved within 5s") - Move loop comment above the loop in verify_port_forwarding - Trim snapshot.rs comment to match actual behavior
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] PR description still out of date — As noted in previous review, the description covers only the original 5 commits. It is now missing 3 additional commits: Newest commit (
SummaryNo new code issues. The only remaining finding is the stale PR description (previously reported). All code changes across all 8 commits are correct and well-structured. Review by Claude | Run |
🔍 Claude ReviewSEVERITY: medium Findings[MEDIUM] PR description still out of date — As noted in previous review, the description covers only the original 5 commits but there are now 9 commits. The newest commit No new code issues. The newest commit is purely SummaryAll code is correct. The only remaining issue is the stale PR description (reported in 4 previous reviews). No fix PR needed — this requires an author update to the PR description. Review by Claude | Run |
Add more diagnostic dumps on first curl error per clone: - pgrep -a pasta: check if pasta is alive, find matching process - ss -tnp: show active connections (not just listening sockets) - All pasta processes listed for cross-clone comparison
🔍 Claude ReviewSEVERITY: none AnalysisReviewed the newest commit
The additions are well-structured — numbered comments (1-4) make the diagnostic sequence clear, and each diagnostic silently skips on failure ( Previously reported issues — all resolved:
No new issues found. The code changes are correct, the test diagnostics are useful for CI debugging, and the PR description is now in sync with the commits. SummaryClean PR. The newest commit adds useful diagnostic output for debugging CI failures — no code correctness issues. All previously reported issues have been resolved. LGTM. Review by Claude | Run |
🔧 CI Auto-FixCreated fix PR: #557 Root cause: Note: Host-Root-arm64 |
- Add curl_check_retry() to tests/common: retries HTTP with 500ms backoff and dumps network diagnostics on failure (ARP cache, namespace sockets, bridge links, VM listening sockets, curl -v) - Extract dump_clone_network_diagnostics() as reusable test helper - Capture ping stderr in verify_port_forwarding for better error messages when ARP resolution fails - Use curl_check_retry in clone port forward tests (rootless, routed, stress) instead of single-shot curl - Revert wait_for_port_forwarding to TCP-connect-only (appropriate for initial boot when guest app isn't ready yet)
🔍 Claude ReviewSEVERITY: none AnalysisReviewed the newest commit Newest commit changes are correct:
PR description vs commits: The description now covers items 1-9, which maps to all 11 commits. The newest commit ( Previously reported issues — all resolved. SummaryClean PR. All 11 commits are correct. The newest additions (curl retry with diagnostics, ping stderr capture) improve test reliability and debugging. LGTM. Review by Claude | Run |
verify_port_forwarding uses nsenter + ping to check ARP resolution after snapshot restore. In container tests, nsenter runs in the container's mount namespace, so ping must be installed there. Was added to Containerfile.nested but not Containerfile, causing: nsenter: failed to execute ping: No such file or directory
🔍 Claude ReviewSEVERITY: none AnalysisReviewed the newest commit Newest commit is correct and necessary. The main PR description vs commits: Item 6 says "Add iputils-ping to nested container" but the newest commit adds it to the main All previously reported issues are resolved:
SummaryClean PR. The newest commit completes the Review by Claude | Run |
🔧 CI Auto-FixCreated fix PR: #558 Problem: Fix: Removed the TCP port probe from |
- Dump full clone log instead of last 30 lines on clone_http failure - Add namespace diagnostics: ARP cache, namespace sockets, bridge links via nsenter using holder PID from fcvm ls - Add VM listening sockets via fcvm exec ss -tnl - Dump full serve log instead of last 30 lines on clone_exec failure These diagnostics match what tests/common/mod.rs already captures, giving visibility into pasta splice state when bench fails.
🔍 Claude ReviewSEVERITY: none AnalysisReviewed all 13 commits on this PR. The core fix and all supporting changes are correct. Core fix verification: The PR description vs commits: All 9 description items map to the 13 commits. Accurate and complete. Documentation check: No doc updates needed — changes are internal plumbing (restore ordering, ARP resolution method, test helpers). No user-facing CLI, flags, or behavior changes. Code correctness:
Note on PR #558: That open PR proposes removing All previously reported issues are resolved. SummaryClean PR. The fix is correct and well-targeted, the refactoring reduces duplication, and the stress test validates the specific failure pattern. LGTM. Review by Claude | Run |
Summary
Fix pasta port forwarding reliability for snapshot-restored clones. After snapshot restore, the previous
wait_for_port_forwarding()probe ran before the VM was loaded, poisoning pasta's state. Now we skip the probe during restore and instead verify forwarding after restore completes using an active ping + port probe.Stacked on: main
Changes
start_pasta()acceptsrestore_modeflag; when true, skipswait_for_port_forwarding()since the guest doesn't exist yetverify_port_forwarding()pings guest from namespace to trigger ARP resolution, then probes each forwarded port to confirm end-to-end forwardingping -c 1 -W 0.2to the guest instead of passiveip neigh showARP table checks. Witharp_accept=0(Linux default), gratuitous ARP only updates existing entries, never creates new ones — the active ping forces a proper ARP exchange.context()?so clones fail immediately if port forwarding can't be verified.context("running ping via nsenter in namespace")?instead of silently ignoring command failuresContainerfile.nestedwas missing thepingbinary needed by the new verificationtest_clone_port_forward_stress_rootless: 3 clones × 20 concurrent HTTP requests, checking for 0-byte responses (pasta poisoning pattern)curl -v,ss -tlnp(listening),ss -tnp(connections),pgrep -a pasta(process alive check)get_loopback_ip()andcurl_check()tocommon/mod.rs, concurrent clone spawning viaJoinSetTest Results
Local stress test passes (60/60 OK, 0 errors):