fix: deterministic exec rebind handshake eliminates 60s post-restore hang#468
fix: deterministic exec rebind handshake eliminates 60s post-restore hang#468
Conversation
After snapshot restore, two exec rebind signals were fired: 1. handle_clone_restore() in restore.rs (correct, via restore-epoch watcher) 2. agent.rs after notify_cache_ready_and_wait() returns (duplicate) The double re_register() left the VsockListener's AsyncFd epoll in a broken state where readability events weren't delivered for ~150 seconds, causing all health monitor exec calls to timeout. Fix: - Remove duplicate rebind signal from agent.rs (root cause) - Swap ordering in restore.rs: exec rebind BEFORE output reconnect (host uses output reconnection as readiness signal) - Add post-restore exec stress test: 10 parallel calls with 10s timeout each — catches this regression instantly Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore 10 parallel exec calls after restore: 11-19ms each (20.8ms total) Full test: 133s (was timing out at 120s before fix)
🔍 Claude ReviewSEVERITY: none FindingsNo issues found. The changes are correct and well-documented. Core fix (agent.rs): Removing the duplicate Ordering change (restore.rs): Moving exec rebind before output reconnect is the right call — the host treats output vsock reconnection as a readiness signal, so exec must be functional first. Doc comments (exec.rs, vsock.rs): Accurately document the single-signal invariant and the Regression test: 10 parallel exec calls with 10s individual timeouts directly targets the failure mode (150s epoll stall after double rebind). Good use of SummaryClean, well-motivated bug fix. The root cause (double Review by Claude | Run |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c15aa6b0ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
fc-agent/src/agent.rs
Outdated
| if container::notify_cache_ready_and_wait(&digest, &restore_flag) { | ||
| eprintln!("[fc-agent] cache ready notification acknowledged"); | ||
| // Pre-start snapshot was taken and we've been restored into a new | ||
| // Firecracker instance. The vsock transport was reset, which | ||
| // invalidates the exec server's listener socket (stale AsyncFd epoll). | ||
| // Signal it to re-bind. Set flag BEFORE notify to prevent race where | ||
| // select! drops the Notified future (see exec.rs doc comment). | ||
| exec_rebind_needed.store(true, std::sync::atomic::Ordering::Release); | ||
| exec_rebind.notify_one(); | ||
| // Exec rebind is handled by handle_clone_restore() (via the | ||
| // restore-epoch watcher). Do NOT signal a second rebind here — | ||
| // a double re-register leaves the listener in a broken state |
There was a problem hiding this comment.
Reintroduce exec rebind fallback after cache-wait success
This branch now does nothing after notify_cache_ready_and_wait() returns true, which makes exec rebind depend entirely on watch_restore_epoch(). That is unsafe because notify_cache_ready_and_wait() can succeed from a transport reset without restore_flag being set (see fc-agent/src/container.rs reset paths), and the host’s post-snapshot MMDS bump is explicitly best-effort (src/commands/common.rs logs and continues on patch_mmds failure at lines 1444-1452). In that failure mode, the watcher never signals rebind and the exec listener stays on stale epoll registration, so post-snapshot exec/health checks can hang.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Valid concern, partially. The two MMDS bump sites have different error handling:
-
Clone restore path (
src/commands/common.rs:1006): Usesput_mmds().await.context("...")?— hard error, propagates failure. If MMDS fails, the clone fails entirely. So clones are safe. -
Baseline pre-start snapshot path (
src/commands/common.rs:1444): Usesif let Err(e)— best-effort. If this fails, the restore-epoch watcher on the baseline won't fire, and exec stays stale on the baseline.
The baseline case is real but low-impact: the baseline continues to run podman run and start the container, but the health monitor was already running before the snapshot. The baseline itself isn't serving traffic — it's a template. The startup snapshot (taken when healthy) would hit the same best-effort MMDS bump, but by that point the exec server has been working for the entire boot.
That said, upgrading the baseline's patch_mmds to hard-error (or adding a fallback rebind signal) is the right hardening. Will address in a follow-up.
…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
🔍 Claude ReviewSEVERITY: none FindingsNo issues found. This review covers commit b7714df (the deterministic handshake chain), which was added after the previous review that covered only commit c15aa6b. Handshake chain correctness: The deterministic ordering — exec rebind → Race condition analysis:
Host-side gate ( Test: 10 parallel exec calls with 10s individual timeouts directly targets the failure mode. The 30s total assertion is conservative but appropriate. PR description: Accurately describes both commits and matches the implementation. SummarySolid follow-up to commit 1. The handshake chain is deterministic and race-free. The host-side output connection gate correctly serializes health monitor startup after exec readiness confirmation. Review by Claude | Run |
The restore-epoch MMDS update in create_snapshot_core() was best-effort (if let Err → warn). This was dangerous because the entire exec rebind chain depends on it: epoch watcher → handle_clone_restore() → exec re_register → output reconnect. If patch_mmds failed silently, the exec server's AsyncFd epoll stayed stale after vsock transport reset, causing health checks to hang for ~60s. The clone path already uses hard error (put_mmds().context()?). The baseline pre-start snapshot path now matches. Found by code review on PR #468. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore - First run: snapshot created, patch_mmds succeeds (hard error path) - Second run: restored from snapshot, exec stress 10/10 in 14ms
Summary
The Problem
After snapshot restore, Firecracker's vsock transport reset leaves the exec server's AsyncFd epoll registration stale. Two signals fired per restore:
handle_clone_restore()via restore-epoch watcher (correct)agent.rsafternotify_cache_ready_and_wait()(duplicate)The duplicate
re_register()corrupted edge-triggered epoll: old AsyncFd consumed the notification, new AsyncFd never saw pending connections. Health checks hung for ~60s until kernel vsock cleanup expired stale connections.Trace evidence (vsock muxer log, vm-ba97c):
Container started at T+0.28s. Exec broken for 60 more seconds.
The Fix: Deterministic Handshake Chain
Every transition has an explicit signal. Zero timing dependencies.
Commit 1: Remove duplicate rebind (c15aa6b)
agent.rsrestore.rs: exec rebind BEFORE output reconnectCommit 2: Deterministic handshake chain (b7714df)
fc-agent side (4 files):
exec.rs: Afterre_register(), signalsrebind_done(AtomicBool + Notify)restore.rs: Signals exec rebind, waits forrebind_doneconfirmation (5s timeout), THEN reconnects outputagent.rs: Addedexec_rebind_done/exec_rebind_done_notifyArcs, removed duplicate signalmmds.rs: Threads new params throughwatch_restore_epochto bothhandle_clone_restorecall sitesHost side (3 files):
listeners.rs: Addedconnected_txoneshot torun_output_listener(), fired on first acceptsnapshot.rs: Waits foroutput_connected_rx(30s timeout) before spawning health monitor; removed staleoutput_reconnect.notify_one()podman/mod.rs: PassesNoneforconnected_tx(non-snapshot path)Test Results
Primary test (
localhost_rootless_btrfs_snapshot_restore): PASSEDExec stress: 10 parallel calls in 16.3ms (max single: 15.3ms), zero timeouts
Full snapshot suite: 82/82 passed, 0 failures
Clone restore timing across all tests (MMDS signal → healthy):
Before: ~61s. After: <500ms. Zero exec timeouts across all 82 tests.
Test plan
make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1make test-root FILTER=snapshot STREAM=1(82/82 passed)exec timed outmessages in any test logfc-agent output connected, exec server readyappears before health monitor start in all clone restores