fix: deterministic exec rebind handshake eliminates 60s post-restore hang by ejc3 · Pull Request #468 · ejc3/fcvm

ejc3 · 2026-02-22T19:07:52Z

Summary

Replace racy double-rebind with deterministic handshake chain for exec server recovery after snapshot restore
Reduce restore-to-healthy from ~61s to ~0.5s (114x faster)
Add host-side output connection gate: health monitor waits until fc-agent confirms exec is ready
Add post-restore exec stress test: 10 parallel calls, catches regressions instantly

The Problem

After snapshot restore, Firecracker's vsock transport reset leaves the exec server's AsyncFd epoll registration stale. Two signals fired per restore:

handle_clone_restore() via restore-epoch watcher (correct)
agent.rs after notify_cache_ready_and_wait() (duplicate)

The duplicate re_register() corrupted edge-triggered epoll: old AsyncFd consumed the notification, new AsyncFd never saw pending connections. Health checks hung for ~60s until kernel vsock cleanup expired stale connections.

Trace evidence (vsock muxer log, vm-ba97c):

T+0.009s  Exec #1 → WORKS
T+0.076s  Exec #2 → WORKS
T+0.520s  Exec #3 → guest ACKs, app never processes → 5s timeout
T+5-55s   Exec #4-#9 → same: kernel accepts, exec server dead
T+60.5s   Guest RSTs all stale connections → Exec #10 WORKS

Container started at T+0.28s. Exec broken for 60 more seconds.

The Fix: Deterministic Handshake Chain

exec_rebind_signal → exec_re_register → rebind_done → output.reconnect()
                                                            ↓
                                             host accepts output connection
                                                            ↓
                                                health monitor spawns

Every transition has an explicit signal. Zero timing dependencies.

Commit 1: Remove duplicate rebind (`c15aa6b`)

Remove duplicate exec rebind signal from agent.rs
Swap ordering in restore.rs: exec rebind BEFORE output reconnect
Add post-restore exec stress test (10 parallel calls, 10s timeout each)

Commit 2: Deterministic handshake chain (`b7714df`)

fc-agent side (4 files):

exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify)
restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output
agent.rs: Added exec_rebind_done / exec_rebind_done_notify Arcs, removed duplicate signal
mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites

Host side (3 files):

listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first accept
snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one()
podman/mod.rs: Passes None for connected_tx (non-snapshot path)

Test Results

Primary test (localhost_rootless_btrfs_snapshot_restore): PASSED

T+0ms    VM resume (total_snapshot_ms=6)
T+35ms   fc-agent output connected, exec server ready
T+35ms   Health monitor started
T+533ms  VM healthy

Exec stress: 10 parallel calls in 16.3ms (max single: 15.3ms), zero timeouts

Full snapshot suite: 82/82 passed, 0 failures

Clone restore timing across all tests (MMDS signal → healthy):

Test	Signal→Connected	Connected→Healthy
btrfs-rw-warm	35ms	369ms
diff-perf-2	97ms	16ms
user-parent-clone	100ms	144ms
diskdir-only-clone	115ms	171ms
fuse-large-clone	107ms	169ms

Before: ~61s. After: <500ms. Zero exec timeouts across all 82 tests.

Test plan

make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
make test-root FILTER=snapshot STREAM=1 (82/82 passed)
Zero exec timed out messages in any test log
fc-agent output connected, exec server ready appears before health monitor start in all clone restores

After snapshot restore, two exec rebind signals were fired: 1. handle_clone_restore() in restore.rs (correct, via restore-epoch watcher) 2. agent.rs after notify_cache_ready_and_wait() returns (duplicate) The double re_register() left the VsockListener's AsyncFd epoll in a broken state where readability events weren't delivered for ~150 seconds, causing all health monitor exec calls to timeout. Fix: - Remove duplicate rebind signal from agent.rs (root cause) - Swap ordering in restore.rs: exec rebind BEFORE output reconnect (host uses output reconnection as readiness signal) - Add post-restore exec stress test: 10 parallel calls with 10s timeout each — catches this regression instantly Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore 10 parallel exec calls after restore: 11-19ms each (20.8ms total) Full test: 133s (was timing out at 120s before fix)

claude-claude · 2026-02-22T19:10:09Z

🔍 Claude Review

SEVERITY: none

Findings

No issues found. The changes are correct and well-documented.

Core fix (agent.rs): Removing the duplicate exec_rebind_needed.store(true) + exec_rebind.notify_one() after notify_cache_ready_and_wait() is correct — the restore-epoch watcher in mmds.rs already triggers handle_clone_restore() which sends the rebind signal. The double re_register() call was the root cause of the ~150s epoll stall.

Ordering change (restore.rs): Moving exec rebind before output reconnect is the right call — the host treats output vsock reconnection as a readiness signal, so exec must be functional first.

Doc comments (exec.rs, vsock.rs): Accurately document the single-signal invariant and the re_register mechanism.

Regression test: 10 parallel exec calls with 10s individual timeouts directly targets the failure mode (150s epoll stall after double rebind). Good use of tokio::spawn for true parallelism and std::time::Instant for wall-clock measurement.

Summary

Clean, well-motivated bug fix. The root cause (double re_register() leaving AsyncFd epoll broken) is clearly identified, the fix is minimal (remove the duplicate signal source), and the regression test directly validates the fix.

Review by Claude | Run

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c15aa6b0ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-22T19:11:44Z

fc-agent/src/agent.rs

            if container::notify_cache_ready_and_wait(&digest, &restore_flag) {
                eprintln!("[fc-agent] cache ready notification acknowledged");
-                // Pre-start snapshot was taken and we've been restored into a new
-                // Firecracker instance. The vsock transport was reset, which
-                // invalidates the exec server's listener socket (stale AsyncFd epoll).
-                // Signal it to re-bind. Set flag BEFORE notify to prevent race where
-                // select! drops the Notified future (see exec.rs doc comment).
-                exec_rebind_needed.store(true, std::sync::atomic::Ordering::Release);
-                exec_rebind.notify_one();
+                // Exec rebind is handled by handle_clone_restore() (via the
+                // restore-epoch watcher). Do NOT signal a second rebind here —
+                // a double re-register leaves the listener in a broken state


Reintroduce exec rebind fallback after cache-wait success

This branch now does nothing after notify_cache_ready_and_wait() returns true, which makes exec rebind depend entirely on watch_restore_epoch(). That is unsafe because notify_cache_ready_and_wait() can succeed from a transport reset without restore_flag being set (see fc-agent/src/container.rs reset paths), and the host’s post-snapshot MMDS bump is explicitly best-effort (src/commands/common.rs logs and continues on patch_mmds failure at lines 1444-1452). In that failure mode, the watcher never signals rebind and the exec listener stays on stale epoll registration, so post-snapshot exec/health checks can hang.

Useful? React with 👍 / 👎.

Valid concern, partially. The two MMDS bump sites have different error handling:

Clone restore path (src/commands/common.rs:1006): Uses put_mmds().await.context("...")? — hard error, propagates failure. If MMDS fails, the clone fails entirely. So clones are safe.

Baseline pre-start snapshot path (src/commands/common.rs:1444): Uses if let Err(e) — best-effort. If this fails, the restore-epoch watcher on the baseline won't fire, and exec stays stale on the baseline.

The baseline case is real but low-impact: the baseline continues to run podman run and start the container, but the health monitor was already running before the snapshot. The baseline itself isn't serving traffic — it's a template. The startup snapshot (taken when healthy) would hit the same best-effort MMDS bump, but by that point the exec server has been working for the entire boot.

That said, upgrading the baseline's patch_mmds to hard-error (or adding a fallback rebind signal) is the right hardening. Will address in a follow-up.

…hang Replace the racy double-rebind approach with a deterministic handshake chain that guarantees the exec server's AsyncFd epoll is re-registered before the host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s. ## The Problem After snapshot restore, Firecracker's vsock transport reset (VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll registration stale. The previous fix (c15aa6b) removed the duplicate rebind signal from agent.rs but left a timing gap: if the restore-epoch watcher's single signal arrived late, the host's health monitor would start exec calls against a stale listener, hanging for ~60s until the kernel's vsock cleanup expired the stale connections. ## Trace Evidence (the smoking gun) From the vsock muxer log of a failing run (vm-ba97c): T+0.009s Exec call #1 → WORKS (167+144+176+123+71+27 bytes response) T+0.076s Exec call #2 → WORKS T+0.520s Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes T+60.5s Guest sends RST for ALL stale connections simultaneously T+60.5s Exec call #10 → WORKS → "container running status running=true" The container started at T+0.28s. The exec server was broken for 60 more seconds because the duplicate re_register() from agent.rs corrupted the edge-triggered epoll: the old AsyncFd consumed the edge notification, and the new AsyncFd never received events for pending connections. ## The Fix: Deterministic Handshake Chain exec_rebind_signal → exec_re_register → rebind_done → output.reconnect() ↓ host accepts output connection ↓ health monitor spawns Every transition has an explicit signal. Zero timing dependencies. ### fc-agent side (4 files): - exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify) - restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output vsock - agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait; added exec_rebind_done/exec_rebind_done_notify Arcs - mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites ### Host side (3 files): - listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first output connection accept - snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one() for startup snapshots - podman/mod.rs: Passes None for connected_tx (non-snapshot path) ## Results Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts) After: restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy) Post-restore exec stress test: 10 parallel calls completed in 16.3ms (max single: 15.3ms), zero timeouts. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1

claude-claude · 2026-02-23T00:48:30Z

🔍 Claude Review

SEVERITY: none

Findings

No issues found. This review covers commit b7714df (the deterministic handshake chain), which was added after the previous review that covered only commit c15aa6b.

Handshake chain correctness: The deterministic ordering — exec rebind → rebind_done confirmation → output reconnect → host health monitor start — is sound. Each transition is gated by an explicit signal with no timing dependencies.

Race condition analysis:

tokio::sync::Notify stores a permit if notify_one() fires before .notified() is created, so the exec_rebind_done_notify handshake cannot lose notifications.
Double-restore races are prevented because watch_restore_epoch() is a single async loop that .awaits handle_clone_restore() — no two restores can overlap.
The exec_rebind_done flag reset (store(false)) before signaling is safe because the previous restore cycle's wait has already completed by the time the watcher loops back.
Memory ordering (Release/AcqRel) is correct for the flag + notify pattern.

Host-side gate (output_connected_rx in snapshot.rs): Correctly fires on the first output listener accept, which in cmd_snapshot_run is always the post-restore fc-agent connection (listener is created fresh before VM restore). The --exec early-return path correctly bypasses this gate (it doesn't spawn a health monitor).

Test: 10 parallel exec calls with 10s individual timeouts directly targets the failure mode. The 30s total assertion is conservative but appropriate.

PR description: Accurately describes both commits and matches the implementation.

Summary

Solid follow-up to commit 1. The handshake chain is deterministic and race-free. The host-side output connection gate correctly serializes health monitor startup after exec readiness confirmation.

Review by Claude | Run

The restore-epoch MMDS update in create_snapshot_core() was best-effort (if let Err → warn). This was dangerous because the entire exec rebind chain depends on it: epoch watcher → handle_clone_restore() → exec re_register → output reconnect. If patch_mmds failed silently, the exec server's AsyncFd epoll stayed stale after vsock transport reset, causing health checks to hang for ~60s. The clone path already uses hard error (put_mmds().context()?). The baseline pre-start snapshot path now matches. Found by code review on PR #468. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore - First run: snapshot created, patch_mmds succeeds (hard error path) - Second run: restored from snapshot, exec stress 10/10 in 14ms

chatgpt-codex-connector bot reviewed Feb 22, 2026

View reviewed changes

ejc3 changed the title ~~fix: remove duplicate exec rebind signal causing 150s post-restore hang~~ fix: deterministic exec rebind handshake eliminates 60s post-restore hang Feb 23, 2026

ejc3 merged commit 5e465a8 into main Feb 23, 2026
18 checks passed

ejc3 mentioned this pull request Feb 23, 2026

fix: upgrade snapshot epoch bump to hard error #470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: deterministic exec rebind handshake eliminates 60s post-restore hang#468

fix: deterministic exec rebind handshake eliminates 60s post-restore hang#468
ejc3 merged 2 commits intomainfrom
fix-exec-double-rebind

ejc3 commented Feb 22, 2026 •

edited

Loading

Uh oh!

claude-claude bot commented Feb 22, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 22, 2026

Uh oh!

ejc3 Feb 23, 2026

Uh oh!

claude-claude bot commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

ejc3 commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The Problem

The Fix: Deterministic Handshake Chain

Commit 1: Remove duplicate rebind (c15aa6b)

Commit 2: Deterministic handshake chain (b7714df)

Test Results

Test plan

Uh oh!

claude-claude bot commented Feb 22, 2026

🔍 Claude Review

Findings

Summary

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

ejc3 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude-claude bot commented Feb 23, 2026

🔍 Claude Review

Findings

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ejc3 commented Feb 22, 2026 •

edited

Loading

Commit 1: Remove duplicate rebind (`c15aa6b`)

Commit 2: Deterministic handshake chain (`b7714df`)