Skip to content

Comments

fix: deterministic exec rebind handshake eliminates 60s post-restore hang#468

Merged
ejc3 merged 2 commits intomainfrom
fix-exec-double-rebind
Feb 23, 2026
Merged

fix: deterministic exec rebind handshake eliminates 60s post-restore hang#468
ejc3 merged 2 commits intomainfrom
fix-exec-double-rebind

Conversation

@ejc3
Copy link
Owner

@ejc3 ejc3 commented Feb 22, 2026

Summary

  • Replace racy double-rebind with deterministic handshake chain for exec server recovery after snapshot restore
  • Reduce restore-to-healthy from ~61s to ~0.5s (114x faster)
  • Add host-side output connection gate: health monitor waits until fc-agent confirms exec is ready
  • Add post-restore exec stress test: 10 parallel calls, catches regressions instantly

The Problem

After snapshot restore, Firecracker's vsock transport reset leaves the exec server's AsyncFd epoll registration stale. Two signals fired per restore:

  1. handle_clone_restore() via restore-epoch watcher (correct)
  2. agent.rs after notify_cache_ready_and_wait() (duplicate)

The duplicate re_register() corrupted edge-triggered epoll: old AsyncFd consumed the notification, new AsyncFd never saw pending connections. Health checks hung for ~60s until kernel vsock cleanup expired stale connections.

Trace evidence (vsock muxer log, vm-ba97c):

T+0.009s  Exec #1 → WORKS
T+0.076s  Exec #2 → WORKS
T+0.520s  Exec #3 → guest ACKs, app never processes → 5s timeout
T+5-55s   Exec #4-#9 → same: kernel accepts, exec server dead
T+60.5s   Guest RSTs all stale connections → Exec #10 WORKS

Container started at T+0.28s. Exec broken for 60 more seconds.

The Fix: Deterministic Handshake Chain

exec_rebind_signal → exec_re_register → rebind_done → output.reconnect()
                                                            ↓
                                             host accepts output connection
                                                            ↓
                                                health monitor spawns

Every transition has an explicit signal. Zero timing dependencies.

Commit 1: Remove duplicate rebind (c15aa6b)

  • Remove duplicate exec rebind signal from agent.rs
  • Swap ordering in restore.rs: exec rebind BEFORE output reconnect
  • Add post-restore exec stress test (10 parallel calls, 10s timeout each)

Commit 2: Deterministic handshake chain (b7714df)

fc-agent side (4 files):

  • exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify)
  • restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s timeout), THEN reconnects output
  • agent.rs: Added exec_rebind_done / exec_rebind_done_notify Arcs, removed duplicate signal
  • mmds.rs: Threads new params through watch_restore_epoch to both handle_clone_restore call sites

Host side (3 files):

  • listeners.rs: Added connected_tx oneshot to run_output_listener(), fired on first accept
  • snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning health monitor; removed stale output_reconnect.notify_one()
  • podman/mod.rs: Passes None for connected_tx (non-snapshot path)

Test Results

Primary test (localhost_rootless_btrfs_snapshot_restore): PASSED

T+0ms    VM resume (total_snapshot_ms=6)
T+35ms   fc-agent output connected, exec server ready
T+35ms   Health monitor started
T+533ms  VM healthy

Exec stress: 10 parallel calls in 16.3ms (max single: 15.3ms), zero timeouts

Full snapshot suite: 82/82 passed, 0 failures

Clone restore timing across all tests (MMDS signal → healthy):

Test Signal→Connected Connected→Healthy
btrfs-rw-warm 35ms 369ms
diff-perf-2 97ms 16ms
user-parent-clone 100ms 144ms
diskdir-only-clone 115ms 171ms
fuse-large-clone 107ms 169ms

Before: ~61s. After: <500ms. Zero exec timeouts across all 82 tests.

Test plan

  • make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
  • make test-root FILTER=snapshot STREAM=1 (82/82 passed)
  • Zero exec timed out messages in any test log
  • fc-agent output connected, exec server ready appears before health monitor start in all clone restores

After snapshot restore, two exec rebind signals were fired:
1. handle_clone_restore() in restore.rs (correct, via restore-epoch watcher)
2. agent.rs after notify_cache_ready_and_wait() returns (duplicate)

The double re_register() left the VsockListener's AsyncFd epoll in a broken
state where readability events weren't delivered for ~150 seconds, causing
all health monitor exec calls to timeout.

Fix:
- Remove duplicate rebind signal from agent.rs (root cause)
- Swap ordering in restore.rs: exec rebind BEFORE output reconnect
  (host uses output reconnection as readiness signal)
- Add post-restore exec stress test: 10 parallel calls with 10s timeout
  each — catches this regression instantly

Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore
  10 parallel exec calls after restore: 11-19ms each (20.8ms total)
  Full test: 133s (was timing out at 120s before fix)
@claude-claude
Copy link
Contributor

claude-claude bot commented Feb 22, 2026

🔍 Claude Review

SEVERITY: none

Findings

No issues found. The changes are correct and well-documented.

Core fix (agent.rs): Removing the duplicate exec_rebind_needed.store(true) + exec_rebind.notify_one() after notify_cache_ready_and_wait() is correct — the restore-epoch watcher in mmds.rs already triggers handle_clone_restore() which sends the rebind signal. The double re_register() call was the root cause of the ~150s epoll stall.

Ordering change (restore.rs): Moving exec rebind before output reconnect is the right call — the host treats output vsock reconnection as a readiness signal, so exec must be functional first.

Doc comments (exec.rs, vsock.rs): Accurately document the single-signal invariant and the re_register mechanism.

Regression test: 10 parallel exec calls with 10s individual timeouts directly targets the failure mode (150s epoll stall after double rebind). Good use of tokio::spawn for true parallelism and std::time::Instant for wall-clock measurement.

Summary

Clean, well-motivated bug fix. The root cause (double re_register() leaving AsyncFd epoll broken) is clearly identified, the fix is minimal (remove the duplicate signal source), and the regression test directly validates the fix.


Review by Claude | Run

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c15aa6b0ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 233 to 237
if container::notify_cache_ready_and_wait(&digest, &restore_flag) {
eprintln!("[fc-agent] cache ready notification acknowledged");
// Pre-start snapshot was taken and we've been restored into a new
// Firecracker instance. The vsock transport was reset, which
// invalidates the exec server's listener socket (stale AsyncFd epoll).
// Signal it to re-bind. Set flag BEFORE notify to prevent race where
// select! drops the Notified future (see exec.rs doc comment).
exec_rebind_needed.store(true, std::sync::atomic::Ordering::Release);
exec_rebind.notify_one();
// Exec rebind is handled by handle_clone_restore() (via the
// restore-epoch watcher). Do NOT signal a second rebind here —
// a double re-register leaves the listener in a broken state

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reintroduce exec rebind fallback after cache-wait success

This branch now does nothing after notify_cache_ready_and_wait() returns true, which makes exec rebind depend entirely on watch_restore_epoch(). That is unsafe because notify_cache_ready_and_wait() can succeed from a transport reset without restore_flag being set (see fc-agent/src/container.rs reset paths), and the host’s post-snapshot MMDS bump is explicitly best-effort (src/commands/common.rs logs and continues on patch_mmds failure at lines 1444-1452). In that failure mode, the watcher never signals rebind and the exec listener stays on stale epoll registration, so post-snapshot exec/health checks can hang.

Useful? React with 👍 / 👎.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern, partially. The two MMDS bump sites have different error handling:

  1. Clone restore path (src/commands/common.rs:1006): Uses put_mmds().await.context("...")? — hard error, propagates failure. If MMDS fails, the clone fails entirely. So clones are safe.

  2. Baseline pre-start snapshot path (src/commands/common.rs:1444): Uses if let Err(e) — best-effort. If this fails, the restore-epoch watcher on the baseline won't fire, and exec stays stale on the baseline.

The baseline case is real but low-impact: the baseline continues to run podman run and start the container, but the health monitor was already running before the snapshot. The baseline itself isn't serving traffic — it's a template. The startup snapshot (taken when healthy) would hit the same best-effort MMDS bump, but by that point the exec server has been working for the entire boot.

That said, upgrading the baseline's patch_mmds to hard-error (or adding a fallback rebind signal) is the right hardening. Will address in a follow-up.

…hang

Replace the racy double-rebind approach with a deterministic handshake chain
that guarantees the exec server's AsyncFd epoll is re-registered before the
host starts health-checking. Reduces restore-to-healthy from ~61s to ~0.5s.

## The Problem

After snapshot restore, Firecracker's vsock transport reset
(VIRTIO_VSOCK_EVENT_TRANSPORT_RESET) leaves the exec server's AsyncFd epoll
registration stale. The previous fix (c15aa6b) removed the duplicate rebind
signal from agent.rs but left a timing gap: if the restore-epoch watcher's
single signal arrived late, the host's health monitor would start exec calls
against a stale listener, hanging for ~60s until the kernel's vsock cleanup
expired the stale connections.

## Trace Evidence (the smoking gun)

From the vsock muxer log of a failing run (vm-ba97c):

  T+0.009s  Exec call #1 → WORKS (167+144+176+123+71+27 bytes response)
  T+0.076s  Exec call #2 → WORKS
  T+0.520s  Exec call #3 → guest ACKs, receives request, sends NOTHING → 5s timeout
  T+5.5-55s Exec calls #4-#9 → same pattern: kernel accepts, app never processes
  T+60.5s   Guest sends RST for ALL stale connections simultaneously
  T+60.5s   Exec call #10 → WORKS → "container running status running=true"

The container started at T+0.28s. The exec server was broken for 60 more
seconds because the duplicate re_register() from agent.rs corrupted the
edge-triggered epoll: the old AsyncFd consumed the edge notification, and
the new AsyncFd never received events for pending connections.

## The Fix: Deterministic Handshake Chain

  exec_rebind_signal → exec_re_register → rebind_done → output.reconnect()
                                                              ↓
                                               host accepts output connection
                                                              ↓
                                                  health monitor spawns

Every transition has an explicit signal. Zero timing dependencies.

### fc-agent side (4 files):

- exec.rs: After re_register(), signals rebind_done (AtomicBool + Notify)
- restore.rs: Signals exec rebind, waits for rebind_done confirmation (5s
  timeout), THEN reconnects output vsock
- agent.rs: Removed duplicate rebind signal after notify_cache_ready_and_wait;
  added exec_rebind_done/exec_rebind_done_notify Arcs
- mmds.rs: Threads new params through watch_restore_epoch to both
  handle_clone_restore call sites

### Host side (3 files):

- listeners.rs: Added connected_tx oneshot to run_output_listener(), fired
  on first output connection accept
- snapshot.rs: Waits for output_connected_rx (30s timeout) before spawning
  health monitor; removed stale output_reconnect.notify_one() for startup
  snapshots
- podman/mod.rs: Passes None for connected_tx (non-snapshot path)

## Results

Before: restore-to-healthy = ~61s (exec broken, 9 consecutive 5s timeouts)
After:  restore-to-healthy = ~0.5s (35ms to output connected, 533ms to healthy)

Post-restore exec stress test: 10 parallel calls completed in 16.3ms
(max single: 15.3ms), zero timeouts.

Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
@ejc3 ejc3 changed the title fix: remove duplicate exec rebind signal causing 150s post-restore hang fix: deterministic exec rebind handshake eliminates 60s post-restore hang Feb 23, 2026
@claude-claude
Copy link
Contributor

claude-claude bot commented Feb 23, 2026

🔍 Claude Review

SEVERITY: none

Findings

No issues found. This review covers commit b7714df (the deterministic handshake chain), which was added after the previous review that covered only commit c15aa6b.

Handshake chain correctness: The deterministic ordering — exec rebind → rebind_done confirmation → output reconnect → host health monitor start — is sound. Each transition is gated by an explicit signal with no timing dependencies.

Race condition analysis:

  • tokio::sync::Notify stores a permit if notify_one() fires before .notified() is created, so the exec_rebind_done_notify handshake cannot lose notifications.
  • Double-restore races are prevented because watch_restore_epoch() is a single async loop that .awaits handle_clone_restore() — no two restores can overlap.
  • The exec_rebind_done flag reset (store(false)) before signaling is safe because the previous restore cycle's wait has already completed by the time the watcher loops back.
  • Memory ordering (Release/AcqRel) is correct for the flag + notify pattern.

Host-side gate (output_connected_rx in snapshot.rs): Correctly fires on the first output listener accept, which in cmd_snapshot_run is always the post-restore fc-agent connection (listener is created fresh before VM restore). The --exec early-return path correctly bypasses this gate (it doesn't spawn a health monitor).

Test: 10 parallel exec calls with 10s individual timeouts directly targets the failure mode. The 30s total assertion is conservative but appropriate.

PR description: Accurately describes both commits and matches the implementation.

Summary

Solid follow-up to commit 1. The handshake chain is deterministic and race-free. The host-side output connection gate correctly serializes health monitor startup after exec readiness confirmation.


Review by Claude | Run

@ejc3 ejc3 merged commit 5e465a8 into main Feb 23, 2026
18 checks passed
ejc3 added a commit that referenced this pull request Feb 23, 2026
The restore-epoch MMDS update in create_snapshot_core() was best-effort
(if let Err → warn). This was dangerous because the entire exec rebind
chain depends on it: epoch watcher → handle_clone_restore() → exec
re_register → output reconnect. If patch_mmds failed silently, the
exec server's AsyncFd epoll stayed stale after vsock transport reset,
causing health checks to hang for ~60s.

The clone path already uses hard error (put_mmds().context()?). The
baseline pre-start snapshot path now matches.

Found by code review on PR #468.

Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore
  - First run: snapshot created, patch_mmds succeeds (hard error path)
  - Second run: restored from snapshot, exec stress 10/10 in 14ms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant