fix: upgrade snapshot epoch bump to hard error#470
Conversation
The restore-epoch MMDS update in create_snapshot_core() was best-effort (if let Err → warn). This was dangerous because the entire exec rebind chain depends on it: epoch watcher → handle_clone_restore() → exec re_register → output reconnect. If patch_mmds failed silently, the exec server's AsyncFd epoll stayed stale after vsock transport reset, causing health checks to hang for ~60s. The clone path already uses hard error (put_mmds().context()?). The baseline pre-start snapshot path now matches. Found by code review on PR #468. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore - First run: snapshot created, patch_mmds succeeds (hard error path) - Second run: restored from snapshot, exec stress 10/10 in 14ms
🔍 Claude ReviewSEVERITY: none FindingsNo issues found. This is a clean, well-motivated change:
SummarySmall, focused fix that closes a correctness gap identified during code review of PR #468. No issues. Review by Claude | Run |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bffc6fcc3e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/commands/common.rs
Outdated
| { | ||
| warn!(error = %e, "failed to bump restore-epoch after snapshot (FUSE remount may be delayed)"); | ||
| } | ||
| .context("bumping restore-epoch in MMDS after snapshot")?; |
There was a problem hiding this comment.
Clean up temp snapshot data on MMDS bump failure
This new hard-error path returns immediately from create_snapshot_core when patch_mmds fails, but unlike earlier failure checks in this function it does not remove temp_snapshot_dir first. In the failure scenario this commit is explicitly targeting (MMDS update errors), that leaves a large .creating directory (including memory snapshot files) on disk, which can consume significant space and cause ENOSPC for later VM/snapshot operations until another snapshot attempt happens to clean it.
Useful? React with 👍 / 👎.
The hard-error patch_mmds path returns early from create_snapshot_core() but didn't remove temp_snapshot_dir first. This leaves a large .creating directory (memory snapshot files) on disk, which can cause ENOSPC. Match the cleanup pattern used by all other early-return error paths in this function. Found by code review on PR #470.
🔍 Claude ReviewSEVERITY: none FindingsNo issues found. This is a clean, well-targeted two-commit fix:
Consistency check: The cleanup pattern ( Docs check: No documentation updates needed — this is an internal error-handling change with no user-facing behavioral change (snapshot creation already failed; this just fails earlier and more explicitly). SummaryClean two-commit fix that closes a correctness gap: MMDS epoch bump failures during snapshot creation are now hard errors with proper temp directory cleanup, consistent with all other error paths in the function. Review by Claude | Run |
Fix: upgrade snapshot epoch bump to hard error + cleanup
Found by code review on PR #468 (inline comment on
fc-agent/src/agent.rs:237).The Problem
In
create_snapshot_core(), the restore-epoch MMDS update was best-effort:The entire exec rebind chain from PR #468 depends on this succeeding:
If
patch_mmdsfails silently, the epoch watcher never fires,handle_clone_restore()is never called, and the exec server's AsyncFd epoll stays stale after vsock transport reset — causing health checks to hang for ~60s.The clone path already uses hard error (
put_mmds().context("...")?). The baseline pre-start snapshot path was the only weak link.Changes
Upgrade
patch_mmdsto hard error — if MMDS is unreachable after VM resume, the snapshot fails rather than leaving the exec server silently broken.Clean up
temp_snapshot_diron failure — the new error path removes the.creatingdirectory (memory snapshot files) before returning, matching all other early-return error paths in the function. Prevents ENOSPC from orphaned snapshot data.Test Results