fix: upgrade snapshot epoch bump to hard error by ejc3 · Pull Request #470 · ejc3/fcvm

ejc3 · 2026-02-23T04:14:49Z

Fix: upgrade snapshot epoch bump to hard error + cleanup

Found by code review on PR #468 (inline comment on fc-agent/src/agent.rs:237).

The Problem

In create_snapshot_core(), the restore-epoch MMDS update was best-effort:

if let Err(e) = client.patch_mmds(...).await {
    warn!("failed to bump restore-epoch after snapshot");
}

The entire exec rebind chain from PR #468 depends on this succeeding:

patch_mmds → epoch watcher → handle_clone_restore() → exec re_register → output reconnect

If patch_mmds fails silently, the epoch watcher never fires, handle_clone_restore() is never called, and the exec server's AsyncFd epoll stays stale after vsock transport reset — causing health checks to hang for ~60s.

The clone path already uses hard error (put_mmds().context("...")?). The baseline pre-start snapshot path was the only weak link.

Changes

Upgrade patch_mmds to hard error — if MMDS is unreachable after VM resume, the snapshot fails rather than leaving the exec server silently broken.
Clean up temp_snapshot_dir on failure — the new error path removes the .creating directory (memory snapshot files) before returning, matching all other early-return error paths in the function. Prevents ENOSPC from orphaned snapshot data.

Test Results

make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
  First run: snapshot created, patch_mmds succeeds
  Second run: restored from snapshot, VM healthy in <1s
  Exec stress: 10 parallel calls completed in 14.0ms (max 13.1ms)
  PASSED

The restore-epoch MMDS update in create_snapshot_core() was best-effort (if let Err → warn). This was dangerous because the entire exec rebind chain depends on it: epoch watcher → handle_clone_restore() → exec re_register → output reconnect. If patch_mmds failed silently, the exec server's AsyncFd epoll stayed stale after vsock transport reset, causing health checks to hang for ~60s. The clone path already uses hard error (put_mmds().context()?). The baseline pre-start snapshot path now matches. Found by code review on PR #468. Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore - First run: snapshot created, patch_mmds succeeds (hard error path) - Second run: restored from snapshot, exec stress 10/10 in 14ms

claude-claude · 2026-02-23T04:17:01Z

🔍 Claude Review

SEVERITY: none

Findings

No issues found. This is a clean, well-motivated change:

Correctness: Upgrading patch_mmds from best-effort (if let Err → warn) to hard error (.context("...")?) is the right call. The entire exec rebind chain (epoch watcher → handle_clone_restore() → AsyncFd re-register → output reconnect) depends on the epoch bump succeeding. Silent failure here causes health checks to hang for ~60s.
Consistency: The clone path already uses hard error for put_mmds (line 1014). This makes the snapshot path match.
Comment quality: The updated comment clearly explains why this must succeed, referencing the specific failure mode.
Error propagation: The function returns Result<()>, so ? propagation is correct. The .context() message is descriptive.

Summary

Small, focused fix that closes a correctness gap identified during code review of PR #468. No issues.

Review by Claude | Run

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bffc6fcc3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-23T04:17:21Z

src/commands/common.rs

-    {
-        warn!(error = %e, "failed to bump restore-epoch after snapshot (FUSE remount may be delayed)");
-    }
+        .context("bumping restore-epoch in MMDS after snapshot")?;


Clean up temp snapshot data on MMDS bump failure

This new hard-error path returns immediately from create_snapshot_core when patch_mmds fails, but unlike earlier failure checks in this function it does not remove temp_snapshot_dir first. In the failure scenario this commit is explicitly targeting (MMDS update errors), that leaves a large .creating directory (including memory snapshot files) on disk, which can consume significant space and cause ENOSPC for later VM/snapshot operations until another snapshot attempt happens to clean it.

Useful? React with 👍 / 👎.

The hard-error patch_mmds path returns early from create_snapshot_core() but didn't remove temp_snapshot_dir first. This leaves a large .creating directory (memory snapshot files) on disk, which can cause ENOSPC. Match the cleanup pattern used by all other early-return error paths in this function. Found by code review on PR #470.

claude-claude · 2026-02-23T04:21:25Z

🔍 Claude Review

SEVERITY: none

Findings

No issues found. This is a clean, well-targeted two-commit fix:

Commit bffc6fc — hard error upgrade: Correctly upgrades the patch_mmds call from best-effort warning to hard error via .context("...")?. This matches the clone path at line 1014 (put_mmds().context("...")?) and closes the correctness gap where a silent MMDS failure would leave the exec server stale. (Covered in previous review.)
Commit 2134b7f — temp dir cleanup: The new hard error path now returns early via return Err(e).context(...), so it needs the same remove_dir_all(&temp_snapshot_dir) cleanup that all other early-return error paths in the function use (lines 1423, 1427, 1431). The if let Err(e) pattern is the right way to handle this since it allows inserting cleanup before the return.

Consistency check: The cleanup pattern (let _ = remove_dir_all + return Err(e).context(...)) exactly matches the three existing error paths at lines 1422–1432. Comment accurately explains the failure mode and why this must succeed.

Docs check: No documentation updates needed — this is an internal error-handling change with no user-facing behavioral change (snapshot creation already failed; this just fails earlier and more explicitly).

Summary

Clean two-commit fix that closes a correctness gap: MMDS epoch bump failures during snapshot creation are now hard errors with proper temp directory cleanup, consistent with all other error paths in the function.

Review by Claude | Run

chatgpt-codex-connector bot reviewed Feb 23, 2026

View reviewed changes

ejc3 merged commit 2134b7f into main Feb 23, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: upgrade snapshot epoch bump to hard error#470

fix: upgrade snapshot epoch bump to hard error#470
ejc3 merged 2 commits intomainfrom
fix-snapshot-epoch-hard-error

ejc3 commented Feb 23, 2026 •

edited

Loading

Uh oh!

claude-claude bot commented Feb 23, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 23, 2026

Uh oh!

claude-claude bot commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

ejc3 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: upgrade snapshot epoch bump to hard error + cleanup

The Problem

Changes

Test Results

Uh oh!

claude-claude bot commented Feb 23, 2026

🔍 Claude Review

Findings

Summary

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude-claude bot commented Feb 23, 2026

🔍 Claude Review

Findings

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ejc3 commented Feb 23, 2026 •

edited

Loading