Skip to content

Comments

fix: upgrade snapshot epoch bump to hard error#470

Merged
ejc3 merged 2 commits intomainfrom
fix-snapshot-epoch-hard-error
Feb 23, 2026
Merged

fix: upgrade snapshot epoch bump to hard error#470
ejc3 merged 2 commits intomainfrom
fix-snapshot-epoch-hard-error

Conversation

@ejc3
Copy link
Owner

@ejc3 ejc3 commented Feb 23, 2026

Fix: upgrade snapshot epoch bump to hard error + cleanup

Found by code review on PR #468 (inline comment on fc-agent/src/agent.rs:237).

The Problem

In create_snapshot_core(), the restore-epoch MMDS update was best-effort:

if let Err(e) = client.patch_mmds(...).await {
    warn!("failed to bump restore-epoch after snapshot");
}

The entire exec rebind chain from PR #468 depends on this succeeding:

patch_mmds → epoch watcher → handle_clone_restore() → exec re_register → output reconnect

If patch_mmds fails silently, the epoch watcher never fires, handle_clone_restore() is never called, and the exec server's AsyncFd epoll stays stale after vsock transport reset — causing health checks to hang for ~60s.

The clone path already uses hard error (put_mmds().context("...")?). The baseline pre-start snapshot path was the only weak link.

Changes

  1. Upgrade patch_mmds to hard error — if MMDS is unreachable after VM resume, the snapshot fails rather than leaving the exec server silently broken.

  2. Clean up temp_snapshot_dir on failure — the new error path removes the .creating directory (memory snapshot files) before returning, matching all other early-return error paths in the function. Prevents ENOSPC from orphaned snapshot data.

Test Results

make test-root FILTER=localhost_rootless_btrfs_snapshot_restore STREAM=1
  First run: snapshot created, patch_mmds succeeds
  Second run: restored from snapshot, VM healthy in <1s
  Exec stress: 10 parallel calls completed in 14.0ms (max 13.1ms)
  PASSED

The restore-epoch MMDS update in create_snapshot_core() was best-effort
(if let Err → warn). This was dangerous because the entire exec rebind
chain depends on it: epoch watcher → handle_clone_restore() → exec
re_register → output reconnect. If patch_mmds failed silently, the
exec server's AsyncFd epoll stayed stale after vsock transport reset,
causing health checks to hang for ~60s.

The clone path already uses hard error (put_mmds().context()?). The
baseline pre-start snapshot path now matches.

Found by code review on PR #468.

Tested: make test-root FILTER=localhost_rootless_btrfs_snapshot_restore
  - First run: snapshot created, patch_mmds succeeds (hard error path)
  - Second run: restored from snapshot, exec stress 10/10 in 14ms
@claude-claude
Copy link
Contributor

claude-claude bot commented Feb 23, 2026

🔍 Claude Review

SEVERITY: none

Findings

No issues found. This is a clean, well-motivated change:

  • Correctness: Upgrading patch_mmds from best-effort (if let Err → warn) to hard error (.context("...")?) is the right call. The entire exec rebind chain (epoch watcher → handle_clone_restore() → AsyncFd re-register → output reconnect) depends on the epoch bump succeeding. Silent failure here causes health checks to hang for ~60s.
  • Consistency: The clone path already uses hard error for put_mmds (line 1014). This makes the snapshot path match.
  • Comment quality: The updated comment clearly explains why this must succeed, referencing the specific failure mode.
  • Error propagation: The function returns Result<()>, so ? propagation is correct. The .context() message is descriptive.

Summary

Small, focused fix that closes a correctness gap identified during code review of PR #468. No issues.


Review by Claude | Run

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bffc6fcc3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

{
warn!(error = %e, "failed to bump restore-epoch after snapshot (FUSE remount may be delayed)");
}
.context("bumping restore-epoch in MMDS after snapshot")?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clean up temp snapshot data on MMDS bump failure

This new hard-error path returns immediately from create_snapshot_core when patch_mmds fails, but unlike earlier failure checks in this function it does not remove temp_snapshot_dir first. In the failure scenario this commit is explicitly targeting (MMDS update errors), that leaves a large .creating directory (including memory snapshot files) on disk, which can consume significant space and cause ENOSPC for later VM/snapshot operations until another snapshot attempt happens to clean it.

Useful? React with 👍 / 👎.

The hard-error patch_mmds path returns early from create_snapshot_core()
but didn't remove temp_snapshot_dir first. This leaves a large .creating
directory (memory snapshot files) on disk, which can cause ENOSPC.

Match the cleanup pattern used by all other early-return error paths in
this function.

Found by code review on PR #470.
@claude-claude
Copy link
Contributor

claude-claude bot commented Feb 23, 2026

🔍 Claude Review

SEVERITY: none

Findings

No issues found. This is a clean, well-targeted two-commit fix:

  1. Commit bffc6fc — hard error upgrade: Correctly upgrades the patch_mmds call from best-effort warning to hard error via .context("...")?. This matches the clone path at line 1014 (put_mmds().context("...")?) and closes the correctness gap where a silent MMDS failure would leave the exec server stale. (Covered in previous review.)

  2. Commit 2134b7f — temp dir cleanup: The new hard error path now returns early via return Err(e).context(...), so it needs the same remove_dir_all(&temp_snapshot_dir) cleanup that all other early-return error paths in the function use (lines 1423, 1427, 1431). The if let Err(e) pattern is the right way to handle this since it allows inserting cleanup before the return.

Consistency check: The cleanup pattern (let _ = remove_dir_all + return Err(e).context(...)) exactly matches the three existing error paths at lines 1422–1432. Comment accurately explains the failure mode and why this must succeed.

Docs check: No documentation updates needed — this is an internal error-handling change with no user-facing behavioral change (snapshot creation already failed; this just fails earlier and more explicitly).

Summary

Clean two-commit fix that closes a correctness gap: MMDS epoch bump failures during snapshot creation are now hard errors with proper temp directory cleanup, consistent with all other error paths in the function.


Review by Claude | Run

@ejc3 ejc3 merged commit 2134b7f into main Feb 23, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant