Skip to content

fix(box_impl): offload blocking handler.stop() and metrics() to spawn_blocking#415

Open
lilongen wants to merge 2 commits into
boxlite-ai:mainfrom
lilongen:fix/spawn-blocking-handler
Open

fix(box_impl): offload blocking handler.stop() and metrics() to spawn_blocking#415
lilongen wants to merge 2 commits into
boxlite-ai:mainfrom
lilongen:fix/spawn-blocking-handler

Conversation

@lilongen
Copy link
Copy Markdown

@lilongen lilongen commented Apr 1, 2026

Summary

  • ShimHandler::stop() uses a std::thread::sleep(50ms) polling loop (up to 2 seconds).
    Called directly from async BoxImpl::stop(), this blocks a Tokio worker thread.
    Similarly, handler.metrics() performs sysinfo syscalls (~5ms), also a blocking operation.
  • Wrap LiveState.handler from std::sync::Mutex<...> to Arc<std::sync::Mutex<...>>
    so it can be cloned and moved into spawn_blocking closures.
  • Both handler.stop() and handler.metrics() now run on Tokio's dedicated blocking
    thread pool via tokio::task::spawn_blocking, keeping worker threads free.
  • Lock poisoning uses asymmetric handling: stop() swallows it (shutdown must not be
    blocked by a poisoned lock), metrics() propagates it (monitoring should surface
    anomalies).

Files changed

File Change
boxlite/src/litebox/box_impl.rs Arc-wrap handler field; spawn_blocking for stop() and metrics()

Test plan

  • cargo test -p boxlite --no-default-features --lib — 593 passed (main branch)
  • cargo clippy -p boxlite --no-default-features --lib -- -D warnings — clean
  • cargo fmt -- --check — clean

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 1, 2026 02:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates boxlite’s BoxImpl lifecycle/observability paths to avoid blocking Tokio worker threads by running VM handler shutdown and metrics collection on Tokio’s blocking thread pool.

Changes:

  • Wrap LiveState.handler in Arc<std::sync::Mutex<...>> so it can be moved into spawn_blocking closures.
  • Move handler.metrics() into tokio::task::spawn_blocking to avoid blocking async workers during syscalls.
  • Move handler.stop() into tokio::task::spawn_blocking to avoid blocking async workers during stop’s polling/sleep loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/boxlite/src/litebox/box_impl.rs Outdated
h.metrics()
})
.await
.map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??;
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The join error message from spawn_blocking is too generic; when this triggers it will be hard to tell whether the failure happened during metrics collection vs some other blocking task. Consider including operation context (e.g., metrics spawn_blocking join) and/or the box_id in the error message to make debugging easier.

Suggested change
.map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??;
.map_err(|e| {
BoxliteError::Internal(format!(
"metrics spawn_blocking join (box_id={}): {}",
self.config.id, e
))
})??;

Copilot uses AI. Check for mistakes.
Comment thread src/boxlite/src/litebox/box_impl.rs Outdated
}
})
.await
.map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??;
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: the spawn_blocking join error message here doesn't indicate which operation failed. Adding context (e.g., stop spawn_blocking join) and/or the box_id would make operational debugging significantly easier.

Suggested change
.map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??;
.map_err(|e| {
BoxliteError::Internal(format!(
"stop spawn_blocking join (box_id={}): {}",
self.config.id, e
))
})??;

Copilot uses AI. Check for mistakes.
Comment thread src/boxlite/src/litebox/box_impl.rs Outdated
Comment on lines +365 to +368
if let Ok(mut h) = handler.lock() {
h.stop()
} else {
Ok(())
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the handler mutex is poisoned, this path currently returns Ok(()) without attempting to stop the handler. That can leave the VM/shim running and leak resources. If the intent is to not fail shutdown on poisoning, consider recovering the guard via PoisonError::into_inner() and still calling stop(), while continuing to swallow the poisoning error itself.

Suggested change
if let Ok(mut h) = handler.lock() {
h.stop()
} else {
Ok(())
match handler.lock() {
Ok(mut h) => h.stop(),
Err(poisoned) => {
tracing::warn!("Handler mutex poisoned during shutdown; recovering guard to stop handler");
let mut h = poisoned.into_inner();
h.stop()
}

Copilot uses AI. Check for mistakes.
@lilongen lilongen force-pushed the fix/spawn-blocking-handler branch from 8aaef1f to 8d043d1 Compare April 2, 2026 14:36
.map_err(|e| BoxliteError::Internal(format!("handler lock poisoned: {}", e)))?;
let raw = handler.metrics()?;
let handler = Arc::clone(&live.handler);
let raw = tokio::task::spawn_blocking(move || {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this raw not being used?

Copy link
Copy Markdown
Contributor

@uran0sH uran0sH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mistake before; I think there's no problem moving them into spawn_blocking. Overall LGTM

lile and others added 2 commits April 21, 2026 16:08
…_blocking

ShimHandler::stop() uses a std::thread::sleep(50ms) polling loop (up to 2s).
Called directly from async BoxImpl::stop(), this blocks a Tokio worker thread.
Similarly, handler.metrics() performs sysinfo syscalls (~5ms blocking I/O).

Changes:
- Wrap LiveState.handler in Arc so it can be cloned into spawn_blocking closures
- handler.stop() and handler.metrics() now run on Tokio's blocking thread pool
- Lock poisoning uses asymmetric handling: stop() swallows (shutdown must not
  be blocked), metrics() propagates (monitoring should surface anomalies)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed mutex

- Add operation name and box_id to spawn_blocking join error messages
  for metrics() and stop() to aid debugging
- Recover poisoned handler mutex during shutdown via into_inner()
  instead of silently returning Ok(()), preventing VM/shim resource leaks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 21, 2026 08:14
@lilongen lilongen force-pushed the fix/spawn-blocking-handler branch from 8d043d1 to d740da1 Compare April 21, 2026 08:14
@lilongen
Copy link
Copy Markdown
Author

Review Fixes

Addressed 3 of 4 review comments:

Fixed

  1. Descriptive spawn_blocking join errors (Copilot #1 & [feature] Add user volume support with resolved paths and read-only option #2) — Added operation name (metrics/stop) and box_id to join error messages for both metrics() and stop() paths.

  2. Recover poisoned mutex during shutdown (Copilot [feature][python] Add support for additional configuration options #3) — Instead of silently returning Ok(()) when handler mutex is poisoned (which leaks the VM/shim), now recovers the guard via PoisonError::into_inner() and still calls handler.stop().

Not Fixed

  1. "Is this raw not being used?" (uran0sH) — raw is used on lines 309-310: raw.cpu_percent and raw.memory_bytes are passed to BoxMetrics::from_storage(). This was a false positive.

Verification: 637 tests pass, clippy clean, fmt clean.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +369 to +373
let box_id_for_join = self.config.id.clone();
tokio::task::spawn_blocking(move || match handler.lock() {
Ok(mut h) => h.stop(),
Err(poisoned) => {
tracing::warn!(
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tracing::warn! emitted when the handler mutex is poisoned doesn’t include the box_id, unlike surrounding logs in stop(). Consider adding box_id (e.g., by cloning it into the spawn_blocking closure) to make this warning easier to correlate during concurrent shutdowns.

Suggested change
let box_id_for_join = self.config.id.clone();
tokio::task::spawn_blocking(move || match handler.lock() {
Ok(mut h) => h.stop(),
Err(poisoned) => {
tracing::warn!(
let box_id_for_join = self.config.id.clone();
let box_id_for_warn = self.config.id.clone();
tokio::task::spawn_blocking(move || match handler.lock() {
Ok(mut h) => h.stop(),
Err(poisoned) => {
tracing::warn!(
box_id = %box_id_for_warn,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants