fix: kill bash tool's full process tree on timeout by bpamiri · Pull Request #2482 · wheels-dev/wheels

bpamiri · 2026-05-07T18:57:02Z

Summary

The docs-validation agent deadlocked for 3+ hours on its first guide-mode run. Root cause: the run_bash tool's per-call timeout couldn't actually kill commands that spawn long-running grandchildren (the pnpm verify:docs chain → node → harness → wheels → lucli/Lucee).

When the timer fired, child.kill('SIGKILL') terminated the bash process directly, but its descendants were reparented to init and kept stdio open — child.on('close') never fired, the Promise never resolved, the agent's turn loop hung waiting for tool output, and the workflow ate the 240-min job timeout doing nothing.

Fix

Three changes to doBash() in tools/docs-validation/lib/tools.mjs:

detached: true on spawn. The bash subprocess becomes a process-group leader, so we can deliver signals to every descendant via process.kill(-child.pid, signal).
5-second grace timer after SIGKILL. If descendants still hold stdio open and prevent close from firing, the Promise resolves anyway with a marker in stderr so the agent can see what happened.
error event handler. Spawn failures now resolve cleanly instead of leaving the Promise pending.

Smoke test

// Plain long-running command
exec('run_bash', { command: 'sleep 60', timeout_seconds: 3 })
  // elapsed: 3012ms, timed_out: true ✓

// Daemonized grandchildren (the actual harness pattern)
exec('run_bash', { command: 'bash -c "(sleep 60 &) ; (sleep 60 &) ; sleep 60"', timeout_seconds: 3 })
  // elapsed: 3011ms, timed_out: true ✓
  // no orphan `sleep 60` processes after exit

What this unblocks

After merge, redispatch the start-here directory:

gh workflow run docs-validation.yml --ref develop \
  -f mode=guide -f directory=start-here -f dry_run=false

If the harness genuinely takes a long time on tutorial chapters (real possibility — the cumulative-step tutorial driver replays prior steps each run), the per-call timeouts will trip and the agent will see "harness timed out" and fall back to alternate annotation strategies. Worst case: the agent reports needs_human for tutorial pages, we tune the prompt or harness-invocation pattern in a follow-up.

Risk

Low. Pure timeout-handling change in tooling. API mode remains unaffected (it's already proven through 8 sections × 378 functions). The new code is more defensive than the old.

https://claude.ai/code/session_014puccJJixwdjRgMx7mPLmz

Generated by Claude Code

The agent's run_bash tool was deadlocking on commands that spawn long-running grandchildren (pnpm verify:docs -> node -> harness -> wheels -> lucli/Lucee). When the per-call timeout fired, child.kill('SIGKILL') terminated the bash process directly, but descendants were reparented to init and kept stdio open — child.on('close') never fired, the Promise never resolved, and the agent's turn loop hung waiting for tool output. A live workflow run on the start-here directory deadlocked at turn 1 of page 1 for 3+ hours before the user spotted it. This change: 1. Spawns the bash subprocess with detached:true so it becomes a process-group leader. We can then process.kill(-child.pid, signal) to deliver the signal to every descendant in the group. 2. Adds a 5-second backstop timer after the SIGKILL fires. If descendants still hold stdio open and prevent close from firing, the Promise resolves anyway with a marker in stderr so the agent can see what happened. 3. Adds an error event handler so spawn failures resolve cleanly. Smoke test: // plain long-running command exec('run_bash', { command: 'sleep 60', timeout_seconds: 3 }) elapsed: 3012ms, timed_out: true ✓ // daemonized grandchildren (the harness pattern) cmd = 'bash -c "(sleep 60 &) ; (sleep 60 &) ; sleep 60"' exec('run_bash', { command: cmd, timeout_seconds: 3 }) elapsed: 3011ms, timed_out: true ✓ no orphan sleep processes after exit https://claude.ai/code/session_014puccJJixwdjRgMx7mPLmz

claude · 2026-05-07T19:09:43Z

+    const timer = setTimeout(() => {
      killed = true;
-      child.kill('SIGKILL');
+      killTree('SIGKILL');
    }, ms);
+    // Backstop: if descendants still hold stdio open after SIGKILL, force-resolve
+    // 5s after the timeout fires so the agent loop never hangs forever on a
+    // single tool call.
+    const graceTimer = setTimeout(() => {
+      finish({
+        ok: false,
+        exit_code: null,


🟡 When the grace timer force-resolves doBash(), it never detaches from the live child: child.stdout/child.stderr keep their 'data' listeners and remain in flowing mode, and detached:true without a matching child.unref() keeps the parent's event loop ref-counted. In the exact scenario this timer is designed for (a descendant escaped the process group via setsid and is still holding stdio open), this leaves the per-tool-call deadlock fixed but re-introduces a per-run hang at shutdown — runApiMode/runGuideMode in orchestrate.mjs complete naturally without calling process.exit, so Node will wait on the open pipes indefinitely. Fix is one line in the grace-timer body: child.stdout?.destroy(); child.stderr?.destroy(); child.unref(); before finish(...).

Extended reasoning...

What the bug is

The grace timer added at lines 219-228 force-resolves the doBash Promise 5s after SIGKILL fires, but finish() only resolves the Promise and clears the two timers. It never disengages from the still-live child:

No child.unref(), despite the spawn using detached: true (line 208).

No child.stdout.destroy() / child.stderr.destroy().

The 'data' listeners (lines 241-242) stay attached and the streams remain in flowing mode.

Why this matters

Per Node's documentation, a spawned child keeps the parent's event loop ref-counted until 'close' fires; 'close' fires only when all stdio streams have closed. The graceTimer exists precisely because some descendant escaped the process group (e.g., via its own setsid call) and is still holding pipe write-ends open after the SIGKILL — i.e., the very scenario where 'close' will not fire naturally. Without destroy() on the read-end and unref() on the child, libuv's active-handle count never drops.

Why orchestrate.mjs makes this user-visible

orchestrate.mjs:66-67:

if (mode === 'api') await runApiMode(); else await runGuideMode();

Both runApiMode (ends at line 158 with console.log('Usage totals:'...)) and runGuideMode (ends at line 241 the same way) fall off the end without calling process.exit() on the success path. Top-level execution then attempts to drain the event loop. A leaked, ref-counted child with active piped streams keeps libuv's loop alive, so node will hang at end-of-run instead of exiting — re-introducing the very deadlock this PR was designed to prevent, just shifted from per-tool-call to per-process granularity.

A secondary concern: the 'data' handlers are closures over the local stdout/stderr strings. After finish() returns, the orphan can keep writing and those strings keep growing unbounded — cap() is applied at result-build time only, so it does not bound the in-memory growth post-resolution.

Step-by-step proof

Agent invokes run_bash with timeout_seconds: 60 on a command whose tree contains a descendant that calls setsid (or is otherwise orphaned out of the bash pgrp).

At t=60s, timer fires → killTree('SIGKILL') kills bash and any descendants still in the pgrp. The escaped descendant survives and continues to hold the pipe write-ends.

'close' does NOT fire on the child, because not all stdio handles are closed.

At t=65s, graceTimer fires → finish({ ok:false, timed_out:true, ... }) resolves the Promise with snapshots of stdout/stderr and clears both timers.

The agent loop continues and eventually completes its run; runApiMode/runGuideMode returns normally.

Top-level await resolves; the script falls off the end. Node tries to exit. It cannot: libuv still sees the live child handle (no unref) and the two open pipe handles in flowing mode (active 'data' listeners, no destroy).

The orphan keeps writing → the closure-held stdout/stderr strings grow unboundedly until the orphan finally exits or something kills the process.

Why existing code does not prevent this

The 'close' handler self-cleans, but its precondition (all stdio closed) is exactly what fails in the grace-timer scenario. The SIGKILL path doesn't help here either — by hypothesis the descendant ignored or escaped the pgrp signal.

Fix

In the graceTimer callback, before/around finish(...):

child.stdout?.destroy(); child.stderr?.destroy(); child.unref();

This forcibly closes the parent's read-ends (so libuv stops waiting on the pipes) and tells the event loop to stop ref-counting the orphaned child handle, allowing Node to exit cleanly at end-of-run. Same treatment in the SIGKILL timer body is reasonable belt-and-braces, though strictly only the grace path requires it.

bpamiri force-pushed the claude/fix-bash-tool-process-tree-kill branch from 70faad0 to f805d19 Compare May 7, 2026 18:58

bpamiri merged commit 3c7befb into develop May 7, 2026
3 checks passed

bpamiri deleted the claude/fix-bash-tool-process-tree-kill branch May 7, 2026 19:03

claude Bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: kill bash tool's full process tree on timeout#2482

fix: kill bash tool's full process tree on timeout#2482
bpamiri merged 1 commit intodevelopfrom
claude/fix-bash-tool-process-tree-kill

bpamiri commented May 7, 2026

Uh oh!

Uh oh!

claude Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

bpamiri commented May 7, 2026

Summary

Fix

Smoke test

What this unblocks

Risk

Uh oh!

Uh oh!

claude Bot May 7, 2026

Choose a reason for hiding this comment

What the bug is

Why this matters

Why orchestrate.mjs makes this user-visible

Step-by-step proof

Why existing code does not prevent this

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants