Skip to content

fix: kill bash tool's full process tree on timeout#2482

Merged
bpamiri merged 1 commit intodevelopfrom
claude/fix-bash-tool-process-tree-kill
May 7, 2026
Merged

fix: kill bash tool's full process tree on timeout#2482
bpamiri merged 1 commit intodevelopfrom
claude/fix-bash-tool-process-tree-kill

Conversation

@bpamiri
Copy link
Copy Markdown
Collaborator

@bpamiri bpamiri commented May 7, 2026

Summary

The docs-validation agent deadlocked for 3+ hours on its first guide-mode run. Root cause: the run_bash tool's per-call timeout couldn't actually kill commands that spawn long-running grandchildren (the pnpm verify:docs chain → node → harness → wheels → lucli/Lucee).

When the timer fired, child.kill('SIGKILL') terminated the bash process directly, but its descendants were reparented to init and kept stdio open — child.on('close') never fired, the Promise never resolved, the agent's turn loop hung waiting for tool output, and the workflow ate the 240-min job timeout doing nothing.

Fix

Three changes to doBash() in tools/docs-validation/lib/tools.mjs:

  1. detached: true on spawn. The bash subprocess becomes a process-group leader, so we can deliver signals to every descendant via process.kill(-child.pid, signal).
  2. 5-second grace timer after SIGKILL. If descendants still hold stdio open and prevent close from firing, the Promise resolves anyway with a marker in stderr so the agent can see what happened.
  3. error event handler. Spawn failures now resolve cleanly instead of leaving the Promise pending.

Smoke test

// Plain long-running command
exec('run_bash', { command: 'sleep 60', timeout_seconds: 3 })
  // elapsed: 3012ms, timed_out: true ✓

// Daemonized grandchildren (the actual harness pattern)
exec('run_bash', { command: 'bash -c "(sleep 60 &) ; (sleep 60 &) ; sleep 60"', timeout_seconds: 3 })
  // elapsed: 3011ms, timed_out: true ✓
  // no orphan `sleep 60` processes after exit

What this unblocks

After merge, redispatch the start-here directory:

gh workflow run docs-validation.yml --ref develop \
  -f mode=guide -f directory=start-here -f dry_run=false

If the harness genuinely takes a long time on tutorial chapters (real possibility — the cumulative-step tutorial driver replays prior steps each run), the per-call timeouts will trip and the agent will see "harness timed out" and fall back to alternate annotation strategies. Worst case: the agent reports needs_human for tutorial pages, we tune the prompt or harness-invocation pattern in a follow-up.

Risk

Low. Pure timeout-handling change in tooling. API mode remains unaffected (it's already proven through 8 sections × 378 functions). The new code is more defensive than the old.

https://claude.ai/code/session_014puccJJixwdjRgMx7mPLmz


Generated by Claude Code

The agent's run_bash tool was deadlocking on commands that spawn
long-running grandchildren (pnpm verify:docs -> node -> harness ->
wheels -> lucli/Lucee). When the per-call timeout fired,
child.kill('SIGKILL') terminated the bash process directly, but
descendants were reparented to init and kept stdio open —
child.on('close') never fired, the Promise never resolved, and the
agent's turn loop hung waiting for tool output.

A live workflow run on the start-here directory deadlocked at turn 1
of page 1 for 3+ hours before the user spotted it.

This change:

1. Spawns the bash subprocess with detached:true so it becomes a
   process-group leader. We can then process.kill(-child.pid, signal)
   to deliver the signal to every descendant in the group.

2. Adds a 5-second backstop timer after the SIGKILL fires. If
   descendants still hold stdio open and prevent close from firing,
   the Promise resolves anyway with a marker in stderr so the agent
   can see what happened.

3. Adds an error event handler so spawn failures resolve cleanly.

Smoke test:
  // plain long-running command
  exec('run_bash', { command: 'sleep 60', timeout_seconds: 3 })
    elapsed: 3012ms, timed_out: true ✓

  // daemonized grandchildren (the harness pattern)
  cmd = 'bash -c "(sleep 60 &) ; (sleep 60 &) ; sleep 60"'
  exec('run_bash', { command: cmd, timeout_seconds: 3 })
    elapsed: 3011ms, timed_out: true ✓
    no orphan sleep processes after exit

https://claude.ai/code/session_014puccJJixwdjRgMx7mPLmz
@bpamiri bpamiri force-pushed the claude/fix-bash-tool-process-tree-kill branch from 70faad0 to f805d19 Compare May 7, 2026 18:58
@bpamiri bpamiri merged commit 3c7befb into develop May 7, 2026
3 checks passed
@bpamiri bpamiri deleted the claude/fix-bash-tool-process-tree-kill branch May 7, 2026 19:03
Comment on lines +225 to +235
const timer = setTimeout(() => {
killed = true;
child.kill('SIGKILL');
killTree('SIGKILL');
}, ms);
// Backstop: if descendants still hold stdio open after SIGKILL, force-resolve
// 5s after the timeout fires so the agent loop never hangs forever on a
// single tool call.
const graceTimer = setTimeout(() => {
finish({
ok: false,
exit_code: null,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 When the grace timer force-resolves doBash(), it never detaches from the live child: child.stdout/child.stderr keep their 'data' listeners and remain in flowing mode, and detached:true without a matching child.unref() keeps the parent's event loop ref-counted. In the exact scenario this timer is designed for (a descendant escaped the process group via setsid and is still holding stdio open), this leaves the per-tool-call deadlock fixed but re-introduces a per-run hang at shutdown — runApiMode/runGuideMode in orchestrate.mjs complete naturally without calling process.exit, so Node will wait on the open pipes indefinitely. Fix is one line in the grace-timer body: child.stdout?.destroy(); child.stderr?.destroy(); child.unref(); before finish(...).

Extended reasoning...

What the bug is

The grace timer added at lines 219-228 force-resolves the doBash Promise 5s after SIGKILL fires, but finish() only resolves the Promise and clears the two timers. It never disengages from the still-live child:

  • No child.unref(), despite the spawn using detached: true (line 208).
  • No child.stdout.destroy() / child.stderr.destroy().
  • The 'data' listeners (lines 241-242) stay attached and the streams remain in flowing mode.

Why this matters

Per Node's documentation, a spawned child keeps the parent's event loop ref-counted until 'close' fires; 'close' fires only when all stdio streams have closed. The graceTimer exists precisely because some descendant escaped the process group (e.g., via its own setsid call) and is still holding pipe write-ends open after the SIGKILL — i.e., the very scenario where 'close' will not fire naturally. Without destroy() on the read-end and unref() on the child, libuv's active-handle count never drops.

Why orchestrate.mjs makes this user-visible

orchestrate.mjs:66-67:

if (mode === 'api') await runApiMode();
else await runGuideMode();

Both runApiMode (ends at line 158 with console.log('Usage totals:'...)) and runGuideMode (ends at line 241 the same way) fall off the end without calling process.exit() on the success path. Top-level execution then attempts to drain the event loop. A leaked, ref-counted child with active piped streams keeps libuv's loop alive, so node will hang at end-of-run instead of exiting — re-introducing the very deadlock this PR was designed to prevent, just shifted from per-tool-call to per-process granularity.

A secondary concern: the 'data' handlers are closures over the local stdout/stderr strings. After finish() returns, the orphan can keep writing and those strings keep growing unbounded — cap() is applied at result-build time only, so it does not bound the in-memory growth post-resolution.

Step-by-step proof

  1. Agent invokes run_bash with timeout_seconds: 60 on a command whose tree contains a descendant that calls setsid (or is otherwise orphaned out of the bash pgrp).
  2. At t=60s, timer fires → killTree('SIGKILL') kills bash and any descendants still in the pgrp. The escaped descendant survives and continues to hold the pipe write-ends.
  3. 'close' does NOT fire on the child, because not all stdio handles are closed.
  4. At t=65s, graceTimer fires → finish({ ok:false, timed_out:true, ... }) resolves the Promise with snapshots of stdout/stderr and clears both timers.
  5. The agent loop continues and eventually completes its run; runApiMode/runGuideMode returns normally.
  6. Top-level await resolves; the script falls off the end. Node tries to exit. It cannot: libuv still sees the live child handle (no unref) and the two open pipe handles in flowing mode (active 'data' listeners, no destroy).
  7. The orphan keeps writing → the closure-held stdout/stderr strings grow unboundedly until the orphan finally exits or something kills the process.

Why existing code does not prevent this

The 'close' handler self-cleans, but its precondition (all stdio closed) is exactly what fails in the grace-timer scenario. The SIGKILL path doesn't help here either — by hypothesis the descendant ignored or escaped the pgrp signal.

Fix

In the graceTimer callback, before/around finish(...):

child.stdout?.destroy();
child.stderr?.destroy();
child.unref();

This forcibly closes the parent's read-ends (so libuv stops waiting on the pipes) and tells the event loop to stop ref-counting the orphaned child handle, allowing Node to exit cleanly at end-of-run. Same treatment in the SIGKILL timer body is reasonable belt-and-braces, though strictly only the grace path requires it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants