fix: kill bash tool's full process tree on timeout#2482
Conversation
The agent's run_bash tool was deadlocking on commands that spawn
long-running grandchildren (pnpm verify:docs -> node -> harness ->
wheels -> lucli/Lucee). When the per-call timeout fired,
child.kill('SIGKILL') terminated the bash process directly, but
descendants were reparented to init and kept stdio open —
child.on('close') never fired, the Promise never resolved, and the
agent's turn loop hung waiting for tool output.
A live workflow run on the start-here directory deadlocked at turn 1
of page 1 for 3+ hours before the user spotted it.
This change:
1. Spawns the bash subprocess with detached:true so it becomes a
process-group leader. We can then process.kill(-child.pid, signal)
to deliver the signal to every descendant in the group.
2. Adds a 5-second backstop timer after the SIGKILL fires. If
descendants still hold stdio open and prevent close from firing,
the Promise resolves anyway with a marker in stderr so the agent
can see what happened.
3. Adds an error event handler so spawn failures resolve cleanly.
Smoke test:
// plain long-running command
exec('run_bash', { command: 'sleep 60', timeout_seconds: 3 })
elapsed: 3012ms, timed_out: true ✓
// daemonized grandchildren (the harness pattern)
cmd = 'bash -c "(sleep 60 &) ; (sleep 60 &) ; sleep 60"'
exec('run_bash', { command: cmd, timeout_seconds: 3 })
elapsed: 3011ms, timed_out: true ✓
no orphan sleep processes after exit
https://claude.ai/code/session_014puccJJixwdjRgMx7mPLmz
70faad0 to
f805d19
Compare
| const timer = setTimeout(() => { | ||
| killed = true; | ||
| child.kill('SIGKILL'); | ||
| killTree('SIGKILL'); | ||
| }, ms); | ||
| // Backstop: if descendants still hold stdio open after SIGKILL, force-resolve | ||
| // 5s after the timeout fires so the agent loop never hangs forever on a | ||
| // single tool call. | ||
| const graceTimer = setTimeout(() => { | ||
| finish({ | ||
| ok: false, | ||
| exit_code: null, |
There was a problem hiding this comment.
🟡 When the grace timer force-resolves doBash(), it never detaches from the live child: child.stdout/child.stderr keep their 'data' listeners and remain in flowing mode, and detached:true without a matching child.unref() keeps the parent's event loop ref-counted. In the exact scenario this timer is designed for (a descendant escaped the process group via setsid and is still holding stdio open), this leaves the per-tool-call deadlock fixed but re-introduces a per-run hang at shutdown — runApiMode/runGuideMode in orchestrate.mjs complete naturally without calling process.exit, so Node will wait on the open pipes indefinitely. Fix is one line in the grace-timer body: child.stdout?.destroy(); child.stderr?.destroy(); child.unref(); before finish(...).
Extended reasoning...
What the bug is
The grace timer added at lines 219-228 force-resolves the doBash Promise 5s after SIGKILL fires, but finish() only resolves the Promise and clears the two timers. It never disengages from the still-live child:
- No
child.unref(), despite the spawn usingdetached: true(line 208). - No
child.stdout.destroy()/child.stderr.destroy(). - The
'data'listeners (lines 241-242) stay attached and the streams remain in flowing mode.
Why this matters
Per Node's documentation, a spawned child keeps the parent's event loop ref-counted until 'close' fires; 'close' fires only when all stdio streams have closed. The graceTimer exists precisely because some descendant escaped the process group (e.g., via its own setsid call) and is still holding pipe write-ends open after the SIGKILL — i.e., the very scenario where 'close' will not fire naturally. Without destroy() on the read-end and unref() on the child, libuv's active-handle count never drops.
Why orchestrate.mjs makes this user-visible
orchestrate.mjs:66-67:
if (mode === 'api') await runApiMode();
else await runGuideMode();Both runApiMode (ends at line 158 with console.log('Usage totals:'...)) and runGuideMode (ends at line 241 the same way) fall off the end without calling process.exit() on the success path. Top-level execution then attempts to drain the event loop. A leaked, ref-counted child with active piped streams keeps libuv's loop alive, so node will hang at end-of-run instead of exiting — re-introducing the very deadlock this PR was designed to prevent, just shifted from per-tool-call to per-process granularity.
A secondary concern: the 'data' handlers are closures over the local stdout/stderr strings. After finish() returns, the orphan can keep writing and those strings keep growing unbounded — cap() is applied at result-build time only, so it does not bound the in-memory growth post-resolution.
Step-by-step proof
- Agent invokes
run_bashwithtimeout_seconds: 60on a command whose tree contains a descendant that callssetsid(or is otherwise orphaned out of the bash pgrp). - At t=60s,
timerfires →killTree('SIGKILL')kills bash and any descendants still in the pgrp. The escaped descendant survives and continues to hold the pipe write-ends. 'close'does NOT fire on the child, because not all stdio handles are closed.- At t=65s,
graceTimerfires →finish({ ok:false, timed_out:true, ... })resolves the Promise with snapshots ofstdout/stderrand clears both timers. - The agent loop continues and eventually completes its run;
runApiMode/runGuideModereturns normally. - Top-level
awaitresolves; the script falls off the end. Node tries to exit. It cannot: libuv still sees the live child handle (nounref) and the two open pipe handles in flowing mode (active'data'listeners, nodestroy). - The orphan keeps writing → the closure-held
stdout/stderrstrings grow unboundedly until the orphan finally exits or something kills the process.
Why existing code does not prevent this
The 'close' handler self-cleans, but its precondition (all stdio closed) is exactly what fails in the grace-timer scenario. The SIGKILL path doesn't help here either — by hypothesis the descendant ignored or escaped the pgrp signal.
Fix
In the graceTimer callback, before/around finish(...):
child.stdout?.destroy();
child.stderr?.destroy();
child.unref();This forcibly closes the parent's read-ends (so libuv stops waiting on the pipes) and tells the event loop to stop ref-counting the orphaned child handle, allowing Node to exit cleanly at end-of-run. Same treatment in the SIGKILL timer body is reasonable belt-and-braces, though strictly only the grace path requires it.
Summary
The docs-validation agent deadlocked for 3+ hours on its first guide-mode run. Root cause: the
run_bashtool's per-call timeout couldn't actually kill commands that spawn long-running grandchildren (thepnpm verify:docschain → node → harness →wheels→ lucli/Lucee).When the timer fired,
child.kill('SIGKILL')terminated the bash process directly, but its descendants were reparented to init and kept stdio open —child.on('close')never fired, the Promise never resolved, the agent's turn loop hung waiting for tool output, and the workflow ate the 240-min job timeout doing nothing.Fix
Three changes to
doBash()intools/docs-validation/lib/tools.mjs:detached: trueon spawn. The bash subprocess becomes a process-group leader, so we can deliver signals to every descendant viaprocess.kill(-child.pid, signal).closefrom firing, the Promise resolves anyway with a marker in stderr so the agent can see what happened.errorevent handler. Spawn failures now resolve cleanly instead of leaving the Promise pending.Smoke test
What this unblocks
After merge, redispatch the
start-heredirectory:If the harness genuinely takes a long time on tutorial chapters (real possibility — the cumulative-step tutorial driver replays prior steps each run), the per-call timeouts will trip and the agent will see "harness timed out" and fall back to alternate annotation strategies. Worst case: the agent reports
needs_humanfor tutorial pages, we tune the prompt or harness-invocation pattern in a follow-up.Risk
Low. Pure timeout-handling change in tooling. API mode remains unaffected (it's already proven through 8 sections × 378 functions). The new code is more defensive than the old.
https://claude.ai/code/session_014puccJJixwdjRgMx7mPLmz
Generated by Claude Code