Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 37 additions & 6 deletions tools/docs-validation/lib/tools.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -200,26 +200,57 @@ async function doEdit(path, oldStr, newStr) {
function doBash(command, timeoutSeconds = 60) {
return new Promise((resolveP) => {
const ms = Math.min(Math.max((timeoutSeconds | 0) || 60, 1), 180) * 1000;
const child = spawn('bash', ['-lc', command], { cwd: REPO_ROOT });
// detached:true puts the child into its own process group so we can kill
// the whole tree (bash -> pnpm -> node -> harness -> wheels -> lucli ...)
// by signalling the negative pid. Without this, SIGKILL on bash leaves
// descendants orphaned to init and they keep stdio open, so close never
// fires and the agent loop deadlocks waiting for tool output.
const child = spawn('bash', ['-lc', command], { cwd: REPO_ROOT, detached: true });
let stdout = '';
let stderr = '';
let killed = false;
const t = setTimeout(() => {
let resolved = false;
const finish = (result) => {
if (resolved) return;
resolved = true;
clearTimeout(timer);
clearTimeout(graceTimer);
resolveP(result);
};
const cap = (s) => (s.length > 32_000 ? s.slice(0, 32_000) + `\n[...truncated ${s.length - 32_000} bytes]` : s);
const killTree = (signal) => {
try { process.kill(-child.pid, signal); } catch {}
try { child.kill(signal); } catch {}
};
const timer = setTimeout(() => {
killed = true;
child.kill('SIGKILL');
killTree('SIGKILL');
}, ms);
// Backstop: if descendants still hold stdio open after SIGKILL, force-resolve
// 5s after the timeout fires so the agent loop never hangs forever on a
// single tool call.
const graceTimer = setTimeout(() => {
finish({
ok: false,
exit_code: null,
Comment on lines +225 to +235
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 When the grace timer force-resolves doBash(), it never detaches from the live child: child.stdout/child.stderr keep their 'data' listeners and remain in flowing mode, and detached:true without a matching child.unref() keeps the parent's event loop ref-counted. In the exact scenario this timer is designed for (a descendant escaped the process group via setsid and is still holding stdio open), this leaves the per-tool-call deadlock fixed but re-introduces a per-run hang at shutdown — runApiMode/runGuideMode in orchestrate.mjs complete naturally without calling process.exit, so Node will wait on the open pipes indefinitely. Fix is one line in the grace-timer body: child.stdout?.destroy(); child.stderr?.destroy(); child.unref(); before finish(...).

Extended reasoning...

What the bug is

The grace timer added at lines 219-228 force-resolves the doBash Promise 5s after SIGKILL fires, but finish() only resolves the Promise and clears the two timers. It never disengages from the still-live child:

  • No child.unref(), despite the spawn using detached: true (line 208).
  • No child.stdout.destroy() / child.stderr.destroy().
  • The 'data' listeners (lines 241-242) stay attached and the streams remain in flowing mode.

Why this matters

Per Node's documentation, a spawned child keeps the parent's event loop ref-counted until 'close' fires; 'close' fires only when all stdio streams have closed. The graceTimer exists precisely because some descendant escaped the process group (e.g., via its own setsid call) and is still holding pipe write-ends open after the SIGKILL — i.e., the very scenario where 'close' will not fire naturally. Without destroy() on the read-end and unref() on the child, libuv's active-handle count never drops.

Why orchestrate.mjs makes this user-visible

orchestrate.mjs:66-67:

if (mode === 'api') await runApiMode();
else await runGuideMode();

Both runApiMode (ends at line 158 with console.log('Usage totals:'...)) and runGuideMode (ends at line 241 the same way) fall off the end without calling process.exit() on the success path. Top-level execution then attempts to drain the event loop. A leaked, ref-counted child with active piped streams keeps libuv's loop alive, so node will hang at end-of-run instead of exiting — re-introducing the very deadlock this PR was designed to prevent, just shifted from per-tool-call to per-process granularity.

A secondary concern: the 'data' handlers are closures over the local stdout/stderr strings. After finish() returns, the orphan can keep writing and those strings keep growing unbounded — cap() is applied at result-build time only, so it does not bound the in-memory growth post-resolution.

Step-by-step proof

  1. Agent invokes run_bash with timeout_seconds: 60 on a command whose tree contains a descendant that calls setsid (or is otherwise orphaned out of the bash pgrp).
  2. At t=60s, timer fires → killTree('SIGKILL') kills bash and any descendants still in the pgrp. The escaped descendant survives and continues to hold the pipe write-ends.
  3. 'close' does NOT fire on the child, because not all stdio handles are closed.
  4. At t=65s, graceTimer fires → finish({ ok:false, timed_out:true, ... }) resolves the Promise with snapshots of stdout/stderr and clears both timers.
  5. The agent loop continues and eventually completes its run; runApiMode/runGuideMode returns normally.
  6. Top-level await resolves; the script falls off the end. Node tries to exit. It cannot: libuv still sees the live child handle (no unref) and the two open pipe handles in flowing mode (active 'data' listeners, no destroy).
  7. The orphan keeps writing → the closure-held stdout/stderr strings grow unboundedly until the orphan finally exits or something kills the process.

Why existing code does not prevent this

The 'close' handler self-cleans, but its precondition (all stdio closed) is exactly what fails in the grace-timer scenario. The SIGKILL path doesn't help here either — by hypothesis the descendant ignored or escaped the pgrp signal.

Fix

In the graceTimer callback, before/around finish(...):

child.stdout?.destroy();
child.stderr?.destroy();
child.unref();

This forcibly closes the parent's read-ends (so libuv stops waiting on the pipes) and tells the event loop to stop ref-counting the orphaned child handle, allowing Node to exit cleanly at end-of-run. Same treatment in the SIGKILL timer body is reasonable belt-and-braces, though strictly only the grace path requires it.

timed_out: true,
stdout: cap(stdout),
stderr: cap(stderr) + '\n[doBash: forced resolve after SIGKILL — descendants still holding stdio]',
});
}, ms + 5000);
child.stdout.on('data', (d) => (stdout += d.toString()));
child.stderr.on('data', (d) => (stderr += d.toString()));
child.on('close', (code) => {
clearTimeout(t);
const cap = (s) => (s.length > 32_000 ? s.slice(0, 32_000) + `\n[...truncated ${s.length - 32_000} bytes]` : s);
resolveP({
finish({
ok: !killed,
exit_code: code,
timed_out: killed,
stdout: cap(stdout),
stderr: cap(stderr),
});
});
child.on('error', (err) => {
finish({ ok: false, exit_code: null, timed_out: false, stdout: cap(stdout), stderr: cap(stderr) + `\n[spawn error: ${err.message}]` });
});
});
}