Skip to content

Unblock WaitForExit when a grandchild inherits redirected pipes#1223

Draft
LukeButters wants to merge 4 commits into
mainfrom
luke/silentprocessrunner-no-hang-on-grandchild-pipes
Draft

Unblock WaitForExit when a grandchild inherits redirected pipes#1223
LukeButters wants to merge 4 commits into
mainfrom
luke/silentprocessrunner-no-hang-on-grandchild-pipes

Conversation

@LukeButters
Copy link
Copy Markdown
Contributor

@LukeButters LukeButters commented May 14, 2026

Fixes: EFT-3354
SilentProcessRunner.ExecuteCommand could hang indefinitely on cancellation when the launched process spawned a grandchild that inherited the redirected stdout/stderr handles and outlived the immediate child (observed in a real dump: PowerShell -> AutoReport spawning a child and exiting). Process.Kill(true) walks descendants by ParentProcessId — once the intermediate parent dies, the grandchild is no longer reachable from the walk, so the pipes stay open and the parameterless WaitForExit() blocks forever waiting on the async readers to drain.

On cancellation, after the kill, close the Process's redirected streams so the in-flight ReadAsync calls throw and the AsyncStreamReader signals EOF, letting WaitForExit return. Gated on asyncReadsStarted so the early-cancel path (before BeginOutputReadLine) is unaffected.

Adds a Windows integration test that reproduces the hang via a PowerShell -> cmd -> ping chain where cmd exits after start /b returns, breaking the PPID chain and bypassing Kill(true)'s descendant walk. The test guards against a true infinite hang by wrapping the ExecuteCommand call in Task.Wait(30s).

Are you a customer of Octopus Deploy? Please contact our support team so we can triage your PR, so that we can make sure it's handled appropriately.

Background

Results

Fixes https://github.com/OctopusDeploy/OctopusTentacle/issues/... (optional public issue)

See How we use GitHub Issues (including this flowchart

Before

After

How to review this PR

Quality ✔️

Pre-requisites

  • I have read How we use GitHub Issues for help deciding when and where it's appropriate to make an issue.
  • I have considered informing or consulting the right people, according to the ownership map.
  • I have considered appropriate testing for my change.

SilentProcessRunner.ExecuteCommand could hang indefinitely on
cancellation when the launched process spawned a grandchild that
inherited the redirected stdout/stderr handles and outlived the
immediate child (observed in a real dump: PowerShell -> AutoReport
spawning a child and exiting). Process.Kill(true) walks descendants
by ParentProcessId — once the intermediate parent dies, the
grandchild is no longer reachable from the walk, so the pipes stay
open and the parameterless WaitForExit() blocks forever waiting on
the async readers to drain.

On cancellation, after the kill, close the Process's redirected
streams so the in-flight ReadAsync calls throw and the AsyncStreamReader
signals EOF, letting WaitForExit return. Gated on asyncReadsStarted
so the early-cancel path (before BeginOutputReadLine) is unaffected.

Adds a Windows integration test that reproduces the hang via a
PowerShell -> cmd -> ping chain where cmd exits after `start /b`
returns, breaking the PPID chain and bypassing Kill(true)'s descendant
walk. The test guards against a true infinite hang by wrapping the
ExecuteCommand call in Task.Wait(30s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

process.BeginOutputReadLine();
process.BeginErrorReadLine();
Volatile.Write(ref asyncReadsStarted, true);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AI says:

● asyncReadsStarted is written by one thread (the thread running ExecuteCommand) and read by another (whoever fires cancel.Cancel(), which invokes the Register callback synchronously on the cancelling thread). It's a textbook cross-thread bool flag.

  Without any synchronisation:

  - JIT can hoist the read. The compiler could cache the captured local in a register and not re-fetch it. Less likely in a one-shot lambda, but not guaranteed.
  - CPU can reorder. On weak memory models (ARM — the project's RIDs include linux-arm64/linux-arm), a plain read might not observe a prior write from another core without a memory fence.

  Volatile.Write does a release fence after writing true; Volatile.Read does an acquire fence before reading. That pairs to give you the happens-before guarantee you need: if the writer set it before the cancel fired, the callback is guaranteed to see true.

  Why not other options:

  - volatile keyword — only allowed on fields, not locals. asyncReadsStarted is a captured local, so it's lifted into a compiler-generated closure class field that you can't decorate.
  - Interlocked — doesn't support bool. You'd have to widen to int, which is uglier for no real benefit here.
  - lock — overkill for a one-way flag.

  Honest caveat: in this specific case it's belt-and-braces. CancellationTokenSource.Cancel() internally uses interlocked operations that act as a full fence, so if Thread A's write happens before Cancel(), other threads observing the cancelled state will also see the write
  transitively. So a plain bool would probably work in practice. But Volatile.Read/Write makes the cross-thread intent explicit, is cheap, and is the correct portable answer that doesn't rely on knowing CTS internals — worth the two lines.

// descendants by ParentProcessId and finds NOTHING (cmd's PID is gone, so
// ping no longer appears as a descendant). Without the fix that closes
// StandardOutput/StandardError after Kill, WaitForExit() hangs forever.
var psScript = @"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep AI generated.
The important thing is that the test failed before hand, which the AI tells me it did :D

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a hang in SilentProcessRunner.ExecuteCommand where, on cancellation, a grandchild process that inherited the redirected stdout/stderr pipes and survived Process.Kill(true) (because the PPID chain was broken when the intermediate parent already exited) would keep the pipe write-ends open. The parameterless WaitForExit() blocks waiting for the async readers to drain, so cancellation never completes. The fix closes the parent Process on the cancellation callback so EOF propagates to the async readers and WaitForExit returns.

Changes:

  • In the cancellation callback, after DoOurBestToCleanUp, call process.Close() to release the redirected pipe handles so async readers see EOF and WaitForExit() can return.
  • Gate the new process.Close() on an asyncReadsStarted flag (set with Volatile.Write after BeginOutputReadLine/BeginErrorReadLine) so the early-cancel path is unchanged.
  • Add a Windows-only integration test reproducing the hang via a PowerShell → cmd → ping chain (cmd exits after start /b, orphaning ping with the inherited pipes), wrapped in a 30s Task.Wait to fail rather than hang.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
source/Octopus.Tentacle.Core/Util/CommandLine/SilentProcessRunner.cs Closes the process on cancel (after async reads are wired up) to unblock WaitForExit when a grandchild holds the redirected pipes.
source/Octopus.Tentacle.Tests.Integration/Util/SilentProcessRunnerFixture.cs New Windows integration test reproducing the grandchild/redirected-pipe hang scenario, with PID-file based orchestration and cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +129 to +136
// Reproduces a dump scenario: a child process spawns a grandchild that
// inherits the redirected stdout/stderr handles AND is detached from the
// .NET-managed job (CREATE_BREAKAWAY_FROM_JOB). When cancellation fires:
// - Hitman.Kill terminates the immediate child.
// - The grandchild survives and keeps the write-end of the pipes open.
// - Process.WaitForExit() blocks forever waiting for the readers to EOF.
// The fix is to close StandardOutput/StandardError after Kill so the
// readers see EOF on our end and WaitForExit returns.
LukeButters and others added 2 commits May 14, 2026 16:48
[Retry(3)] was cargo-culted from a neighbouring cancellation test;
the new test's assertion is deterministic so retries only hide real
regressions. Bump the grandchild-spawn deadline from 30s to 60s to
cover slow CI boxes (AV scanning newly-launched cmd.exe/ping.exe,
Roslyn Add-Type compile, WMI lookup) — a setup timeout now means
something is actually broken. Use Task.Delay instead of Thread.Sleep
so the test does not block a thread-pool thread while waiting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The header comment was left over from an earlier iteration that
tried to detach the grandchild via CREATE_BREAKAWAY_FROM_JOB. The
script only passes CREATE_NO_WINDOW (0x08000000); what actually
makes Kill(true) miss the grandchild is the PPID chain breaking
when cmd exits after `start /b` returns. The comment lower down
already describes this correctly, so drop the stale header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xwipeoutx
Copy link
Copy Markdown
Contributor

Ensure you make a public issue on this one

Replace the embedded Add-Type C# (STARTUPINFO, PROCESS_INFORMATION,
CreateProcessW, GetStdHandle, SetHandleInformation, CloseHandle plus
the StringBuilder commandline quote-juggling) with a plain
System.Diagnostics.Process.Start. .NET's Process.Start sets
bInheritHandles=true and passes parent-inherited handles via
GetStdHandle for any std stream not explicitly redirected -- as long
as at least one IS redirected. Redirecting stdin (which we don't use)
flips the switch, so cmd inherits PowerShell's stdout/stderr (which
came from SilentProcessRunner's pipes) for free.

Verified the simplified script still hangs the unfixed code at the
30s assertion deadline, and passes in ~1s with the fix in place. -84
LOC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@xwipeoutx xwipeoutx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we're solving the wrong problem... the control flow itself is kinda borked, right? There's chances for stale bools all over it.

Like, should DoOurBestToCleanUp call Close() after killing it? It'll cause the BeginOutputReadLine to throw but... who cares? We're dead anyway?

I think I'd need to dive in further to actually understand it

// active ReadAsync calls never see EOF and the call
// hangs forever. Only valid once BeginOutputReadLine
// has wired up the AsyncStreamReaders.
if (Volatile.Read(ref asyncReadsStarted))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[formatting] Can you improve the formatting here? Jamming a whole try/catch onto a single line to avoid a brace is ... a decision.

@@ -128,17 +128,30 @@ void WriteData(Action<string> action, ManualResetEventSlim resetEvent, DataRecei
process.Start();

var running = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this through Claude and I think running has similar problems... can running get hoisted and cause issues?

Honestly I'm struggling to follow along here, and this is still in draft so don't want to spend much time on it.

if (cancel.IsCancellationRequested)
DoOurBestToCleanUp(process, error);

process.BeginOutputReadLine();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still race conditions here - we can cancel while we're beginning still. Again dunno if this is a problem... I think this reduces the window, but I'm not convinced the volatile gets us much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants