Force-kill stalled E2E teardown (verify Windows E2E hang fix)#4017
Draft
mokagio wants to merge 4 commits into
Draft
Force-kill stalled E2E teardown (verify Windows E2E hang fix)#4017mokagio wants to merge 4 commits into
mokagio wants to merge 4 commits into
Conversation
`closeApp()` waited 30s for the app process to exit, then on timeout only logged and moved on — leaving an orphaned, still-alive Electron process. That process keeps the Playwright runner from exiting, so after the suite prints its summary the job hangs until the CI timeout (the ~2h Windows E2E stalls behind AINFRA-2588). On timeout, walk and kill the process tree instead: `taskkill /F /T` on Windows (renderer + php.exe descendants orphan otherwise), `SIGKILL` elsewhere. Mirrors `killChild` in `tools/common/lib/cli-process.ts`. --- Generated with the help of Claude Code, https://claude.com/claude-code Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Windows E2E was disabled on trunk under AINFRA-2588. Re-enable the windows-x64 matrix entry so CI exercises the closeApp() force-kill and we can confirm the job now lands under the 180-min timeout. Revert before merge. --- Generated with the help of Claude Code, https://claude.com/claude-code Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Windows E2E job hangs after the Playwright suite finishes (~26 min) and runs to the 180-min timeout — the runner never exits. closeApp() teardowns all succeed, so the leak is a surviving child process/handle downstream of the app. Add a Windows-only background watchdog that dumps the surviving node/Studio/php process tree (pid, parent pid, command line) at 40m and 70m, i.e. while the runner is hung, to pinpoint what holds it open. Probe only; revert with the AINFRA-2588 fix. --- Generated with the help of Claude Code, https://claude.com/claude-code Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The probe on build #18462 caught the Windows E2E hang: after the suite finishes, Studio's CLI process-manager daemon (process-manager-daemon.mjs) is orphaned when the Electron app quits but survives with its php-server-child.mjs and php.exe subtree, accumulated across every test session. The Playwright runner holds a handle to it and never exits, so the job runs to the 180-min timeout. closeApp() and `site stop --all` don't reap it. Add a Playwright globalTeardown that, once the whole suite is done, kills any surviving daemon and its process tree (`taskkill /F /T` on Windows, `pkill` on unix) so the runner can exit. --- Generated with the help of Claude Code, https://claude.com/claude-code Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related issues
How AI was used in this PR
Claude Code (Opus 4.8) traced the Windows E2E hang through the Buildkite logs, located the teardown gap, and drafted the fix. The Playwright-Electron force-kill approach was grounded in upstream issues (playwright#39248, #20016, #27048); I reviewed the diff.
Proposed Changes
Draft / verification PR — not merge-ready. It re-enables the Windows E2E matrix entry (disabled on trunk under AINFRA-2588) purely so CI can exercise the fix; that commit must be reverted before merge.
The Windows E2E job was hanging for hours. From the logs, the Playwright suite actually finishes (e.g. build #18295:
33 passed, 12 failedin 37 min), then the job sits idle until Buildkite's 180-min SIGKILL. Root cause:E2ESession.closeApp()waits 30s for the app to exit and, on timeout, only logs — leaving an orphaned Electron process alive, which keeps the Playwright runner from exiting.The fix force-kills the process tree on that timeout (
taskkill /F /Ton Windows,SIGKILLelsewhere), so a stalled teardown can no longer block the run. This is unrelated to the Windows agent AMI (ruled out — identical image on passing and hanging runs).Note: the migrated E2E tests failing on Windows (the
sites.test.ts/ customize-links cluster) is a separate bug; this PR only stops a stalled teardown from hanging the job.Testing Instructions
E2E Tests on windows-x64in the Buildkite build). Confirm the job now reaches a terminal state under the 180-min timeout — that validates the fix, regardless of individual test pass/fail.timed_outat ~181 min.Pre-merge Checklist
npm run typecheckpasses;eslintclean)