Skip to content

Force-kill stalled E2E teardown (verify Windows E2E hang fix)#4017

Draft
mokagio wants to merge 4 commits into
trunkfrom
mokagio/verify-windows-e2e-teardown-killfix
Draft

Force-kill stalled E2E teardown (verify Windows E2E hang fix)#4017
mokagio wants to merge 4 commits into
trunkfrom
mokagio/verify-windows-e2e-teardown-killfix

Conversation

@mokagio

@mokagio mokagio commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Related issues

How AI was used in this PR

Claude Code (Opus 4.8) traced the Windows E2E hang through the Buildkite logs, located the teardown gap, and drafted the fix. The Playwright-Electron force-kill approach was grounded in upstream issues (playwright#39248, #20016, #27048); I reviewed the diff.

Proposed Changes

Draft / verification PR — not merge-ready. It re-enables the Windows E2E matrix entry (disabled on trunk under AINFRA-2588) purely so CI can exercise the fix; that commit must be reverted before merge.

The Windows E2E job was hanging for hours. From the logs, the Playwright suite actually finishes (e.g. build #18295: 33 passed, 12 failed in 37 min), then the job sits idle until Buildkite's 180-min SIGKILL. Root cause: E2ESession.closeApp() waits 30s for the app to exit and, on timeout, only logs — leaving an orphaned Electron process alive, which keeps the Playwright runner from exiting.

The fix force-kills the process tree on that timeout (taskkill /F /T on Windows, SIGKILL elsewhere), so a stalled teardown can no longer block the run. This is unrelated to the Windows agent AMI (ruled out — identical image on passing and hanging runs).

Note: the migrated E2E tests failing on Windows (the sites.test.ts / customize-links cluster) is a separate bug; this PR only stops a stalled teardown from hanging the job.

Testing Instructions

  • CI on this branch runs Windows E2E (E2E Tests on windows-x64 in the Buildkite build). Confirm the job now reaches a terminal state under the 180-min timeout — that validates the fix, regardless of individual test pass/fail.
  • Compare against the pre-fix behaviour: Windows E2E jobs on recent builds hit timed_out at ~181 min.

⚠️ Visual change: none. Do not merge with the Windows E2E re-enable commit in place.

Pre-merge Checklist

  • Have you checked for TypeScript, React or other console errors? (npm run typecheck passes; eslint clean)

mokagio and others added 2 commits July 1, 2026 10:01
`closeApp()` waited 30s for the app process to exit, then on timeout only
logged and moved on — leaving an orphaned, still-alive Electron process.
That process keeps the Playwright runner from exiting, so after the suite
prints its summary the job hangs until the CI timeout (the ~2h Windows E2E
stalls behind AINFRA-2588).

On timeout, walk and kill the process tree instead: `taskkill /F /T` on
Windows (renderer + php.exe descendants orphan otherwise), `SIGKILL`
elsewhere. Mirrors `killChild` in `tools/common/lib/cli-process.ts`.

---

Generated with the help of Claude Code, https://claude.com/claude-code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Windows E2E was disabled on trunk under AINFRA-2588. Re-enable the
windows-x64 matrix entry so CI exercises the closeApp() force-kill and we
can confirm the job now lands under the 180-min timeout. Revert before merge.

---

Generated with the help of Claude Code, https://claude.com/claude-code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mokagio mokagio self-assigned this Jul 1, 2026
mokagio and others added 2 commits July 1, 2026 15:40
The Windows E2E job hangs after the Playwright suite finishes (~26 min) and
runs to the 180-min timeout — the runner never exits. closeApp() teardowns all
succeed, so the leak is a surviving child process/handle downstream of the app.

Add a Windows-only background watchdog that dumps the surviving
node/Studio/php process tree (pid, parent pid, command line) at 40m and 70m,
i.e. while the runner is hung, to pinpoint what holds it open. Probe only;
revert with the AINFRA-2588 fix.

---

Generated with the help of Claude Code, https://claude.com/claude-code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The probe on build #18462 caught the Windows E2E hang: after the suite
finishes, Studio's CLI process-manager daemon (process-manager-daemon.mjs) is
orphaned when the Electron app quits but survives with its php-server-child.mjs
and php.exe subtree, accumulated across every test session. The Playwright
runner holds a handle to it and never exits, so the job runs to the 180-min
timeout. closeApp() and `site stop --all` don't reap it.

Add a Playwright globalTeardown that, once the whole suite is done, kills any
surviving daemon and its process tree (`taskkill /F /T` on Windows, `pkill` on
unix) so the runner can exit.

---

Generated with the help of Claude Code, https://claude.com/claude-code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant