ENGINE-1383: Handle Windows exit code 3010 and fix reboot after MCR install#624
Conversation
- installer.go: remove dead-code branch that blocked installer download
path; GetInstaller() now always attempts the download when no cached
path exists.
- common.go: hoist rebootable interface to package scope so it is
accessible from both windows.go and any future configurers.
- windows.go (InstallMCR):
* Detect exit code 3010 (ERROR_SUCCESS_REBOOT_REQUIRED) via
isExitCode3010() helper and treat it as a reboot-required signal
instead of a hard failure.
* Preserve fallback: if the installer exits 0 but prints
'Your machine needs to be rebooted', still trigger a reboot.
* Fix reboot success fall-through: after rh.Reboot() succeeds,
return nil instead of falling through to the 'host isn't
rebootable' error return.
rig's Windows Reboot implementation uses 'shutdown /r /t 5' which is silently ignored when Windows has a pending-reboot state (e.g. after an MCR install that exits 3010). Override Reboot() on WindowsConfigurer to use PowerShell's Restart-Computer -Force which bypasses pending-reboot locks and reliably triggers the restart. TODO: move this fix upstream into k0sproject/rig.
|
The unit tests did not fail, but rather the unit test tear down failed after the tests passed. |
shutdown /r /t 0 returns immediately (exit 0) before Windows has begun tearing down its network stack. waitForHost starts polling right after Reboot() returns, so all 60 echo probes (3s apart) succeed before the host ever drops WinRM — the offline window is never observed. Adding a 15-second sleep gives Windows time to start its shutdown sequence so the subsequent waitForHost(false) poll loop actually catches the host going offline.
The /f flag forces running applications to close, which is necessary on Windows Server 2025 where processes can block a pending reboot. Without it, 'shutdown /r /t 0' completes but the host never actually reboots.
AWS EC2 WinRM sessions run under a filtered Administrator token that lacks SeShutdownPrivilege. 'shutdown /r /f /t 0' succeeds (exit 0) but is silently ignored because the token has insufficient privilege. Fix: create a one-shot scheduled task running as SYSTEM (which always holds SeShutdownPrivilege) and trigger it immediately. SYSTEM-context tasks bypass the WinRM token restriction and reliably trigger the reboot.
/sc once /st 00:00 causes schtasks to write a warning to stderr when the scheduled time is in the past. Rig treats any stderr output as an error, causing Reboot() to fail even though the task was created successfully. /sc onstart requires no start time and creates the task silently.
Each worker locked mutex at entry (line 106) and deferred unlock (line 107), then attempted a second mutex.Lock() on the error path (line 114). The second lock deadlocked the goroutine since it already held the mutex. workerpool.StopWait() then blocked forever waiting for the deadlocked worker to finish. Fix: remove the outer lock/defer and only lock when recording an error, using an early-return guard so only the first error is kept.
The schtask is scheduled with /sc onstart, meaning it fires on every system startup. Without cleanup, the task triggers a second reboot when the machine comes back up after the MCR-install reboot, causing docker swarm join (and any subsequent operation) to fail because the host reboots again mid-flight. Delete the task immediately after rh.Reboot() returns (machine is back up and WinRM is reconnected) to prevent it from firing on subsequent startups.
The ONSTART schtask fires on every startup, causing repeated reboots. The post-reboot cleanup in InstallMCR is too late -- the task has already triggered a second reboot by the time the cleanup runs. Fix: use /t 5 (5-second countdown) so the task can be deleted immediately after it is triggered but before the OS actually executes the shutdown. This prevents the task from re-firing on subsequent startups. The post-reboot cleanup in InstallMCR is kept as a fallback in case the pre-delete fails (e.g. the WinRM session is dropped in the 5s window).
There was a problem hiding this comment.
Pull request overview
Fixes multiple issues in the Windows Mirantis Container Runtime (MCR) install/reboot path (reboot-required exit code handling, reboot fall-through, and a more reliable reboot implementation), and removes dead code that prevented installer downloads.
Changes:
- Treat Windows installer exit code
3010as “reboot required” and returnnilafter a successful reboot. - Replace/override Windows reboot behavior with a scheduled-task-driven restart approach and task cleanup.
- Remove dead branch in
GetInstallerthat made the download path unreachable.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| pkg/docker/image.go | Adjusts parallel image pull error handling (but introduces a return-value timing issue; see comment). |
| pkg/configurer/windows.go | Handles exit code 3010, fixes reboot fall-through, and adds a Windows reboot implementation plus cleanup. |
| pkg/configurer/installer.go | Removes unreachable early-return to allow installer download; concurrency concern remains (see comment). |
| pkg/configurer/common.go | Hoists rebootable interface to package scope for reuse. |
Comments suppressed due to low confidence (1)
pkg/configurer/installer.go:32
downloadedInstallersis a package-level map accessed without synchronization.GetInstalleris called from Windows install/uninstall paths that run across hosts in parallel, so concurrent reads/writes can panic with "concurrent map read and map write". Protect this cache with async.Mutex/sync.RWMutex, usesync.Map, or remove the global cache and let callers manage caching.
func GetInstaller(source string) (string, error) {
path, ok := downloadedInstallers[source]
if ok {
return path, nil
}
path, getErr := downloadInstaller(source)
if getErr != nil {
return "", fmt.Errorf("%w, installer download failed; %s", ErrInstallerDownloadFailed, getErr.Error())
}
downloadedInstallers[source] = path
return path, nil
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@james-nesbitt Overall, it looks good. Could you please address the Copilot review comments? |
pkg/docker/image.go: - Fix StopWait() race: call wp.StopWait() explicitly before reading lastError instead of deferring it. A deferred StopWait() evaluates the return expression before workers finish, potentially returning nil when a worker later records an error. pkg/configurer/windows.go: - isExitCode3010: tighten match string from '3010' to 'non-zero exit code: 3010' to avoid false positives on error messages that incidentally contain those digits. - Reboot: fix comment '/t 0' -> '/t 5' to match the actual command. - Reboot: update top-level comment to describe the actual schtask mechanism; remove stale reference to Restart-Computer -Force from an earlier iteration.
b61c7a0 to
78d209d
Compare
|
All CoPilot issues resolved. |
|
I realized that I need to run one more test, but using the current/existing install.ps1 to see if we have broken anything. |
|
Regression test for the older install.ps1 passed without concern. The only CoPilot suggestion that was not exactly addressesd was: #624 (comment) This is ready for final review (@smerkviladze ) |
Summary
Fixes two bugs in the Windows MCR installation path and adds a reliable reboot mechanism.
Bug 1: Exit code 3010 treated as hard failure
When the MCR installer exits with code 3010 (
ERROR_SUCCESS_REBOOT_REQUIRED), rig wraps the non-zero exit asErrCommandFailed.InstallMCRwas returning immediately on any non-zero exit, so the reboot-detection logic was unreachable.Fix: Intercept the error in
InstallMCRviaisExitCode3010()and treat it as a reboot-required signal instead of a hard failure.Bug 2: Successful reboot fell through to error return
After a successful
rh.Reboot()call, execution fell through to theerrRebootRequiredreturn, making every successful reboot appear as a failure.Fix: Return
nilafter a successful reboot.Bug 3: rig's Windows Reboot uses
shutdown /r /t 5rig's
Windows.Reboot()issuesshutdown /r /t 5which is silently ignored when Windows has a pending-reboot state (as is the case after a 3010 exit). The host never goes offline and the wait loop times out.Fix: Override
Reboot()onWindowsConfigurerto useRestart-Computer -Forcevia PowerShell, which bypasses pending-reboot locks.A TODO comment marks this for upstreaming into k0sproject/rig.
Additional fix:
GetInstallerdead codeA dead-code branch in
GetInstallermade the installer download path unreachable when no cached path existed.Fix: Remove the unreachable branch.
Files changed
pkg/configurer/installer.gopkg/configurer/common.gorebootableinterface to package scopepkg/configurer/windows.goReboot()withRestart-Computer -ForceTesting
End-to-end tested against fresh AWS infrastructure (1x Ubuntu 22.04 manager, 3x Windows workers: 2019/2022/2025) using the ENGINE1383 custom installer script.