Skip to content

ENGINE-1383 post rig upstream#625

Closed
james-nesbitt wants to merge 13 commits intomainfrom
ENGINE-1383-post-rig-upstream
Closed

ENGINE-1383 post rig upstream#625
james-nesbitt wants to merge 13 commits intomainfrom
ENGINE-1383-post-rig-upstream

Conversation

@james-nesbitt
Copy link
Copy Markdown
Collaborator

No description provided.

- installer.go: remove dead-code branch that blocked installer download
  path; GetInstaller() now always attempts the download when no cached
  path exists.

- common.go: hoist rebootable interface to package scope so it is
  accessible from both windows.go and any future configurers.

- windows.go (InstallMCR):
  * Detect exit code 3010 (ERROR_SUCCESS_REBOOT_REQUIRED) via
    isExitCode3010() helper and treat it as a reboot-required signal
    instead of a hard failure.
  * Preserve fallback: if the installer exits 0 but prints
    'Your machine needs to be rebooted', still trigger a reboot.
  * Fix reboot success fall-through: after rh.Reboot() succeeds,
    return nil instead of falling through to the 'host isn't
    rebootable' error return.
rig's Windows Reboot implementation uses 'shutdown /r /t 5' which is
silently ignored when Windows has a pending-reboot state (e.g. after an
MCR install that exits 3010). Override Reboot() on WindowsConfigurer to
use PowerShell's Restart-Computer -Force which bypasses pending-reboot
locks and reliably triggers the restart.

TODO: move this fix upstream into k0sproject/rig.
shutdown /r /t 0 returns immediately (exit 0) before Windows has begun
tearing down its network stack. waitForHost starts polling right after
Reboot() returns, so all 60 echo probes (3s apart) succeed before the
host ever drops WinRM — the offline window is never observed.

Adding a 15-second sleep gives Windows time to start its shutdown
sequence so the subsequent waitForHost(false) poll loop actually catches
the host going offline.
The /f flag forces running applications to close, which is necessary on
Windows Server 2025 where processes can block a pending reboot. Without
it, 'shutdown /r /t 0' completes but the host never actually reboots.
AWS EC2 WinRM sessions run under a filtered Administrator token that
lacks SeShutdownPrivilege. 'shutdown /r /f /t 0' succeeds (exit 0) but
is silently ignored because the token has insufficient privilege.

Fix: create a one-shot scheduled task running as SYSTEM (which always
holds SeShutdownPrivilege) and trigger it immediately. SYSTEM-context
tasks bypass the WinRM token restriction and reliably trigger the
reboot.
/sc once /st 00:00 causes schtasks to write a warning to stderr when
the scheduled time is in the past. Rig treats any stderr output as an
error, causing Reboot() to fail even though the task was created
successfully. /sc onstart requires no start time and creates the task
silently.
Each worker locked mutex at entry (line 106) and deferred unlock (line
107), then attempted a second mutex.Lock() on the error path (line 114).
The second lock deadlocked the goroutine since it already held the mutex.
workerpool.StopWait() then blocked forever waiting for the deadlocked
worker to finish.

Fix: remove the outer lock/defer and only lock when recording an error,
using an early-return guard so only the first error is kept.
The schtask is scheduled with /sc onstart, meaning it fires on every
system startup. Without cleanup, the task triggers a second reboot when
the machine comes back up after the MCR-install reboot, causing
docker swarm join (and any subsequent operation) to fail because the
host reboots again mid-flight.

Delete the task immediately after rh.Reboot() returns (machine is back
up and WinRM is reconnected) to prevent it from firing on subsequent
startups.
The ONSTART schtask fires on every startup, causing repeated reboots.
The post-reboot cleanup in InstallMCR is too late -- the task has already
triggered a second reboot by the time the cleanup runs.

Fix: use /t 5 (5-second countdown) so the task can be deleted immediately
after it is triggered but before the OS actually executes the shutdown.
This prevents the task from re-firing on subsequent startups.

The post-reboot cleanup in InstallMCR is kept as a fallback in case the
pre-delete fails (e.g. the WinRM session is dropped in the 5s window).
pkg/docker/image.go:
- Fix StopWait() race: call wp.StopWait() explicitly before reading
  lastError instead of deferring it. A deferred StopWait() evaluates
  the return expression before workers finish, potentially returning
  nil when a worker later records an error.

pkg/configurer/windows.go:
- isExitCode3010: tighten match string from '3010' to
  'non-zero exit code: 3010' to avoid false positives on error messages
  that incidentally contain those digits.
- Reboot: fix comment '/t 0' -> '/t 5' to match the actual command.
- Reboot: update top-level comment to describe the actual schtask
  mechanism; remove stale reference to Restart-Computer -Force from
  an earlier iteration.
The schtask-based Windows reboot has been moved upstream to
k0sproject/rig (os/windows: ENGINE-1383-winrm-fixes branch, commit
167bb2f). The task name changed from LaunchpadReboot to RigReboot;
update the post-reboot fallback cleanup accordingly.

The WindowsConfigurer.Reboot() override is no longer needed since rig
now implements the same logic directly.
@james-nesbitt james-nesbitt changed the title Engine 1383 post rig upstream ENGINE-1383 post rig upstream Apr 30, 2026
/sc once /st 00:00 causes schtasks to write a warning to stderr when
the scheduled time is in the past. Rig treats any stderr output as an
error, causing Reboot() to fail even though the task was created
successfully. /sc onstart requires no start time and creates the task
silently.
pkg/docker/image.go:
- Fix StopWait() race: call wp.StopWait() explicitly before reading
  lastError instead of deferring it. A deferred StopWait() evaluates
  the return expression before workers finish, potentially returning
  nil when a worker later records an error.

pkg/configurer/windows.go:
- isExitCode3010: tighten match string from '3010' to
  'non-zero exit code: 3010' to avoid false positives on error messages
  that incidentally contain those digits.
- Reboot: fix comment '/t 0' -> '/t 5' to match the actual command.
- Reboot: update top-level comment to describe the actual schtask
  mechanism; remove stale reference to Restart-Computer -Force from
  an earlier iteration.
@james-nesbitt
Copy link
Copy Markdown
Collaborator Author

This PR is a mess, and will need to be rebuilt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant