fix(ci): fail over across zones when an integration-test VM stockouts#10822
Open
gustavovalverde wants to merge 1 commit into
Open
fix(ci): fail over across zones when an integration-test VM stockouts#10822gustavovalverde wants to merge 1 commit into
gustavovalverde wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR makes the GCP integration-test VM creation resilient to single-zone capacity stockouts. Previously the instance was pinned to vars.GCP_ZONE; when that zone ran out of capacity the create step failed after ~5 minutes and reddened the whole test even when the rest of the region had capacity. The change adds a sequential per-zone failover loop within the us-east1 region, publishes the winning zone as an instance_zone job output, and routes all downstream zone-scoped operations (SSH steps, checkpoint upload, cached-state image capture, and cleanup) to that resolved zone instead of the primary zone.
Changes:
- Wrap instance/disk creation in a
create_in_zonehelper and iterate the region's zones over two sweeps (with a 45s backoff), failing over only on capacity errors and fast-failing otherwise, cleaning up partial instances/disks between attempts. - Expose the successful zone via
instance_zoneoutput and consume it in all in-job SSH steps plus the downstream checkpoint-upload and image-capture jobs. - Make the cleanup job discover the instance's actual zone by name so a VM that landed in a fallback zone is still deleted.
c026ece to
153a1b4
Compare
Integration-test VMs were pinned to a single zone with no fallback, so when that zone ran out of capacity the create step failed at ~5 minutes with `reason: stockout` and the whole test went red, even though the rest of the region had capacity. Try the primary zone first, then the other zones in the region, with one short backoff sweep before giving up. Only capacity stockouts trigger failover; any other create error fails fast. The winning zone is published as the `instance_zone` output so the in-job SSH steps, the checkpoint upload, and the cached-state image capture target it, and the cleanup job discovers the instance's actual zone by name.
153a1b4 to
2981706
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Integration-test VMs are pinned to a single zone with no fallback. When that zone runs out of capacity, the instance-create step fails after about five minutes with
reason: stockoutand the whole test goes red, even when the rest of the region has capacity.Solution
Try the primary zone first, then the region's other zones, with one short backoff sweep before giving up. Only capacity stockouts trigger failover; any other create error fails fast. The zone that wins is published as the
instance_zoneoutput, so the in-job SSH steps, the checkpoint upload, and the cached-state image capture all target it. The cleanup job discovers the instance's actual zone by name, so a VM that landed in a fallback zone is still deleted. Partial instances and disks are removed before advancing to the next zone, so the zone-scoped disk-reuse check cannot leave an orphan.Tests
actionlintreports no new findings andzizmorreports fewer than main with none new. The failover control flow was exercised against a mockedgcloudfor the success, failover, fast-fail, and exhausted-capacity paths. End-to-end behavior is validated by this PR's integration-test run.Follow-up Work
The region's zone list is now duplicated between this workflow and
zfnd-deploy-nodes-gcp.yml. Promoting it to a shared repo variable would remove the duplication, but the two uses differ (sequential failover here, parallel matrix there), so it is deferred rather than forced now.AI Disclosure
PR Checklist