Skip to content

fix(ci): fail over across zones when an integration-test VM stockouts#10822

Open
gustavovalverde wants to merge 1 commit into
mainfrom
fix/gcp-integration-zone-failover
Open

fix(ci): fail over across zones when an integration-test VM stockouts#10822
gustavovalverde wants to merge 1 commit into
mainfrom
fix/gcp-integration-zone-failover

Conversation

@gustavovalverde

Copy link
Copy Markdown
Member

Motivation

Integration-test VMs are pinned to a single zone with no fallback. When that zone runs out of capacity, the instance-create step fails after about five minutes with reason: stockout and the whole test goes red, even when the rest of the region has capacity.

Solution

Try the primary zone first, then the region's other zones, with one short backoff sweep before giving up. Only capacity stockouts trigger failover; any other create error fails fast. The zone that wins is published as the instance_zone output, so the in-job SSH steps, the checkpoint upload, and the cached-state image capture all target it. The cleanup job discovers the instance's actual zone by name, so a VM that landed in a fallback zone is still deleted. Partial instances and disks are removed before advancing to the next zone, so the zone-scoped disk-reuse check cannot leave an orphan.

Tests

actionlint reports no new findings and zizmor reports fewer than main with none new. The failover control flow was exercised against a mocked gcloud for the success, failover, fast-fail, and exhausted-capacity paths. End-to-end behavior is validated by this PR's integration-test run.

Follow-up Work

The region's zone list is now duplicated between this workflow and zfnd-deploy-nodes-gcp.yml. Promoting it to a shared repo variable would remove the duplication, but the two uses differ (sequential failover here, parallel matrix there), so it is deferred rather than forced now.

AI Disclosure

  • AI tools were used: Claude for diagnosis, implementation, and this description.

PR Checklist

  • The PR title follows conventional commits format.
  • The PR follows the contribution guidelines.
  • The solution is tested.
  • The documentation and changelogs are up to date.

Copilot AI review requested due to automatic review settings June 26, 2026 08:17
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the GCP integration-test VM creation resilient to single-zone capacity stockouts. Previously the instance was pinned to vars.GCP_ZONE; when that zone ran out of capacity the create step failed after ~5 minutes and reddened the whole test even when the rest of the region had capacity. The change adds a sequential per-zone failover loop within the us-east1 region, publishes the winning zone as an instance_zone job output, and routes all downstream zone-scoped operations (SSH steps, checkpoint upload, cached-state image capture, and cleanup) to that resolved zone instead of the primary zone.

Changes:

  • Wrap instance/disk creation in a create_in_zone helper and iterate the region's zones over two sweeps (with a 45s backoff), failing over only on capacity errors and fast-failing otherwise, cleaning up partial instances/disks between attempts.
  • Expose the successful zone via instance_zone output and consume it in all in-job SSH steps plus the downstream checkpoint-upload and image-capture jobs.
  • Make the cleanup job discover the instance's actual zone by name so a VM that landed in a fallback zone is still deleted.

Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml Outdated
@gustavovalverde gustavovalverde force-pushed the fix/gcp-integration-zone-failover branch from c026ece to 153a1b4 Compare June 26, 2026 08:52
@gustavovalverde gustavovalverde added the run-stateful-tests Allos to manually trigger a stateful tests run in GCP in PRs label Jun 26, 2026
Integration-test VMs were pinned to a single zone with no fallback, so
when that zone ran out of capacity the create step failed at ~5 minutes
with `reason: stockout` and the whole test went red, even though the
rest of the region had capacity.

Try the primary zone first, then the other zones in the region, with one
short backoff sweep before giving up. Only capacity stockouts trigger
failover; any other create error fails fast. The winning zone is
published as the `instance_zone` output so the in-job SSH steps, the
checkpoint upload, and the cached-state image capture target it, and the
cleanup job discovers the instance's actual zone by name.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-stateful-tests Allos to manually trigger a stateful tests run in GCP in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants