fix(ci): fail over across zones when an integration-test VM stockouts by gustavovalverde · Pull Request #10822 · ZcashFoundation/zebra

gustavovalverde · 2026-06-26T08:17:51Z

Motivation

Integration-test VMs are pinned to a single zone with no fallback. When that zone runs out of capacity, the instance-create step fails after about five minutes with reason: stockout and the whole test goes red, even when the rest of the region has capacity.

Solution

Try the primary zone first, then the region's other zones, with one short backoff sweep before giving up. Only capacity stockouts trigger failover; any other create error fails fast. The zone that wins is published as the instance_zone output, so the in-job SSH steps, the checkpoint upload, and the cached-state image capture all target it. The cleanup job discovers the instance's actual zone by name, so a VM that landed in a fallback zone is still deleted. Partial instances and disks are removed before advancing to the next zone, so the zone-scoped disk-reuse check cannot leave an orphan.

Tests

actionlint reports no new findings and zizmor reports fewer than main with none new. The failover control flow was exercised against a mocked gcloud for the success, failover, fast-fail, and exhausted-capacity paths. End-to-end behavior is validated by this PR's integration-test run.

Follow-up Work

The region's zone list is now duplicated between this workflow and zfnd-deploy-nodes-gcp.yml. Promoting it to a shared repo variable would remove the duplication, but the two uses differ (sequential failover here, parallel matrix there), so it is deferred rather than forced now.

AI Disclosure

AI tools were used: Claude for diagnosis, implementation, and this description.

PR Checklist

The PR title follows conventional commits format.
The PR follows the contribution guidelines.
The solution is tested.
The documentation and changelogs are up to date.

Copilot

Pull request overview

This PR makes the GCP integration-test VM creation resilient to single-zone capacity stockouts. Previously the instance was pinned to vars.GCP_ZONE; when that zone ran out of capacity the create step failed after ~5 minutes and reddened the whole test even when the rest of the region had capacity. The change adds a sequential per-zone failover loop within the us-east1 region, publishes the winning zone as an instance_zone job output, and routes all downstream zone-scoped operations (SSH steps, checkpoint upload, cached-state image capture, and cleanup) to that resolved zone instead of the primary zone.

Changes:

Wrap instance/disk creation in a create_in_zone helper and iterate the region's zones over two sweeps (with a 45s backoff), failing over only on capacity errors and fast-failing otherwise, cleaning up partial instances/disks between attempts.
Expose the successful zone via instance_zone output and consume it in all in-job SSH steps plus the downstream checkpoint-upload and image-capture jobs.
Make the cleanup job discover the instance's actual zone by name so a VM that landed in a fallback zone is still deleted.

Integration-test VMs were pinned to a single zone with no fallback, so when that zone ran out of capacity the create step failed at ~5 minutes with `reason: stockout` and the whole test went red, even though the rest of the region had capacity. Try the primary zone first, then the other zones in the region, with one short backoff sweep before giving up. Only capacity stockouts trigger failover; any other create error fails fast. The winning zone is published as the `instance_zone` output so the in-job SSH steps, the checkpoint upload, and the cached-state image capture target it, and the cleanup job discovers the instance's actual zone by name.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings June 26, 2026 08:17

Copilot started reviewing on behalf of gustavovalverde June 26, 2026 08:18 View session

github-advanced-security AI found potential problems Jun 26, 2026

View reviewed changes

Copilot AI reviewed Jun 26, 2026

View reviewed changes

Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml Outdated

gustavovalverde force-pushed the fix/gcp-integration-zone-failover branch from c026ece to 153a1b4 Compare June 26, 2026 08:52

gustavovalverde added the run-stateful-tests Allos to manually trigger a stateful tests run in GCP in PRs label Jun 26, 2026

Copilot AI review requested due to automatic review settings June 26, 2026 11:31

gustavovalverde force-pushed the fix/gcp-integration-zone-failover branch from 153a1b4 to 2981706 Compare June 26, 2026 11:31

gustavovalverde temporarily deployed to dev June 26, 2026 11:31 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 11:32 — with GitHub Actions Inactive

Copilot started reviewing on behalf of gustavovalverde June 26, 2026 11:32 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

gustavovalverde temporarily deployed to dev June 26, 2026 11:44 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 11:45 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 12:05 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 12:07 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 12:08 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 12:09 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 12:10 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 12:11 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 12:13 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 12:16 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 12:20 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 12:26 — with GitHub Actions Inactive

gustavovalverde temporarily deployed to dev June 26, 2026 13:00 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ci): fail over across zones when an integration-test VM stockouts#10822

fix(ci): fail over across zones when an integration-test VM stockouts#10822
gustavovalverde wants to merge 1 commit into
mainfrom
fix/gcp-integration-zone-failover

gustavovalverde commented Jun 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

gustavovalverde commented Jun 26, 2026

Motivation

Solution

Tests

Follow-up Work

AI Disclosure

PR Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants