Skip to content

fix(ci): repair GCP resource cleanup so it deletes old disks#10767

Open
gustavovalverde wants to merge 3 commits into
mainfrom
fix/gcp-disk-cleanup
Open

fix(ci): repair GCP resource cleanup so it deletes old disks#10767
gustavovalverde wants to merge 3 commits into
mainfrom
fix/gcp-disk-cleanup

Conversation

@gustavovalverde

@gustavovalverde gustavovalverde commented Jun 22, 2026

Copy link
Copy Markdown
Member

Motivation

The daily "Delete GCP resources" job deleted nothing: it passed each disk to gcloud as a single quoted "<name> --zone=<zone>" argument, which gcloud rejected as underspecified. Orphaned per-commit test disks then piled up until the dev project's regional SSD quota filled and integration tests failed at instance creation with Quota 'SSD_TOTAL_GB' exceeded.

Solution

Pass the disk name and its --zone/--region flag as separate arguments, add --quiet, and delete only unattached disks. The instance, template, and cache-image cleanups get the same --quiet fix. Scope the cleanup to the dev project.

AI Disclosure

  • AI tools were used: Claude Code.

PR Checklist

  • The PR title follows conventional commits format.
  • The PR follows the contribution guidelines.
  • This change was discussed in an issue or with the team beforehand.
  • The solution is tested.
  • The documentation and changelogs are up to date.

Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml Fixed
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml Fixed
Comment thread .github/workflows/zfnd-deploy-integration-tests-gcp.yml Fixed
The daily cleanup deleted nothing: it passed each disk to gcloud as a single
quoted "<name> --zone=<zone>" argument, which gcloud rejected as underspecified.
The swallowed failure let orphaned per-commit test disks pile up until the
regional SSD_TOTAL_GB quota filled and integration tests failed at instance
creation.

Pass the name and --zone/--region as separate arguments, add --quiet, and
delete only unattached disks. Scope the cleanup to the dev project.
@gustavovalverde gustavovalverde changed the title fix(ci): repair GCP disk cleanup and close test-disk leak fix(ci): repair GCP resource cleanup so it deletes old disks Jun 22, 2026
Comment thread .github/workflows/scripts/gcp-delete-old-disks.sh Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the daily "Delete GCP resources" workflow, which was silently deleting nothing. The previous scripts used sed to fold a resource's name and its --zone/--region flag into a single string, then passed that string as one shell‑quoted argument (e.g. "<name> --zone=<zone>"); gcloud rejected it as underspecified. As a result, orphaned per‑commit test disks accumulated until the dev project's regional SSD quota filled and integration tests failed with Quota 'SSD_TOTAL_GB' exceeded. The fix passes the name and location flag as separate arguments, adds --quiet so non‑interactive runs don't abort at the confirmation prompt, restricts disk deletion to unattached disks (-users:*), and scopes the job to the dev environment.

Changes:

  • Rewrote disk/instance deletion loops to parse tab‑separated value(...) output with while IFS=$'\t' read and pass --zone/--region as separate args.
  • Added --quiet to disk, instance, template, and cache‑image deletes; limited disk cleanup to unattached disks via the -users:* filter.
  • Replaced the [dev, prod] cleanup matrix with a single environment: dev, narrowing cleanup to the dev project.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
.github/workflows/scripts/gcp-delete-old-disks.sh Refactors to a delete_disks helper; separates name/location args, adds --quiet, and filters to unattached disks.
.github/workflows/scripts/gcp-delete-old-instances.sh Replaces IFS-juggling loop with while read; passes --zone separately and adds --quiet.
.github/workflows/scripts/gcp-delete-old-templates.sh Adds --quiet to the template delete.
.github/workflows/scripts/gcp-delete-old-cache-images.sh Adds --quiet to the image delete.
.github/workflows/zfnd-delete-gcp-resources.yml Removes the [dev, prod] matrix in favor of environment: dev and drops now‑obsolete comments.

The shell refactors are correct: the here-string loops run in the current shell (so exit and error handling work), IFS=$'\t' is scoped to the read builtin, empty input is guarded with the [[ -z "${NAME}" ]] check, and the -users:* filter is valid gcloud syntax for unattached disks. The one item worth a maintainer's attention is that this also drops prod cleanup, which is a behavioral change bundled with the bugfix.

Comment thread .github/workflows/zfnd-delete-gcp-resources.yml
The "zebrad-" disk filter matched zebrad-cache-* chain-state disks, which the
-users:* guard only protects while attached. Exclude them so an unattached
cached-state disk is never deleted.
@gustavovalverde

Copy link
Copy Markdown
Member Author

@oxarbitrage ready

@gustavovalverde gustavovalverde added C-bug Category: This is a bug A-devops Area: Pipelines, CI/CD and Dockerfiles I-heavy Problems with excessive memory, disk, or CPU usage P-High 🔥 labels Jun 26, 2026
@mergify

mergify Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Queued — the merge queue status continues in this comment ↓.

@mergify mergify Bot added the queued label Jun 26, 2026
@mergify

mergify Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Merge Queue Status

This pull request spent 15 minutes 40 seconds in the queue, including 5 minutes 16 seconds running CI.

Waiting for
  • any of: [🛡 GitHub repository ruleset rule PR Requirements]
    • check-neutral = @github-actions/lint
    • check-skipped = @github-actions/lint
    • check-success = @github-actions/lint
  • any of: [🛡 GitHub repository ruleset rule PR Requirements]
    • check-neutral = test-crates
    • check-skipped = test-crates
    • check-success = test-crates
  • any of: [🛡 GitHub repository ruleset rule PR Requirements]
    • check-neutral = unit-tests
    • check-skipped = unit-tests
    • check-success = unit-tests
All conditions

Reason

The merge conditions cannot be satisfied due to failing checks

  • @github-actions/lint

Hint

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio queue comment.

Tick the box to put this pull request back in the merge queue (same as @mergifyio queue).

  • Requeue this pull request

mergify Bot added a commit that referenced this pull request Jun 26, 2026
@mergify mergify Bot added dequeued and removed queued labels Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-devops Area: Pipelines, CI/CD and Dockerfiles C-bug Category: This is a bug dequeued I-heavy Problems with excessive memory, disk, or CPU usage P-High 🔥

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants