Skip to content

bug(ci): image-tag race between concurrent workflows breaks Branch Kubernetes E2E #1343

@TaylorMutch

Description

@TaylorMutch

Agent Diagnostic

Investigation done while reviewing CI on PR #1316.

Skills used: watch-github-actions to track workflow runs, direct gh api / docker manifest inspect to inspect registry state.

Findings:

A single git SHA triggers three independent workflows that all call .github/workflows/docker-build.yml and push images to the same registry tag:

Workflow Builds What it pushes
Branch Kubernetes E2E amd64 only single-arch amd64 image to :SHA (bare tag)
Branch E2E Checks arm64 only single-arch arm64 image to :SHA (bare tag)
GPU Test amd64 + arm64 per-arch to :SHA-amd64 / :SHA-arm64, then a manifest list to :SHA

The relevant logic in .github/workflows/docker-build.yml:184:

IMAGE_TAG: ${{ needs.resolve.outputs.platform_count == '1'
  && needs.resolve.outputs.image_tag_base
  || format('{0}-{1}', needs.resolve.outputs.image_tag_base, matrix.arch) }}

collapses single-arch builds onto the unsuffixed tag. The merge step (if: platform_count != '1') is gated on multi-arch. Three writers, last one wins.

Failure manifestation:

In Branch Kubernetes E2E (run 25765424071 on commit d4fdfb1b27...):

ERROR: failed to load image: command "docker exec ... ctr ... images import
  --all-platforms --digests --snapshotter=overlayfs -" failed with error: exit status 1
Command Output: ctr: content digest sha256:a1537bbca22883a1f7c01c6d162252199017b2b01b19e5b12b6dde224de26b98: not found

docker manifest inspect ghcr.io/nvidia/openshell/supervisor:d4fdfb1b27... confirmed the tag is a manifest list (written last by GPU Test's merge step) pointing at amd64 sha256:4b5468c6… and arm64 sha256:a1537bbca…. The K8s E2E runs on amd64, so docker pull "$image" only fetched amd64 layers. Then kind load docker-image invokes ctr import --all-platforms --digests, which insists on importing every platform in the manifest list — and the arm64 layers are absent locally.

Depending on race ordering, this surface differently:

  • GPU Test merges last → manifest list at the tag → kind load fails with content digest not found (current state).
  • Branch E2E Checks (arm64) finishes last → bare tag becomes arm64-only → K8s pod hits exec /usr/local/bin/openshell-gateway: exec format error on its amd64 host.
  • K8s E2E (amd64) finishes last → works briefly until the next workflow overwrites.

This started biting once the Dockerfile was split (#1316 ancestor commits) and the per-arch single-image build path activated alongside GPU Test's multi-arch merge.

Description

Actual behavior: Branch Kubernetes E2E fails on every recent commit of PR #1316 (and presumably on any PR where concurrent workflows race on the same image tag). The failure mode is either ctr: content digest … not found during kind load, or exec format error in the gateway pod when the wrong-arch image happens to win the tag race.

Expected behavior: Concurrent workflows for the same SHA should not corrupt each other's registry state. Each consumer should be able to deterministically pull the arch it needs.

Reproduction Steps

  1. Push a commit to a PR that triggers Branch Kubernetes E2E, Branch E2E Checks, and GPU Test simultaneously (i.e. any branch with test:e2e and test:e2e-gpu semantics applied).
  2. Observe the Branch Kubernetes E2E job at the "Load gateway and supervisor images into kind" step.
  3. docker manifest inspect ghcr.io/nvidia/openshell/<component>:<sha> to confirm a manifest list was published while one of the per-arch workflows also targeted the bare tag.

Environment

Logs

ERROR: failed to load image: command "docker exec --privileged -i kube-e2e-25765424071-control-plane ctr --namespace=k8s.io images import --all-platforms --digests --snapshotter=overlayfs -" failed with error: exit status 1
Command Output: ctr: content digest sha256:a1537bbca22883a1f7c01c6d162252199017b2b01b19e5b12b6dde224de26b98: not found

Failing run: https://github.com/NVIDIA/OpenShell/actions/runs/25765424071/job/75677991018

Proposed Fix

Recommended (option 1): Stop letting single-arch builds collapse onto the bare tag. Change IMAGE_TAG in .github/workflows/docker-build.yml:184 to always include the arch suffix:

IMAGE_TAG: ${{ format('{0}-{1}', needs.resolve.outputs.image_tag_base, matrix.arch) }}

Then only the merge step writes the bare tag (and it can also run for platform_count == 1 to produce a single-platform manifest list). Consumers (Branch Kubernetes E2E's "Load gateway and supervisor images into kind" step, the e2e harness, etc.) reference the arch-suffixed tag matching their host. The bare tag becomes a stable manifest list assembled deterministically per workflow rather than a shared mutable write target.

Alternatives considered:

  • Per-workflow tag scoping (e.g. :SHA-<workflow>-<arch>): smaller change, but still leaks workflow names into image references.
  • Have consumers resolve the platform digest via docker buildx imagetools inspect then docker pull <image>@sha256:<digest>: local workaround only — the next consumer hits the same race.
  • Replace kind load docker-image with docker save | kind load image-archive to sidestep ctr --all-platforms: works around kind's behavior rather than fixing the registry race.
  • Concurrency-group all three workflows on SHA: kills parallelism, slows CI significantly.

Surfaced in PR #1316.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions