Agent Diagnostic
Investigation done while reviewing CI on PR #1316.
Skills used: watch-github-actions to track workflow runs, direct gh api / docker manifest inspect to inspect registry state.
Findings:
A single git SHA triggers three independent workflows that all call .github/workflows/docker-build.yml and push images to the same registry tag:
| Workflow |
Builds |
What it pushes |
| Branch Kubernetes E2E |
amd64 only |
single-arch amd64 image to :SHA (bare tag) |
| Branch E2E Checks |
arm64 only |
single-arch arm64 image to :SHA (bare tag) |
| GPU Test |
amd64 + arm64 |
per-arch to :SHA-amd64 / :SHA-arm64, then a manifest list to :SHA |
The relevant logic in .github/workflows/docker-build.yml:184:
IMAGE_TAG: ${{ needs.resolve.outputs.platform_count == '1'
&& needs.resolve.outputs.image_tag_base
|| format('{0}-{1}', needs.resolve.outputs.image_tag_base, matrix.arch) }}
collapses single-arch builds onto the unsuffixed tag. The merge step (if: platform_count != '1') is gated on multi-arch. Three writers, last one wins.
Failure manifestation:
In Branch Kubernetes E2E (run 25765424071 on commit d4fdfb1b27...):
ERROR: failed to load image: command "docker exec ... ctr ... images import
--all-platforms --digests --snapshotter=overlayfs -" failed with error: exit status 1
Command Output: ctr: content digest sha256:a1537bbca22883a1f7c01c6d162252199017b2b01b19e5b12b6dde224de26b98: not found
docker manifest inspect ghcr.io/nvidia/openshell/supervisor:d4fdfb1b27... confirmed the tag is a manifest list (written last by GPU Test's merge step) pointing at amd64 sha256:4b5468c6… and arm64 sha256:a1537bbca…. The K8s E2E runs on amd64, so docker pull "$image" only fetched amd64 layers. Then kind load docker-image invokes ctr import --all-platforms --digests, which insists on importing every platform in the manifest list — and the arm64 layers are absent locally.
Depending on race ordering, this surface differently:
- GPU Test merges last → manifest list at the tag →
kind load fails with content digest not found (current state).
- Branch E2E Checks (arm64) finishes last → bare tag becomes arm64-only → K8s pod hits
exec /usr/local/bin/openshell-gateway: exec format error on its amd64 host.
- K8s E2E (amd64) finishes last → works briefly until the next workflow overwrites.
This started biting once the Dockerfile was split (#1316 ancestor commits) and the per-arch single-image build path activated alongside GPU Test's multi-arch merge.
Description
Actual behavior: Branch Kubernetes E2E fails on every recent commit of PR #1316 (and presumably on any PR where concurrent workflows race on the same image tag). The failure mode is either ctr: content digest … not found during kind load, or exec format error in the gateway pod when the wrong-arch image happens to win the tag race.
Expected behavior: Concurrent workflows for the same SHA should not corrupt each other's registry state. Each consumer should be able to deterministically pull the arch it needs.
Reproduction Steps
- Push a commit to a PR that triggers Branch Kubernetes E2E, Branch E2E Checks, and GPU Test simultaneously (i.e. any branch with
test:e2e and test:e2e-gpu semantics applied).
- Observe the Branch Kubernetes E2E job at the "Load gateway and supervisor images into kind" step.
docker manifest inspect ghcr.io/nvidia/openshell/<component>:<sha> to confirm a manifest list was published while one of the per-arch workflows also targeted the bare tag.
Environment
Logs
ERROR: failed to load image: command "docker exec --privileged -i kube-e2e-25765424071-control-plane ctr --namespace=k8s.io images import --all-platforms --digests --snapshotter=overlayfs -" failed with error: exit status 1
Command Output: ctr: content digest sha256:a1537bbca22883a1f7c01c6d162252199017b2b01b19e5b12b6dde224de26b98: not found
Failing run: https://github.com/NVIDIA/OpenShell/actions/runs/25765424071/job/75677991018
Proposed Fix
Recommended (option 1): Stop letting single-arch builds collapse onto the bare tag. Change IMAGE_TAG in .github/workflows/docker-build.yml:184 to always include the arch suffix:
IMAGE_TAG: ${{ format('{0}-{1}', needs.resolve.outputs.image_tag_base, matrix.arch) }}
Then only the merge step writes the bare tag (and it can also run for platform_count == 1 to produce a single-platform manifest list). Consumers (Branch Kubernetes E2E's "Load gateway and supervisor images into kind" step, the e2e harness, etc.) reference the arch-suffixed tag matching their host. The bare tag becomes a stable manifest list assembled deterministically per workflow rather than a shared mutable write target.
Alternatives considered:
- Per-workflow tag scoping (e.g.
:SHA-<workflow>-<arch>): smaller change, but still leaks workflow names into image references.
- Have consumers resolve the platform digest via
docker buildx imagetools inspect then docker pull <image>@sha256:<digest>: local workaround only — the next consumer hits the same race.
- Replace
kind load docker-image with docker save | kind load image-archive to sidestep ctr --all-platforms: works around kind's behavior rather than fixing the registry race.
- Concurrency-group all three workflows on SHA: kills parallelism, slows CI significantly.
Surfaced in PR #1316.
Agent Diagnostic
Investigation done while reviewing CI on PR #1316.
Skills used:
watch-github-actionsto track workflow runs, directgh api/docker manifest inspectto inspect registry state.Findings:
A single git SHA triggers three independent workflows that all call
.github/workflows/docker-build.ymland push images to the same registry tag::SHA(bare tag):SHA(bare tag):SHA-amd64/:SHA-arm64, then a manifest list to:SHAThe relevant logic in
.github/workflows/docker-build.yml:184:collapses single-arch builds onto the unsuffixed tag. The merge step (
if: platform_count != '1') is gated on multi-arch. Three writers, last one wins.Failure manifestation:
In Branch Kubernetes E2E (run 25765424071 on commit
d4fdfb1b27...):docker manifest inspect ghcr.io/nvidia/openshell/supervisor:d4fdfb1b27...confirmed the tag is a manifest list (written last by GPU Test's merge step) pointing at amd64sha256:4b5468c6…and arm64sha256:a1537bbca…. The K8s E2E runs on amd64, sodocker pull "$image"only fetched amd64 layers. Thenkind load docker-imageinvokesctr import --all-platforms --digests, which insists on importing every platform in the manifest list — and the arm64 layers are absent locally.Depending on race ordering, this surface differently:
kind loadfails withcontent digest not found(current state).exec /usr/local/bin/openshell-gateway: exec format erroron its amd64 host.This started biting once the Dockerfile was split (#1316 ancestor commits) and the per-arch single-image build path activated alongside GPU Test's multi-arch merge.
Description
Actual behavior: Branch Kubernetes E2E fails on every recent commit of PR #1316 (and presumably on any PR where concurrent workflows race on the same image tag). The failure mode is either
ctr: content digest … not foundduringkind load, orexec format errorin the gateway pod when the wrong-arch image happens to win the tag race.Expected behavior: Concurrent workflows for the same SHA should not corrupt each other's registry state. Each consumer should be able to deterministically pull the arch it needs.
Reproduction Steps
test:e2eandtest:e2e-gpusemantics applied).docker manifest inspect ghcr.io/nvidia/openshell/<component>:<sha>to confirm a manifest list was published while one of the per-arch workflows also targeted the bare tag.Environment
.github/workflows/branch-kubernetes-e2e.ymlvia.github/workflows/docker-build.ymlbbc46d37and later (any commit since the Dockerfile.images split started exercising multi-workflow single-arch pushes)Logs
Failing run: https://github.com/NVIDIA/OpenShell/actions/runs/25765424071/job/75677991018
Proposed Fix
Recommended (option 1): Stop letting single-arch builds collapse onto the bare tag. Change
IMAGE_TAGin.github/workflows/docker-build.yml:184to always include the arch suffix:Then only the merge step writes the bare tag (and it can also run for
platform_count == 1to produce a single-platform manifest list). Consumers (Branch Kubernetes E2E's "Load gateway and supervisor images into kind" step, the e2e harness, etc.) reference the arch-suffixed tag matching their host. The bare tag becomes a stable manifest list assembled deterministically per workflow rather than a shared mutable write target.Alternatives considered:
:SHA-<workflow>-<arch>): smaller change, but still leaks workflow names into image references.docker buildx imagetools inspectthendocker pull <image>@sha256:<digest>: local workaround only — the next consumer hits the same race.kind load docker-imagewithdocker save | kind load image-archiveto sidestepctr --all-platforms: works around kind's behavior rather than fixing the registry race.Surfaced in PR #1316.