Skip to content

feat(ci): route UAT through the reservation broker (#1274)#1569

Merged
mchmarny merged 3 commits into
NVIDIA:mainfrom
njhensley:ci/uat-broker-cutover
Jul 1, 2026
Merged

feat(ci): route UAT through the reservation broker (#1274)#1569
mchmarny merged 3 commits into
NVIDIA:mainfrom
njhensley:ci/uat-broker-cutover

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

DC1 PR-2 — the broker cutover: route all real-hardware UAT through a shared dispatch surface that holds a per-reservation concurrency lease, so contending runs queue instead of racing the reserved GPU hardware. Registry-driven; behavior is otherwise the existing UAT lane, rewired.

Motivation / Context

Second of four PRs implementing DC1 — day/night scheduler + reservation broker (#1274), part of the UAT epic (#1264), building on the reservation registry + uat-broker helper merged in #1559. Today each per-cloud UAT workflow serializes only against itself, and AWS hard-fails when its reservation is busy; this PR replaces that with a reservation-keyed lease so a contending run queues rather than racing (or failing).

Fixes: N/A
Related: #1274, #1264, #1559

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Build/CI/tooling

Component(s) Affected

  • Docs/examples (docs/, examples/)
  • Other: .github/workflows/ (UAT lane), infra/uat-*-account/ READMEs

Implementation Notes

  • New uat-run.yaml — the lease owner + dispatch surface. Top-level concurrency: uat-${{ github.event.inputs.reservation || inputs.reservation }} (cancel-in-progress: false); a resolve job builds the uat-broker helper and resolves the reservation row from infra/uat/reservations.yaml; run-aws/run-gcp invoke the cloud pipeline with the permission superset the nested pipeline + evidence-ingest need. Both workflow_dispatch (human) and workflow_call (batch).
  • uat-aws.yaml / uat-gcp.yaml converted in place to workflow_call-only: dropped the per-workflow schedule cron and self-concurrency; CLUSTER_CONFIG/TEST_CONFIG now come from the registry via inputs. Filenames kept, so the keyless-cosign identity pins stay valid; the pipeline bodies are otherwise untouched and ingest-evidence stays nested.
  • New uat-nightly-batch.yaml — the cron night side: enumerate reservations from the registry into a fromJSON matrix that uses: uat-run.yaml (mirrors the existing kwok-recipes.yaml dynamic-matrix pattern, so no gh workflow run token dependency). Different reservations run in parallel; same-reservation runs serialize on the lease.
  • Docsdocs/contributor/uat.md (day/night model, requesting a run, queuing + the one-in-progress-plus-one-pending limit + standing-broker escalation, adding a reservation, and the ACL/non-secret note), registered in docs/index.yml. Also fixes the now-stale gh workflow run uat-gcp.yaml command in demos/cuj1-training.md and clarifies the two infra-account READMEs.

Scope / boundaries

Exercises the main × training cell on both reservations (aws-h100, gcp-h100). Deferred to their consuming work to avoid dead pass-through inputs:

  • the time-boxed main-first version matrix → PR-3 / DC5 (uat-broker schedule + versioned install)
  • intent → test-config / recipe selection → DC2 / DC3 (inference)
  • superseded-run surfacing → DC6

reservation is a free-form string validated against the registry (not a hardcoded choice), so adding a GPU pool stays a one-row data edit.

Testing

actionlint .github/workflows/uat-run.yaml .github/workflows/uat-nightly-batch.yaml \
           .github/workflows/uat-aws.yaml .github/workflows/uat-gcp.yaml
yamllint -c .yamllint.yaml <the four workflows> docs/index.yml
./tools/check-docs-mdx && ./tools/check-docs-filenames

Risk Assessment

  • Medium — Touches the live UAT lane (a rewiring), though isolated to that lane and straightforward to revert.

Rollout notes: The cutover is atomic — the per-cloud crons move to uat-nightly-batch.yaml in the same change, so there is no window without nightly coverage and no window where two triggers can hit a reservation without sharing the lease. Real-hardware behavior can only be validated post-merge (the nightly cron fires on the default branch); the smoke test is a manual gh workflow run uat-run.yaml -f reservation=aws-h100. One watch-item: the nightly path nests four reusable-workflow levels (batch → uat-run → uat-aws → evidence-ingest), which is GitHub's documented maximum.

Checklist

  • Tests pass locally (make test with -race) — N/A (no Go changes; helper tested in feat(ci): add UAT reservation registry and broker helper (#1274) #1559)
  • Linter passes (make lint) — actionlint + yamllint + doc checks green
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — workflow-level; actionlint is the merge gate
  • I updated docs if user-facing behavior changed — new uat.md + demo/README fixes
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@njhensley njhensley requested review from a team as code owners July 1, 2026 01:16
@njhensley njhensley added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This change refactors UAT GitHub Actions workflows around a reservation-based model. uat-aws.yaml and uat-gcp.yaml are converted from workflow_dispatch/schedule workflows into reusable workflow_call targets with reservation and config path inputs. A new uat-run.yaml resolves reservations and dispatches to the appropriate cloud workflow under a per-reservation concurrency lock. A new uat-nightly-batch.yaml enumerates reservations and triggers uat-run.yaml per reservation on a nightly cron. Documentation and READMEs are updated to reflect the new entry points.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/aicr#1559: Introduces the reservation schema and tools/uat-broker behavior that the new UAT workflows depend on.

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: routing UAT through a reservation broker.
Description check ✅ Passed The description is detailed and clearly matches the workflow and docs changes in the pull request.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
.github/workflows/uat-aws.yaml (1)

377-384: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Move reservation through env before printing it. Line 384 expands a caller-controlled input directly in a shell run block, so shell metacharacters can execute. Pass it via env and print it with printf '%s'.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-aws.yaml around lines 377 - 384, The Test Summary step
is expanding the caller-controlled reservation input directly inside the shell
script, which should be avoided. Update the workflow step that prints the UAT
summary to pass reservation through the step’s env mapping first, then reference
that env value inside the run block for the Test Summary step. Use the existing
Test Summary / reservation fields to locate the block, and print the value with
a safe literal output method like printf rather than direct shell interpolation.

Source: Linters/SAST tools

.github/workflows/uat-gcp.yaml (1)

350-357: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Move reservation through env before printing it. run expands ${{ inputs.reservation }} directly into bash, so a crafted value can trigger command substitution in this step. Pass it via env and print it with printf instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-gcp.yaml around lines 350 - 357, The Test Summary step
in the UAT GCP workflow is interpolating inputs.reservation directly inside the
shell script, which can be unsafe. Update the workflow step to pass reservation
through env on the Test Summary job step, then print it from the shell using
printf instead of inline GitHub expression expansion. Use the existing Test
Summary step and inputs.reservation reference to locate the change.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/uat-nightly-batch.yaml:
- Around line 27-30: The nightly schedule in the workflow trigger is
inconsistent with the timezone comment; update the `schedule` cron in the
workflow definition or revise the comment so they match. If `workflow_dispatch`
and the scheduled trigger in `.github/workflows/uat-nightly-batch.yaml` are
meant to run around Pacific midnight, adjust the cron accordingly; otherwise,
keep `0 4 * * *` and change the comment to reflect the actual UTC/Pacific
execution time.

In @.github/workflows/uat-run.yaml:
- Around line 17-24: Clarify the concurrency behavior in the uat-run workflow
comment/docs: the current GitHub Actions concurrency setup in uat-run.yaml only
guarantees one active and one pending run per reservation, so a newer request
can replace an older pending one. Update the existing explanatory comment near
the workflow_dispatch/workflow_call handling to state this explicitly, and
mention that preserving every request would require a separate queue/broker
mechanism rather than relying on the current concurrency lease.

In `@docs/contributor/uat.md`:
- Line 14: Reword the passage in docs/contributor/uat.md so it does not suggest
a general “full working day” GitHub Actions limitation. Update the text around
the phased scheduling/lease explanation to say the hosted UAT jobs are capped by
the runner’s job timeout, the lease only needs to cover the brief transition
windows, and the steady-state daytime cluster is protected by its lease-tag.

---

Outside diff comments:
In @.github/workflows/uat-aws.yaml:
- Around line 377-384: The Test Summary step is expanding the caller-controlled
reservation input directly inside the shell script, which should be avoided.
Update the workflow step that prints the UAT summary to pass reservation through
the step’s env mapping first, then reference that env value inside the run block
for the Test Summary step. Use the existing Test Summary / reservation fields to
locate the block, and print the value with a safe literal output method like
printf rather than direct shell interpolation.

In @.github/workflows/uat-gcp.yaml:
- Around line 350-357: The Test Summary step in the UAT GCP workflow is
interpolating inputs.reservation directly inside the shell script, which can be
unsafe. Update the workflow step to pass reservation through env on the Test
Summary job step, then print it from the shell using printf instead of inline
GitHub expression expansion. Use the existing Test Summary step and
inputs.reservation reference to locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bc3cdb95-5fa2-41a9-a686-8711e961079a

📥 Commits

Reviewing files that changed from the base of the PR and between a0befce and 599b796.

📒 Files selected for processing (9)
  • .github/workflows/uat-aws.yaml
  • .github/workflows/uat-gcp.yaml
  • .github/workflows/uat-nightly-batch.yaml
  • .github/workflows/uat-run.yaml
  • demos/cuj1-training.md
  • docs/contributor/uat.md
  • docs/index.yml
  • infra/uat-aws-account/README.md
  • infra/uat-gcp-account/README.md

Comment thread .github/workflows/uat-nightly-batch.yaml
Comment thread .github/workflows/uat-run.yaml
Comment thread docs/contributor/uat.md Outdated
@njhensley njhensley force-pushed the ci/uat-broker-cutover branch from 599b796 to 6079c0e Compare July 1, 2026 01:27

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/uat-aws.yaml:
- Line 384: The reservation value is being expanded directly inside the shell in
the workflow step, which should be avoided. Update the step that prints the
reservation in uat-aws.yaml to pass inputs.reservation through the env block and
echo the environment variable instead, following the same safer pattern used in
uat-run.yaml. Locate the affected workflow step by its reservation print
statement and adjust that step only.

In @.github/workflows/uat-gcp.yaml:
- Around line 82-84: The GCP bringup/teardown shell snippets still read
CLUSTER_CONFIG with an unquoted redirection, which can break if the
caller-provided path contains special characters. Update the downstream shell
commands that consume CLUSTER_CONFIG in the workflow to quote the variable in
the input redirection, and check both the bringup and teardown sections so they
use the same safe pattern.
- Line 357: The reservation value is being expanded directly inside the shell in
the workflow step, which should be avoided in this elevated-permissions job.
Update the step that echoes reservation in the uat-gcp workflow to pass
inputs.reservation through env first, then reference the environment variable in
the shell command instead of interpolating the expression inline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6cee5958-1a81-453e-b860-cac079658042

📥 Commits

Reviewing files that changed from the base of the PR and between 599b796 and 6079c0e.

📒 Files selected for processing (9)
  • .github/workflows/uat-aws.yaml
  • .github/workflows/uat-gcp.yaml
  • .github/workflows/uat-nightly-batch.yaml
  • .github/workflows/uat-run.yaml
  • demos/cuj1-training.md
  • docs/contributor/uat.md
  • docs/index.yml
  • infra/uat-aws-account/README.md
  • infra/uat-gcp-account/README.md

Comment thread .github/workflows/uat-aws.yaml Outdated
Comment thread .github/workflows/uat-gcp.yaml
Comment thread .github/workflows/uat-gcp.yaml Outdated
@njhensley njhensley force-pushed the ci/uat-broker-cutover branch from 6079c0e to 53a5a9e Compare July 1, 2026 01:37

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/uat-nightly-batch.yaml:
- Around line 93-97: The permissions block in the uat-nightly-batch workflow
needs an explicit comment explaining why actions: read is required, since the
uat-run workflow and its nested uat-aws and uat-gcp jobs depend on it. Keep
actions: read in the permissions for the caller job and add a concrete
zizmor-focused comment near the permissions declaration so the intent is
documented instead of removing the permission.

In @.github/workflows/uat-run.yaml:
- Around line 122-126: The UAT caller jobs in the workflow are granting actions:
read without any local explanation, so either add a short inline comment on the
permissions block in each caller job or remove actions: read if the nested
workflow does not require it. Update both caller job permission sections
together so the rationale is consistent and easy to find near the permissions
declaration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 393bcdae-55a5-4deb-9245-b46a36e916a4

📥 Commits

Reviewing files that changed from the base of the PR and between 6079c0e and 53a5a9e.

📒 Files selected for processing (9)
  • .github/workflows/uat-aws.yaml
  • .github/workflows/uat-gcp.yaml
  • .github/workflows/uat-nightly-batch.yaml
  • .github/workflows/uat-run.yaml
  • demos/cuj1-training.md
  • docs/contributor/uat.md
  • docs/index.yml
  • infra/uat-aws-account/README.md
  • infra/uat-gcp-account/README.md

Comment thread .github/workflows/uat-nightly-batch.yaml Outdated
Comment thread .github/workflows/uat-run.yaml Outdated
DC1 PR-2 (the broker cutover), building on the reservation registry +
uat-broker helper from NVIDIA#1559. Every UAT run now goes through a shared
dispatch surface that holds a per-reservation concurrency lease, so
contending runs QUEUE instead of racing the reserved GPU hardware.

- uat-run.yaml (new): the lease owner + dispatch surface. Holds the
  concurrency group uat-<reservation>, resolves the reservation row from
  infra/uat/reservations.yaml via the uat-broker helper, and invokes the
  cloud-appropriate reusable pipeline with the permission superset it needs.
  Dispatchable by a human (workflow_dispatch) or the nightly batch
  (workflow_call).
- uat-aws.yaml / uat-gcp.yaml: converted in place to workflow_call-only
  (dropped the per-workflow cron + self-concurrency; CLUSTER_CONFIG /
  TEST_CONFIG now come from the registry via inputs). Filenames and the
  keyless-cosign identity pins are unchanged; the pipeline bodies are
  otherwise untouched, and evidence-ingest stays nested.
- uat-nightly-batch.yaml (new): the night side of the cycle — a cron that
  enumerates reservations from the registry and runs one main pass per
  reservation (data-driven matrix -> uat-run.yaml). Different reservations
  run in parallel; same-reservation runs serialize on the lease.
- docs/contributor/uat.md (new, registered in docs/index.yml): the day/night
  model, how to request a run, how queuing + the one-pending limit work, and
  how to add a reservation. Also fixes the stale gh-workflow-run command in
  the CUJ1 demo and clarifies the two infra-account READMEs.

Scope: exercises the main x training cell on both reservations. The version
matrix (PR-3/DC5), the inference intent (DC2/DC3), and superseded-run
surfacing (DC6) light up the remaining axes; the mechanism already scales to
them.

Signed-off-by: Nathan Hensley <nhensley@nvidia.com>
@njhensley njhensley force-pushed the ci/uat-broker-cutover branch from 53a5a9e to fa18e21 Compare July 1, 2026 01:47

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean cutover. The per-reservation lease on uat-run.yaml (with the reusable pipelines correctly dropping their own concurrency) is the right shape, and keeping the uat-aws/gcp.yaml filenames preserves the keyless-cosign identity + EXPECTED_IDENTITY_REGEXP pins — the load-bearing detail here. Verified the repository guard casing (matches repo convention, GH == is case-insensitive), env-passed reservation avoids injection, and the permission superset is scoped to the calling jobs. Atomic cron move means no coverage gap. actionlint + full CI green. No blocking concerns.

@mchmarny mchmarny enabled auto-merge (squash) July 1, 2026 12:20
@mchmarny mchmarny merged commit e4fdf16 into NVIDIA:main Jul 1, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci area/docs area/infra size/L theme/ci-dx CI pipelines, developer experience, and build tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants