feat(ci): route UAT through the reservation broker (#1274)#1569
Conversation
|
🌿 Preview your docs: https://nvidia-preview-ci-uat-broker-cutover.docs.buildwithfern.com/aicr |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis change refactors UAT GitHub Actions workflows around a reservation-based model. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/uat-aws.yaml (1)
377-384: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winMove
reservationthroughenvbefore printing it. Line 384 expands a caller-controlled input directly in a shellrunblock, so shell metacharacters can execute. Pass it viaenvand print it withprintf '%s'.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/uat-aws.yaml around lines 377 - 384, The Test Summary step is expanding the caller-controlled reservation input directly inside the shell script, which should be avoided. Update the workflow step that prints the UAT summary to pass reservation through the step’s env mapping first, then reference that env value inside the run block for the Test Summary step. Use the existing Test Summary / reservation fields to locate the block, and print the value with a safe literal output method like printf rather than direct shell interpolation.Source: Linters/SAST tools
.github/workflows/uat-gcp.yaml (1)
350-357: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winMove
reservationthroughenvbefore printing it.runexpands${{ inputs.reservation }}directly into bash, so a crafted value can trigger command substitution in this step. Pass it viaenvand print it withprintfinstead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/uat-gcp.yaml around lines 350 - 357, The Test Summary step in the UAT GCP workflow is interpolating inputs.reservation directly inside the shell script, which can be unsafe. Update the workflow step to pass reservation through env on the Test Summary job step, then print it from the shell using printf instead of inline GitHub expression expansion. Use the existing Test Summary step and inputs.reservation reference to locate the change.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/uat-nightly-batch.yaml:
- Around line 27-30: The nightly schedule in the workflow trigger is
inconsistent with the timezone comment; update the `schedule` cron in the
workflow definition or revise the comment so they match. If `workflow_dispatch`
and the scheduled trigger in `.github/workflows/uat-nightly-batch.yaml` are
meant to run around Pacific midnight, adjust the cron accordingly; otherwise,
keep `0 4 * * *` and change the comment to reflect the actual UTC/Pacific
execution time.
In @.github/workflows/uat-run.yaml:
- Around line 17-24: Clarify the concurrency behavior in the uat-run workflow
comment/docs: the current GitHub Actions concurrency setup in uat-run.yaml only
guarantees one active and one pending run per reservation, so a newer request
can replace an older pending one. Update the existing explanatory comment near
the workflow_dispatch/workflow_call handling to state this explicitly, and
mention that preserving every request would require a separate queue/broker
mechanism rather than relying on the current concurrency lease.
In `@docs/contributor/uat.md`:
- Line 14: Reword the passage in docs/contributor/uat.md so it does not suggest
a general “full working day” GitHub Actions limitation. Update the text around
the phased scheduling/lease explanation to say the hosted UAT jobs are capped by
the runner’s job timeout, the lease only needs to cover the brief transition
windows, and the steady-state daytime cluster is protected by its lease-tag.
---
Outside diff comments:
In @.github/workflows/uat-aws.yaml:
- Around line 377-384: The Test Summary step is expanding the caller-controlled
reservation input directly inside the shell script, which should be avoided.
Update the workflow step that prints the UAT summary to pass reservation through
the step’s env mapping first, then reference that env value inside the run block
for the Test Summary step. Use the existing Test Summary / reservation fields to
locate the block, and print the value with a safe literal output method like
printf rather than direct shell interpolation.
In @.github/workflows/uat-gcp.yaml:
- Around line 350-357: The Test Summary step in the UAT GCP workflow is
interpolating inputs.reservation directly inside the shell script, which can be
unsafe. Update the workflow step to pass reservation through env on the Test
Summary job step, then print it from the shell using printf instead of inline
GitHub expression expansion. Use the existing Test Summary step and
inputs.reservation reference to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: bc3cdb95-5fa2-41a9-a686-8711e961079a
📒 Files selected for processing (9)
.github/workflows/uat-aws.yaml.github/workflows/uat-gcp.yaml.github/workflows/uat-nightly-batch.yaml.github/workflows/uat-run.yamldemos/cuj1-training.mddocs/contributor/uat.mddocs/index.ymlinfra/uat-aws-account/README.mdinfra/uat-gcp-account/README.md
599b796 to
6079c0e
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/uat-aws.yaml:
- Line 384: The reservation value is being expanded directly inside the shell in
the workflow step, which should be avoided. Update the step that prints the
reservation in uat-aws.yaml to pass inputs.reservation through the env block and
echo the environment variable instead, following the same safer pattern used in
uat-run.yaml. Locate the affected workflow step by its reservation print
statement and adjust that step only.
In @.github/workflows/uat-gcp.yaml:
- Around line 82-84: The GCP bringup/teardown shell snippets still read
CLUSTER_CONFIG with an unquoted redirection, which can break if the
caller-provided path contains special characters. Update the downstream shell
commands that consume CLUSTER_CONFIG in the workflow to quote the variable in
the input redirection, and check both the bringup and teardown sections so they
use the same safe pattern.
- Line 357: The reservation value is being expanded directly inside the shell in
the workflow step, which should be avoided in this elevated-permissions job.
Update the step that echoes reservation in the uat-gcp workflow to pass
inputs.reservation through env first, then reference the environment variable in
the shell command instead of interpolating the expression inline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6cee5958-1a81-453e-b860-cac079658042
📒 Files selected for processing (9)
.github/workflows/uat-aws.yaml.github/workflows/uat-gcp.yaml.github/workflows/uat-nightly-batch.yaml.github/workflows/uat-run.yamldemos/cuj1-training.mddocs/contributor/uat.mddocs/index.ymlinfra/uat-aws-account/README.mdinfra/uat-gcp-account/README.md
6079c0e to
53a5a9e
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/uat-nightly-batch.yaml:
- Around line 93-97: The permissions block in the uat-nightly-batch workflow
needs an explicit comment explaining why actions: read is required, since the
uat-run workflow and its nested uat-aws and uat-gcp jobs depend on it. Keep
actions: read in the permissions for the caller job and add a concrete
zizmor-focused comment near the permissions declaration so the intent is
documented instead of removing the permission.
In @.github/workflows/uat-run.yaml:
- Around line 122-126: The UAT caller jobs in the workflow are granting actions:
read without any local explanation, so either add a short inline comment on the
permissions block in each caller job or remove actions: read if the nested
workflow does not require it. Update both caller job permission sections
together so the rationale is consistent and easy to find near the permissions
declaration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 393bcdae-55a5-4deb-9245-b46a36e916a4
📒 Files selected for processing (9)
.github/workflows/uat-aws.yaml.github/workflows/uat-gcp.yaml.github/workflows/uat-nightly-batch.yaml.github/workflows/uat-run.yamldemos/cuj1-training.mddocs/contributor/uat.mddocs/index.ymlinfra/uat-aws-account/README.mdinfra/uat-gcp-account/README.md
DC1 PR-2 (the broker cutover), building on the reservation registry + uat-broker helper from NVIDIA#1559. Every UAT run now goes through a shared dispatch surface that holds a per-reservation concurrency lease, so contending runs QUEUE instead of racing the reserved GPU hardware. - uat-run.yaml (new): the lease owner + dispatch surface. Holds the concurrency group uat-<reservation>, resolves the reservation row from infra/uat/reservations.yaml via the uat-broker helper, and invokes the cloud-appropriate reusable pipeline with the permission superset it needs. Dispatchable by a human (workflow_dispatch) or the nightly batch (workflow_call). - uat-aws.yaml / uat-gcp.yaml: converted in place to workflow_call-only (dropped the per-workflow cron + self-concurrency; CLUSTER_CONFIG / TEST_CONFIG now come from the registry via inputs). Filenames and the keyless-cosign identity pins are unchanged; the pipeline bodies are otherwise untouched, and evidence-ingest stays nested. - uat-nightly-batch.yaml (new): the night side of the cycle — a cron that enumerates reservations from the registry and runs one main pass per reservation (data-driven matrix -> uat-run.yaml). Different reservations run in parallel; same-reservation runs serialize on the lease. - docs/contributor/uat.md (new, registered in docs/index.yml): the day/night model, how to request a run, how queuing + the one-pending limit work, and how to add a reservation. Also fixes the stale gh-workflow-run command in the CUJ1 demo and clarifies the two infra-account READMEs. Scope: exercises the main x training cell on both reservations. The version matrix (PR-3/DC5), the inference intent (DC2/DC3), and superseded-run surfacing (DC6) light up the remaining axes; the mechanism already scales to them. Signed-off-by: Nathan Hensley <nhensley@nvidia.com>
53a5a9e to
fa18e21
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Clean cutover. The per-reservation lease on uat-run.yaml (with the reusable pipelines correctly dropping their own concurrency) is the right shape, and keeping the uat-aws/gcp.yaml filenames preserves the keyless-cosign identity + EXPECTED_IDENTITY_REGEXP pins — the load-bearing detail here. Verified the repository guard casing (matches repo convention, GH == is case-insensitive), env-passed reservation avoids injection, and the permission superset is scoped to the calling jobs. Atomic cron move means no coverage gap. actionlint + full CI green. No blocking concerns.
Summary
DC1 PR-2 — the broker cutover: route all real-hardware UAT through a shared dispatch surface that holds a per-reservation concurrency lease, so contending runs queue instead of racing the reserved GPU hardware. Registry-driven; behavior is otherwise the existing UAT lane, rewired.
Motivation / Context
Second of four PRs implementing DC1 — day/night scheduler + reservation broker (#1274), part of the UAT epic (#1264), building on the reservation registry +
uat-brokerhelper merged in #1559. Today each per-cloud UAT workflow serializes only against itself, and AWS hard-fails when its reservation is busy; this PR replaces that with a reservation-keyed lease so a contending run queues rather than racing (or failing).Fixes: N/A
Related: #1274, #1264, #1559
Type of Change
Component(s) Affected
docs/,examples/).github/workflows/(UAT lane),infra/uat-*-account/READMEsImplementation Notes
uat-run.yaml— the lease owner + dispatch surface. Top-levelconcurrency: uat-${{ github.event.inputs.reservation || inputs.reservation }}(cancel-in-progress: false); aresolvejob builds theuat-brokerhelper and resolves the reservation row frominfra/uat/reservations.yaml;run-aws/run-gcpinvoke the cloud pipeline with the permission superset the nested pipeline + evidence-ingest need. Bothworkflow_dispatch(human) andworkflow_call(batch).uat-aws.yaml/uat-gcp.yamlconverted in place toworkflow_call-only: dropped the per-workflowschedulecron and self-concurrency;CLUSTER_CONFIG/TEST_CONFIGnow come from the registry via inputs. Filenames kept, so the keyless-cosign identity pins stay valid; the pipeline bodies are otherwise untouched andingest-evidencestays nested.uat-nightly-batch.yaml— the cron night side: enumerate reservations from the registry into afromJSONmatrix thatuses: uat-run.yaml(mirrors the existingkwok-recipes.yamldynamic-matrix pattern, so nogh workflow runtoken dependency). Different reservations run in parallel; same-reservation runs serialize on the lease.docs/contributor/uat.md(day/night model, requesting a run, queuing + the one-in-progress-plus-one-pending limit + standing-broker escalation, adding a reservation, and the ACL/non-secret note), registered indocs/index.yml. Also fixes the now-stalegh workflow run uat-gcp.yamlcommand indemos/cuj1-training.mdand clarifies the two infra-account READMEs.Scope / boundaries
Exercises the
main× training cell on both reservations (aws-h100,gcp-h100). Deferred to their consuming work to avoid dead pass-through inputs:main-first version matrix → PR-3 / DC5 (uat-broker schedule+ versioned install)intent→ test-config / recipe selection → DC2 / DC3 (inference)reservationis a free-form string validated against the registry (not a hardcodedchoice), so adding a GPU pool stays a one-row data edit.Testing
actionlint .github/workflows/uat-run.yaml .github/workflows/uat-nightly-batch.yaml \ .github/workflows/uat-aws.yaml .github/workflows/uat-gcp.yaml yamllint -c .yamllint.yaml <the four workflows> docs/index.yml ./tools/check-docs-mdx && ./tools/check-docs-filenamesactionlintclean on all four workflows (this validates the reusable-workflow input wiring betweenuat-run.yamland the per-cloud pipelines).yamllintclean;check-docs-mdx+check-docs-filenamespass.uat-brokerhelper (with its unit tests) landed in feat(ci): add UAT reservation registry and broker helper (#1274) #1559.Risk Assessment
Rollout notes: The cutover is atomic — the per-cloud crons move to
uat-nightly-batch.yamlin the same change, so there is no window without nightly coverage and no window where two triggers can hit a reservation without sharing the lease. Real-hardware behavior can only be validated post-merge (the nightly cron fires on the default branch); the smoke test is a manualgh workflow run uat-run.yaml -f reservation=aws-h100. One watch-item: the nightly path nests four reusable-workflow levels (batch → uat-run → uat-aws → evidence-ingest), which is GitHub's documented maximum.Checklist
make testwith-race) — N/A (no Go changes; helper tested in feat(ci): add UAT reservation registry and broker helper (#1274) #1559)make lint) — actionlint + yamllint + doc checks greenactionlintis the merge gateuat.md+ demo/README fixesgit commit -S)