Skip to content

feat(performance): add GB200 EKS support for NCCL all-reduce bandwidth check#640

Merged
mchmarny merged 34 commits into
NVIDIA:mainfrom
njhensley:feat/nccl-gb200-eks-variants
Apr 24, 2026
Merged

feat(performance): add GB200 EKS support for NCCL all-reduce bandwidth check#640
mchmarny merged 34 commits into
NVIDIA:mainfrom
njhensley:feat/nccl-gb200-eks-variants

Conversation

@njhensley

@njhensley njhensley commented Apr 22, 2026

Copy link
Copy Markdown
Member

Summary

Adds nccl-all-reduce-bw-net and nccl-all-reduce-bw-nvls validator variants so recipes for p6e-gb200.36xlarge on EKS can independently verify the EFA fabric and the NVL72 MNNVL fabric — ending the silent skip that hid fabric misconfiguration on GB200/EKS. Also extracts a shared defaults.ProbeImage constant and plumbs per-check timeout from catalog → validator container.

Motivation / Context

The existing nccl-all-reduce-bw validator silently skipped on GB200/EKS because supportedNCCLCombinations listed GB200 only under CriteriaServiceAny, and no testdata/gb200/eks/ template existed. A recipe targeting p6e-gb200.36xlarge logged "Skipping NCCL All Reduce bandwidth validation: unsupported service/accelerator combination" and reported a passing skip — so the recipe shipped without a real fabric-bandwidth check.

GB200/EKS nodes expose two inter-node GPU fabrics at once: MNNVL (NVLink-C2C across the NVL72 rack, multiple hundreds of GB/s) and EFA (the AWS network fabric). Operators need per-fabric assertions — a single auto-detect test would always select MNNVL and silently give no signal on EFA health (or vice versa if MNNVL is misconfigured). The two variants let a recipe opt in to either or both, with per-variant thresholds.

Fixes: #531
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)  
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)  
  • Core libraries (pkg/errors, pkg/k8s)  
  • Docs/examples (docs/, examples/)
  • Other: validators/performance, validators/conformance  

Implementation Notes

One catalog entry per variant, with NCCL's own transport vocabulary. No schema change to recipe.Constraint; variants are surfaced as distinct catalog names (nccl-all-reduce-bw-net, nccl-all-reduce-bw-nvls) that a recipe can list independently with their own thresholds. The existing nccl-all-reduce-bw entry stays as the default (behavior preserved). Variant names follow NCCL's own vocabulary — NET (the generic network-transport class) and NVLS (NVLink SHARP / MNNVL). This keeps the catalog stable as future NET implementations land: adding testdata/{accelerator}/{service}/runtime-net.yaml extends coverage with no new catalog entry.

Per-variant runtime.yaml files. templatePath gains a variant parameter so runtime YAMLs resolve to testdata/{accelerator}/{service}/runtime[-{variant}].yaml. Bare runtime.yaml is unchanged for the default variant. Both new templates hardcode the same AWS-prebuilt nccl-tests image main already uses for H100/EKS (public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1-efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9, multi-arch).

verifyTransportFromLogs asserts the expected fabric carried traffic. Parses NCCL_DEBUG=INFO output and fails the check if the observed transport doesn't match the selected variant. Keys on NCCL 2.27's authoritative signals (the version shipped in the AWS image):

Variant Signal
NET NCCL INFO Using network <plugin> with non-Socket plugin
NVLS NVLS comm 0x<addr> — communicator-init line, emitted only when NCCL actually constructs an NVLS communicator (not merely detects hardware support)

This turns the variant label into a hard guarantee: a cluster with a broken IMEX domain fails loudly on the NVLS variant instead of silently falling back to EFA, and a cluster missing the NVreg EFA flag fails loudly on the NET variant instead of passing at Socket-fallback bandwidth.

NVLS variant self-provisions the IMEX domain. Before the TrainJob, the validator creates a resource.nvidia.com/v1beta1 ComputeDomain CR; the NVIDIA DRA driver reconciles it into a ResourceClaimTemplate named nccl-all-reduce-imex that worker pods reference to get /dev/nvidia-caps-imex-channels mounted. No hand-wired IMEX setup required on the cluster beyond having nvidia-dra-driver-gpu installed; cleanup deletes the CR alongside the TrainJob.

New NVreg_GrdmaPciTopoCheckOverride=1 preflight for the NET+GB200+EKS tuple. Without this NVIDIA kernel parameter, the driver rejects EFA's dma-buf attach to GPU HBM on the Grace CPU topology (NVRM: dma-buf attach failed: topology not supported for mapping type FORCE_PCIE) and NCCL silently falls back to Socket — below the NET bandwidth threshold but well above what looks like noise. The preflight spawns one short-lived probe pod per target GPU node with /proc/driver/nvidia hostPath-mounted, greps for the active parameter line, and fails fast with an actionable GPU-Operator-ClusterPolicy fix hint if any node is missing it.

Probe-pod approach over exec-into-GPU-Operator-DaemonSet — considered and prototyped an exec fast path (reuse the existing driver DaemonSet pod, avoid spawning our own), but the client-go/tools/remotecommand dependency chain vendored in three new top-level modules (gorilla/websocket, moby/spdystream, mxk/go-flowrate) and ~22 k lines of transport-infrastructure code. The seconds of per-node latency saved didn't justify that footprint. The probe-pod path is also install-agnostic (works whether the driver came from GPU Operator, Helm, DKMS-on-AMI, or a bare-metal install).

defaults.ProbeImage extraction. Three probe-style validators (pod_autoscaling_check, secure_access_check, the new NVreg preflight) each hardcoded "busybox:1.37". Centralized in pkg/defaults/images.go — consistent with how other shared tunables (timeouts, HTTP settings) live there. No override machinery; if a caller needs airgapped-registry override, the existing AICR_CHECK_TIMEOUT / AICR_VALIDATOR_IMAGE_REGISTRY pattern (read env var at the call site, fall back to the constant) extends naturally.

Catalog-entry timeout now reaches the validator container. pkg/validator/job/deployer.go already translated the catalog timeout into the Job's ActiveDeadlineSeconds, but the value never reached the inner validator process — validators/context.go:LoadContext hardcoded defaults.CheckExecutionTimeout as the parent context deadline regardless. A catalog entry saying timeout: 30m still cut off the check at the package default because the inner context cancelled first. Propagated via a new AICR_CHECK_TIMEOUT env var (Go duration string, parsed on the validator side, fallback preserved when unset). Unlocks catalog-level per-check timeout configuration across the whole validator framework, not just NCCL.

Design decisions worth reviewing

  • Why not variantEFA instead of variantNET? NET is NCCL's own term for the generic network-transport class. A future TCPXO-on-GKE or IB-on-prem implementation would also be a NET variant under the same catalog entry — different testdata/{accelerator}/{service}/runtime-net.yaml, no new constraint name. Naming the variant after today's single concrete provider (EFA) would close that door.
  • Why not refactor the platform dispatch to a strategy/interface? Surveyed the rest of the codebase — no other validator uses CriteriaService* branching at all; they solve platform independence via recipe + label-selector discovery (e.g., validators/conformance/ai_service_metrics_check.go's discoverPrometheusURL). NCCL is unavoidably more platform-specific because the MPI pod spec itself differs per provider. The existing approach (per-tuple testdata + sibling helper files for platform-specific runtime discovery) predated this PR and handles 4 already-supported platforms without the conditionals becoming painful. Deferring a platform-strategy refactor until a second concrete NET implementation (e.g. GKE TCPXO) forces its shape.
  • AWS image, not a vendored one. An early iteration built our own nccl-tests image from a vendored Dockerfile on the theory that aws-ofi-nccl 1.16.3's CUDA 13 handling was broken. Cluster-side testing confirmed the theory was wrong — the AWS prebuilt image reaches identical bandwidth on the fixed cluster. The vendored Dockerfile path was scrapped in favor of reusing the image main already pins for H100/EKS.

Testing

make qualify

Per-package coverage vs origin/main:

Package Baseline HEAD Delta
pkg/defaults 100.0% 100.0%
pkg/validator/job 10.6% 10.5% -0.1% (within 0.5% floor)
validators 15.2% 19.8% +4.6%
validators/conformance 15.3% 15.3%
validators/performance 20.2% 20.3% +0.1%

New exported functions are covered by new unit tests (TestParseNVregFromParams, TestGB200NetPreflightApplies, TestFilterPreflightNodes, expanded TestTemplatePath, TestVerifyTransportFromLogs, TestBuildComputeDomain, TestNVLSRuntimeYAMLReferencesIMEXClaim, TestCheckTimeoutFromEnv). TestDeployJobEnvVars additionally asserts the new AICR_CHECK_TIMEOUT env var is injected with the value from the catalog entry's timeout.

Hardware validation on 2 × p6e-gb200.36xlarge EKS cluster (EKS 1.34, Ubuntu 24.04, NVIDIA driver 580.105.08 — the recipe's shipped default, EFA installer 1.47.0, nvidia-dra-driver-gpu v25.12.0, NVreg_GrdmaPciTopoCheckOverride=1 set):

nccl-all-reduce-bw-net  : 327.94 GB/s  (threshold  >=  40 GB/s)   PASS ✓
nccl-all-reduce-bw-nvls : 841.29 GB/s  (threshold  >= 500 GB/s)   PASS ✓

Both variants also verified on driver 580.126.20 (NVIDIA's recommended floor for GB200+EFA): 327.46 GB/s / 840.18 GB/s. The recipe ships 580.105.08 unchanged; operators who want to match NVIDIA's guidance can bump gpu-operator.driver.version in their overlay.

verifyTransportFromLogs confirmed "NCCL INFO Using network Libfabric" on the NET run and "NVLS comm 0x..." on the NVLS run. Both runs observed clean $variant-appropriate paths with no Socket fallback.

The NET variant also disables NCCL_MNNVL_ENABLE: on GB200 NCCL probes the multi-node NVLink path independently of NVLS, and the NET pods have no IMEX ResourceClaim attached so the MNNVL init aborts (Cuda failure 800 operation not permitted) before any EFA traffic. Disabling both NVLink paths keeps NCCL on the intended NET transport end-to-end.

Follow-ups (not in this PR)

  • Adoption in shipped overlays. Extend an existing gb200-eks-* overlay to enable nccl-all-reduce-bw-net and nccl-all-reduce-bw-nvls by default; they are opt-in per recipe today. The legacy nccl-all-reduce-bw entry stays as the universal default for non-GB200 accelerators.
  • Threshold calibration. Current thresholds (>= 40 GB/s NET, >= 500 GB/s NVLS) are deliberately conservative — sized for a 2-node GB200 pair rather than a full NVL72 UltraServer. Once we have data from production-scale deployments we can raise the floors so the checks catch soft regressions instead of only hard failures.
  • Driver version. Recipe ships 580.105.08 and validation passes there; NVIDIA's recommended floor for GB200+EFA is 580.126.20. A follow-up could either (a) bump recipes/components/gpu-operator/values.yaml globally, or (b) introduce a GB200-specific overlay that bumps only for this accelerator, leaving H100/B200 recipes unchanged. Option (b) has a narrower blast radius.
  • Driver kernel module flag. NVreg_GrdmaPciTopoCheckOverride=1 is a cluster prerequisite for the NET variant on GB200/EKS, enforced today by the preflight with an actionable error message if missing. A follow-up could make it self-fulfilling by shipping a ConfigMap in the GB200/EKS overlay and wiring gpu-operator.ClusterPolicy.spec.driver.kernelModuleConfig to reference it — same scoping tradeoff as the driver-version bullet (GB200-only overlay vs. global values.yaml).
  • PSA hardening for hostPath-ing validators. The NET preflight is the first validator in this repo to hostPath-mount a node path (/proc/driver/nvidia). ensureNamespace currently creates aicr-validation with no pod-security.kubernetes.io/enforce label, so any future hostPath-ing validator would hit the same admission question. Follow-up to (a) label the namespace privileged in ensureNamespace, or (b) add a minimal baseline-compatible SecurityContext where possible and document the PSA fallback where not. Doing this cross-cutting rather than bolting onto the NVreg preflight.
  • Catalog-level NCCL image pinning. verifyTransportFromLogs regexes key off NCCL 2.27+ banner strings (Using network <plugin>, NVLS comm 0x<addr>). Today the NCCL version is pinned by construction because the AWS prebuilt image tag is hardcoded in the runtime YAMLs (cuda12.8.1-efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9), but that pin lives in two files. Follow-up to move the image to a catalog entry field so a single bump rolls through all variants with a visible gate — avoids a future silent NCCL 2.28+ rename false-failing a healthy cluster.
  • User-facing docs. docs/user/validation.md currently documents only nccl-all-reduce-bw. Variants are opt-in per recipe so this is lower priority — the natural moment to update that page is when a shipped GB200/EKS overlay adopts -net / -nvls as defaults (see the first follow-up).

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

  • Backwards compatible: Existing nccl-all-reduce-bw catalog entry is preserved with identical behavior; existing recipes keep working. The two new entries are opt-in per recipe.
  • Timeout propagation behaviour change: Validator images built before this change continue to use defaults.CheckExecutionTimeout as their parent context deadline (the new env var is unset → fallback). Validator images built after this change honor the catalog entry's timeout. Mixed-version rolling deployments are safe.
  • Cluster prerequisite for the NET variant on GB200/EKS: NVreg_GrdmaPciTopoCheckOverride=1 must be set on the NVIDIA kernel driver (via the GPU Operator ClusterPolicy.spec.driver.kernelModuleConfig). The preflight fails loudly with an actionable fix hint if this is missing, rather than silently passing at Socket-fallback bandwidth.
  • Cluster prerequisite for the NVLS variant on GB200/EKS: nvidia-dra-driver-gpu ≥ v25.12.0 must be installed with the resource.nvidia.com/v1beta1 ComputeDomain CRD present. The validator creates a ComputeDomain with spec.numNodes: 0, which relies on the IMEXDaemonsWithDNSNames=true default introduced in v25.12.0; older drivers would wait for a quorum that never forms and fail at DiagnosticTimeout (2m). The CR creation itself fails up front with a clear NotFound error if the CRD is missing.
  • Pod Security Admission: The NET preflight probe pod hostPath-mounts /proc/driver/nvidia read-only. Clusters that enforce baseline or restricted PSA as the default on the aicr-validation namespace will reject the pod at admission. On those clusters, either label the namespace pod-security.kubernetes.io/enforce: privileged or set the cluster-wide PSA default to privileged for this namespace. (A follow-up will generalize this to cover any future hostPath-ing validator — see Follow-ups.)

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

…ts for GB200/EKS

Adds two transport-class-specific NCCL all-reduce bandwidth checks for
p6e-gb200.36xlarge on EKS:

* nccl-all-reduce-bw-net — forces NCCL onto the NET transport (AWS EFA via
  the aws-ofi-nccl plugin) so operators can measure the EFA fabric
  independently of the NVL72 NVLink domain.

* nccl-all-reduce-bw-nvls — exercises MNNVL / NVLink SHARP across the NVL72
  IMEX domain. The validator auto-provisions a resource.nvidia.com/v1beta1
  ComputeDomain CR before the TrainJob; the NVIDIA DRA driver reconciles
  that into a ResourceClaimTemplate that worker pods reference to get
  /dev/nvidia-caps-imex-channels mounted. No hand-wired IMEX setup required.

Both variants share the same AWS-prebuilt nccl-tests image main already
uses for H100/EKS (public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1-
efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9, multi-arch). The image
is hardcoded in the runtime YAMLs; no vendored image or registry
templating is needed.

verifyTransportFromLogs parses NCCL_DEBUG=INFO output to assert the
selected transport actually carried traffic, keying on NCCL 2.27's
authoritative signals:

  NET  — "NCCL INFO Using network <plugin>" with non-Socket plugin
  NVLS — "NVLS comm 0x<addr>" communicator-init line

This turns the variant label into a hard guarantee: a GB200/EKS cluster
with a broken IMEX domain or unset NVreg_GrdmaPciTopoCheckOverride will
fail loudly on the appropriate variant instead of silently falling back
to Socket.

supportedNCCLCombinations is restructured to map variant → service →
accelerators. The default variant preserves existing H100/EKS, H100/GKE,
B200/Any, GB200/Any coverage; NET and NVLS each opt in to GB200/EKS only.

templatePath gains a variant parameter so runtime YAMLs load from
testdata/{accelerator}/{service}/runtime[-{variant}].yaml — bare
runtime.yaml when variant is the default. Unit tests updated to cover
the new path shapes and the transport-verification regexes.
Catalog entries have a `timeout` field that the Job deployer (pkg/validator/
job/deployer.go) translates into the Job's ActiveDeadlineSeconds. But that
value never reached the inner validator container — validators/context.go
LoadContext() hardcoded defaults.CheckExecutionTimeout as the parent
context deadline regardless. Result: a catalog entry saying `timeout: 45m`
still cut off the check at the package default because the inner context
cancelled first.

This particularly impacts long-running benchmark-style checks whose real
runtime can exceed the package default depending on cluster hardware and
fabric characteristics — the validator pod completes the test setup and
produces real output, but ctx.Done() fires mid-benchmark and
waitForLauncherPodAndGetLogs() returns a generic "pod did not succeed"
with a TIMEOUT error chain that obscures the fact that a longer timeout
would have let the check finish successfully.

Fix, both sides of the pod boundary:
- pkg/validator/job/deployer.go buildEnvApply() now injects
  AICR_CHECK_TIMEOUT=<catalog timeout as Go duration string> into the
  validator container (same fallback to ValidatorDefaultTimeout when
  catalog entry has Timeout=0 as buildApplyConfig already uses).
- validators/context.go LoadContext() now calls checkTimeoutFromEnv()
  which parses AICR_CHECK_TIMEOUT and falls back to
  defaults.CheckExecutionTimeout if unset or malformed. Behavior is
  identical for any validator image that hasn't been rebuilt with this
  change (env var absent → existing default).

Unlocks catalog-level per-check timeout configuration across the whole
validator framework. Any check that legitimately needs more than the
package default can now just set its catalog timeout to the real value.
…de for GB200 NET

Without NVreg_GrdmaPciTopoCheckOverride=1 set on the NVIDIA kernel driver,
a p6e-gb200.36xlarge EKS node silently fails EFA GPUDirect RDMA — the
NVIDIA driver rejects EFA's dma-buf attach request with:

  NVRM: dma-buf attach failed: topology not supported for mapping type FORCE_PCIE

and NCCL falls back to the Socket transport. The NET variant's bandwidth
threshold is set to a permissive value so it passes whatever Socket-over-
TCP bandwidth the cluster delivers, easy to mistake for a real EFA number
when reading only the final row.

Add a preflight that runs before the benchmark whenever the NET variant
targets GB200/EKS: spin up one short-lived Pod per target GPU node, pinned
via NodeName, hostPath-mounting /proc/driver/nvidia read-only, and grep
for the "GrdmaPciTopoCheckOverride: 1" line the NVIDIA kernel module
writes when the parameter is set on driver load. Pod exits 0 if found,
non-zero otherwise; the validator consolidates missing nodes into a
single error with an actionable GPU Operator ClusterPolicy fix hint.

Scope: only NET + GB200 + EKS triggers the check. NVLS traffic stays on
the Grace C2C NVLink fabric (no PCIe dma-buf involvement) so the flag
doesn't matter there. H100, B200, and GKE variants are unaffected.

The runtime-parameter parsing is factored into parseNVregFromParams so
the happy path, driver-default, explicit-zero, commented-out, and
suffix-collision cases are directly unit-tested; the pod-based check uses
grep with the same anchored pattern.
Three validators — pod_autoscaling_check, secure_access_check, and the new
nccl NVreg preflight — each spawn a short-lived busybox pod on target nodes
to check host state. All three hardcoded the same literal "busybox:1.37"
string. Centralize in pkg/defaults as defaults.ProbeImage so bumping the
pinned tag is a one-file change, and every future probe-style validator
has an obvious shared home to pull the reference from.

pkg/defaults is the established home for compile-time tunables (timeouts,
limits, HTTP client settings); adding an images.go file with a ProbeImage
const extends the same pattern. No override machinery — callers that ever
need runtime override (airgapped registries, fork-specific mirrors) should
read an AICR_PROBE_IMAGE env var at the call site and fall back to the
constant, matching the existing AICR_CHECK_TIMEOUT / AICR_VALIDATOR_IMAGE_
REGISTRY patterns rather than baking override logic into pkg/defaults.

Tag choice (busybox:1.37) is unchanged — this is pure de-duplication, not
a version bump.
…X probe

On GB200 NCCL probes MNNVL (multi-node NVLink via IMEX) independently of
NVLS. The NET-variant pods have no IMEX ResourceClaim attached, so the
probe aborts with "Cuda failure 800 operation not permitted" before any
EFA traffic and the launcher exits with unhandled system error.

Setting NCCL_MNNVL_ENABLE=0 alongside the existing NCCL_NVLS_ENABLE=0 keeps
NCCL on the NET transport as intended. Verified on two p6e-gb200.36xlarge
nodes: peak 317.87 GB/s (8 GiB messages), transport banner confirms
[send] via NET/Libfabric/GDRDMA(PCI).
… for IMEX + NVreg waits

Three small follow-ups on top of the NCCL GB200 variant work:

* pkg/defaults: introduce EnvCheckTimeout constant for AICR_CHECK_TIMEOUT
  so the deployer-side producer and validator-side consumer reference the
  same symbol instead of two string literals that can drift.

* pkg/validator/job/deployer: extract resolvedTimeout() to de-duplicate the
  catalog.Timeout == 0 fallback between buildApplyConfig and buildEnvApply.

* validators/performance: replace polling waits with the watch API per
  CLAUDE.md "Kubernetes Patterns" — waitForIMEXClaimTemplate keeps a fast
  Get path for warm clusters and falls through to a name-scoped watch;
  preflightGB200NetNVregFlag runs the per-node probe pods in parallel
  via errgroup and watches pod phase transitions instead of polling.

No behavior change beyond lower API-server load and faster ready-signal
pickup on warm clusters. Scoped lint + tests pass on affected packages.
…l inline

The constant was introduced in the preceding commit to centralize the
orchestrator→container env-var contract. The callsites (one producer in
pkg/validator/job/deployer.go, one consumer in validators/context.go)
already document the string in comments, and AICR_CHECK_TIMEOUT is shorter
and more greppable than defaults.EnvCheckTimeout. Not enough repetition to
justify a new exported symbol.

The other parts of the preceding refactor (resolvedTimeout() helper, watch
API for IMEX/NVreg waits, parallel preflight errgroup) are unaffected.
@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@njhensley njhensley marked this pull request as draft April 22, 2026 17:53
njhensley and others added 2 commits April 22, 2026 11:05
…e race window

Both waitForIMEXClaimTemplate and waitForPreflightPodPhase use the same
pattern: an initial Get, then a Watch on the expected name/field. The
Watch does not replay prior events — so if the state transition we're
waiting for lands between the initial Get and the Watch call, the loop
blocks until the context deadline even though the resource is already
in the target state.

Add a defensive Get right after Watch is established:

* waitForIMEXClaimTemplate: if the DRA driver reconciled the RCT into
  existence in that window, return immediately instead of timing out.

* waitForPreflightPodPhase: if the probe pod transitioned to Succeeded
  or Failed in that window, return the phase immediately.

Both re-checks are idempotent and cheap (one GET each). Found by
coderabbit review of the watch-API refactor.
@njhensley njhensley marked this pull request as ready for review April 22, 2026 19:04
@mchmarny mchmarny added this to the v0.12 milestone Apr 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

@njhensley this PR now has merge conflicts with main. Please rebase to resolve them.

Signed-off-by: njhensley <229213852+njhensley@users.noreply.github.com>
coderabbitai[bot]

This comment was marked as resolved.

Two linked fixups to the orchestrator Job env-var pipeline:

1. orchestratorEnvCount := 8 hint under-allocates after this PR's
   AICR_CHECK_TIMEOUT addition. The always-injected set is now 7
   (SNAPSHOT_PATH, RECIPE_PATH, VALIDATOR_NAME, VALIDATOR_PHASE,
   RUN_ID, NAMESPACE, CHECK_TIMEOUT) plus up to 5 conditionally-
   injected (NODE_SELECTOR, TOLERATIONS, CLI_VERSION, CLI_COMMIT,
   VALIDATOR_IMAGE_REGISTRY) = 12 realistic max. Promote to a
   package-level const with a comment enumerating the contributors,
   so the next contributor has a single place to bump when the list
   changes. No behavior change — append() resizes either way.

2. TestDeployJobEnvVars asserted every injected var except the new
   AICR_CHECK_TIMEOUT. A regression dropping or malforming the env
   wouldn't be caught. Add a positive assertion: env must be present
   and its value must equal testEntry().Timeout.String() (matches
   what buildEnvApply writes via time.Duration.String()).
Three doc gaps called out in Yuan's review:

- docs/contributor/validator.md env-var table omitted AICR_CHECK_TIMEOUT,
  which the Job deployer now injects unconditionally. Add a row that
  explains what sets it (catalog entry timeout), how it's consumed
  (validators.checkTimeoutFromEnv → ctx.Ctx deadline), the fallback
  path, and the WARN-on-malformed behavior added alongside.

- Same file's LoadContext paragraph still said the timeout comes from
  defaults.CheckExecutionTimeout. Rewrite to point at
  checkTimeoutFromEnv and reflect the env-first precedence.

- recipes/validators/README.md Performance Phase table listed only the
  pre-PR nccl-all-reduce-bw entry. Add rows for the two new NET and
  NVLS variants. Descriptions copied verbatim from catalog.yaml so a
  grep for the name surfaces the same one-liner in both places.
Both runtime-net.yaml and runtime-nvls.yaml had comments next to
NCCL_DEBUG=INFO that named "[send] via NET/..." / "[send] via NVLS"
per-channel log lines — banners that the Go docstring at
validators/performance/nccl_all_reduce_bw_constraint.go:117-124
explicitly documents as dropped in NCCL 2.27+. The actual regexes
verifyTransportFromLogs matches are the NCCL-2.27-era banners
"NCCL INFO Using network <plugin>" (NET) and "NVLS comm 0x<addr>"
(NVLS).

Rewrite both comments to describe what NCCL actually emits in the
image this runtime pins (ncclv2.27.7) and what verifyTransportFromLogs
actually greps. Pure documentation change in testdata; no YAML body
change, so no runtime impact.
…/defaults

Project convention per CLAUDE.md "Constants Rules" is that timeouts
live in pkg/defaults. The preflight-path cleanup bound was the only
timeout on this branch that still lived as a local const in the
caller, flagged as a nit in Mark's review.

Promote to defaults.PreflightCleanupTimeout (same value, 30s, same
semantics) and update the single call site. Generic name leaves room
for future validator preflights to reuse it — e.g., the NVreg path
is the only preflight today, but the conformance-phase probe-pod
patterns would reach for the same bound.
@njhensley

Copy link
Copy Markdown
Member Author

@yuanchen8911 thanks — all findings and nits addressed; open questions triaged inline below. Commits referenced are on feat/nccl-gb200-eks-variants.

Findings (all addressed)

  1. AICR_CHECK_TIMEOUT missing from docs/contributor/validator.md env table → added in 10c1c086, describes injection source (Job deployer from catalog timeout), fallback, and the WARN-on-malformed behavior.
  2. LoadContext() prose still said timeout came from defaults.CheckExecutionTimeout → rewritten in 10c1c086 to point at checkTimeoutFromEnv and the env-first precedence.
  3. recipes/validators/README.md performance table missing variants → two new rows in 10c1c086 for nccl-all-reduce-bw-net and nccl-all-reduce-bw-nvls.
  4. TestDeployJobEnvVars missed AICR_CHECK_TIMEOUT + no unit test for checkTimeoutFromEnv → positive assertion added in 34c4a999 (env present, value equals testEntry().Timeout.String()); 6-case table-driven TestCheckTimeoutFromEnv added in c102bca6 covering unset / empty / valid / malformed / negative / zero, asserting the duration and the WARN log for the malformed/non-positive paths.

Nits (both addressed)

  • Stale NCCL banner references in runtime-net.yaml / runtime-nvls.yaml → rewritten in 2590f84f to reference Using network <plugin> (NET) and NVLS comm 0x<addr> (NVLS), matching the regexes verifyTransportFromLogs actually greps for.
  • orchestratorEnvCount := 8 under-allocates34c4a999 promotes it to a package-level orchestratorEnvMax = 12 const with a comment enumerating the 7 always-injected + up to 5 conditionally-injected contributors, so the next contributor has a single place to bump.

Open questions

  • PSA on aicr-validation namespace — real concern and out of scope for this PR. The probe-pod uses hostPath on /proc/driver/nvidia read-only, so it will trip baseline/restricted admission on clusters that enforce it by default. I'd rather fix this as a cross-cutting change covering ensureNamespace + any future hostPath-ing validator, not only the NVreg preflight — will file a follow-up. For now adding a line to the rollout notes calling out the PSA prerequisite alongside NVreg_GrdmaPciTopoCheckOverride=1.
  • NCCL 2.27+ banner stability — agreed; catalog-level image pinning is the right shape. Also out of scope — the AWS prebuilt image tag (cuda12.8.1-efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9) is hardcoded in the runtime templates today, which locks NCCL to 2.27 by construction, but that pin is in two files instead of one catalog field. Follow-up to move it to a catalog attribute so a single bump rolls through all variants with a visible gate.
  • nvidia-dra-driver-gpu >= v25.12.0 for ComputeDomain numNodes: 0 — added to the PR description's rollout notes under the NVLS prerequisite (IMEXDaemonsWithDNSNames default is what makes numNodes: 0 work).
  • docs/user/validation.md mention of variants — noted. Left out of 10c1c086 because the page is end-user oriented and variants are opt-in per recipe; adopting them in shipped GB200/EKS overlays is the natural moment to document on that page. Tracked in the follow-up bullet that was already in the PR description.

Both new follow-ups (PSA hardening + catalog-level NCCL image pin) appended to the "Follow-ups" section of the PR description.

@yuanchen8911

yuanchen8911 commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Verified against 94f0a83.

1. [Major] TrainJob sizing/scoping is still asymmetric on mixed EKS clusters

The PR now filters the NET preflight to the nodes it wants to probe at nccl_all_reduce_bw_constraint.go#L221-L240, but the actual TrainJob is still sized from the unfiltered GPU fleet. determineGPUConfig() and template substitution still derive WORKER_COUNT, GPU_COUNT, and per-node shape from all schedulable GPU nodes at nccl_all_reduce_bw_constraint.go#L379-L401, while EKS worker placement later narrows to a single instance type at nccl_all_reduce_bw_constraint.go#L446-L451 and platformWorkerScheduling EKS path.

On a cluster with 2 GB200 + 3 H100 nodes, the TrainJob can request 5 workers while only the 2 GB200 nodes match the selector, leaving 3 workers Pending and mis-stating mpirun -np.

The fix I want is not another preflight-only filter. I want one shared target-node resolution step applied before sizing:

  • start from all schedulable GPU nodes
  • apply ctx.NodeSelector if present; otherwise, on EKS, apply the same default instance-type selector the worker pods use
  • rebuild gpuConfiguration from that filtered node set
  • use that filtered config for the >=2 nodes check, NVreg preflight, EFA discovery, WORKER_COUNT, GPU_COUNT, and runNCCLTrainJob()

If the effective selector matches zero GPU nodes, fail clearly. If it matches only one node, skip as a non-multi-node topology.

Non-blocking follow-ups

I am not blocking on the defaults.ProbeImage / AICR_VALIDATOR_IMAGE_REGISTRY gap from my earlier summary. That limitation is real, but it already exists in other helper-pod sites in this same SHA, so I would treat it as follow-up debt rather than a regression introduced here.

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified against 94f0a83 and against my earlier summary in #640 (comment).

One blocking issue remains:

[Major] TrainJob sizing/scoping is still asymmetric on mixed EKS clusters

The new NET preflight now filters the NVreg probe to target nodes at nccl_all_reduce_bw_constraint.go#L221-L240, but runNCCLTrainJob() still receives the full gpuConfig, and determineGPUConfig() plus template substitution still derive WORKER_COUNT, GPU_COUNT, and per-node shape from all schedulable GPU nodes at nccl_all_reduce_bw_constraint.go#L379-L401. Later, EKS worker placement narrows to a single instance type at nccl_all_reduce_bw_constraint.go#L446-L451 and platformWorkerScheduling EKS path.

On a mixed cluster such as 2 GB200 + 3 H100 nodes, the TrainJob can request 5 workers while only 2 nodes match the selector, leaving 3 workers Pending and mis-stating mpirun -np. The preflight change made the asymmetry explicit; the benchmark path needs to use the same filtered node subset, or these GB200 EKS variants need a heterogeneous-cluster fail-fast.

I am not blocking on the defaults.ProbeImage / AICR_VALIDATOR_IMAGE_REGISTRY gap from my earlier summary. That limitation is real, but it already exists in other helper-pod sites in this same SHA (validators/conformance/pod_autoscaling_check.go, validators/conformance/secure_access_check.go), so I would treat it as follow-up debt rather than a regression introduced here.

…node set

determineGPUConfig sized WORKER_COUNT / GPU_COUNT / TotalGPUCount from
every schedulable GPU node, while applyNCCLResources later narrowed the
worker podSpec to a single instance type on EKS (via
platformWorkerScheduling). On a mixed cluster — 2× p6e-gb200 + 3× p5-h100
under one EKS control plane — the TrainJob would request 5 workers while
only the 2 GB200 nodes match the selector, leaving 3 Pending and
mis-stating `mpirun -np`.

Per review feedback, fix with one shared target-node resolution step
applied before sizing:

 - Extract resolveTargetGPUNodes that returns both the filtered node set
   and the effective selector used to derive it.
 - Precedence: ctx.NodeSelector override wins; else on EKS use the first
   node's node.kubernetes.io/instance-type label (same key
   platformWorkerScheduling stamps into the worker podSpec); else no
   filter. Zero-match returns ErrCodeInternal naming the selector and
   the total GPU-node count so the misconfiguration is diagnosable.
 - Call it inside determineGPUConfig so WorkerCount, GPUCountPerNode,
   TotalGPUCount, and gpuConfig.Nodes are all derived from the filtered
   set. The existing WorkerCount < 2 gate continues to handle the
   single-match-skip case.
 - Delete the preflight-only filterPreflightNodes helper; the preflight
   is now called with gpuConfig.Nodes directly, since that slice is
   already scoped to the target set.

Heterogeneous-cluster contract: the automatic path handles pools that
are homogeneous within the matched subset. On a truly mixed cluster the
operator is expected to pass --node-selector to name the pool they want,
per the convention already used by inference_perf_constraint.go. A
cross-validator improvement to auto-select by Criteria.Accelerator +
nvidia.com/gpu.product label (the existing TODO in nccl_eks_utils.go)
would close the "first node's instance type" arbitrariness on mixed
clusters — out of scope here because it needs to land in inference-perf
the same way.

TestResolveTargetGPUNodes moved to nccl_test.go with 7 cases: EKS mixed
narrow, EKS homogeneous no-op, user override, non-EKS no-op, unlabeled
first node, zero-match error, empty input. Asserts both the returned
node set and the effective selector.
…a gpu.product

resolveTargetGPUNodes previously picked the first GPU node's instance-type
as the EKS narrow selector. On a heterogeneous cluster (e.g. 2× GB200 +
3× H100 under one EKS control plane) this was non-deterministic w.r.t.
the recipe's accelerator intent: if H100 nodes sorted first, a recipe
with accelerator: gb200 would validate H100 hardware. NET would then
fail the NVreg preflight with a misleading "set the kernel flag" advice,
and NVLS would fail on the transport-verification step after spending
TrainJob wall time — both wrong-cause errors for the same root problem.

Add a gpu.product filter as the second-precedence rule in the resolver.
NVIDIA GPU Feature Discovery (part of the GPU Operator AICR deploys)
labels every GPU node with nvidia.com/gpu.product carrying the concrete
product string (NVIDIA-GB200, NVIDIA-H100-80GB-HBM3, NVIDIA-B200, ...).
Define acceleratorProductMatchers mapping each Criteria.Accelerator enum
to a predicate — exact match for single-SKU accelerators (GB200, B200,
RTX Pro 6000), prefix match for families with multiple SKUs (H100 SXM /
PCIe / NVL, A100 SXM / PCIe variants), custom for L40 / L40S.

New resolver precedence:
 1. --node-selector override (unchanged).
 2. NEW: gpu.product ↔ recipe accelerator filter, when a matcher exists
    and at least one node carries the label. Zero-match returns a
    diagnostic error naming the expected accelerator and the products
    observed — no more misattributed preflight failures.
 3. On EKS, further narrow to first-filtered-node instance-type so the
    pool matches the worker podSpec's platformWorkerScheduling selector.
    Also the sole narrow on non-GFD clusters (graceful degrade).
 4. Otherwise return the accelerator-filtered set as-is.

CriteriaAcceleratorAny has no matcher, so variantDefault + any-accelerator
recipes skip the filter (unchanged behavior). Clusters without GFD labels
also skip the filter cleanly — the existing first-node-instance-type
heuristic remains the fallback.

Tests: TestResolveTargetGPUNodes grows to 12 cases covering accelerator
filtering on mixed clusters, H100 SXM+PCIe sub-narrowing, mismatch error
with products-seen diagnostic, non-GFD degrade, override-wins, empty
input, and non-EKS flow. New TestAcceleratorProductMatchers pins each
accelerator's matcher behavior (including exact-vs-prefix boundary cases
like NVIDIA-L40 vs NVIDIA-L4 and NVIDIA-GB200 vs NVIDIA-GB200-96GB).

Inference-perf keeps its --node-selector-or-first-most-free heuristic
for now; adopting gpu.product there is a separate follow-up (tracked in
the PR description).
Post-review simplification pass on the accelerator-aware node resolver.
No behavior change — all 12 resolver cases and 17 matcher cases still
pass; log output and error messages are unchanged modulo phrasing.

 - Drop filterByLabels duplicate; call the existing nodesMatchingSelector
   from inference_perf_constraint.go (same package, drop-in).
 - Drop the returned selector map from resolveTargetGPUNodes (caller only
   used it for one log line). Each narrowing step now logs inline, which
   also kills the synthetic `<gb200>`-bracketed selector the error path
   used to emit as a stand-in for a real label value.
 - Split the decision tree into narrowByAccelerator + narrowByInstanceType
   helpers so the top-level flow is a four-line cascade instead of three
   levels of nested conditionals. Zero-match-after-attempted-filter is
   now a local concern of narrowByAccelerator.
 - Defer uniqueGPUProducts sort+set construction to the zero-match
   diagnostic path; happy paths no longer allocate it.
 - Hoist `node.kubernetes.io/instance-type` to a package constant
   (instanceTypeLabel), matching gpuProductLabel.
 - Trim narrating comments at call sites ("nodes are already narrowed",
   "gpuConfig.Nodes is already the target subset") — they restated the
   code.

Tests: TestResolveTargetGPUNodes now asserts error messages by substring
instead of a faked selector map, which was brittle and only testable
because of the synthetic. Cases and coverage unchanged.
@njhensley

Copy link
Copy Markdown
Member Author

@yuanchen8911 addressed across three commits:

Scope fix4a0a8c19 fix(validators/performance): size NCCL TrainJob from filtered target node set

Extracted one shared target-node resolution step applied before sizing. determineGPUConfig now takes service, calls resolveTargetGPUNodes, and gpuConfig.Nodes (plus WorkerCount / GPUCountPerNode / TotalGPUCount) is the filtered set. The >= 2 node gate, NVreg preflight, EFA discovery, and runNCCLTrainJob all consume that same slice. filterPreflightNodes deleted — the preflight is now called with gpuConfig.Nodes directly. Zero-match on an explicit selector returns ErrCodeInternal naming the selector and the total count.

On your mixed 2 GB200 + 3 H100 example, gpuConfig.WorkerCount is now 2 / TotalGPUCount 8 / mpirun -np agrees, and the worker podSpec's platformWorkerScheduling selector hits exactly those two nodes.

Accelerator-aware auto-pathf1820c2e feat(validators/performance): accelerator-aware GPU node selection via gpu.product

Went beyond the reviewer ask to fix a related hole the first-node-instance-type default has: on a mixed cluster without --node-selector, node list order decided which accelerator got validated. Added a GFD-based filter as the second-precedence rule:

  1. ctx.NodeSelector override wins.
  2. NEW: filter by nvidia.com/gpu.productCriteria.Accelerator when a matcher exists and the cluster carries the label. Zero-match → hard error (no schedulable GPU nodes match recipe accelerator "gb200" (products seen: [NVIDIA-H100-80GB-HBM3])) so operators aren't pointed at misleading secondary errors like the NVreg preflight failing on H100 nodes.
  3. On EKS, further narrow to first-filtered-node instance-type so the pool is homogeneous in what platformWorkerScheduling stamps into the worker podSpec.
  4. Otherwise return the accelerator-filtered set as-is.

acceleratorProductMatchers covers GB200, B200, H100 family, A100 family, L40/L40S, RTX Pro 6000 (exact for single-SKU accelerators, prefix for families). CriteriaAcceleratorAny intentionally has no matcher. Clusters without GFD labels fall through to the first-node-instance-type heuristic unchanged.

Same shape should land in inference_perf_constraint.go — it currently only consults --node-selector and not gpu.product. Tracking as a follow-up in the PR description.

Simplification pass5d35fedd refactor(validators/performance): simplify resolveTargetGPUNodes

Post-review cleanup: filterByLabels was a dup of the existing nodesMatchingSelector in inference-perf (same package) — deleted and reused. Dropped the returned-selector map (was only used for logging; each narrowing step logs inline now) which also killed a synthetic <gb200> bracketed value the error path emitted as a selector stand-in. Decision tree split into narrowByAccelerator + narrowByInstanceType helpers so the top-level cascade is flat instead of three levels deep. Zero behavior change.

Tests: TestResolveTargetGPUNodes covers 12 cases — mixed-accelerator determinism regardless of list order, H100 SXM+PCIe sub-narrowing, accelerator mismatch with products seen: diagnostic, non-GFD fallback, override-wins, override-zero-match, non-EKS flow, empty input. TestAcceleratorProductMatchers pins 17 matcher boundaries (e.g. NVIDIA-GB200 exact vs NVIDIA-GB200-96GB, NVIDIA-H100-* prefix vs NVIDIA-H200-*, NVIDIA-L40/NVIDIA-L40S vs NVIDIA-L4).

@yuanchen8911 yuanchen8911 dismissed their stale review April 23, 2026 23:09

Blocker resolved by 4a0a8c1 + f1820c2 + 5d35fed — resolveTargetGPUNodes now feeds preflight, sizing, EFA discovery, and the TrainJob from one filtered node set.

yuanchen8911
yuanchen8911 previously approved these changes Apr 23, 2026

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Blocker resolved: resolveTargetGPUNodes provides a single target-node pipeline feeding preflight, sizing, EFA discovery, and the TrainJob uniformly. The GFD-based accelerator narrow is a nice improvement beyond the prescription — auto-path stays correct on mixed accelerator pools without requiring --node-selector.

Two Low follow-ups remain and are non-blocking: inference-perf catalog-comment drift, and checkNVregOnNode conflating flag-absent with grep-can't-open-file errors.

@yuanchen8911 yuanchen8911 requested a review from mchmarny April 24, 2026 04:44
@mchmarny mchmarny enabled auto-merge (squash) April 24, 2026 12:26
Signed-off-by: Mark Chmarny <mchmarny@users.noreply.github.com>
@mchmarny mchmarny merged commit 4e158cf into NVIDIA:main Apr 24, 2026
65 checks passed
lockwobr pushed a commit that referenced this pull request Apr 28, 2026
…h check (#640)

Signed-off-by: njhensley <229213852+njhensley@users.noreply.github.com>
Co-authored-by: Mark Chmarny <mchmarny@users.noreply.github.com>
@njhensley njhensley deleted the feat/nccl-gb200-eks-variants branch June 23, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(performance): add GB200 EKS support for NCCL all-reduce bandwidth check

5 participants