feat(performance): add GB200 EKS support for NCCL all-reduce bandwidth check by njhensley · Pull Request #640 · NVIDIA/aicr

njhensley · 2026-04-22T17:37:39Z

Summary

Adds nccl-all-reduce-bw-net and nccl-all-reduce-bw-nvls validator variants so recipes for p6e-gb200.36xlarge on EKS can independently verify the EFA fabric and the NVL72 MNNVL fabric — ending the silent skip that hid fabric misconfiguration on GB200/EKS. Also extracts a shared defaults.ProbeImage constant and plumbs per-check timeout from catalog → validator container.

Motivation / Context

The existing nccl-all-reduce-bw validator silently skipped on GB200/EKS because supportedNCCLCombinations listed GB200 only under CriteriaServiceAny, and no testdata/gb200/eks/ template existed. A recipe targeting p6e-gb200.36xlarge logged "Skipping NCCL All Reduce bandwidth validation: unsupported service/accelerator combination" and reported a passing skip — so the recipe shipped without a real fabric-bandwidth check.

GB200/EKS nodes expose two inter-node GPU fabrics at once: MNNVL (NVLink-C2C across the NVL72 rack, multiple hundreds of GB/s) and EFA (the AWS network fabric). Operators need per-fabric assertions — a single auto-detect test would always select MNNVL and silently give no signal on EFA health (or vice versa if MNNVL is misconfigured). The two variants let a recipe opt in to either or both, with per-variant thresholds.

Fixes: #531
Related: N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: validators/performance, validators/conformance

Implementation Notes

One catalog entry per variant, with NCCL's own transport vocabulary. No schema change to recipe.Constraint; variants are surfaced as distinct catalog names (nccl-all-reduce-bw-net, nccl-all-reduce-bw-nvls) that a recipe can list independently with their own thresholds. The existing nccl-all-reduce-bw entry stays as the default (behavior preserved). Variant names follow NCCL's own vocabulary — NET (the generic network-transport class) and NVLS (NVLink SHARP / MNNVL). This keeps the catalog stable as future NET implementations land: adding testdata/{accelerator}/{service}/runtime-net.yaml extends coverage with no new catalog entry.

Per-variant runtime.yaml files. templatePath gains a variant parameter so runtime YAMLs resolve to testdata/{accelerator}/{service}/runtime[-{variant}].yaml. Bare runtime.yaml is unchanged for the default variant. Both new templates hardcode the same AWS-prebuilt nccl-tests image main already uses for H100/EKS (public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1-efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9, multi-arch).

verifyTransportFromLogs asserts the expected fabric carried traffic. Parses NCCL_DEBUG=INFO output and fails the check if the observed transport doesn't match the selected variant. Keys on NCCL 2.27's authoritative signals (the version shipped in the AWS image):

Variant	Signal
NET	`NCCL INFO Using network <plugin>` with non-`Socket` plugin
NVLS	`NVLS comm 0x<addr>` — communicator-init line, emitted only when NCCL actually constructs an NVLS communicator (not merely detects hardware support)

This turns the variant label into a hard guarantee: a cluster with a broken IMEX domain fails loudly on the NVLS variant instead of silently falling back to EFA, and a cluster missing the NVreg EFA flag fails loudly on the NET variant instead of passing at Socket-fallback bandwidth.

NVLS variant self-provisions the IMEX domain. Before the TrainJob, the validator creates a resource.nvidia.com/v1beta1 ComputeDomain CR; the NVIDIA DRA driver reconciles it into a ResourceClaimTemplate named nccl-all-reduce-imex that worker pods reference to get /dev/nvidia-caps-imex-channels mounted. No hand-wired IMEX setup required on the cluster beyond having nvidia-dra-driver-gpu installed; cleanup deletes the CR alongside the TrainJob.

New NVreg_GrdmaPciTopoCheckOverride=1 preflight for the NET+GB200+EKS tuple. Without this NVIDIA kernel parameter, the driver rejects EFA's dma-buf attach to GPU HBM on the Grace CPU topology (NVRM: dma-buf attach failed: topology not supported for mapping type FORCE_PCIE) and NCCL silently falls back to Socket — below the NET bandwidth threshold but well above what looks like noise. The preflight spawns one short-lived probe pod per target GPU node with /proc/driver/nvidia hostPath-mounted, greps for the active parameter line, and fails fast with an actionable GPU-Operator-ClusterPolicy fix hint if any node is missing it.

Probe-pod approach over exec-into-GPU-Operator-DaemonSet — considered and prototyped an exec fast path (reuse the existing driver DaemonSet pod, avoid spawning our own), but the client-go/tools/remotecommand dependency chain vendored in three new top-level modules (gorilla/websocket, moby/spdystream, mxk/go-flowrate) and ~22 k lines of transport-infrastructure code. The seconds of per-node latency saved didn't justify that footprint. The probe-pod path is also install-agnostic (works whether the driver came from GPU Operator, Helm, DKMS-on-AMI, or a bare-metal install).

defaults.ProbeImage extraction. Three probe-style validators (pod_autoscaling_check, secure_access_check, the new NVreg preflight) each hardcoded "busybox:1.37". Centralized in pkg/defaults/images.go — consistent with how other shared tunables (timeouts, HTTP settings) live there. No override machinery; if a caller needs airgapped-registry override, the existing AICR_CHECK_TIMEOUT / AICR_VALIDATOR_IMAGE_REGISTRY pattern (read env var at the call site, fall back to the constant) extends naturally.

Catalog-entry timeout now reaches the validator container. pkg/validator/job/deployer.go already translated the catalog timeout into the Job's ActiveDeadlineSeconds, but the value never reached the inner validator process — validators/context.go:LoadContext hardcoded defaults.CheckExecutionTimeout as the parent context deadline regardless. A catalog entry saying timeout: 30m still cut off the check at the package default because the inner context cancelled first. Propagated via a new AICR_CHECK_TIMEOUT env var (Go duration string, parsed on the validator side, fallback preserved when unset). Unlocks catalog-level per-check timeout configuration across the whole validator framework, not just NCCL.

Design decisions worth reviewing

Why not variantEFA instead of variantNET? NET is NCCL's own term for the generic network-transport class. A future TCPXO-on-GKE or IB-on-prem implementation would also be a NET variant under the same catalog entry — different testdata/{accelerator}/{service}/runtime-net.yaml, no new constraint name. Naming the variant after today's single concrete provider (EFA) would close that door.
Why not refactor the platform dispatch to a strategy/interface? Surveyed the rest of the codebase — no other validator uses CriteriaService* branching at all; they solve platform independence via recipe + label-selector discovery (e.g., validators/conformance/ai_service_metrics_check.go's discoverPrometheusURL). NCCL is unavoidably more platform-specific because the MPI pod spec itself differs per provider. The existing approach (per-tuple testdata + sibling helper files for platform-specific runtime discovery) predated this PR and handles 4 already-supported platforms without the conditionals becoming painful. Deferring a platform-strategy refactor until a second concrete NET implementation (e.g. GKE TCPXO) forces its shape.
AWS image, not a vendored one. An early iteration built our own nccl-tests image from a vendored Dockerfile on the theory that aws-ofi-nccl 1.16.3's CUDA 13 handling was broken. Cluster-side testing confirmed the theory was wrong — the AWS prebuilt image reaches identical bandwidth on the fixed cluster. The vendored Dockerfile path was scrapped in favor of reusing the image main already pins for H100/EKS.

Testing

make qualify

Per-package coverage vs origin/main:

Package	Baseline	HEAD	Delta
`pkg/defaults`	100.0%	100.0%	—
`pkg/validator/job`	10.6%	10.5%	-0.1% (within 0.5% floor)
`validators`	15.2%	19.8%	+4.6%
`validators/conformance`	15.3%	15.3%	—
`validators/performance`	20.2%	20.3%	+0.1%

New exported functions are covered by new unit tests (TestParseNVregFromParams, TestGB200NetPreflightApplies, TestFilterPreflightNodes, expanded TestTemplatePath, TestVerifyTransportFromLogs, TestBuildComputeDomain, TestNVLSRuntimeYAMLReferencesIMEXClaim, TestCheckTimeoutFromEnv). TestDeployJobEnvVars additionally asserts the new AICR_CHECK_TIMEOUT env var is injected with the value from the catalog entry's timeout.

Hardware validation on 2 × p6e-gb200.36xlarge EKS cluster (EKS 1.34, Ubuntu 24.04, NVIDIA driver 580.105.08 — the recipe's shipped default, EFA installer 1.47.0, nvidia-dra-driver-gpu v25.12.0, NVreg_GrdmaPciTopoCheckOverride=1 set):

nccl-all-reduce-bw-net  : 327.94 GB/s  (threshold  >=  40 GB/s)   PASS ✓
nccl-all-reduce-bw-nvls : 841.29 GB/s  (threshold  >= 500 GB/s)   PASS ✓

Both variants also verified on driver 580.126.20 (NVIDIA's recommended floor for GB200+EFA): 327.46 GB/s / 840.18 GB/s. The recipe ships 580.105.08 unchanged; operators who want to match NVIDIA's guidance can bump gpu-operator.driver.version in their overlay.

verifyTransportFromLogs confirmed "NCCL INFO Using network Libfabric" on the NET run and "NVLS comm 0x..." on the NVLS run. Both runs observed clean $variant-appropriate paths with no Socket fallback.

The NET variant also disables NCCL_MNNVL_ENABLE: on GB200 NCCL probes the multi-node NVLink path independently of NVLS, and the NET pods have no IMEX ResourceClaim attached so the MNNVL init aborts (Cuda failure 800 operation not permitted) before any EFA traffic. Disabling both NVLink paths keeps NCCL on the intended NET transport end-to-end.

Follow-ups (not in this PR)

Adoption in shipped overlays. Extend an existing gb200-eks-* overlay to enable nccl-all-reduce-bw-net and nccl-all-reduce-bw-nvls by default; they are opt-in per recipe today. The legacy nccl-all-reduce-bw entry stays as the universal default for non-GB200 accelerators.
Threshold calibration. Current thresholds (>= 40 GB/s NET, >= 500 GB/s NVLS) are deliberately conservative — sized for a 2-node GB200 pair rather than a full NVL72 UltraServer. Once we have data from production-scale deployments we can raise the floors so the checks catch soft regressions instead of only hard failures.
Driver version. Recipe ships 580.105.08 and validation passes there; NVIDIA's recommended floor for GB200+EFA is 580.126.20. A follow-up could either (a) bump recipes/components/gpu-operator/values.yaml globally, or (b) introduce a GB200-specific overlay that bumps only for this accelerator, leaving H100/B200 recipes unchanged. Option (b) has a narrower blast radius.
Driver kernel module flag. NVreg_GrdmaPciTopoCheckOverride=1 is a cluster prerequisite for the NET variant on GB200/EKS, enforced today by the preflight with an actionable error message if missing. A follow-up could make it self-fulfilling by shipping a ConfigMap in the GB200/EKS overlay and wiring gpu-operator.ClusterPolicy.spec.driver.kernelModuleConfig to reference it — same scoping tradeoff as the driver-version bullet (GB200-only overlay vs. global values.yaml).
PSA hardening for hostPath-ing validators. The NET preflight is the first validator in this repo to hostPath-mount a node path (/proc/driver/nvidia). ensureNamespace currently creates aicr-validation with no pod-security.kubernetes.io/enforce label, so any future hostPath-ing validator would hit the same admission question. Follow-up to (a) label the namespace privileged in ensureNamespace, or (b) add a minimal baseline-compatible SecurityContext where possible and document the PSA fallback where not. Doing this cross-cutting rather than bolting onto the NVreg preflight.
Catalog-level NCCL image pinning. verifyTransportFromLogs regexes key off NCCL 2.27+ banner strings (Using network <plugin>, NVLS comm 0x<addr>). Today the NCCL version is pinned by construction because the AWS prebuilt image tag is hardcoded in the runtime YAMLs (cuda12.8.1-efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9), but that pin lives in two files. Follow-up to move the image to a catalog entry field so a single bump rolls through all variants with a visible gate — avoids a future silent NCCL 2.28+ rename false-failing a healthy cluster.
User-facing docs. docs/user/validation.md currently documents only nccl-all-reduce-bw. Variants are opt-in per recipe so this is lower priority — the natural moment to update that page is when a shipped GB200/EKS overlay adopts -net / -nvls as defaults (see the first follow-up).

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Backwards compatible: Existing nccl-all-reduce-bw catalog entry is preserved with identical behavior; existing recipes keep working. The two new entries are opt-in per recipe.
Timeout propagation behaviour change: Validator images built before this change continue to use defaults.CheckExecutionTimeout as their parent context deadline (the new env var is unset → fallback). Validator images built after this change honor the catalog entry's timeout. Mixed-version rolling deployments are safe.
Cluster prerequisite for the NET variant on GB200/EKS: NVreg_GrdmaPciTopoCheckOverride=1 must be set on the NVIDIA kernel driver (via the GPU Operator ClusterPolicy.spec.driver.kernelModuleConfig). The preflight fails loudly with an actionable fix hint if this is missing, rather than silently passing at Socket-fallback bandwidth.
Cluster prerequisite for the NVLS variant on GB200/EKS: nvidia-dra-driver-gpu ≥ v25.12.0 must be installed with the resource.nvidia.com/v1beta1 ComputeDomain CRD present. The validator creates a ComputeDomain with spec.numNodes: 0, which relies on the IMEXDaemonsWithDNSNames=true default introduced in v25.12.0; older drivers would wait for a quorum that never forms and fail at DiagnosticTimeout (2m). The CR creation itself fails up front with a clear NotFound error if the CRD is missing.
Pod Security Admission: The NET preflight probe pod hostPath-mounts /proc/driver/nvidia read-only. Clusters that enforce baseline or restricted PSA as the default on the aicr-validation namespace will reject the pod at admission. On those clusters, either label the namespace pod-security.kubernetes.io/enforce: privileged or set the cluster-wide PSA default to privileged for this namespace. (A follow-up will generalize this to cover any future hostPath-ing validator — see Follow-ups.)

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

…ts for GB200/EKS Adds two transport-class-specific NCCL all-reduce bandwidth checks for p6e-gb200.36xlarge on EKS: * nccl-all-reduce-bw-net — forces NCCL onto the NET transport (AWS EFA via the aws-ofi-nccl plugin) so operators can measure the EFA fabric independently of the NVL72 NVLink domain. * nccl-all-reduce-bw-nvls — exercises MNNVL / NVLink SHARP across the NVL72 IMEX domain. The validator auto-provisions a resource.nvidia.com/v1beta1 ComputeDomain CR before the TrainJob; the NVIDIA DRA driver reconciles that into a ResourceClaimTemplate that worker pods reference to get /dev/nvidia-caps-imex-channels mounted. No hand-wired IMEX setup required. Both variants share the same AWS-prebuilt nccl-tests image main already uses for H100/EKS (public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1- efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9, multi-arch). The image is hardcoded in the runtime YAMLs; no vendored image or registry templating is needed. verifyTransportFromLogs parses NCCL_DEBUG=INFO output to assert the selected transport actually carried traffic, keying on NCCL 2.27's authoritative signals: NET — "NCCL INFO Using network <plugin>" with non-Socket plugin NVLS — "NVLS comm 0x<addr>" communicator-init line This turns the variant label into a hard guarantee: a GB200/EKS cluster with a broken IMEX domain or unset NVreg_GrdmaPciTopoCheckOverride will fail loudly on the appropriate variant instead of silently falling back to Socket. supportedNCCLCombinations is restructured to map variant → service → accelerators. The default variant preserves existing H100/EKS, H100/GKE, B200/Any, GB200/Any coverage; NET and NVLS each opt in to GB200/EKS only. templatePath gains a variant parameter so runtime YAMLs load from testdata/{accelerator}/{service}/runtime[-{variant}].yaml — bare runtime.yaml when variant is the default. Unit tests updated to cover the new path shapes and the transport-verification regexes.

Catalog entries have a `timeout` field that the Job deployer (pkg/validator/ job/deployer.go) translates into the Job's ActiveDeadlineSeconds. But that value never reached the inner validator container — validators/context.go LoadContext() hardcoded defaults.CheckExecutionTimeout as the parent context deadline regardless. Result: a catalog entry saying `timeout: 45m` still cut off the check at the package default because the inner context cancelled first. This particularly impacts long-running benchmark-style checks whose real runtime can exceed the package default depending on cluster hardware and fabric characteristics — the validator pod completes the test setup and produces real output, but ctx.Done() fires mid-benchmark and waitForLauncherPodAndGetLogs() returns a generic "pod did not succeed" with a TIMEOUT error chain that obscures the fact that a longer timeout would have let the check finish successfully. Fix, both sides of the pod boundary: - pkg/validator/job/deployer.go buildEnvApply() now injects AICR_CHECK_TIMEOUT=<catalog timeout as Go duration string> into the validator container (same fallback to ValidatorDefaultTimeout when catalog entry has Timeout=0 as buildApplyConfig already uses). - validators/context.go LoadContext() now calls checkTimeoutFromEnv() which parses AICR_CHECK_TIMEOUT and falls back to defaults.CheckExecutionTimeout if unset or malformed. Behavior is identical for any validator image that hasn't been rebuilt with this change (env var absent → existing default). Unlocks catalog-level per-check timeout configuration across the whole validator framework. Any check that legitimately needs more than the package default can now just set its catalog timeout to the real value.

…de for GB200 NET Without NVreg_GrdmaPciTopoCheckOverride=1 set on the NVIDIA kernel driver, a p6e-gb200.36xlarge EKS node silently fails EFA GPUDirect RDMA — the NVIDIA driver rejects EFA's dma-buf attach request with: NVRM: dma-buf attach failed: topology not supported for mapping type FORCE_PCIE and NCCL falls back to the Socket transport. The NET variant's bandwidth threshold is set to a permissive value so it passes whatever Socket-over- TCP bandwidth the cluster delivers, easy to mistake for a real EFA number when reading only the final row. Add a preflight that runs before the benchmark whenever the NET variant targets GB200/EKS: spin up one short-lived Pod per target GPU node, pinned via NodeName, hostPath-mounting /proc/driver/nvidia read-only, and grep for the "GrdmaPciTopoCheckOverride: 1" line the NVIDIA kernel module writes when the parameter is set on driver load. Pod exits 0 if found, non-zero otherwise; the validator consolidates missing nodes into a single error with an actionable GPU Operator ClusterPolicy fix hint. Scope: only NET + GB200 + EKS triggers the check. NVLS traffic stays on the Grace C2C NVLink fabric (no PCIe dma-buf involvement) so the flag doesn't matter there. H100, B200, and GKE variants are unaffected. The runtime-parameter parsing is factored into parseNVregFromParams so the happy path, driver-default, explicit-zero, commented-out, and suffix-collision cases are directly unit-tested; the pod-based check uses grep with the same anchored pattern.

Three validators — pod_autoscaling_check, secure_access_check, and the new nccl NVreg preflight — each spawn a short-lived busybox pod on target nodes to check host state. All three hardcoded the same literal "busybox:1.37" string. Centralize in pkg/defaults as defaults.ProbeImage so bumping the pinned tag is a one-file change, and every future probe-style validator has an obvious shared home to pull the reference from. pkg/defaults is the established home for compile-time tunables (timeouts, limits, HTTP client settings); adding an images.go file with a ProbeImage const extends the same pattern. No override machinery — callers that ever need runtime override (airgapped registries, fork-specific mirrors) should read an AICR_PROBE_IMAGE env var at the call site and fall back to the constant, matching the existing AICR_CHECK_TIMEOUT / AICR_VALIDATOR_IMAGE_ REGISTRY patterns rather than baking override logic into pkg/defaults. Tag choice (busybox:1.37) is unchanged — this is pure de-duplication, not a version bump.

…X probe On GB200 NCCL probes MNNVL (multi-node NVLink via IMEX) independently of NVLS. The NET-variant pods have no IMEX ResourceClaim attached, so the probe aborts with "Cuda failure 800 operation not permitted" before any EFA traffic and the launcher exits with unhandled system error. Setting NCCL_MNNVL_ENABLE=0 alongside the existing NCCL_NVLS_ENABLE=0 keeps NCCL on the NET transport as intended. Verified on two p6e-gb200.36xlarge nodes: peak 317.87 GB/s (8 GiB messages), transport banner confirms [send] via NET/Libfabric/GDRDMA(PCI).

… for IMEX + NVreg waits Three small follow-ups on top of the NCCL GB200 variant work: * pkg/defaults: introduce EnvCheckTimeout constant for AICR_CHECK_TIMEOUT so the deployer-side producer and validator-side consumer reference the same symbol instead of two string literals that can drift. * pkg/validator/job/deployer: extract resolvedTimeout() to de-duplicate the catalog.Timeout == 0 fallback between buildApplyConfig and buildEnvApply. * validators/performance: replace polling waits with the watch API per CLAUDE.md "Kubernetes Patterns" — waitForIMEXClaimTemplate keeps a fast Get path for warm clusters and falls through to a name-scoped watch; preflightGB200NetNVregFlag runs the per-node probe pods in parallel via errgroup and watches pod phase transitions instead of polling. No behavior change beyond lower API-server load and faster ready-signal pickup on warm clusters. Scoped lint + tests pass on affected packages.

…l inline The constant was introduced in the preceding commit to centralize the orchestrator→container env-var contract. The callsites (one producer in pkg/validator/job/deployer.go, one consumer in validators/context.go) already document the string in comments, and AICR_CHECK_TIMEOUT is shorter and more greppable than defaults.EnvCheckTimeout. Not enough repetition to justify a new exported symbol. The other parts of the preceding refactor (resolvedTimeout() helper, watch API for IMEX/NVreg waits, parallel preflight errgroup) are unaffected.

…e race window Both waitForIMEXClaimTemplate and waitForPreflightPodPhase use the same pattern: an initial Get, then a Watch on the expected name/field. The Watch does not replay prior events — so if the state transition we're waiting for lands between the initial Get and the Watch call, the loop blocks until the context deadline even though the resource is already in the target state. Add a defensive Get right after Watch is established: * waitForIMEXClaimTemplate: if the DRA driver reconciled the RCT into existence in that window, return immediately instead of timing out. * waitForPreflightPodPhase: if the probe pod transitioned to Succeeded or Failed in that window, return the phase immediately. Both re-checks are idempotent and cheap (one GET each). Found by coderabbit review of the watch-API refactor.

github-actions · 2026-04-23T15:41:25Z

@njhensley this PR now has merge conflicts with main. Please rebase to resolve them.

Signed-off-by: njhensley <229213852+njhensley@users.noreply.github.com>

Two linked fixups to the orchestrator Job env-var pipeline: 1. orchestratorEnvCount := 8 hint under-allocates after this PR's AICR_CHECK_TIMEOUT addition. The always-injected set is now 7 (SNAPSHOT_PATH, RECIPE_PATH, VALIDATOR_NAME, VALIDATOR_PHASE, RUN_ID, NAMESPACE, CHECK_TIMEOUT) plus up to 5 conditionally- injected (NODE_SELECTOR, TOLERATIONS, CLI_VERSION, CLI_COMMIT, VALIDATOR_IMAGE_REGISTRY) = 12 realistic max. Promote to a package-level const with a comment enumerating the contributors, so the next contributor has a single place to bump when the list changes. No behavior change — append() resizes either way. 2. TestDeployJobEnvVars asserted every injected var except the new AICR_CHECK_TIMEOUT. A regression dropping or malforming the env wouldn't be caught. Add a positive assertion: env must be present and its value must equal testEntry().Timeout.String() (matches what buildEnvApply writes via time.Duration.String()).

Three doc gaps called out in Yuan's review: - docs/contributor/validator.md env-var table omitted AICR_CHECK_TIMEOUT, which the Job deployer now injects unconditionally. Add a row that explains what sets it (catalog entry timeout), how it's consumed (validators.checkTimeoutFromEnv → ctx.Ctx deadline), the fallback path, and the WARN-on-malformed behavior added alongside. - Same file's LoadContext paragraph still said the timeout comes from defaults.CheckExecutionTimeout. Rewrite to point at checkTimeoutFromEnv and reflect the env-first precedence. - recipes/validators/README.md Performance Phase table listed only the pre-PR nccl-all-reduce-bw entry. Add rows for the two new NET and NVLS variants. Descriptions copied verbatim from catalog.yaml so a grep for the name surfaces the same one-liner in both places.

Both runtime-net.yaml and runtime-nvls.yaml had comments next to NCCL_DEBUG=INFO that named "[send] via NET/..." / "[send] via NVLS" per-channel log lines — banners that the Go docstring at validators/performance/nccl_all_reduce_bw_constraint.go:117-124 explicitly documents as dropped in NCCL 2.27+. The actual regexes verifyTransportFromLogs matches are the NCCL-2.27-era banners "NCCL INFO Using network <plugin>" (NET) and "NVLS comm 0x<addr>" (NVLS). Rewrite both comments to describe what NCCL actually emits in the image this runtime pins (ncclv2.27.7) and what verifyTransportFromLogs actually greps. Pure documentation change in testdata; no YAML body change, so no runtime impact.

…/defaults Project convention per CLAUDE.md "Constants Rules" is that timeouts live in pkg/defaults. The preflight-path cleanup bound was the only timeout on this branch that still lived as a local const in the caller, flagged as a nit in Mark's review. Promote to defaults.PreflightCleanupTimeout (same value, 30s, same semantics) and update the single call site. Generic name leaves room for future validator preflights to reuse it — e.g., the NVreg path is the only preflight today, but the conformance-phase probe-pod patterns would reach for the same bound.

njhensley · 2026-04-23T20:37:25Z

@yuanchen8911 thanks — all findings and nits addressed; open questions triaged inline below. Commits referenced are on feat/nccl-gb200-eks-variants.

Findings (all addressed)

AICR_CHECK_TIMEOUT missing from docs/contributor/validator.md env table → added in 10c1c086, describes injection source (Job deployer from catalog timeout), fallback, and the WARN-on-malformed behavior.
LoadContext() prose still said timeout came from defaults.CheckExecutionTimeout → rewritten in 10c1c086 to point at checkTimeoutFromEnv and the env-first precedence.
recipes/validators/README.md performance table missing variants → two new rows in 10c1c086 for nccl-all-reduce-bw-net and nccl-all-reduce-bw-nvls.
TestDeployJobEnvVars missed AICR_CHECK_TIMEOUT + no unit test for checkTimeoutFromEnv → positive assertion added in 34c4a999 (env present, value equals testEntry().Timeout.String()); 6-case table-driven TestCheckTimeoutFromEnv added in c102bca6 covering unset / empty / valid / malformed / negative / zero, asserting the duration and the WARN log for the malformed/non-positive paths.

Nits (both addressed)

Stale NCCL banner references in runtime-net.yaml / runtime-nvls.yaml → rewritten in 2590f84f to reference Using network <plugin> (NET) and NVLS comm 0x<addr> (NVLS), matching the regexes verifyTransportFromLogs actually greps for.
orchestratorEnvCount := 8 under-allocates → 34c4a999 promotes it to a package-level orchestratorEnvMax = 12 const with a comment enumerating the 7 always-injected + up to 5 conditionally-injected contributors, so the next contributor has a single place to bump.

Open questions

PSA on aicr-validation namespace — real concern and out of scope for this PR. The probe-pod uses hostPath on /proc/driver/nvidia read-only, so it will trip baseline/restricted admission on clusters that enforce it by default. I'd rather fix this as a cross-cutting change covering ensureNamespace + any future hostPath-ing validator, not only the NVreg preflight — will file a follow-up. For now adding a line to the rollout notes calling out the PSA prerequisite alongside NVreg_GrdmaPciTopoCheckOverride=1.
NCCL 2.27+ banner stability — agreed; catalog-level image pinning is the right shape. Also out of scope — the AWS prebuilt image tag (cuda12.8.1-efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9) is hardcoded in the runtime templates today, which locks NCCL to 2.27 by construction, but that pin is in two files instead of one catalog field. Follow-up to move it to a catalog attribute so a single bump rolls through all variants with a visible gate.
nvidia-dra-driver-gpu >= v25.12.0 for ComputeDomain numNodes: 0 — added to the PR description's rollout notes under the NVLS prerequisite (IMEXDaemonsWithDNSNames default is what makes numNodes: 0 work).
docs/user/validation.md mention of variants — noted. Left out of 10c1c086 because the page is end-user oriented and variants are opt-in per recipe; adopting them in shipped GB200/EKS overlays is the natural moment to document on that page. Tracked in the follow-up bullet that was already in the PR description.

Both new follow-ups (PSA hardening + catalog-level NCCL image pin) appended to the "Follow-ups" section of the PR description.

yuanchen8911 · 2026-04-23T21:23:36Z

Verified against 94f0a83.

1. [Major] TrainJob sizing/scoping is still asymmetric on mixed EKS clusters

The PR now filters the NET preflight to the nodes it wants to probe at nccl_all_reduce_bw_constraint.go#L221-L240, but the actual TrainJob is still sized from the unfiltered GPU fleet. determineGPUConfig() and template substitution still derive WORKER_COUNT, GPU_COUNT, and per-node shape from all schedulable GPU nodes at nccl_all_reduce_bw_constraint.go#L379-L401, while EKS worker placement later narrows to a single instance type at nccl_all_reduce_bw_constraint.go#L446-L451 and platformWorkerScheduling EKS path.

On a cluster with 2 GB200 + 3 H100 nodes, the TrainJob can request 5 workers while only the 2 GB200 nodes match the selector, leaving 3 workers Pending and mis-stating mpirun -np.

The fix I want is not another preflight-only filter. I want one shared target-node resolution step applied before sizing:

start from all schedulable GPU nodes
apply ctx.NodeSelector if present; otherwise, on EKS, apply the same default instance-type selector the worker pods use
rebuild gpuConfiguration from that filtered node set
use that filtered config for the >=2 nodes check, NVreg preflight, EFA discovery, WORKER_COUNT, GPU_COUNT, and runNCCLTrainJob()

If the effective selector matches zero GPU nodes, fail clearly. If it matches only one node, skip as a non-multi-node topology.

Non-blocking follow-ups

[Low] The inference-perf timeout comment is now stale. recipes/validators/catalog.yaml#L70-L77 still describes a 5-minute ctx-to-Job stagger, but both ActiveDeadlineSeconds and AICR_CHECK_TIMEOUT now come from the same resolvedTimeout() path in pkg/validator/job/deployer.go#L150-L165, pkg/validator/job/deployer.go#L218-L230, and validators/context.go#L74-L91. At minimum the comment should be updated to match reality.
[Low] checkNVregOnNode() still conflates "flag absent" with "grep could not read /host-proc-nvidia/params". The code comment says it distinguishes harder failures by inspecting logs, but the implementation only warns on non-empty output and still returns (false, nil). Distinguishing empty output from grep: /host-proc-nvidia/params:-style errors would avoid sending operators to the wrong fix.

I am not blocking on the defaults.ProbeImage / AICR_VALIDATOR_IMAGE_REGISTRY gap from my earlier summary. That limitation is real, but it already exists in other helper-pod sites in this same SHA, so I would treat it as follow-up debt rather than a regression introduced here.

yuanchen8911

Verified against 94f0a83 and against my earlier summary in #640 (comment).

One blocking issue remains:

[Major] TrainJob sizing/scoping is still asymmetric on mixed EKS clusters

The new NET preflight now filters the NVreg probe to target nodes at nccl_all_reduce_bw_constraint.go#L221-L240, but runNCCLTrainJob() still receives the full gpuConfig, and determineGPUConfig() plus template substitution still derive WORKER_COUNT, GPU_COUNT, and per-node shape from all schedulable GPU nodes at nccl_all_reduce_bw_constraint.go#L379-L401. Later, EKS worker placement narrows to a single instance type at nccl_all_reduce_bw_constraint.go#L446-L451 and platformWorkerScheduling EKS path.

On a mixed cluster such as 2 GB200 + 3 H100 nodes, the TrainJob can request 5 workers while only 2 nodes match the selector, leaving 3 workers Pending and mis-stating mpirun -np. The preflight change made the asymmetry explicit; the benchmark path needs to use the same filtered node subset, or these GB200 EKS variants need a heterogeneous-cluster fail-fast.

I am not blocking on the defaults.ProbeImage / AICR_VALIDATOR_IMAGE_REGISTRY gap from my earlier summary. That limitation is real, but it already exists in other helper-pod sites in this same SHA (validators/conformance/pod_autoscaling_check.go, validators/conformance/secure_access_check.go), so I would treat it as follow-up debt rather than a regression introduced here.

…node set determineGPUConfig sized WORKER_COUNT / GPU_COUNT / TotalGPUCount from every schedulable GPU node, while applyNCCLResources later narrowed the worker podSpec to a single instance type on EKS (via platformWorkerScheduling). On a mixed cluster — 2× p6e-gb200 + 3× p5-h100 under one EKS control plane — the TrainJob would request 5 workers while only the 2 GB200 nodes match the selector, leaving 3 Pending and mis-stating `mpirun -np`. Per review feedback, fix with one shared target-node resolution step applied before sizing: - Extract resolveTargetGPUNodes that returns both the filtered node set and the effective selector used to derive it. - Precedence: ctx.NodeSelector override wins; else on EKS use the first node's node.kubernetes.io/instance-type label (same key platformWorkerScheduling stamps into the worker podSpec); else no filter. Zero-match returns ErrCodeInternal naming the selector and the total GPU-node count so the misconfiguration is diagnosable. - Call it inside determineGPUConfig so WorkerCount, GPUCountPerNode, TotalGPUCount, and gpuConfig.Nodes are all derived from the filtered set. The existing WorkerCount < 2 gate continues to handle the single-match-skip case. - Delete the preflight-only filterPreflightNodes helper; the preflight is now called with gpuConfig.Nodes directly, since that slice is already scoped to the target set. Heterogeneous-cluster contract: the automatic path handles pools that are homogeneous within the matched subset. On a truly mixed cluster the operator is expected to pass --node-selector to name the pool they want, per the convention already used by inference_perf_constraint.go. A cross-validator improvement to auto-select by Criteria.Accelerator + nvidia.com/gpu.product label (the existing TODO in nccl_eks_utils.go) would close the "first node's instance type" arbitrariness on mixed clusters — out of scope here because it needs to land in inference-perf the same way. TestResolveTargetGPUNodes moved to nccl_test.go with 7 cases: EKS mixed narrow, EKS homogeneous no-op, user override, non-EKS no-op, unlabeled first node, zero-match error, empty input. Asserts both the returned node set and the effective selector.

…a gpu.product resolveTargetGPUNodes previously picked the first GPU node's instance-type as the EKS narrow selector. On a heterogeneous cluster (e.g. 2× GB200 + 3× H100 under one EKS control plane) this was non-deterministic w.r.t. the recipe's accelerator intent: if H100 nodes sorted first, a recipe with accelerator: gb200 would validate H100 hardware. NET would then fail the NVreg preflight with a misleading "set the kernel flag" advice, and NVLS would fail on the transport-verification step after spending TrainJob wall time — both wrong-cause errors for the same root problem. Add a gpu.product filter as the second-precedence rule in the resolver. NVIDIA GPU Feature Discovery (part of the GPU Operator AICR deploys) labels every GPU node with nvidia.com/gpu.product carrying the concrete product string (NVIDIA-GB200, NVIDIA-H100-80GB-HBM3, NVIDIA-B200, ...). Define acceleratorProductMatchers mapping each Criteria.Accelerator enum to a predicate — exact match for single-SKU accelerators (GB200, B200, RTX Pro 6000), prefix match for families with multiple SKUs (H100 SXM / PCIe / NVL, A100 SXM / PCIe variants), custom for L40 / L40S. New resolver precedence: 1. --node-selector override (unchanged). 2. NEW: gpu.product ↔ recipe accelerator filter, when a matcher exists and at least one node carries the label. Zero-match returns a diagnostic error naming the expected accelerator and the products observed — no more misattributed preflight failures. 3. On EKS, further narrow to first-filtered-node instance-type so the pool matches the worker podSpec's platformWorkerScheduling selector. Also the sole narrow on non-GFD clusters (graceful degrade). 4. Otherwise return the accelerator-filtered set as-is. CriteriaAcceleratorAny has no matcher, so variantDefault + any-accelerator recipes skip the filter (unchanged behavior). Clusters without GFD labels also skip the filter cleanly — the existing first-node-instance-type heuristic remains the fallback. Tests: TestResolveTargetGPUNodes grows to 12 cases covering accelerator filtering on mixed clusters, H100 SXM+PCIe sub-narrowing, mismatch error with products-seen diagnostic, non-GFD degrade, override-wins, empty input, and non-EKS flow. New TestAcceleratorProductMatchers pins each accelerator's matcher behavior (including exact-vs-prefix boundary cases like NVIDIA-L40 vs NVIDIA-L4 and NVIDIA-GB200 vs NVIDIA-GB200-96GB). Inference-perf keeps its --node-selector-or-first-most-free heuristic for now; adopting gpu.product there is a separate follow-up (tracked in the PR description).

Post-review simplification pass on the accelerator-aware node resolver. No behavior change — all 12 resolver cases and 17 matcher cases still pass; log output and error messages are unchanged modulo phrasing. - Drop filterByLabels duplicate; call the existing nodesMatchingSelector from inference_perf_constraint.go (same package, drop-in). - Drop the returned selector map from resolveTargetGPUNodes (caller only used it for one log line). Each narrowing step now logs inline, which also kills the synthetic `<gb200>`-bracketed selector the error path used to emit as a stand-in for a real label value. - Split the decision tree into narrowByAccelerator + narrowByInstanceType helpers so the top-level flow is a four-line cascade instead of three levels of nested conditionals. Zero-match-after-attempted-filter is now a local concern of narrowByAccelerator. - Defer uniqueGPUProducts sort+set construction to the zero-match diagnostic path; happy paths no longer allocate it. - Hoist `node.kubernetes.io/instance-type` to a package constant (instanceTypeLabel), matching gpuProductLabel. - Trim narrating comments at call sites ("nodes are already narrowed", "gpuConfig.Nodes is already the target subset") — they restated the code. Tests: TestResolveTargetGPUNodes now asserts error messages by substring instead of a faked selector map, which was brittle and only testable because of the synthetic. Cases and coverage unchanged.

njhensley · 2026-04-23T22:57:37Z

@yuanchen8911 addressed across three commits:

Scope fix — 4a0a8c19 fix(validators/performance): size NCCL TrainJob from filtered target node set

Extracted one shared target-node resolution step applied before sizing. determineGPUConfig now takes service, calls resolveTargetGPUNodes, and gpuConfig.Nodes (plus WorkerCount / GPUCountPerNode / TotalGPUCount) is the filtered set. The >= 2 node gate, NVreg preflight, EFA discovery, and runNCCLTrainJob all consume that same slice. filterPreflightNodes deleted — the preflight is now called with gpuConfig.Nodes directly. Zero-match on an explicit selector returns ErrCodeInternal naming the selector and the total count.

On your mixed 2 GB200 + 3 H100 example, gpuConfig.WorkerCount is now 2 / TotalGPUCount 8 / mpirun -np agrees, and the worker podSpec's platformWorkerScheduling selector hits exactly those two nodes.

Accelerator-aware auto-path — f1820c2e feat(validators/performance): accelerator-aware GPU node selection via gpu.product

Went beyond the reviewer ask to fix a related hole the first-node-instance-type default has: on a mixed cluster without --node-selector, node list order decided which accelerator got validated. Added a GFD-based filter as the second-precedence rule:

ctx.NodeSelector override wins.
NEW: filter by nvidia.com/gpu.product ↔ Criteria.Accelerator when a matcher exists and the cluster carries the label. Zero-match → hard error (no schedulable GPU nodes match recipe accelerator "gb200" (products seen: [NVIDIA-H100-80GB-HBM3])) so operators aren't pointed at misleading secondary errors like the NVreg preflight failing on H100 nodes.
On EKS, further narrow to first-filtered-node instance-type so the pool is homogeneous in what platformWorkerScheduling stamps into the worker podSpec.
Otherwise return the accelerator-filtered set as-is.

acceleratorProductMatchers covers GB200, B200, H100 family, A100 family, L40/L40S, RTX Pro 6000 (exact for single-SKU accelerators, prefix for families). CriteriaAcceleratorAny intentionally has no matcher. Clusters without GFD labels fall through to the first-node-instance-type heuristic unchanged.

Same shape should land in inference_perf_constraint.go — it currently only consults --node-selector and not gpu.product. Tracking as a follow-up in the PR description.

Simplification pass — 5d35fedd refactor(validators/performance): simplify resolveTargetGPUNodes

Post-review cleanup: filterByLabels was a dup of the existing nodesMatchingSelector in inference-perf (same package) — deleted and reused. Dropped the returned-selector map (was only used for logging; each narrowing step logs inline now) which also killed a synthetic <gb200> bracketed value the error path emitted as a selector stand-in. Decision tree split into narrowByAccelerator + narrowByInstanceType helpers so the top-level cascade is flat instead of three levels deep. Zero behavior change.

Tests: TestResolveTargetGPUNodes covers 12 cases — mixed-accelerator determinism regardless of list order, H100 SXM+PCIe sub-narrowing, accelerator mismatch with products seen: diagnostic, non-GFD fallback, override-wins, override-zero-match, non-EKS flow, empty input. TestAcceleratorProductMatchers pins 17 matcher boundaries (e.g. NVIDIA-GB200 exact vs NVIDIA-GB200-96GB, NVIDIA-H100-* prefix vs NVIDIA-H200-*, NVIDIA-L40/NVIDIA-L40S vs NVIDIA-L4).

Blocker resolved by 4a0a8c1 + f1820c2 + 5d35fed — resolveTargetGPUNodes now feeds preflight, sizing, EFA discovery, and the TrainJob from one filtered node set.

yuanchen8911

/lgtm

Blocker resolved: resolveTargetGPUNodes provides a single target-node pipeline feeding preflight, sizing, EFA discovery, and the TrainJob uniformly. The GFD-based accelerator narrow is a nice improvement beyond the prescription — auto-path stays correct on mixed accelerator pools without requiring --node-selector.

Two Low follow-ups remain and are non-blocking: inference-perf catalog-comment drift, and checkNVregOnNode conflating flag-absent with grep-can't-open-file errors.

Signed-off-by: Mark Chmarny <mchmarny@users.noreply.github.com>

…h check (#640) Signed-off-by: njhensley <229213852+njhensley@users.noreply.github.com> Co-authored-by: Mark Chmarny <mchmarny@users.noreply.github.com>

njhensley added 7 commits April 21, 2026 16:29

njhensley requested review from a team as code owners April 22, 2026 17:37

github-actions Bot added area/recipes area/validator size/XL labels Apr 22, 2026

Merge branch 'main' into feat/nccl-gb200-eks-variants

a282315

This comment was marked as resolved.

Sign in to view

njhensley marked this pull request as draft April 22, 2026 17:53

mchmarny assigned njhensley Apr 22, 2026

njhensley and others added 2 commits April 22, 2026 11:05

Merge branch 'main' into feat/nccl-gb200-eks-variants

a3eb936

njhensley marked this pull request as ready for review April 22, 2026 19:04

njhensley added 2 commits April 22, 2026 13:56

Merge branch 'main' into feat/nccl-gb200-eks-variants

29f1a66

Merge branch 'NVIDIA:main' into feat/nccl-gb200-eks-variants

c46d0bc

mchmarny added this to the v0.12 milestone Apr 23, 2026

Merge branch 'main' into feat/nccl-gb200-eks-variants

f56c28a

github-actions Bot added the needs-rebase label Apr 23, 2026

Merge branch 'main' into feat/nccl-gb200-eks-variants

74ee6cd

Signed-off-by: njhensley <229213852+njhensley@users.noreply.github.com>

github-actions Bot removed the needs-rebase label Apr 23, 2026

Merge branch 'main' into feat/nccl-gb200-eks-variants

1793d72

This comment was marked as resolved.

Sign in to view

njhensley added 4 commits April 23, 2026 12:45

github-actions Bot added the area/docs label Apr 23, 2026

Merge branch 'main' into feat/nccl-gb200-eks-variants

94f0a83

yuanchen8911 previously requested changes Apr 23, 2026

View reviewed changes

njhensley added 3 commits April 23, 2026 15:21

yuanchen8911 previously approved these changes Apr 23, 2026

View reviewed changes

yuanchen8911 requested a review from mchmarny April 24, 2026 04:44

Merge branch 'main' into feat/nccl-gb200-eks-variants

c784941

mchmarny enabled auto-merge (squash) April 24, 2026 12:26

Merge branch 'main' into feat/nccl-gb200-eks-variants

6826009

Signed-off-by: Mark Chmarny <mchmarny@users.noreply.github.com>

mchmarny dismissed yuanchen8911’s stale review via 6826009 April 24, 2026 12:33

mchmarny added 2 commits April 24, 2026 05:42

Merge branch 'main' into feat/nccl-gb200-eks-variants

2de89cb

Merge branch 'main' into feat/nccl-gb200-eks-variants

cd9d9cf

ArangoGutierrez approved these changes Apr 24, 2026

View reviewed changes

lalitadithya approved these changes Apr 24, 2026

View reviewed changes

mchmarny merged commit 4e158cf into NVIDIA:main Apr 24, 2026
65 checks passed

njhensley mentioned this pull request Apr 24, 2026

feat: GB200 EKS NET/NVLS NCCL validation and driver bump #668

Merged

12 tasks

njhensley deleted the feat/nccl-gb200-eks-variants branch June 23, 2026 16:21

This was referenced Jun 23, 2026

feat(validators): RoCE NET variant for nccl-all-reduce-bw #1428

Merged

Add OpenShift Container Platform (OCP) as a service type #1380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(performance): add GB200 EKS support for NCCL all-reduce bandwidth check#640

feat(performance): add GB200 EKS support for NCCL all-reduce bandwidth check#640
mchmarny merged 34 commits into
NVIDIA:mainfrom
njhensley:feat/nccl-gb200-eks-variants

njhensley commented Apr 22, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

njhensley commented Apr 23, 2026

Uh oh!

yuanchen8911 commented Apr 23, 2026 •

edited

Loading

Uh oh!

yuanchen8911 left a comment

Uh oh!

njhensley commented Apr 23, 2026

Uh oh!

yuanchen8911 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

njhensley commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Design decisions worth reviewing

Testing

Follow-ups (not in this PR)

Risk Assessment

Checklist

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

njhensley commented Apr 23, 2026

Findings (all addressed)

Nits (both addressed)

Open questions

Uh oh!

yuanchen8911 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

njhensley commented Apr 23, 2026

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

njhensley commented Apr 22, 2026 •

edited

Loading

yuanchen8911 commented Apr 23, 2026 •

edited

Loading