-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
How is this issue impacting you?
Data corruption
Share Your Debug Logs
ncclDebug.gpu-1-7.192846.txt
ncclDebug.gpu-1-7.192847.txt
ncclDebug.gpu-1-7.192848.txt
ncclDebug.gpu-1-7.192849.txt
ncclSystem.txt
Steps to Reproduce the Issue
I'm testing nccl using nccl-tests on a GH200 system with Slingshot 11 and aws-ofi-nccl plugin. I compile the libraries against CUDA 12.6 using Cray Programing Environment. Starting from nccl-2.27.3 I am seeing occasional #wrong transfers reported by all_reduce_perf. I did not reproduce those problems with nccl-2.26.6. Interestingly, the issues happen mostly for message size 1048576 and around 30% of the time. Most runs finish without errors.
I tried with a number of libfabric and aws-ofi-nccl, including newest ones available (libfabric-2.4.0 and aws-ofi-nccl-1.17.3). This has no impact. The only trigger I found is switching NCCL version to greater than 2.27.3 (I also tried the newest nccl-2.29.2 and see the same issue).
NCCL Version
2.27.3 and higher
Your platform details
HPE Cray EX system, GH200 nodes with 4 GPUs, Slingshot 11 interconnect between compute nodes, bare metal. Happens when using cross-node transfers. Tested with 1 GPU per node, and 4 GPUs per node.
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV6 NV6 NV6 0-71 0 4
GPU1 NV6 X NV6 NV6 72-143 1 12
GPU2 NV6 NV6 X NV6 144-215 2 20
GPU3 NV6 NV6 NV6 X 216-287 3 28
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Error Message & Behavior
Note that for message size 1048576 there are wrong transfers.
srun ~/src/nccl/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nccl-tests version 2.17.8 nccl-headers=22902 nccl-library=22703
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 2772724 on gpu-1-1 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 1 Group 0 Pid 2772725 on gpu-1-1 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 2 Group 0 Pid 2772726 on gpu-1-1 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 3 Group 0 Pid 2772727 on gpu-1-1 device 3 [0039:01:00] NVIDIA GH200 120GB
# Rank 4 Group 0 Pid 140817 on gpu-1-7 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 5 Group 0 Pid 140818 on gpu-1-7 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 6 Group 0 Pid 140819 on gpu-1-7 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 7 Group 0 Pid 140820 on gpu-1-7 device 3 [0039:01:00] NVIDIA GH200 120GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 20.85 0.00 0.00 0 336.26 0.00 0.00 0
16 4 float sum -1 21.13 0.00 0.00 0 20.77 0.00 0.00 0
32 8 float sum -1 21.70 0.00 0.00 0 20.79 0.00 0.00 0
64 16 float sum -1 22.17 0.00 0.01 0 22.17 0.00 0.01 0
128 32 float sum -1 22.59 0.01 0.01 0 21.96 0.01 0.01 0
256 64 float sum -1 21.87 0.01 0.02 0 22.79 0.01 0.02 0
512 128 float sum -1 22.32 0.02 0.04 0 22.16 0.02 0.04 0
1024 256 float sum -1 23.11 0.04 0.08 0 22.35 0.05 0.08 0
2048 512 float sum -1 23.12 0.09 0.15 0 23.21 0.09 0.15 0
4096 1024 float sum -1 24.96 0.16 0.29 0 24.14 0.17 0.30 0
8192 2048 float sum -1 28.25 0.29 0.51 0 27.19 0.30 0.53 0
16384 4096 float sum -1 29.64 0.55 0.97 0 29.34 0.56 0.98 0
32768 8192 float sum -1 32.09 1.02 1.79 0 31.82 1.03 1.80 0
65536 16384 float sum -1 56.54 1.16 2.03 0 60.05 1.09 1.91 0
131072 32768 float sum -1 57.02 2.30 4.02 0 57.37 2.28 4.00 0
262144 65536 float sum -1 98.90 2.65 4.64 0 94.74 2.77 4.84 0
524288 131072 float sum -1 124.21 4.22 7.39 0 151.36 3.46 6.06 0
1048576 262144 float sum -1 110.47 9.49 16.61 0 159.24 6.58 11.52 96
2097152 524288 float sum -1 99.42 21.09 36.91 0 98.81 21.22 37.14 0
4194304 1048576 float sum -1 146.90 28.55 49.97 0 147.99 28.34 49.60 0
8388608 2097152 float sum -1 210.38 39.87 69.78 0 209.63 40.02 70.03 0
16777216 4194304 float sum -1 373.67 44.90 78.57 0 371.44 45.17 79.04 0
33554432 8388608 float sum -1 888.03 37.79 66.12 0 804.90 41.69 72.95 0
67108864 16777216 float sum -1 1213.26 55.31 96.80 0 1209.98 55.46 97.06 0
134217728 33554432 float sum -1 1765.55 76.02 133.04 0 1766.69 75.97 132.95 0
# Out of bounds values : 6 FAILED
# Avg bus bandwidth : 22.8153
#
# Collective test concluded: all_reduce_perf
#