Skip to content

[Issue]: nccl-tests fail with wrong transfers #2001

@angainor

Description

@angainor

How is this issue impacting you?

Data corruption

Share Your Debug Logs

ncclDebug.gpu-1-7.192846.txt
ncclDebug.gpu-1-7.192847.txt
ncclDebug.gpu-1-7.192848.txt
ncclDebug.gpu-1-7.192849.txt
ncclSystem.txt

Steps to Reproduce the Issue

I'm testing nccl using nccl-tests on a GH200 system with Slingshot 11 and aws-ofi-nccl plugin. I compile the libraries against CUDA 12.6 using Cray Programing Environment. Starting from nccl-2.27.3 I am seeing occasional #wrong transfers reported by all_reduce_perf. I did not reproduce those problems with nccl-2.26.6. Interestingly, the issues happen mostly for message size 1048576 and around 30% of the time. Most runs finish without errors.

I tried with a number of libfabric and aws-ofi-nccl, including newest ones available (libfabric-2.4.0 and aws-ofi-nccl-1.17.3). This has no impact. The only trigger I found is switching NCCL version to greater than 2.27.3 (I also tried the newest nccl-2.29.2 and see the same issue).

NCCL Version

2.27.3 and higher

Your platform details

HPE Cray EX system, GH200 nodes with 4 GPUs, Slingshot 11 interconnect between compute nodes, bare metal. Happens when using cross-node transfers. Tested with 1 GPU per node, and 4 GPUs per node.

nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV6	NV6	NV6	0-71	0		4
GPU1	NV6	 X 	NV6	NV6	72-143	1		12
GPU2	NV6	NV6	 X 	NV6	144-215	2		20
GPU3	NV6	NV6	NV6	 X 	216-287	3		28

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error Message & Behavior

Note that for message size 1048576 there are wrong transfers.

srun ~/src/nccl/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

# nccl-tests version 2.17.8 nccl-headers=22902 nccl-library=22703
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2772724 on    gpu-1-1 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  1 Group  0 Pid 2772725 on    gpu-1-1 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  2 Group  0 Pid 2772726 on    gpu-1-1 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  3 Group  0 Pid 2772727 on    gpu-1-1 device  3 [0039:01:00] NVIDIA GH200 120GB
#  Rank  4 Group  0 Pid 140817 on    gpu-1-7 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  5 Group  0 Pid 140818 on    gpu-1-7 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  6 Group  0 Pid 140819 on    gpu-1-7 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  7 Group  0 Pid 140820 on    gpu-1-7 device  3 [0039:01:00] NVIDIA GH200 120GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
           8             2     float     sum      -1    20.85    0.00    0.00       0   336.26    0.00    0.00       0
          16             4     float     sum      -1    21.13    0.00    0.00       0    20.77    0.00    0.00       0
          32             8     float     sum      -1    21.70    0.00    0.00       0    20.79    0.00    0.00       0
          64            16     float     sum      -1    22.17    0.00    0.01       0    22.17    0.00    0.01       0
         128            32     float     sum      -1    22.59    0.01    0.01       0    21.96    0.01    0.01       0
         256            64     float     sum      -1    21.87    0.01    0.02       0    22.79    0.01    0.02       0
         512           128     float     sum      -1    22.32    0.02    0.04       0    22.16    0.02    0.04       0
        1024           256     float     sum      -1    23.11    0.04    0.08       0    22.35    0.05    0.08       0
        2048           512     float     sum      -1    23.12    0.09    0.15       0    23.21    0.09    0.15       0
        4096          1024     float     sum      -1    24.96    0.16    0.29       0    24.14    0.17    0.30       0
        8192          2048     float     sum      -1    28.25    0.29    0.51       0    27.19    0.30    0.53       0
       16384          4096     float     sum      -1    29.64    0.55    0.97       0    29.34    0.56    0.98       0
       32768          8192     float     sum      -1    32.09    1.02    1.79       0    31.82    1.03    1.80       0
       65536         16384     float     sum      -1    56.54    1.16    2.03       0    60.05    1.09    1.91       0
      131072         32768     float     sum      -1    57.02    2.30    4.02       0    57.37    2.28    4.00       0
      262144         65536     float     sum      -1    98.90    2.65    4.64       0    94.74    2.77    4.84       0
      524288        131072     float     sum      -1   124.21    4.22    7.39       0   151.36    3.46    6.06       0
     1048576        262144     float     sum      -1   110.47    9.49   16.61       0   159.24    6.58   11.52      96
     2097152        524288     float     sum      -1    99.42   21.09   36.91       0    98.81   21.22   37.14       0
     4194304       1048576     float     sum      -1   146.90   28.55   49.97       0   147.99   28.34   49.60       0
     8388608       2097152     float     sum      -1   210.38   39.87   69.78       0   209.63   40.02   70.03       0
    16777216       4194304     float     sum      -1   373.67   44.90   78.57       0   371.44   45.17   79.04       0
    33554432       8388608     float     sum      -1   888.03   37.79   66.12       0   804.90   41.69   72.95       0
    67108864      16777216     float     sum      -1  1213.26   55.31   96.80       0  1209.98   55.46   97.06       0
   134217728      33554432     float     sum      -1  1765.55   76.02  133.04       0  1766.69   75.97  132.95       0
# Out of bounds values : 6 FAILED
# Avg bus bandwidth    : 22.8153 
#
# Collective test concluded: all_reduce_perf
#

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions