-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
How is this issue impacting you?
NCCL crash training tasks.
Share Your Debug Logs
We found that NCCL 2.27.7 triggers segmentation fault errors in RAS related code.
This problem can be bypassed using export NCCL_RAS_ENABLE=0
We did not actively trigger the ncclRas operation.
- case 1 using H100, NCCL version 2.27.7 + cuda12.4
[n125-011-039:288505:0:296459] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x48)
==== backtrace (tid: 296459) ====
0 /usr/lib/libucs.so.0(ucs_handle_error+0x2d4) [0x7fedc925c274]
1 /usr/lib/libucs.so.0(+0x2d44c) [0x7fedc925c44c]
2 /usr/lib/libucs.so.0(+0x2d61a) [0x7fedc925c61a]
3 /opt/tiger/nccl/lib/libnccl.so(+0x120057) [0x7ff1c18c1057]
4 /opt/tiger/nccl/lib/libnccl.so(+0x12209d) [0x7ff1c18c309d]
5 /opt/tiger/nccl/lib/libnccl.so(+0x125331) [0x7ff1c18c6331]
6 /opt/tiger/nccl/lib/libnccl.so(+0x122e70) [0x7ff1c18c3e70]
7 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x89144) [0x7ff1c151e144]
8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x7ff1c159e7dc]
- case 2 using H20, NCCL version 2.27.7 + cuda12.4
==== backtrace(tid:136570)====
0 0x000000000003c050 _sigaction()???:0
1 0x0000000000122927rasMsgHandlePeersUpdate() /opt/tiger/compile_path/src/code.byted.org/data/nccl/src/ras/peers.cc:54
2 0x000000000012496d rasMsgHandle() /opt/tiger/compile_path/src/code.byted.org/data/nccl/Src/ras/ras.cC:402
3 0x0000000000127c01 rasSockEventLoop() /opt/tiger/compile_path/src/code.byted.org/data/nccl/src/ras/rasnet.cc:64
4 0×0000000000125740 rasThreadMain() /opt/tiger/compile_path/src/code.byted.org/data/nccl/src/ras/ras.cc:648
5 0x0000000000089144 pthread_condattr_setpshared() ???:0
Steps to Reproduce the Issue
No response
NCCL Version
NCCL version 2.27.7 + cuda12.4
Your platform details
No response
Error Message & Behavior
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels