-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
How is this issue impacting you?
Application crash
Share Your Debug Logs
ncclCommGetAsyncError() internally calls ncclGinQueryLastError(), which iterates over all GIN contexts and queries GDAKI QP error states via doca_gpu_verbs_query_last_error(). The problem is that GIN context creation (ncclGinConnectOnce → ncclGinGdakiCreateContext) happens lazily — triggered inside any collective call (ncclAllReduce, etc.) that needs new transport channels. During this initialization, IB QPs are still transitioning through states (INIT → RTR → RTS) and doca_gpu_verbs_query_last_error() reports spurious errors that surface as ncclRemoteError.
We should fix this by gating the GIN error query block in ncclCommGetAsyncError() on ginState->connected, which is only set to true at the end of ncclGinConnectOnce() after all GIN contexts are fully created and QPs have completed their IB state transitions. This skips the query entirely during the vulnerable initialization window while still checking for real GIN errors once the subsystem is ready.
Basically, replace
if (*asyncError == ncclSuccess && comm->sharedRes && comm->sharedRes->ginState.ncclGin) {
with
if (*asyncError == ncclSuccess && comm->sharedRes && comm->sharedRes->ginState.ncclGin
&& comm->sharedRes->ginState.connected) {
in src/init.cc
Steps to Reproduce the Issue
No response
NCCL Version
2.28+cuda12.8
Your platform details
No response
Error Message & Behavior
No response