Skip to content

[Issue]: Need to guard GIN error query on ginState->connected in ncclCommGetAsyncError #2019

@goelayu

Description

@goelayu

How is this issue impacting you?

Application crash

Share Your Debug Logs

ncclCommGetAsyncError() internally calls ncclGinQueryLastError(), which iterates over all GIN contexts and queries GDAKI QP error states via doca_gpu_verbs_query_last_error(). The problem is that GIN context creation (ncclGinConnectOncencclGinGdakiCreateContext) happens lazily — triggered inside any collective call (ncclAllReduce, etc.) that needs new transport channels. During this initialization, IB QPs are still transitioning through states (INIT → RTR → RTS) and doca_gpu_verbs_query_last_error() reports spurious errors that surface as ncclRemoteError.

We should fix this by gating the GIN error query block in ncclCommGetAsyncError() on ginState->connected, which is only set to true at the end of ncclGinConnectOnce() after all GIN contexts are fully created and QPs have completed their IB state transitions. This skips the query entirely during the vulnerable initialization window while still checking for real GIN errors once the subsystem is ready.

Basically, replace

if (*asyncError == ncclSuccess && comm->sharedRes && comm->sharedRes->ginState.ncclGin) {

with

if (*asyncError == ncclSuccess && comm->sharedRes && comm->sharedRes->ginState.ncclGin
      && comm->sharedRes->ginState.connected) {

in src/init.cc

Steps to Reproduce the Issue

No response

NCCL Version

2.28+cuda12.8

Your platform details

No response

Error Message & Behavior

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions