[BUG] fix opCount not self-increasing in single node communicator by alpha-baby · Pull Request #2021 · NVIDIA/nccl

alpha-baby · 2026-02-10T10:21:30Z

Description

in single node, opCount not self-increasing

after fix:

Related Issues

Changes & Impact

Performance Impact

xiaofanl-nvidia · 2026-02-12T04:39:27Z

@stephenmsachs can you take a look

teou · 2026-02-12T10:18:30Z

2.27+
single node, opCount always 0
multi node, opCount is self-increasing normally

sjeaugey · 2026-02-12T10:41:43Z

@teou can you explain what the problem is? opCount is an internal counter which we happen to dump in our INFO lines, but I'm not sure why users should care whether it is increasing or not. It is normal that it is not increasing as we are not launching a collective operation, only a local operation.

alpha-baby · 2026-02-12T10:56:18Z

@teou can you explain what the problem is? opCount is an internal counter which we happen to dump in our INFO lines, but I'm not sure why users should care whether it is increasing or not. It is normal that it is not increasing as we are not launching a collective operation, only a local operation.

We used the opCount output in the logs to troubleshoot the issue. After checking, I found that this problem was introduced in version 2.27.3.

sjeaugey · 2026-02-12T12:56:38Z

I'm still not sure why this should be a problem. I understand you built some tool on top of that, but arguably you should not, given this is an implementation detail and not a supported NCCL API. We may need to change how things are implemented internally and INFO logs may change at any time depending on our needs to refactor or extend NCCL.

Some question for the team though, to confirm that this change is not causing other problems:
@gcongiu is the opCount reported in the profiler plugin?
@kiskra-nvidia same question for RAS: is RAS relying on the opCount to detect mismatches?

teou · 2026-02-13T02:46:59Z

hi @sjeaugey

Thank you for the explanation. I understand that opCount is an internal implementation detail of NCCL and not a user-facing API, and I also understand that fields printed in the INFO logs may change across versions.

The reason we noticed this change is mainly due to operational troubleshooting needs in production systems. When a distributed training job hangs in collective communication, there is often no time to enable profilers or introduce heavier debugging tools. We have built some simple, zero-intrusion troubleshooting tools that correlate NCCL INFO COLL logs with eBPF hooks on collective functions such as ncclAllReduce, in order to quickly identify which rank starts to behave differently (for example, based on combinations of parameters such as commHash, root, opCount, count, etc.). This allows us to make a preliminary assessment of from which collective the divergence begins (not as a definitive conclusion), and to provide additional context to developers during production system troubleshooting.

In earlier versions (e.g. 2.26.3), the opCount printed in the INFO logs(and from ebpf hooks) was a convenient signal for cross-rank alignment, which we could use to attempt to align the collective operation sequences of different ranks within the same communicator and provide more context to developers. However, in 2.27.7 (NGC 25.08), in a single-node multi-process setup (e.g. nranks=4), we observed that opCount in the INFO logs always remains 0, while in multi-node scenarios opCount still increments normally.

I would like to ask:

Is this behavior change (i.e. opCount no longer incrementing in INFO logs in the single-node case) expected, for example due to taking a certain “local op” execution path, or is it an unintended side effect?
If this is expected behavior, is there a recommended, relatively more stable alternative identifier or sequencing mechanism (ideally zero-intrusion and something that can be enabled at runtime, e.g. via INFO logs) that can be used for cross-rank correlation, so that we can minimize reliance on internal fields such as opCount?

sjeaugey · 2026-02-13T10:47:41Z

Setting NCCL_DEBUG=INFO and having NCCL log each NCCL collective operation seems actually pretty intrusive to me.

What you describe seems to me to be what the profiler plugin was intended for, allowing the recording of every NCCL operation in a much more efficient manner. And enabling such a plugin should not be much different than enabling your current tracing tool.

That being said, if the goal is to identify ranks which are stuck, and causing hangs, the RAS system is doing exactly that when you request a report: checking that all ranks are still there, and check where all ranks are, printing mismatches if one rank did not reach a particular collective operation.

alpha-baby · 2026-02-16T01:28:14Z

Setting NCCL_DEBUG=INFO and having NCCL log each NCCL collective operation seems actually pretty intrusive to me.

What you describe seems to me to be what the profiler plugin was intended for, allowing the recording of every NCCL operation in a much more efficient manner. And enabling such a plugin should not be much different than enabling your current tracing tool.

That being said, if the goal is to identify ranks which are stuck, and causing hangs, the RAS system is doing exactly that when you request a report: checking that all ranks are still there, and check where all ranks are, printing mismatches if one rank did not reach a particular collective operation.

thank you for your reply, I understand that RAS system can solve some hang problems, but at present, OpCount is not strictly self-increasing according to the logic of previous versions. I understand that the modification of my present commit keeps the previous logic. If so, can it be merge?

fix opCount not self-increasing in single node communicator

c8c5835

alpha-baby changed the title ~~fix opCount not self-increasing in single node communicator~~ [BUG] fix opCount not self-increasing in single node communicator Feb 10, 2026

fix

edf1205

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[BUG] fix opCount not self-increasing in single node communicator#2021

[BUG] fix opCount not self-increasing in single node communicator#2021
alpha-baby wants to merge 2 commits intoNVIDIA:masterfrom
alpha-baby:fujh/fix_opCount_not_self-increasing

alpha-baby commented Feb 10, 2026

Uh oh!

xiaofanl-nvidia commented Feb 12, 2026

Uh oh!

teou commented Feb 12, 2026 •

edited

Loading

Uh oh!

sjeaugey commented Feb 12, 2026

Uh oh!

alpha-baby commented Feb 12, 2026

Uh oh!

sjeaugey commented Feb 12, 2026

Uh oh!

teou commented Feb 13, 2026 •

edited

Loading

Uh oh!

sjeaugey commented Feb 13, 2026

Uh oh!

alpha-baby commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

alpha-baby commented Feb 10, 2026

Description

Related Issues

Changes & Impact

Performance Impact

Uh oh!

xiaofanl-nvidia commented Feb 12, 2026

Uh oh!

teou commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjeaugey commented Feb 12, 2026

Uh oh!

alpha-baby commented Feb 12, 2026

Uh oh!

sjeaugey commented Feb 12, 2026

Uh oh!

teou commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjeaugey commented Feb 13, 2026

Uh oh!

alpha-baby commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

teou commented Feb 12, 2026 •

edited

Loading

teou commented Feb 13, 2026 •

edited

Loading