Skip to content

Comments

[BUG] fix opCount not self-increasing in single node communicator#2021

Open
alpha-baby wants to merge 2 commits intoNVIDIA:masterfrom
alpha-baby:fujh/fix_opCount_not_self-increasing
Open

[BUG] fix opCount not self-increasing in single node communicator#2021
alpha-baby wants to merge 2 commits intoNVIDIA:masterfrom
alpha-baby:fujh/fix_opCount_not_self-increasing

Conversation

@alpha-baby
Copy link

Description

in single node, opCount not self-increasing

image

after fix:

image

Related Issues

Changes & Impact

Performance Impact

@alpha-baby alpha-baby changed the title fix opCount not self-increasing in single node communicator [BUG] fix opCount not self-increasing in single node communicator Feb 10, 2026
@xiaofanl-nvidia
Copy link
Collaborator

@stephenmsachs can you take a look

@teou
Copy link

teou commented Feb 12, 2026

2.27+
single node, opCount always 0
multi node, opCount is self-increasing normally

@sjeaugey
Copy link
Member

@teou can you explain what the problem is? opCount is an internal counter which we happen to dump in our INFO lines, but I'm not sure why users should care whether it is increasing or not. It is normal that it is not increasing as we are not launching a collective operation, only a local operation.

@alpha-baby
Copy link
Author

@teou can you explain what the problem is? opCount is an internal counter which we happen to dump in our INFO lines, but I'm not sure why users should care whether it is increasing or not. It is normal that it is not increasing as we are not launching a collective operation, only a local operation.

We used the opCount output in the logs to troubleshoot the issue. After checking, I found that this problem was introduced in version 2.27.3.

@sjeaugey
Copy link
Member

I'm still not sure why this should be a problem. I understand you built some tool on top of that, but arguably you should not, given this is an implementation detail and not a supported NCCL API. We may need to change how things are implemented internally and INFO logs may change at any time depending on our needs to refactor or extend NCCL.

Some question for the team though, to confirm that this change is not causing other problems:
@gcongiu is the opCount reported in the profiler plugin?
@kiskra-nvidia same question for RAS: is RAS relying on the opCount to detect mismatches?

@teou
Copy link

teou commented Feb 13, 2026

hi @sjeaugey

Thank you for the explanation. I understand that opCount is an internal implementation detail of NCCL and not a user-facing API, and I also understand that fields printed in the INFO logs may change across versions.

The reason we noticed this change is mainly due to operational troubleshooting needs in production systems. When a distributed training job hangs in collective communication, there is often no time to enable profilers or introduce heavier debugging tools. We have built some simple, zero-intrusion troubleshooting tools that correlate NCCL INFO COLL logs with eBPF hooks on collective functions such as ncclAllReduce, in order to quickly identify which rank starts to behave differently (for example, based on combinations of parameters such as commHash, root, opCount, count, etc.). This allows us to make a preliminary assessment of from which collective the divergence begins (not as a definitive conclusion), and to provide additional context to developers during production system troubleshooting.

In earlier versions (e.g. 2.26.3), the opCount printed in the INFO logs(and from ebpf hooks) was a convenient signal for cross-rank alignment, which we could use to attempt to align the collective operation sequences of different ranks within the same communicator and provide more context to developers. However, in 2.27.7 (NGC 25.08), in a single-node multi-process setup (e.g. nranks=4), we observed that opCount in the INFO logs always remains 0, while in multi-node scenarios opCount still increments normally.

I would like to ask:

  • Is this behavior change (i.e. opCount no longer incrementing in INFO logs in the single-node case) expected, for example due to taking a certain “local op” execution path, or is it an unintended side effect?
  • If this is expected behavior, is there a recommended, relatively more stable alternative identifier or sequencing mechanism (ideally zero-intrusion and something that can be enabled at runtime, e.g. via INFO logs) that can be used for cross-rank correlation, so that we can minimize reliance on internal fields such as opCount?

@sjeaugey
Copy link
Member

Setting NCCL_DEBUG=INFO and having NCCL log each NCCL collective operation seems actually pretty intrusive to me.

What you describe seems to me to be what the profiler plugin was intended for, allowing the recording of every NCCL operation in a much more efficient manner. And enabling such a plugin should not be much different than enabling your current tracing tool.

That being said, if the goal is to identify ranks which are stuck, and causing hangs, the RAS system is doing exactly that when you request a report: checking that all ranks are still there, and check where all ranks are, printing mismatches if one rank did not reach a particular collective operation.

@alpha-baby
Copy link
Author

Setting NCCL_DEBUG=INFO and having NCCL log each NCCL collective operation seems actually pretty intrusive to me.

What you describe seems to me to be what the profiler plugin was intended for, allowing the recording of every NCCL operation in a much more efficient manner. And enabling such a plugin should not be much different than enabling your current tracing tool.

That being said, if the goal is to identify ranks which are stuck, and causing hangs, the RAS system is doing exactly that when you request a report: checking that all ranks are still there, and check where all ranks are, printing mismatches if one rank did not reach a particular collective operation.

thank you for your reply, I understand that RAS system can solve some hang problems, but at present, OpCount is not strictly self-increasing according to the logic of previous versions. I understand that the modification of my present commit keeps the previous logic. If so, can it be merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants