[BUG] fix opCount not self-increasing in single node communicator#2021
[BUG] fix opCount not self-increasing in single node communicator#2021alpha-baby wants to merge 2 commits intoNVIDIA:masterfrom
Conversation
|
@stephenmsachs can you take a look |
|
2.27+ |
|
@teou can you explain what the problem is? opCount is an internal counter which we happen to dump in our INFO lines, but I'm not sure why users should care whether it is increasing or not. It is normal that it is not increasing as we are not launching a collective operation, only a local operation. |
We used the opCount output in the logs to troubleshoot the issue. After checking, I found that this problem was introduced in version 2.27.3. |
|
I'm still not sure why this should be a problem. I understand you built some tool on top of that, but arguably you should not, given this is an implementation detail and not a supported NCCL API. We may need to change how things are implemented internally and INFO logs may change at any time depending on our needs to refactor or extend NCCL. Some question for the team though, to confirm that this change is not causing other problems: |
|
hi @sjeaugey Thank you for the explanation. I understand that The reason we noticed this change is mainly due to operational troubleshooting needs in production systems. When a distributed training job hangs in collective communication, there is often no time to enable profilers or introduce heavier debugging tools. We have built some simple, zero-intrusion troubleshooting tools that correlate NCCL INFO In earlier versions (e.g. 2.26.3), the I would like to ask:
|
|
Setting NCCL_DEBUG=INFO and having NCCL log each NCCL collective operation seems actually pretty intrusive to me. What you describe seems to me to be what the profiler plugin was intended for, allowing the recording of every NCCL operation in a much more efficient manner. And enabling such a plugin should not be much different than enabling your current tracing tool. That being said, if the goal is to identify ranks which are stuck, and causing hangs, the RAS system is doing exactly that when you request a report: checking that all ranks are still there, and check where all ranks are, printing mismatches if one rank did not reach a particular collective operation. |
thank you for your reply, I understand that RAS system can solve some hang problems, but at present, OpCount is not strictly self-increasing according to the logic of previous versions. I understand that the modification of my present commit keeps the previous logic. If so, can it be merge? |
Description
in single node, opCount not self-increasing
after fix:
Related Issues
Changes & Impact
Performance Impact