How is this issue impacting you?
Application crash
Share Your Debug Logs
cudaMemcpyBatchAsync fails when default stream is passed with invalid attribute error.
pytorch latest now defaults to using user stream for when async_op=False. So when we use ce_collectives with pytorch in async_op=False mode, default stream gets passed to cudaMemcpyBatchAsync and NCCL fails with RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Maybe it cant be solved in NCCL but if so, might be better to handle this gracefully by raising some error about the stream can't be default stream.
Steps to Reproduce the Issue
No response
NCCL Version
2.29.2+cuda13.1
Your platform details
No response
Error Message & Behavior
No response