[Issue]: CE collectives fail if they are run on default cuda stream

### How is this issue impacting you?

Application crash

### Share Your Debug Logs

`cudaMemcpyBatchAsync` fails when default stream is passed with `invalid attribute error`.

pytorch latest now defaults to using user stream for when `async_op=False`. So when we use ce_collectives with pytorch in `async_op=False` mode, default stream gets passed to `cudaMemcpyBatchAsync` and NCCL fails with `RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)`


Maybe it cant be solved in NCCL but if so, might be better to handle this gracefully by raising some error about the stream can't be default stream.

### Steps to Reproduce the Issue

_No response_

### NCCL Version

2.29.2+cuda13.1

### Your platform details

_No response_

### Error Message & Behavior

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: CE collectives fail if they are run on default cuda stream #2003

How is this issue impacting you?

Share Your Debug Logs

Steps to Reproduce the Issue

NCCL Version

Your platform details

Error Message & Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: CE collectives fail if they are run on default cuda stream #2003

Description

How is this issue impacting you?

Share Your Debug Logs

Steps to Reproduce the Issue

NCCL Version

Your platform details

Error Message & Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions