Skip to content

[Issue]: CE collectives fail if they are run on default cuda stream #2003

@ngoyal2707

Description

@ngoyal2707

How is this issue impacting you?

Application crash

Share Your Debug Logs

cudaMemcpyBatchAsync fails when default stream is passed with invalid attribute error.

pytorch latest now defaults to using user stream for when async_op=False. So when we use ce_collectives with pytorch in async_op=False mode, default stream gets passed to cudaMemcpyBatchAsync and NCCL fails with RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

Maybe it cant be solved in NCCL but if so, might be better to handle this gracefully by raising some error about the stream can't be default stream.

Steps to Reproduce the Issue

No response

NCCL Version

2.29.2+cuda13.1

Your platform details

No response

Error Message & Behavior

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions