Skip to content

Conversation

@Chamberlain0w0
Copy link
Contributor

No description provided.

@Chamberlain0w0 Chamberlain0w0 changed the title [WIP] feat: DDP gradient bucketing feat: DDP gradient bucketing Nov 17, 2025
@Chamberlain0w0 Chamberlain0w0 force-pushed the feature/gradient_bucketing branch from 19c9727 to ed1a608 Compare November 25, 2025 10:00
@Chamberlain0w0
Copy link
Contributor Author

原先 stream wait 逻辑有误(在某个 bucket 的 allreduce 调用后立刻让 compute stream wait for done_event,这样的话通信计算相当于完全不重叠)。现在把 wait 时机延后至所有 bucket 均发射完 allreduce 后再进行。

为此,需要调用 Work 提供的 wait 操作,同时让 Work 提供 WaitBlocking/WaitNonBlocking 两种操作。前者是 cpu 端的 cudaEventSynchronize 操作,这点与 torch 提供的是对齐的;后者是 cudaStreamWaitEvent 操作,只是在 stream 中插点,不阻塞 cpu 端执行。

@kilinchange
Copy link
Collaborator

kilinchange commented Nov 27, 2025

work 用法和示例:
image

@kilinchange kilinchange merged commit e7d57db into master Nov 27, 2025
2 checks passed
@kilinchange kilinchange deleted the feature/gradient_bucketing branch November 27, 2025 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants