Implement asynchronous module prefetching (QWEN+WAN so far) #10594

rattus128 · 2025-11-01T16:42:27Z

Draft of a generic module prefetcher. Implement the core feature and give one example of how to use it with QWEN.

This is able to get very close to compute saturation whereas --async-offload as-is still has a few compute stalls.

Leaving as a draft for now, as I am still trying to find a better way.

Start comfy use QWEN to try it out. You need the following startup args:

--async-offload --fast pinned_memory --reserve-vram 3

It consumes a bit extra VRAM so you need to --reserve-vram to avoid OOMing.

rattus128 · 2025-11-03T23:57:46Z

Added WAN support

contentis · 2025-11-06T17:27:31Z

I wasn't able to check the PR yet but have you looked at GrpupOffloading from diffusers: https://github.com/huggingface/diffusers/blob/main/src/diffusers/hooks/group_offloading.py ?

It is similar but should have the adwantage of not requiring any model code changes.

rattus128 · 2025-11-08T09:44:16Z

I wasn't able to check the PR yet but have you looked at GrpupOffloading from diffusers: https://github.com/huggingface/diffusers/blob/main/src/diffusers/hooks/group_offloading.py ?

It is similar but should have the adwantage of not requiring any model code changes.

I had a very quick skim though. I see it has awareness of nn.ModuleList which may actually short circuit the prefetching block code instrumentation I did here and get it frictionless. It's definately a good idea if going with long range prefetchers.

That approach is slightly fragile in that a model author could do something weird or have multiple or heirachical lists whereas this open-coded system give you just that tiny bit of control a model author might want anyway.

The design goal is simplicity at the moment, and ideally we get away with totally generic layer level prefetching with just incremental improvement to --async-offload.

Implement an API that allows instrumenting a model with a prefetch queue. Units of work are on the nn.Module level.

rattus128 force-pushed the prs/prefetching branch 2 times, most recently from ec37c80 to 944c3cc Compare November 3, 2025 23:56

rattus128 changed the title ~~Implement asynchronous module prefetching (QWEN only so far)~~ Implement asynchronous module prefetching (QWEN+WAN so far) Nov 3, 2025

rattus128 added 3 commits November 10, 2025 16:10

ops: Implement prefetching API

e279e1f

Implement an API that allows instrumenting a model with a prefetch queue. Units of work are on the nn.Module level.

qwen: Implement transformer block prefetching

0814c1f

wan: Implement block level prefetching

2e843f3

rattus128 force-pushed the prs/prefetching branch from 944c3cc to 2e843f3 Compare November 10, 2025 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement asynchronous module prefetching (QWEN+WAN so far) #10594

Implement asynchronous module prefetching (QWEN+WAN so far) #10594

rattus128 commented Nov 1, 2025 •

edited

Loading

Uh oh!

rattus128 commented Nov 3, 2025

Uh oh!

contentis commented Nov 6, 2025

Uh oh!

rattus128 commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement asynchronous module prefetching (QWEN+WAN so far) #10594

Are you sure you want to change the base?

Implement asynchronous module prefetching (QWEN+WAN so far) #10594

Conversation

rattus128 commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rattus128 commented Nov 3, 2025

Uh oh!

contentis commented Nov 6, 2025

Uh oh!

rattus128 commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rattus128 commented Nov 1, 2025 •

edited

Loading