Skip to content

Conversation

@rattus128
Copy link
Contributor

@rattus128 rattus128 commented Nov 1, 2025

Draft of a generic module prefetcher. Implement the core feature and give one example of how to use it with QWEN.

This is able to get very close to compute saturation whereas --async-offload as-is still has a few compute stalls.

Leaving as a draft for now, as I am still trying to find a better way.

Start comfy use QWEN to try it out. You need the following startup args:

--async-offload --fast pinned_memory --reserve-vram 3 

It consumes a bit extra VRAM so you need to --reserve-vram to avoid OOMing.

@rattus128 rattus128 force-pushed the prs/prefetching branch 2 times, most recently from ec37c80 to 944c3cc Compare November 3, 2025 23:56
@rattus128 rattus128 changed the title Implement asynchronous module prefetching (QWEN only so far) Implement asynchronous module prefetching (QWEN+WAN so far) Nov 3, 2025
@rattus128
Copy link
Contributor Author

Added WAN support

@contentis
Copy link
Contributor

I wasn't able to check the PR yet but have you looked at GrpupOffloading from diffusers: https://github.com/huggingface/diffusers/blob/main/src/diffusers/hooks/group_offloading.py ?

It is similar but should have the adwantage of not requiring any model code changes.

@rattus128
Copy link
Contributor Author

I wasn't able to check the PR yet but have you looked at GrpupOffloading from diffusers: https://github.com/huggingface/diffusers/blob/main/src/diffusers/hooks/group_offloading.py ?

It is similar but should have the adwantage of not requiring any model code changes.

I had a very quick skim though. I see it has awareness of nn.ModuleList which may actually short circuit the prefetching block code instrumentation I did here and get it frictionless. It's definately a good idea if going with long range prefetchers.

That approach is slightly fragile in that a model author could do something weird or have multiple or heirachical lists whereas this open-coded system give you just that tiny bit of control a model author might want anyway.

The design goal is simplicity at the moment, and ideally we get away with totally generic layer level prefetching with just incremental improvement to --async-offload.

Implement an API that allows instrumenting a model with a prefetch
queue. Units of work are on the nn.Module level.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants