-
Notifications
You must be signed in to change notification settings - Fork 10.8k
RAM cache implementation - part II #10779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hey, is this PR in a state where it can be taken off draft + reviewed, or is it still in the oven? |
Hey, we are stuck on draft as it conflicts with async offloading and I will need to do a small rebase and retest. Feel free to review though. |
647d88a to
ff66d8a
Compare
comfy_extras/nodes_custom_sampler.py
Outdated
| outputs=[io.Sigmas.Output()] | ||
| ) | ||
|
|
||
| @classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be in the v3 schema stuff instead of on every node.
@Kosinkadink What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would be able to make it so that in v3, the check_lazy_status function would be autogenerated if at least one input is marked as lazy and no override was created. For this PR, I think having it be per node is fine, and I can edit these when I add that autogeneration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ive backed out the lazying changes so we can get the loader awareness in in its own right. Theres some discussion ongoing on how to do this lazy stuff.
58267a8 to
ac4378e
Compare
move the headroom logic into the RAM cache to make this a little easier to call to "free me some RAM". Rename the API to free_ram(). Split off the clean_list creation to a completely separate function to avoid any stray strong reference to the content-to-be-freed on the stack.
Add the free_ram() API and a means to install implementations of the freer (I.E. the RAM cache).
currently this hard assumes that the caller of model_unload will keep current_loaded_models in sync. With RAMPressureCache its possible to have the garbage collector occur in the middle of the model free process which can split these two steps.
This is currently put together as a list of indexes assuming the current_loaded_models doesn't change. However we might need to pruge a model as part of the offload process which means this list can change in the middle of the freeing process. handle by taking independent refs to the LoadedModel objects and dong safe by-value deletion of current_loaded_models.
RAMPressure caching may ned to purge the same model that you are currently trying to offload for VRAM freeing. In this case, RAMPressure cache takes priority and needs to be able to pull the trigger on dumping the whole model and freeing the ModelPatcher in question. To do this, defer the actual tranfer of model weights from GPU to RAM to model_management state and not as part of ModelPatcher. This is dones as a list of weakrefs. If RAM cache decides to free to model you are currently unloading, then the ModelPatcher and refs simply dissappear in the middle of the unloading process, and both RAM and VRAM will be freed. The unpatcher now queues the individual leaf modules to be offloaded one-by-one so that RAM levels can be monitored. Note that the UnloadPartially that is potentially done as part of a load will not be freeable this way, however it shouldn't be anyway as that is the currently active model and RAM cache cannot save you if you cant even fit the one model you are currently trying to use.
87568ba to
e65e642
Compare

This PR expands the robustness of the RAM cache implementation. This makes the RAM cache much friendlier to use and avoids users needing to specifically size the cache based on their workflow. It also avoids OOMs in more cases, especially flows with multiple large models. There are three key changes:
1: Loosing the executors pre-emptive cache pin on models so that cached models late in a workflow can be freed to make space for earlier ones
2: pre-emptively freeing space for large models on load
3: freeing space on demand during the GPU -> RAM weight offload process
Example test conditions:
Linux, RTX5090, swapoff, 96GB RAM
Workflow: Flux FP16 -> qwen FP16 -> wan 2.2 FP16
giant-flow.json
In the screenshot its executing wan. The RAM trace shows it dropping down from 95% to make space for wan after qwen.
On rerun it still has all the text encodings for re-use.