Skip to content

Conversation

@rattus128
Copy link
Contributor

@rattus128 rattus128 commented Nov 18, 2025

This PR expands the robustness of the RAM cache implementation. This makes the RAM cache much friendlier to use and avoids users needing to specifically size the cache based on their workflow. It also avoids OOMs in more cases, especially flows with multiple large models. There are three key changes:

1: Loosing the executors pre-emptive cache pin on models so that cached models late in a workflow can be freed to make space for earlier ones
2: pre-emptively freeing space for large models on load
3: freeing space on demand during the GPU -> RAM weight offload process

Example test conditions:

Linux, RTX5090, swapoff, 96GB RAM
Workflow: Flux FP16 -> qwen FP16 -> wan 2.2 FP16
giant-flow.json

giant-flow-scr

In the screenshot its executing wan. The RAM trace shows it dropping down from 95% to make space for wan after qwen.

On rerun it still has all the text encodings for re-use.

@rattus128 rattus128 changed the title RAM cache implementation - part II (after next stable) RAM cache implementation - part II Nov 19, 2025
@Kosinkadink
Copy link
Collaborator

Hey, is this PR in a state where it can be taken off draft + reviewed, or is it still in the oven?

@rattus128
Copy link
Contributor Author

Hey, is this PR in a state where it can be taken off draft + reviewed, or is it still in the oven?

Hey, we are stuck on draft as it conflicts with async offloading and I will need to do a small rebase and retest. Feel free to review though.

@rattus128 rattus128 force-pushed the prs/ram-cache-2 branch 2 times, most recently from 647d88a to ff66d8a Compare November 27, 2025 07:17
@rattus128 rattus128 marked this pull request as ready for review November 27, 2025 07:17
outputs=[io.Sigmas.Output()]
)

@classmethod
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be in the v3 schema stuff instead of on every node.

@Kosinkadink What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would be able to make it so that in v3, the check_lazy_status function would be autogenerated if at least one input is marked as lazy and no override was created. For this PR, I think having it be per node is fine, and I can edit these when I add that autogeneration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive backed out the lazying changes so we can get the loader awareness in in its own right. Theres some discussion ongoing on how to do this lazy stuff.

move the headroom logic into the RAM cache to make this a little easier
to call to "free me some RAM".

Rename the API to free_ram().

Split off the clean_list creation to a completely separate function to
avoid any stray strong reference to the content-to-be-freed on the
stack.
Add the free_ram() API and a means to install implementations of the
freer (I.E. the RAM cache).
currently this hard assumes that the caller of model_unload will keep
current_loaded_models in sync. With RAMPressureCache its possible to
have the garbage collector occur in the middle of the model free process
which can split these two steps.
This is currently put together as a list of indexes assuming the
current_loaded_models  doesn't change. However we might need to pruge a
model as part of the offload process which means this list can change in
the middle of the freeing process. handle by taking independent refs to
the LoadedModel objects and dong safe by-value deletion of
current_loaded_models.
RAMPressure caching may ned to purge the same model that you are
currently trying to offload for VRAM freeing. In this case, RAMPressure
cache takes priority and needs to be able to pull the trigger on dumping
the whole model and freeing the ModelPatcher in question. To do this,
defer the actual tranfer of model weights from GPU to RAM to
model_management state and not as part of ModelPatcher. This is dones as
a list of weakrefs.

If RAM cache decides to free to model you are currently unloading, then
the ModelPatcher and refs simply dissappear in the middle of the
unloading process, and both RAM and VRAM will be freed.

The unpatcher now queues the individual leaf modules to be offloaded
one-by-one so that RAM levels can be monitored.

Note that the UnloadPartially that is potentially done as part of a
load will not be freeable this way, however it shouldn't be anyway as
that is the currently active model and RAM cache cannot save you if
you cant even fit the one model you are currently trying to use.
@comfyanonymous
Copy link
Owner

There's a bug where I enable ram cache on simulated 50GB ram + 24 vram.

I run this workflow twice in a row:

ComfyUI_277027_

It unloads the high noise model on the first workflow run which is good but the second time it gets stuck on the first sampler node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants