Skip to content

fix: evict idle models before loading reranker on low-VRAM GPUs#276

Open
Jua004 wants to merge 1 commit intotobi:mainfrom
Jua004:fix/low-vram-reranker-eviction
Open

fix: evict idle models before loading reranker on low-VRAM GPUs#276
Jua004 wants to merge 1 commit intotobi:mainfrom
Jua004:fix/low-vram-reranker-eviction

Conversation

@Jua004
Copy link

@Jua004 Jua004 commented Mar 1, 2026

Problem

On GPUs with limited VRAM (e.g. 2 GB GTX 960M), qmd query crashes at the reranking step:

Error: Failed to create any rerank context

The three models (generation ~1.2 GB, embedding ~314 MB, reranker ~610 MB) plus a rerank context (~960 MB) total ~3.1 GB — well beyond 2 GB. Since query expansion and embedding are already done by the time reranking starts, those models sit idle in VRAM while the reranker fails to allocate.

Fix

In ensureRerankModel(), before loading the reranker, check free VRAM against the reranker model file size + context overhead. If insufficient, dispose the generation model and embedding model/contexts — they've already completed their work and can be reloaded from disk later if needed.

  • Uses statSync on the model file (already imported) so the threshold adapts if models change
  • 15% headroom factor for runtime allocations
  • On machines with enough VRAM, the check passes and nothing is evicted — zero impact on high-VRAM setups

Testing

Tested on GTX 960M (2 GB VRAM), driver 560.35.05, CUDA 12.6:

  • Before: qmd query crashes at reranking every time
  • After: full query pipeline (expansion → embedding → reranking) completes, VRAM peaks at ~1.9 GB

Relates to #275

On GPUs with limited VRAM (e.g. 2 GB), `qmd query` crashes at the
reranking step because the generation model (~1.2 GB) and embedding
model (~314 MB) remain resident while the reranker (~610 MB + 960 MB
context) tries to allocate.

Before loading the reranker, check free VRAM against the model file
size plus context overhead. If insufficient, dispose the generation
and embedding models first — they've already completed their work in
the pipeline and can be reloaded from disk later if needed.

On machines with enough VRAM, the check passes and nothing is evicted.

Tested on GTX 960M (2 GB VRAM), driver 560.35.05, CUDA 12.6.
@Jua004 Jua004 force-pushed the fix/low-vram-reranker-eviction branch from 6abb410 to 8d61ba0 Compare March 1, 2026 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant