fix: evict idle models before loading reranker on low-VRAM GPUs by Jua004 · Pull Request #276 · tobi/qmd

Jua004 · 2026-03-01T22:57:52Z

Problem

On GPUs with limited VRAM (e.g. 2 GB GTX 960M), qmd query crashes at the reranking step:

Error: Failed to create any rerank context

The three models (generation ~1.2 GB, embedding ~314 MB, reranker ~610 MB) plus a rerank context (~960 MB) total ~3.1 GB — well beyond 2 GB. Since query expansion and embedding are already done by the time reranking starts, those models sit idle in VRAM while the reranker fails to allocate.

Fix

In ensureRerankModel(), before loading the reranker, check free VRAM against the reranker model file size + context overhead. If insufficient, dispose the generation model and embedding model/contexts — they've already completed their work and can be reloaded from disk later if needed.

Uses statSync on the model file (already imported) so the threshold adapts if models change
15% headroom factor for runtime allocations
On machines with enough VRAM, the check passes and nothing is evicted — zero impact on high-VRAM setups

Testing

Tested on GTX 960M (2 GB VRAM), driver 560.35.05, CUDA 12.6:

Before: qmd query crashes at reranking every time
After: full query pipeline (expansion → embedding → reranking) completes, VRAM peaks at ~1.9 GB

Relates to #275

On GPUs with limited VRAM (e.g. 2 GB), `qmd query` crashes at the reranking step because the generation model (~1.2 GB) and embedding model (~314 MB) remain resident while the reranker (~610 MB + 960 MB context) tries to allocate. Before loading the reranker, check free VRAM against the model file size plus context overhead. If insufficient, dispose the generation and embedding models first — they've already completed their work in the pipeline and can be reloaded from disk later if needed. On machines with enough VRAM, the check passes and nothing is evicted. Tested on GTX 960M (2 GB VRAM), driver 560.35.05, CUDA 12.6.

Jua004 force-pushed the fix/low-vram-reranker-eviction branch from 6abb410 to 8d61ba0 Compare March 1, 2026 23:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: evict idle models before loading reranker on low-VRAM GPUs#276

fix: evict idle models before loading reranker on low-VRAM GPUs#276
Jua004 wants to merge 1 commit intotobi:mainfrom
Jua004:fix/low-vram-reranker-eviction

Jua004 commented Mar 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jua004 commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Jua004 commented Mar 1, 2026 •

edited

Loading