feat(llm): add env vars for low-VRAM GPU configuration#330
Open
NguyenQS504092s wants to merge 1 commit intotobi:mainfrom
Open
feat(llm): add env vars for low-VRAM GPU configuration#330NguyenQS504092s wants to merge 1 commit intotobi:mainfrom
NguyenQS504092s wants to merge 1 commit intotobi:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
QMD_RERANK_CONTEXT_SIZEenv var (default: 2048) for tuning rerank context windowQMD_EMBED_CONTEXT_SIZEenv var (default: auto) for tuning embedding context windowQMD_MAX_PARALLELISMenv var (default: auto) to cap parallel contextsQMD_EMBED_BATCH_SIZEenv var (default: 32) for embed loop batch sizeFollows the per-setting env var + config pattern established in #313.
Problem
On GPUs with ≤4GB VRAM (RTX 3050, GTX 960M, etc.),
qmd query,qmd vsearch, andqmd embedcrash with OOM errors. The default settings require ~7GB peak VRAM.Solution
Add granular env vars so users can tune VRAM usage without modifying source:
Each new env var follows the same resolver pattern as
QMD_EXPAND_CONTEXT_SIZEfrom #313:Changes
src/llm.tsresolveRerankContextSize(),resolveEmbedContextSize(),resolveMaxParallelism()resolver functionsrerankContextSizeandembedContextSizefields toLlamaCppConfigtypeRERANK_CONTEXT_SIZEwith instance field resolved from config/envcontextSizetocreateEmbeddingContext()whenQMD_EMBED_CONTEXT_SIZEis setcomputeParallelism()result whenQMD_MAX_PARALLELISMis setsrc/qmd.tsBATCH_SIZEconfigurable viaQMD_EMBED_BATCH_SIZEenv varREADME.mdTested on RTX 3050 Laptop (4GB VRAM, Vulkan)
qmd embed -f(107 docs)qmd queryqmd vsearchSearch quality impact: ~1-2% (rare edge-case truncation).
Test plan
tsc --noEmit)Fixes #329
Relates to #275, #303