Skip to content

feat(llm): add env vars for low-VRAM GPU configuration#330

Open
NguyenQS504092s wants to merge 1 commit intotobi:mainfrom
NguyenQS504092s:feat/low-vram-config
Open

feat(llm): add env vars for low-VRAM GPU configuration#330
NguyenQS504092s wants to merge 1 commit intotobi:mainfrom
NguyenQS504092s:feat/low-vram-config

Conversation

@NguyenQS504092s
Copy link

Summary

  • Add QMD_RERANK_CONTEXT_SIZE env var (default: 2048) for tuning rerank context window
  • Add QMD_EMBED_CONTEXT_SIZE env var (default: auto) for tuning embedding context window
  • Add QMD_MAX_PARALLELISM env var (default: auto) to cap parallel contexts
  • Add QMD_EMBED_BATCH_SIZE env var (default: 32) for embed loop batch size

Follows the per-setting env var + config pattern established in #313.

Problem

On GPUs with ≤4GB VRAM (RTX 3050, GTX 960M, etc.), qmd query, qmd vsearch, and qmd embed crash with OOM errors. The default settings require ~7GB peak VRAM.

Solution

Add granular env vars so users can tune VRAM usage without modifying source:

# Example for 4GB GPU:
export QMD_RERANK_CONTEXT_SIZE=1024
export QMD_EMBED_CONTEXT_SIZE=1024
export QMD_MAX_PARALLELISM=2
export QMD_EMBED_BATCH_SIZE=8
export QMD_EXPAND_CONTEXT_SIZE=1024  # already exists

Each new env var follows the same resolver pattern as QMD_EXPAND_CONTEXT_SIZE from #313:

  • Config value takes precedence over env var
  • Invalid values print a warning to stderr and fall back to defaults
  • Input validation rejects non-positive integers

Changes

src/llm.ts

  • Add resolveRerankContextSize(), resolveEmbedContextSize(), resolveMaxParallelism() resolver functions
  • Add rerankContextSize and embedContextSize fields to LlamaCppConfig type
  • Replace static RERANK_CONTEXT_SIZE with instance field resolved from config/env
  • Pass contextSize to createEmbeddingContext() when QMD_EMBED_CONTEXT_SIZE is set
  • Cap computeParallelism() result when QMD_MAX_PARALLELISM is set

src/qmd.ts

  • Make BATCH_SIZE configurable via QMD_EMBED_BATCH_SIZE env var

README.md

  • Document all env vars in the Environment Variables table
  • Add "Low-VRAM GPU Configuration" section with example values

Tested on RTX 3050 Laptop (4GB VRAM, Vulkan)

Metric Default (crash) With env vars
qmd embed -f (107 docs) ❌ OOM ✅ 2m19s
qmd query ❌ timeout/OOM ✅ ~60s
qmd vsearch ❌ OOM ✅ works
Peak VRAM ~7GB ~3GB

Search quality impact: ~1-2% (rare edge-case truncation).

Test plan

  • TypeScript compiles cleanly (tsc --noEmit)
  • Tested on RTX 3050 4GB with all env vars set
  • Verify default behavior unchanged when no env vars are set
  • Verify invalid env var values produce warnings and use defaults

Fixes #329
Relates to #275, #303

Add QMD_RERANK_CONTEXT_SIZE, QMD_EMBED_CONTEXT_SIZE,
QMD_MAX_PARALLELISM, and QMD_EMBED_BATCH_SIZE env vars to allow
tuning for GPUs with limited VRAM (≤4GB).

Follows the per-setting env var pattern from PR tobi#313.

Fixes tobi#329
Relates to tobi#275, tobi#303
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add low-VRAM configuration for GPUs with ≤4GB memory

1 participant