feat(llm): add env vars for low-VRAM GPU configuration by NguyenQS504092s · Pull Request #330 · tobi/qmd

NguyenQS504092s · 2026-03-09T03:34:31Z

Summary

Add QMD_RERANK_CONTEXT_SIZE env var (default: 2048) for tuning rerank context window
Add QMD_EMBED_CONTEXT_SIZE env var (default: auto) for tuning embedding context window
Add QMD_MAX_PARALLELISM env var (default: auto) to cap parallel contexts
Add QMD_EMBED_BATCH_SIZE env var (default: 32) for embed loop batch size

Follows the per-setting env var + config pattern established in #313.

Problem

On GPUs with ≤4GB VRAM (RTX 3050, GTX 960M, etc.), qmd query, qmd vsearch, and qmd embed crash with OOM errors. The default settings require ~7GB peak VRAM.

Solution

Add granular env vars so users can tune VRAM usage without modifying source:

# Example for 4GB GPU:
export QMD_RERANK_CONTEXT_SIZE=1024
export QMD_EMBED_CONTEXT_SIZE=1024
export QMD_MAX_PARALLELISM=2
export QMD_EMBED_BATCH_SIZE=8
export QMD_EXPAND_CONTEXT_SIZE=1024  # already exists

Each new env var follows the same resolver pattern as QMD_EXPAND_CONTEXT_SIZE from #313:

Config value takes precedence over env var
Invalid values print a warning to stderr and fall back to defaults
Input validation rejects non-positive integers

Changes

src/llm.ts

Add resolveRerankContextSize(), resolveEmbedContextSize(), resolveMaxParallelism() resolver functions
Add rerankContextSize and embedContextSize fields to LlamaCppConfig type
Replace static RERANK_CONTEXT_SIZE with instance field resolved from config/env
Pass contextSize to createEmbeddingContext() when QMD_EMBED_CONTEXT_SIZE is set
Cap computeParallelism() result when QMD_MAX_PARALLELISM is set

src/qmd.ts

Make BATCH_SIZE configurable via QMD_EMBED_BATCH_SIZE env var

README.md

Document all env vars in the Environment Variables table
Add "Low-VRAM GPU Configuration" section with example values

Tested on RTX 3050 Laptop (4GB VRAM, Vulkan)

Metric	Default (crash)	With env vars
`qmd embed -f` (107 docs)	❌ OOM	✅ 2m19s
`qmd query`	❌ timeout/OOM	✅ ~60s
`qmd vsearch`	❌ OOM	✅ works
Peak VRAM	~7GB	~3GB

Search quality impact: ~1-2% (rare edge-case truncation).

Test plan

TypeScript compiles cleanly (tsc --noEmit)
Tested on RTX 3050 4GB with all env vars set
Verify default behavior unchanged when no env vars are set
Verify invalid env var values produce warnings and use defaults

Fixes #329
Relates to #275, #303

Add QMD_RERANK_CONTEXT_SIZE, QMD_EMBED_CONTEXT_SIZE, QMD_MAX_PARALLELISM, and QMD_EMBED_BATCH_SIZE env vars to allow tuning for GPUs with limited VRAM (≤4GB). Follows the per-setting env var pattern from PR tobi#313. Fixes tobi#329 Relates to tobi#275, tobi#303

NguyenQS504092s mentioned this pull request Mar 9, 2026

feat: Add low-VRAM configuration for GPUs with ≤4GB memory #329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): add env vars for low-VRAM GPU configuration#330

feat(llm): add env vars for low-VRAM GPU configuration#330
NguyenQS504092s wants to merge 1 commit intotobi:mainfrom
NguyenQS504092s:feat/low-vram-config

NguyenQS504092s commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NguyenQS504092s commented Mar 9, 2026

Summary

Problem

Solution

Changes

Tested on RTX 3050 Laptop (4GB VRAM, Vulkan)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant