diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 01edd9b..dc91fe4 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -20,7 +20,7 @@ jobs: strategy: matrix: - model: [qwen3.5-35b-a3b, hermes-4.3-36b] + model: [qwen3.5-35b-a3b, qwen3-coder-30b-a3b, hermes-4.3-36b] steps: - name: Checkout diff --git a/AGENTS.md b/AGENTS.md index 792ac83..720efdf 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -51,6 +51,10 @@ No API key is required by default. If your client demands one, any non-empty str "qwen": { "id": "qwen3.5-35b-a3b", "name": "Qwen 3.5 35B A3B" + }, + "qwen-coder": { + "id": "qwen3-coder-30b-a3b", + "name": "Qwen 3 Coder 30B A3B" } } } @@ -68,7 +72,7 @@ API Key: sk-local Model: qwen3.5-35b-a3b ``` -Cursor uses streaming by default. Foundry supports SSE streaming natively. With 4 parallel slots, you can run Cursor's background indexing and active chat simultaneously without blocking. +Cursor uses streaming by default. Foundry supports SSE streaming natively. With multiple parallel slots, you can run Cursor's background indexing and active chat simultaneously without blocking. ### Continue (VS Code / JetBrains) @@ -115,7 +119,7 @@ Model ID: qwen3.5-35b-a3b ## Multi-Agent Frameworks -Foundry's 4 parallel inference slots make it particularly suited for multi-agent workflows where multiple agents share a single model. Each slot processes requests independently with minimal throughput degradation. +Foundry's parallel inference slots make it particularly suited for multi-agent workflows where multiple agents share a single model. Each slot processes requests independently with minimal throughput degradation. ### CrewAI @@ -162,7 +166,7 @@ crew = Crew( result = crew.kickoff(inputs={"topic": "GPU inference optimization"}) ``` -With 4 parallel slots, CrewAI can run 4 agents simultaneously at ~80 tok/s each (Qwen MoE) or ~16 tok/s each (Hermes Dense). +With 3 parallel slots, CrewAI can run 3 agents simultaneously at ~168 tok/s each (Qwen3-Coder MoE) or ~33 tok/s each with Hermes Dense (4 slots). ### AutoGen @@ -237,7 +241,7 @@ docker run -d -p 3000:8080 \ ghcr.io/open-webui/open-webui:main ``` -Open WebUI supports multi-user chat with conversation history. Each user session uses one of Foundry's 4 inference slots. +Open WebUI supports multi-user chat with conversation history. Each user session uses one of Foundry's inference slots. ### text-generation-webui (oobabooga) @@ -315,13 +319,14 @@ console.log(response.choices[0].message.content); | Use case | Recommended model | Why | |----------|-------------------|-----| -| **Coding agents** (OpenCode, Cursor, Aider) | Qwen3.5-35B-A3B | Fast decode (181 tok/s), 192K context for large codebases, good at code | -| **Multi-agent orchestration** (CrewAI, AutoGen) | Qwen3.5-35B-A3B | 4-concurrent at 320 tok/s aggregate, MoE batching advantage | +| **Coding agents** (OpenCode, Cursor, Aider) | Qwen3-Coder-30B-A3B | Fastest decode (275 tok/s), purpose-built for code, tool calling support | +| **Multi-agent orchestration** (CrewAI, AutoGen) | Qwen3-Coder-30B-A3B | 3-concurrent at 497 tok/s aggregate, best MoE batching efficiency | +| **General coding + long context** | Qwen3.5-35B-A3B | 192K effective context for large codebases, hybrid recurrent architecture | | **Reasoning-heavy tasks** | Hermes-4.3-36B | Thinking mode with `` tags, stronger reasoning on hard problems | -| **Tool use / function calling** | Hermes-4.3-36B | Trained specifically for structured tool calling with `` XML | +| **Tool use / function calling** | Qwen3-Coder-30B-A3B or Hermes-4.3-36B | Both have strong tool calling; Coder is 4x faster, Hermes more reliable on complex schemas | | **Roleplay / creative writing** | Hermes-4.3-36B | NousResearch fine-tune optimized for personality and narrative | | **Long document Q&A** | Qwen3.5-35B-A3B | 192K context window, recurrent layers handle long sequences efficiently | -| **16 GB VRAM GPUs** | Qwen3.5-35B-A3B | MoE expert offloading works on 16 GB; Hermes needs 24 GB minimum | +| **16 GB VRAM GPUs** | Qwen3-Coder-30B-A3B | Smallest disk footprint (17.7 GB), MoE expert offloading works on 16 GB | ## Performance Considerations @@ -331,10 +336,11 @@ Single-stream decode latency (time to generate one token): | Model | Latency per token | Tokens per second | |-------|-------------------|-------------------| +| Qwen3-Coder-30B-A3B | ~3.6 ms | ~275 tok/s | | Qwen3.5-35B-A3B | ~5.5 ms | ~181 tok/s | | Hermes-4.3-36B | ~15.5 ms | ~64 tok/s | -For interactive coding agents, Qwen delivers a visibly faster typing experience. For batch/background tasks where latency is less critical, Hermes' stronger reasoning may be worth the tradeoff. +For interactive coding agents, Qwen3-Coder delivers the fastest typing experience. Qwen3.5 trades some speed for 192K effective context. For batch/background tasks where latency is less critical, Hermes' stronger reasoning may be worth the tradeoff. ### Prompt processing @@ -342,13 +348,21 @@ Prompt processing (prefill) runs at ~1,163 tok/s for Qwen on RTX 5090. A 10K tok ### Concurrent agent scaling +Qwen3-Coder-30B-A3B (fastest, 3 slots): +``` +1 agent: 275 tok/s (100% per-agent speed) +2 agents: 405 tok/s (~204 tok/s each, 74% per-agent) +3 agents: 497 tok/s (~168 tok/s each, 61% per-agent) +``` + +Qwen3.5-35B-A3B (4 slots): ``` 1 agent: 181 tok/s (100% per-agent speed) 2 agents: 234 tok/s (~117 tok/s each, 65% per-agent) 4 agents: 320 tok/s (~80 tok/s each, 44% per-agent) ``` -If your workflow has >4 concurrent agents, requests queue until a slot is free. Consider multi-GPU routing (below) for higher concurrency. +If your workflow has more concurrent agents than slots, requests queue until a slot is free. Consider multi-GPU routing (below) for higher concurrency. ### Context window usage @@ -356,6 +370,7 @@ VRAM scales with context usage. The default RTX 5090 profiles are tuned for maxi | Model | Default context | VRAM at idle | VRAM at full context | |-------|----------------|--------------|---------------------| +| Qwen3-Coder-30B-A3B | 192K | 25.0 GB | ~28.9 GB | | Qwen3.5-35B-A3B | 192K | 25.3 GB | ~26.1 GB | | Hermes-4.3-36B | 32K | 24.5 GB | ~27.8 GB | @@ -370,7 +385,7 @@ docker run --gpus all -p 8080:8080 \ ## Structured Output -Both models support JSON mode for structured outputs: +All three models support JSON mode for structured outputs: ```python response = client.chat.completions.create( @@ -440,7 +455,7 @@ if tool_calls: print(f"Args: {tool_calls[0].function.arguments}") ``` -Both models support Jinja chat templates for tool calling. The entrypoint enables `--jinja` by default. +All models support Jinja chat templates for tool calling. The entrypoint enables `--jinja` by default. ## Thinking / Reasoning Mode @@ -483,7 +498,7 @@ The Docker Compose configuration includes TCP tuning (BBR congestion control, bu ## Multi-GPU Agent Routing -For workloads requiring more than 4 concurrent agents, run multiple Foundry instances and load-balance across them. +For workloads requiring more concurrent agents than slots, run multiple Foundry instances and load-balance across them. ### Simple round-robin with nginx @@ -511,14 +526,15 @@ For deterministic routing (each agent always hits the same GPU): ```python import os -# Route based on agent ID +# Route based on agent ID (adjust slots_per_gpu to match your model's --parallel setting) +slots_per_gpu = 3 # Qwen3-Coder default; use 4 for Qwen3.5/Hermes gpu_endpoints = [ - "http://localhost:8080/v1", # GPU 0: agents 0-3 - "http://localhost:8081/v1", # GPU 1: agents 4-7 + "http://localhost:8080/v1", # GPU 0 + "http://localhost:8081/v1", # GPU 1 ] def get_client(agent_id: int) -> OpenAI: - endpoint = gpu_endpoints[agent_id // 4] + endpoint = gpu_endpoints[agent_id // slots_per_gpu] return OpenAI(base_url=endpoint, api_key="sk-local") ``` @@ -536,7 +552,7 @@ If you see `no devices with dedicated memory found` in the logs, the CUDA backen 1. Check if all layers are on GPU: look for `offloaded N/N layers to GPU` in container logs 2. Check VRAM: `nvidia-smi` -- if VRAM is full, reduce context with `FOUNDRY_CTX_LENGTH` -3. Check if all 4 slots are occupied: `curl http://localhost:8080/metrics | grep slots` +3. Check if all slots are occupied: `curl http://localhost:8080/metrics | grep slots` ### Connection refused diff --git a/Makefile b/Makefile index ab14819..6dc6406 100644 --- a/Makefile +++ b/Makefile @@ -12,8 +12,8 @@ MODELS_DIR ?= $(HOME)/.cache/foundry .PHONY: help build run run-profile test benchmark monitoring down push push-all clean clean-models download help: ## Show this help - @echo "Available models: qwen3.5-35b-a3b (default), hermes-4.3-36b" - @echo "Usage: make run MODEL=hermes-4.3-36b" + @echo "Available models: qwen3.5-35b-a3b (default), qwen3-coder-30b-a3b, hermes-4.3-36b" + @echo "Usage: make run MODEL=qwen3-coder-30b-a3b" @echo "" @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | \ awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}' diff --git a/README.md b/README.md index 5be0e9e..a22da10 100644 --- a/README.md +++ b/README.md @@ -89,6 +89,40 @@ Dense transformer. 36B total parameters, all active per token. ByteDance Seed-OS Dense models activate all parameters per token, making them compute-bound rather than memory-bandwidth-bound. Expect ~3x slower decode than equivalently-sized MoE models on the same hardware. +### Qwen3-Coder-30B-A3B (MoE) + +Standard Mixture-of-Experts optimized for code generation. 30B total parameters, only 3B active per token. + +- 48 transformer layers, standard attention (GQA 32:4) +- 128 experts per MoE layer, top-8 active per token +- Quantization: UD-Q4_K_XL via [Unsloth](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) (Dynamic 2.0) +- Disk: ~17.7 GB | Min VRAM: 16 GB (with expert offloading) | Max context: 262K native +- Built-in tool calling support via `--jinja` chat template + +| GPU | VRAM | Context | Decode | 3-concurrent | VRAM used | +|-----|------|---------|--------|--------------|-----------| +| RTX 5090 | 32 GB | 64K/slot | ~275 tok/s | ~497 tok/s | 28.9 GB | +| Other NVIDIA (16 GB+) | 16+ GB | 16K/slot | varies | varies | varies | + +
+RTX 5090 detailed benchmark + +``` +SINGLE-STREAM DECODE: ~275 tok/s +3-CONCURRENT AGGREGATE: ~497 tok/s (+81% via MoE expert batching) +3-CONCURRENT PER-SLOT: ~168 tok/s each +PROMPT PROCESSING: ~345-1,038 tok/s (varies with batch position) +VRAM USAGE: 28.9 GB / 32.6 GB (3.7 GB headroom) +CONTEXT: 64K per slot (3 slots, auto-fitted from 192K request) +``` + +Benchmarked 2026-03-02 with native sm_120a (Blackwell) compilation and `BLACKWELL_NATIVE_FP4=1` enabled. + +**Why 3 slots (not 4)?** With 3 slots, `--fit on` allocates 64K context per slot instead of 48K. Aggregate throughput is identical (497 vs 495 tok/s), but per-agent speed under load is 35% faster (168 vs 124 tok/s). The 4th slot rarely matters for a single-GPU workstation. Override with `FOUNDRY_EXTRA_ARGS="--parallel 4"` if needed. + +**vs Qwen3.5-35B-A3B**: 52% faster single-stream, 55% faster aggregate. The standard MoE architecture (no DeltaNet recurrent layers) batches more efficiently on Blackwell. Trades the 192K effective context of Qwen3.5 for raw speed. +
+ ## How It Works Why llama.cpp and not SGLang or vLLM? For **consumer GPUs**, llama.cpp's MoE expert offloading (`--fit on`) is the only engine that can run a 35B-parameter MoE model on a single 16-24 GB card at full speed. SGLang and vLLM require the entire model to fit in VRAM. @@ -143,7 +177,7 @@ All settings can be overridden via environment variables: ## Multi-Agent Inference -The RTX 5090 profile is configured with `--parallel 4`, enabling 4 concurrent inference slots. This makes Foundry well-suited for multi-agent workflows where several AI agents share a single GPU. +The RTX 5090 profiles are configured with multiple concurrent inference slots: `--parallel 4` for Qwen3.5 and Hermes, `--parallel 3` for Qwen3-Coder. This makes Foundry well-suited for multi-agent workflows where several AI agents share a single GPU. ### Why MoE batching works @@ -151,26 +185,29 @@ Qwen3.5-35B-A3B uses a 256-expert MoE architecture with only 8 experts active pe ### Throughput scaling -| Active agents | Aggregate throughput | Per-agent speed | VRAM | -|---------------|---------------------|-----------------|------| -| 1 | 181 tok/s | 181 tok/s | 25.3 GB | -| 2 | 234 tok/s | ~117 tok/s each | 25.7 GB | -| 4 | 320 tok/s | ~80 tok/s each | 26.1 GB | +Measured on RTX 5090 with Qwen models (MoE): + +| Active agents | Qwen3.5-35B-A3B (4 slots) | Qwen3-Coder-30B-A3B (3 slots) | +|---------------|----------------------------|--------------------------------| +| 1 | 181 tok/s | 275 tok/s | +| 2 | 234 tok/s (117 each) | 405 tok/s (204 each) | +| 3 | — | 497 tok/s (168 each) | +| 4 | 320 tok/s (80 each) | — | -Single-agent speed is unaffected. The 4 slots only activate when there are concurrent requests. +Single-agent speed is unaffected. Concurrent slots only activate when there are simultaneous requests. ### Multi-GPU scaling -With 2x RTX 5090, run two independent instances for 8 total concurrent slots and ~640 tok/s combined aggregate throughput: +With 2x RTX 5090, run two independent instances for double the concurrent slots and aggregate throughput: ```bash -# GPU 0: agents 1-4 +# GPU 0 docker run --gpus '"device=0"' -p 8080:8080 -v ~/.cache/foundry:/models \ - ghcr.io/infernet-org/foundry/qwen3.5-35b-a3b:latest + ghcr.io/infernet-org/foundry/qwen3-coder-30b-a3b:latest -# GPU 1: agents 5-8 +# GPU 1 docker run --gpus '"device=1"' -p 8081:8080 -v ~/.cache/foundry:/models \ - ghcr.io/infernet-org/foundry/qwen3.5-35b-a3b:latest + ghcr.io/infernet-org/foundry/qwen3-coder-30b-a3b:latest ``` ### Compatible frameworks @@ -187,6 +224,7 @@ docker compose up # Choose a different model FOUNDRY_MODEL=hermes-4.3-36b docker compose up +FOUNDRY_MODEL=qwen3-coder-30b-a3b docker compose up # With explicit profile FOUNDRY_PROFILE=rtx5090 docker compose up @@ -208,6 +246,7 @@ GF_ADMIN_PASSWORD=admin ```bash make build # Build the default model image (qwen3.5-35b-a3b) make build MODEL=hermes-4.3-36b # Build a different model +make build MODEL=qwen3-coder-30b-a3b # Build the coding-optimized model make run # Run with auto-detected GPU make test # Smoke test: start, wait for health, send one request make benchmark # Run benchmark against a running server @@ -332,12 +371,18 @@ foundry/ │ │ └── profiles/ │ │ ├── rtx5090.sh # 192K ctx, 4 slots, ~320 tok/s aggregate │ │ └── default.sh # 16K ctx, conservative settings -│ └── hermes-4.3-36b/ +│ ├── hermes-4.3-36b/ +│ │ ├── Dockerfile # Multi-stage: compiles llama.cpp for sm_89 + sm_120a +│ │ ├── entrypoint.sh # Copied from scripts/entrypoint.sh at build time +│ │ └── profiles/ +│ │ ├── rtx5090.sh # 32K ctx, 4 slots, ~132 tok/s aggregate +│ │ └── default.sh # 8K ctx, 24 GB minimum +│ └── qwen3-coder-30b-a3b/ │ ├── Dockerfile # Multi-stage: compiles llama.cpp for sm_89 + sm_120a │ ├── entrypoint.sh # Copied from scripts/entrypoint.sh at build time │ └── profiles/ -│ ├── rtx5090.sh # 32K ctx, 4 slots, ~132 tok/s aggregate -│ └── default.sh # 8K ctx, 24 GB minimum +│ ├── rtx5090.sh # 192K ctx, 3 slots, ~497 tok/s aggregate +│ └── default.sh # 32K ctx, conservative settings ├── scripts/ │ ├── entrypoint.sh # Shared entrypoint (GPU detect, profile load, model download) │ ├── benchmark.py # Generation speed, prompt processing, throughput diff --git a/models/qwen3-coder-30b-a3b/Dockerfile b/models/qwen3-coder-30b-a3b/Dockerfile new file mode 100644 index 0000000..0fcf326 --- /dev/null +++ b/models/qwen3-coder-30b-a3b/Dockerfile @@ -0,0 +1,93 @@ +# ============================================================================== +# Foundry Model Image: Qwen3-Coder-30B-A3B-Instruct +# ============================================================================== +# Multi-stage build for a minimal CUDA runtime. +# Compiles llama.cpp from source for sm_89 (Ada) and sm_120a (Blackwell), +# then copies only the binary and required libraries to a clean Ubuntu base. +# +# Weights are NOT baked in. They are downloaded on first run or mounted +# from the host at /models. +# ============================================================================== + +# ------------------------------------------------------------------------------ +# Stage 1: Builder +# ------------------------------------------------------------------------------ +FROM nvidia/cuda:12.9.1-devel-ubuntu24.04 AS builder + +RUN apt-get update && apt-get install -y git cmake g++ curl +RUN git clone --depth 1 -b b8183 https://github.com/ggml-org/llama.cpp.git /llama.cpp +WORKDIR /llama.cpp + +# Compile explicitly for Ada (sm_89) and Blackwell (sm_120a). +# GGML_BACKEND_DL=ON builds CUDA as a runtime-loaded plugin (dlopen), which +# avoids the libcuda.so.1 transitive link error during Docker builds where +# no real GPU driver is present. This matches the official llama.cpp Dockerfile. +RUN cmake -B build \ + -DGGML_NATIVE=OFF \ + -DGGML_CUDA=ON \ + -DGGML_BACKEND_DL=ON \ + -DGGML_CPU_ALL_VARIANTS=ON \ + -DCMAKE_CUDA_ARCHITECTURES="89;120a" \ + -DCMAKE_BUILD_TYPE=Release \ + -DLLAMA_CURL=OFF \ + -DLLAMA_BUILD_TESTS=OFF \ + -DLLAMA_BUILD_EXAMPLES=OFF \ + -DCMAKE_EXE_LINKER_FLAGS="-Wl,--allow-shlib-undefined" && \ + cmake --build build --config Release -j$(nproc) + +# ------------------------------------------------------------------------------ +# Stage 2: Minimal Runtime +# ------------------------------------------------------------------------------ +FROM ubuntu:24.04 + +# Install minimal runtime dependencies +RUN apt-get update && apt-get install -y --no-install-recommends \ + libgomp1 \ + python3 python3-pip curl \ + && pip3 install --break-system-packages --no-cache-dir "huggingface-hub>=0.28,<1" "hf_transfer>=0.1.6" \ + && rm -rf /var/lib/apt/lists/* + +# The NVIDIA runtime needs these env vars to mount the CUDA drivers correctly +ENV NVIDIA_VISIBLE_DEVICES=all +ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility + +# Model metadata +ENV FOUNDRY_MODEL_NAME="Qwen3-Coder-30B-A3B-Instruct" +ENV FOUNDRY_GGUF_REPO="unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF" +ENV FOUNDRY_GGUF_FILE="Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf" +ENV FOUNDRY_ARCH="moe" + +# Enable fast downloads +ENV HF_HUB_ENABLE_HF_TRANSFER="1" + +# Runtime defaults (can be overridden) +ENV FOUNDRY_PROFILE="auto" +ENV FOUNDRY_PORT="8080" +ENV FOUNDRY_CTX_LENGTH="" +ENV FOUNDRY_THREADS="" +ENV FOUNDRY_EXTRA_ARGS="" + +# Copy the compiled binary and all shared libraries from the build output. +# With GGML_BACKEND_DL=ON, backends (ggml-cuda, ggml-cpu-*) are .so modules +# loaded at runtime via dlopen. CMake places everything in build/bin/. +COPY --from=builder /llama.cpp/build/bin/ /app/ + +# Cherry-pick only the CUDA runtime libs that libggml-cuda.so actually needs. +# libcuda.so.1 is provided by the NVIDIA container runtime at launch. +COPY --from=builder /usr/local/cuda/lib64/libcudart.so.12 /app/ +COPY --from=builder /usr/local/cuda/lib64/libcublas.so.12 /app/ +COPY --from=builder /usr/local/cuda/lib64/libcublasLt.so.12 /app/ +ENV LD_LIBRARY_PATH="/app" + +# Copy profiles and shared entrypoint +COPY profiles/ /opt/foundry/profiles/ +COPY entrypoint.sh /opt/foundry/entrypoint.sh +RUN chmod +x /opt/foundry/entrypoint.sh + +# Model storage +RUN mkdir -p /models +VOLUME /models + +EXPOSE 8080 + +ENTRYPOINT ["/opt/foundry/entrypoint.sh"] diff --git a/models/qwen3-coder-30b-a3b/entrypoint.sh b/models/qwen3-coder-30b-a3b/entrypoint.sh new file mode 100755 index 0000000..9dff64d --- /dev/null +++ b/models/qwen3-coder-30b-a3b/entrypoint.sh @@ -0,0 +1,316 @@ +#!/bin/bash +# ============================================================================== +# Foundry Entrypoint (shared across all models) +# ============================================================================== +# 1. Detect GPU and load hardware profile +# 2. Download model if not present +# 3. Apply architecture-aware tuning (MoE vs Dense) +# 4. Launch llama-server with tuned parameters +# +# Model identity is set via Dockerfile ENV vars: +# FOUNDRY_MODEL_NAME -- display name (e.g. "Qwen3.5-35B-A3B") +# FOUNDRY_GGUF_REPO -- HuggingFace repo (e.g. "unsloth/Qwen3.5-35B-A3B-GGUF") +# FOUNDRY_GGUF_FILE -- GGUF filename +# FOUNDRY_ARCH -- architecture type: "moe" or "dense" +# +# Architecture-specific flags are applied automatically based on FOUNDRY_ARCH: +# +# MoE (e.g. Qwen3.5-35B-A3B): +# --fit on Expert offloading: spill inactive experts to CPU +# +# Dense (e.g. Hermes-4.3-36B): +# (no --fit) No experts to offload +# +# Model-specific quirks (e.g. --swa-full for hybrid attention, --cache-ram 0 +# for recurrent architectures) belong in PROFILE_EXTRA_ARGS, NOT in the arch +# tier -- they are not universal to the architecture class. +# +# Hardware-specific tuning (context, threads, KV quant, slots) is set by +# per-GPU profiles in /opt/foundry/profiles/*.sh. +# ============================================================================== + +set -euo pipefail + +FOUNDRY_DIR="/opt/foundry" +PROFILES_DIR="${FOUNDRY_DIR}/profiles" +MODELS_DIR="/models" + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +CYAN='\033[0;36m' +NC='\033[0m' # No Color + +log() { echo -e "${CYAN}[foundry]${NC} $*"; } +warn() { echo -e "${YELLOW}[foundry]${NC} $*" >&2; } +err() { echo -e "${RED}[foundry]${NC} $*" >&2; } + +# ============================================================================== +# GPU Detection +# ============================================================================== + +detect_gpu() { + local gpu_name + if ! command -v nvidia-smi &> /dev/null; then + warn "nvidia-smi not found, using default profile" + echo "default" + return + fi + + gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader,nounits 2>/dev/null | head -1 | xargs) + + if [ -z "$gpu_name" ]; then + warn "Could not detect GPU, using default profile" + echo "default" + return + fi + + # Log to stderr so it doesn't interfere with the captured profile name + log "Detected GPU: ${gpu_name}" >&2 + + # Map GPU name to profile + case "$gpu_name" in + *"5090"*) echo "rtx5090" ;; + *) + warn "Unknown or unsupported GPU '${gpu_name}', using default profile" + echo "default" + ;; + esac +} + +# ============================================================================== +# Profile Loading +# ============================================================================== + +load_profile() { + local profile_name="$1" + local profile_file="${PROFILES_DIR}/${profile_name}.sh" + + if [ ! -f "$profile_file" ]; then + warn "Profile '${profile_name}' not found, falling back to default" + profile_file="${PROFILES_DIR}/default.sh" + fi + + if [ ! -f "$profile_file" ]; then + err "No default profile found at ${profile_file}" + exit 1 + fi + + log "Loading profile: ${profile_name}" + # shellcheck source=profiles/default.sh + source "$profile_file" +} + +# ============================================================================== +# Model Download +# ============================================================================== + +download_model() { + local gguf_path="${MODELS_DIR}/${FOUNDRY_GGUF_FILE}" + + if [ -f "$gguf_path" ]; then + local size + size=$(du -h "$gguf_path" | cut -f1) + log "Model found: ${gguf_path} (${size})" + return 0 + fi + + log "Model not found at ${gguf_path}" + log "Downloading ${FOUNDRY_GGUF_FILE} from ${FOUNDRY_GGUF_REPO}..." + log "This is a one-time download (~20GB). Subsequent starts will be instant." + echo "" + + # Use python3 huggingface_hub to download (huggingface-cli may not be on PATH) + # Variables are passed via environment to avoid shell injection in inline Python + if python3 -c "import huggingface_hub" 2>/dev/null; then + FOUNDRY_GGUF_REPO="${FOUNDRY_GGUF_REPO}" \ + FOUNDRY_GGUF_FILE="${FOUNDRY_GGUF_FILE}" \ + FOUNDRY_MODELS_DIR="${MODELS_DIR}" \ + python3 -c " +import os +from huggingface_hub import hf_hub_download +token = os.environ.get('HF_TOKEN') or os.environ.get('HUGGING_FACE_HUB_TOKEN') +hf_hub_download( + repo_id=os.environ['FOUNDRY_GGUF_REPO'], + filename=os.environ['FOUNDRY_GGUF_FILE'], + local_dir=os.environ['FOUNDRY_MODELS_DIR'], + token=token +) +" + else + err "huggingface-hub not found. Please mount the GGUF at ${gguf_path}" + err "Or install huggingface-hub: pip install huggingface-hub" + exit 1 + fi + + if [ ! -f "$gguf_path" ]; then + err "Download failed: ${gguf_path} not found after download" + exit 1 + fi + + local size + size=$(du -h "$gguf_path" | cut -f1) + log "Download complete: ${gguf_path} (${size})" +} + +# ============================================================================== +# Build Launch Command +# ============================================================================== +# Flags are layered in three tiers: +# 1. Architecture defaults (FOUNDRY_ARCH) -- systematic, model-class level +# 2. Hardware profile (PROFILE_*) -- per-GPU tuning knobs +# 3. User overrides (FOUNDRY_EXTRA_ARGS) -- escape hatch, highest priority + +build_command() { + local gguf_path="${MODELS_DIR}/${FOUNDRY_GGUF_FILE}" + local arch="${FOUNDRY_ARCH:-dense}" + + # Use a bash array to safely handle arguments with spaces + local -a cmd=("/app/llama-server") + cmd+=("--model" "${gguf_path}") + cmd+=("--host" "0.0.0.0") + cmd+=("--port" "${FOUNDRY_PORT:-8080}") + + # --- Tier 1: Architecture-specific flags ---------------------------------- + # These are determined by the model class, not by the GPU or user preference. + + if [ "$arch" = "moe" ]; then + # MoE: enable expert offloading (spill inactive experts to CPU when VRAM + # is tight). On high-VRAM GPUs --fit keeps everything on GPU automatically. + cmd+=("--fit" "on") + fi + # Dense models: no --fit (no experts to offload). + # Model-specific flags (--swa-full, --cache-ram) go in PROFILE_EXTRA_ARGS. + + # --- Tier 2: Hardware profile tuning -------------------------------------- + # These come from the sourced profile .sh file and tune for the specific GPU. + + # Context length (env override > profile > default) + local ctx="${FOUNDRY_CTX_LENGTH:-${PROFILE_CTX_LENGTH:-32768}}" + cmd+=("--ctx-size" "${ctx}") + + # Thread count (env override > profile > auto) + local threads="${FOUNDRY_THREADS:-${PROFILE_THREADS:-}}" + if [ -n "$threads" ]; then + cmd+=("--threads" "${threads}") + fi + + # Batch thread count (can be higher than decode threads for prompt processing) + local threads_batch="${PROFILE_THREADS_BATCH:-${threads}}" + if [ -n "$threads_batch" ]; then + cmd+=("--threads-batch" "${threads_batch}") + fi + + # Flash attention (new llama.cpp requires explicit on/off/auto value) + local fa="${PROFILE_FLASH_ATTN:-on}" + cmd+=("--flash-attn" "${fa}") + + # KV cache quantization + local ctk="${PROFILE_KV_TYPE_K:-q8_0}" + local ctv="${PROFILE_KV_TYPE_V:-q8_0}" + cmd+=("-ctk" "${ctk}" "-ctv" "${ctv}") + + # Memory mapping + if [ "${PROFILE_NO_MMAP:-true}" = "true" ]; then + cmd+=("--no-mmap") + fi + + # Jinja templates (for tool calling / chat templates) + if [ "${PROFILE_JINJA:-true}" = "true" ]; then + cmd+=("--jinja") + fi + + # Parallel slots for concurrent requests + local slots="${PROFILE_PARALLEL:-2}" + cmd+=("--parallel" "${slots}") + + # Thread priority for reduced scheduling latency + local prio="${PROFILE_PRIO:-0}" + if [ "$prio" != "0" ]; then + cmd+=("--prio" "${prio}") + fi + + # Strict CPU placement for cache locality + if [ "${PROFILE_CPU_STRICT:-0}" = "1" ]; then + cmd+=("--cpu-strict" "1") + fi + + # KV cache reuse for multi-turn chat (prefix sharing via KV shifting) + local cache_reuse="${PROFILE_CACHE_REUSE:-0}" + if [ "$cache_reuse" != "0" ]; then + cmd+=("--cache-reuse" "${cache_reuse}") + fi + + # Disable web UI for headless server deployments + if [ "${PROFILE_NO_WEBUI:-false}" = "true" ]; then + cmd+=("--no-webui") + fi + + # Prometheus-compatible metrics endpoint + if [ "${PROFILE_METRICS:-false}" = "true" ]; then + cmd+=("--metrics") + fi + + # Profile-specific extra args (split on spaces intentionally) + if [ -n "${PROFILE_EXTRA_ARGS:-}" ]; then + # shellcheck disable=SC2206 + cmd+=(${PROFILE_EXTRA_ARGS}) + fi + + # --- Tier 3: User overrides ----------------------------------------------- + if [ -n "${FOUNDRY_EXTRA_ARGS:-}" ]; then + # shellcheck disable=SC2206 + cmd+=(${FOUNDRY_EXTRA_ARGS}) + fi + + # Store the array globally so main() can exec it safely + FOUNDRY_CMD=("${cmd[@]}") +} + +# ============================================================================== +# Main +# ============================================================================== + +main() { + echo "" + echo -e "${GREEN}╔════════════════════════════════════════════╗${NC}" + echo -e "${GREEN}║ Foundry Inference ║${NC}" + echo -e "${GREEN}║ github.com/infernet-org/foundry ║${NC}" + echo -e "${GREEN}╚════════════════════════════════════════════╝${NC}" + echo "" + + log "Model: ${FOUNDRY_MODEL_NAME}" + log "Architecture: ${FOUNDRY_ARCH:-dense}" + + # 1. Determine profile + local profile + if [ "${FOUNDRY_PROFILE}" = "auto" ]; then + profile=$(detect_gpu) + else + profile="${FOUNDRY_PROFILE}" + fi + + # 2. Load profile + load_profile "$profile" + + # 3. Download model if needed + download_model + + # 4. Build launch command (sets FOUNDRY_CMD array directly, no subshell) + build_command + + echo "" + log "Launch command:" + echo -e "${CYAN} ${FOUNDRY_CMD[*]}${NC}" + echo "" + log "OpenAI-compatible API will be available at:" + echo -e "${GREEN} http://localhost:${FOUNDRY_PORT:-8080}/v1/chat/completions${NC}" + echo "" + + # 5. Launch (exec replaces shell process for proper signal handling) + # Use the array form to avoid word-splitting issues + exec "${FOUNDRY_CMD[@]}" +} + +main "$@" diff --git a/models/qwen3-coder-30b-a3b/profiles/default.sh b/models/qwen3-coder-30b-a3b/profiles/default.sh new file mode 100644 index 0000000..6b1880a --- /dev/null +++ b/models/qwen3-coder-30b-a3b/profiles/default.sh @@ -0,0 +1,25 @@ +# ============================================================================== +# Foundry Profile: Default (16GB+ VRAM) +# ============================================================================== +# Qwen3-Coder-30B-A3B-Instruct UD-Q4_K_XL (~17.7GB) +# +# Conservative profile for GPUs with 16-24GB VRAM. +# At 17.7GB model weight, this is the lightest model in the lineup and +# has the most headroom on 16GB cards with MoE expert offloading. +# ============================================================================== + +PROFILE_CTX_LENGTH=32768 # 32K context -- safe for 16GB+ cards +PROFILE_THREADS=8 # Conservative thread count +PROFILE_THREADS_BATCH=8 +PROFILE_FLASH_ATTN="on" +PROFILE_KV_TYPE_K="q4_0" # Aggressive KV quantization to save VRAM +PROFILE_KV_TYPE_V="q4_0" +PROFILE_NO_MMAP="true" +PROFILE_JINJA="true" # Tool calling support +PROFILE_PARALLEL=2 # 2 slots for smaller GPUs +PROFILE_PRIO=2 +PROFILE_CPU_STRICT=0 +PROFILE_CACHE_REUSE=256 +PROFILE_NO_WEBUI="true" +PROFILE_METRICS="true" +PROFILE_EXTRA_ARGS="--mlock" diff --git a/models/qwen3-coder-30b-a3b/profiles/rtx5090.sh b/models/qwen3-coder-30b-a3b/profiles/rtx5090.sh new file mode 100644 index 0000000..4b75c28 --- /dev/null +++ b/models/qwen3-coder-30b-a3b/profiles/rtx5090.sh @@ -0,0 +1,54 @@ +# ============================================================================== +# Foundry Profile: RTX 5090 (32GB) +# ============================================================================== +# Qwen3-Coder-30B-A3B-Instruct UD-Q4_K_XL (~17.7GB) +# +# Architecture: Qwen3 MoE (standard transformer, NOT hybrid DeltaNet) +# - Standard MoE with full KV cache (no recurrent layers) +# - 128 experts per MoE layer, top-8 active per token (~3B active) +# - Optimized for code generation and tool calling +# +# Why --parallel 3 (not 4): +# With 3 slots, --fit on allocates 64K context per slot (vs 48K with 4 slots). +# This is 33% more context per agent with identical aggregate throughput: +# 3 slots: 275 tok/s single | 497 tok/s agg | 168 tok/s each | 64K/slot +# 4 slots: 274 tok/s single | 495 tok/s agg | 124 tok/s each | 48K/slot +# The 3rd slot queues only when 3+ requests are in-flight simultaneously, +# and per-agent speed under load is 35% faster (168 vs 124 tok/s). +# +# VRAM budget (32,607 MiB total): +# Model weights: ~17.7 GB +# KV cache (192K): ~9.8 GB (3 slots x 64K, q8_0) +# Compute buffers: ~2.4 GB +# Free headroom: ~2.7 GB +# +# Key differences from Qwen3.5-35B-A3B profile: +# - No --swa-full (not a hybrid model, no sliding window attention) +# - No --cache-ram 0 (standard KV cache, prompt caching works normally) +# - --parallel 3 (vs 4 for Qwen3.5, which has smaller KV due to recurrent layers) +# - cache-reuse enabled (effective for coding workflows with repeated context) +# +# Benchmarked on RTX 5090 (2026-03-02, native sm_120a, BLACKWELL_NATIVE_FP4=1): +# Single-stream decode: ~275 tok/s (memory-bandwidth-bound) +# 3-concurrent aggregate: ~497 tok/s (+81% via MoE expert batching) +# 3-concurrent per-slot: ~168 tok/s each +# Prompt processing: ~345-1,038 tok/s (varies with batch position) +# ============================================================================== + +PROFILE_CTX_LENGTH=196608 # 192K total -- --fit on allocates 64K per slot with 3 slots +PROFILE_THREADS=16 # Physical cores (avoid hyperthreads for decode) +PROFILE_THREADS_BATCH=20 # Higher thread count for prompt processing +PROFILE_FLASH_ATTN="on" # Flash attention for long context perf +PROFILE_KV_TYPE_K="q8_0" # KV cache key quantization +PROFILE_KV_TYPE_V="q8_0" # KV cache value quantization +PROFILE_NO_MMAP="true" # Avoid page faults, load model into RAM +PROFILE_JINJA="true" # Chat template / tool calling support +PROFILE_PARALLEL=3 # 3 slots: 64K/slot, 497 tok/s agg, 168 tok/s each + # (see "Why --parallel 3" above) +PROFILE_PRIO=2 # High thread priority for reduced scheduling latency +PROFILE_CPU_STRICT=1 # Strict CPU placement for cache locality +PROFILE_CACHE_REUSE=256 # KV cache reuse for multi-turn coding sessions +PROFILE_NO_WEBUI="true" # Headless: no web UI, reduce attack surface +PROFILE_METRICS="true" # Prometheus-compatible /metrics endpoint +# --mlock: pin model in RAM; -b/-ub 4096: large batch for fast prompt encode +PROFILE_EXTRA_ARGS="--mlock -b 4096 -ub 4096"