diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index 01edd9b..dc91fe4 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -20,7 +20,7 @@ jobs:
 
     strategy:
       matrix:
-        model: [qwen3.5-35b-a3b, hermes-4.3-36b]
+        model: [qwen3.5-35b-a3b, qwen3-coder-30b-a3b, hermes-4.3-36b]
 
     steps:
       - name: Checkout
diff --git a/AGENTS.md b/AGENTS.md
index 792ac83..720efdf 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -51,6 +51,10 @@ No API key is required by default. If your client demands one, any non-empty str
         "qwen": {
           "id": "qwen3.5-35b-a3b",
           "name": "Qwen 3.5 35B A3B"
+        },
+        "qwen-coder": {
+          "id": "qwen3-coder-30b-a3b",
+          "name": "Qwen 3 Coder 30B A3B"
         }
       }
     }
@@ -68,7 +72,7 @@ API Key:  sk-local
 Model:    qwen3.5-35b-a3b
 ```
 
-Cursor uses streaming by default. Foundry supports SSE streaming natively. With 4 parallel slots, you can run Cursor's background indexing and active chat simultaneously without blocking.
+Cursor uses streaming by default. Foundry supports SSE streaming natively. With multiple parallel slots, you can run Cursor's background indexing and active chat simultaneously without blocking.
 
 ### Continue (VS Code / JetBrains)
 
@@ -115,7 +119,7 @@ Model ID: qwen3.5-35b-a3b
 
 ## Multi-Agent Frameworks
 
-Foundry's 4 parallel inference slots make it particularly suited for multi-agent workflows where multiple agents share a single model. Each slot processes requests independently with minimal throughput degradation.
+Foundry's parallel inference slots make it particularly suited for multi-agent workflows where multiple agents share a single model. Each slot processes requests independently with minimal throughput degradation.
 
 ### CrewAI
 
@@ -162,7 +166,7 @@ crew = Crew(
 result = crew.kickoff(inputs={"topic": "GPU inference optimization"})
 ```
 
-With 4 parallel slots, CrewAI can run 4 agents simultaneously at ~80 tok/s each (Qwen MoE) or ~16 tok/s each (Hermes Dense).
+With 3 parallel slots, CrewAI can run 3 agents simultaneously at ~168 tok/s each (Qwen3-Coder MoE) or ~33 tok/s each with Hermes Dense (4 slots).
 
 ### AutoGen
 
@@ -237,7 +241,7 @@ docker run -d -p 3000:8080 \
   ghcr.io/open-webui/open-webui:main
 ```
 
-Open WebUI supports multi-user chat with conversation history. Each user session uses one of Foundry's 4 inference slots.
+Open WebUI supports multi-user chat with conversation history. Each user session uses one of Foundry's inference slots.
 
 ### text-generation-webui (oobabooga)
 
@@ -315,13 +319,14 @@ console.log(response.choices[0].message.content);
 
 | Use case | Recommended model | Why |
 |----------|-------------------|-----|
-| **Coding agents** (OpenCode, Cursor, Aider) | Qwen3.5-35B-A3B | Fast decode (181 tok/s), 192K context for large codebases, good at code |
-| **Multi-agent orchestration** (CrewAI, AutoGen) | Qwen3.5-35B-A3B | 4-concurrent at 320 tok/s aggregate, MoE batching advantage |
+| **Coding agents** (OpenCode, Cursor, Aider) | Qwen3-Coder-30B-A3B | Fastest decode (275 tok/s), purpose-built for code, tool calling support |
+| **Multi-agent orchestration** (CrewAI, AutoGen) | Qwen3-Coder-30B-A3B | 3-concurrent at 497 tok/s aggregate, best MoE batching efficiency |
+| **General coding + long context** | Qwen3.5-35B-A3B | 192K effective context for large codebases, hybrid recurrent architecture |
 | **Reasoning-heavy tasks** | Hermes-4.3-36B | Thinking mode with `<think>` tags, stronger reasoning on hard problems |
-| **Tool use / function calling** | Hermes-4.3-36B | Trained specifically for structured tool calling with `<tool_call>` XML |
+| **Tool use / function calling** | Qwen3-Coder-30B-A3B or Hermes-4.3-36B | Both have strong tool calling; Coder is 4x faster, Hermes more reliable on complex schemas |
 | **Roleplay / creative writing** | Hermes-4.3-36B | NousResearch fine-tune optimized for personality and narrative |
 | **Long document Q&A** | Qwen3.5-35B-A3B | 192K context window, recurrent layers handle long sequences efficiently |
-| **16 GB VRAM GPUs** | Qwen3.5-35B-A3B | MoE expert offloading works on 16 GB; Hermes needs 24 GB minimum |
+| **16 GB VRAM GPUs** | Qwen3-Coder-30B-A3B | Smallest disk footprint (17.7 GB), MoE expert offloading works on 16 GB |
 
 ## Performance Considerations
 
@@ -331,10 +336,11 @@ Single-stream decode latency (time to generate one token):
 
 | Model | Latency per token | Tokens per second |
 |-------|-------------------|-------------------|
+| Qwen3-Coder-30B-A3B | ~3.6 ms | ~275 tok/s |
 | Qwen3.5-35B-A3B | ~5.5 ms | ~181 tok/s |
 | Hermes-4.3-36B | ~15.5 ms | ~64 tok/s |
 
-For interactive coding agents, Qwen delivers a visibly faster typing experience. For batch/background tasks where latency is less critical, Hermes' stronger reasoning may be worth the tradeoff.
+For interactive coding agents, Qwen3-Coder delivers the fastest typing experience. Qwen3.5 trades some speed for 192K effective context. For batch/background tasks where latency is less critical, Hermes' stronger reasoning may be worth the tradeoff.
 
 ### Prompt processing
 
@@ -342,13 +348,21 @@ Prompt processing (prefill) runs at ~1,163 tok/s for Qwen on RTX 5090. A 10K tok
 
 ### Concurrent agent scaling
 
+Qwen3-Coder-30B-A3B (fastest, 3 slots):
+```
+1 agent:  275 tok/s  (100% per-agent speed)
+2 agents: 405 tok/s  (~204 tok/s each, 74% per-agent)
+3 agents: 497 tok/s  (~168 tok/s each, 61% per-agent)
+```
+
+Qwen3.5-35B-A3B (4 slots):
 ```
 1 agent:  181 tok/s  (100% per-agent speed)
 2 agents: 234 tok/s  (~117 tok/s each, 65% per-agent)
 4 agents: 320 tok/s  (~80 tok/s each, 44% per-agent)
 ```
 
-If your workflow has >4 concurrent agents, requests queue until a slot is free. Consider multi-GPU routing (below) for higher concurrency.
+If your workflow has more concurrent agents than slots, requests queue until a slot is free. Consider multi-GPU routing (below) for higher concurrency.
 
 ### Context window usage
 
@@ -356,6 +370,7 @@ VRAM scales with context usage. The default RTX 5090 profiles are tuned for maxi
 
 | Model | Default context | VRAM at idle | VRAM at full context |
 |-------|----------------|--------------|---------------------|
+| Qwen3-Coder-30B-A3B | 192K | 25.0 GB | ~28.9 GB |
 | Qwen3.5-35B-A3B | 192K | 25.3 GB | ~26.1 GB |
 | Hermes-4.3-36B | 32K | 24.5 GB | ~27.8 GB |
 
@@ -370,7 +385,7 @@ docker run --gpus all -p 8080:8080 \
 
 ## Structured Output
 
-Both models support JSON mode for structured outputs:
+All three models support JSON mode for structured outputs:
 
 ```python
 response = client.chat.completions.create(
@@ -440,7 +455,7 @@ if tool_calls:
     print(f"Args: {tool_calls[0].function.arguments}")
 ```
 
-Both models support Jinja chat templates for tool calling. The entrypoint enables `--jinja` by default.
+All models support Jinja chat templates for tool calling. The entrypoint enables `--jinja` by default.
 
 ## Thinking / Reasoning Mode
 
@@ -483,7 +498,7 @@ The Docker Compose configuration includes TCP tuning (BBR congestion control, bu
 
 ## Multi-GPU Agent Routing
 
-For workloads requiring more than 4 concurrent agents, run multiple Foundry instances and load-balance across them.
+For workloads requiring more concurrent agents than slots, run multiple Foundry instances and load-balance across them.
 
 ### Simple round-robin with nginx
 
@@ -511,14 +526,15 @@ For deterministic routing (each agent always hits the same GPU):
 ```python
 import os
 
-# Route based on agent ID
+# Route based on agent ID (adjust slots_per_gpu to match your model's --parallel setting)
+slots_per_gpu = 3  # Qwen3-Coder default; use 4 for Qwen3.5/Hermes
 gpu_endpoints = [
-    "http://localhost:8080/v1",  # GPU 0: agents 0-3
-    "http://localhost:8081/v1",  # GPU 1: agents 4-7
+    "http://localhost:8080/v1",  # GPU 0
+    "http://localhost:8081/v1",  # GPU 1
 ]
 
 def get_client(agent_id: int) -> OpenAI:
-    endpoint = gpu_endpoints[agent_id // 4]
+    endpoint = gpu_endpoints[agent_id // slots_per_gpu]
     return OpenAI(base_url=endpoint, api_key="sk-local")
 ```
 
@@ -536,7 +552,7 @@ If you see `no devices with dedicated memory found` in the logs, the CUDA backen
 
 1. Check if all layers are on GPU: look for `offloaded N/N layers to GPU` in container logs
 2. Check VRAM: `nvidia-smi` -- if VRAM is full, reduce context with `FOUNDRY_CTX_LENGTH`
-3. Check if all 4 slots are occupied: `curl http://localhost:8080/metrics | grep slots`
+3. Check if all slots are occupied: `curl http://localhost:8080/metrics | grep slots`
 
 ### Connection refused
 
diff --git a/Makefile b/Makefile
index ab14819..6dc6406 100644
--- a/Makefile
+++ b/Makefile
@@ -12,8 +12,8 @@ MODELS_DIR ?= $(HOME)/.cache/foundry
 .PHONY: help build run run-profile test benchmark monitoring down push push-all clean clean-models download
 
 help: ## Show this help
-	@echo "Available models: qwen3.5-35b-a3b (default), hermes-4.3-36b"
-	@echo "Usage: make run MODEL=hermes-4.3-36b"
+	@echo "Available models: qwen3.5-35b-a3b (default), qwen3-coder-30b-a3b, hermes-4.3-36b"
+	@echo "Usage: make run MODEL=qwen3-coder-30b-a3b"
 	@echo ""
 	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | \
 		awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'
diff --git a/README.md b/README.md
index 5be0e9e..a22da10 100644
--- a/README.md
+++ b/README.md
@@ -89,6 +89,40 @@ Dense transformer. 36B total parameters, all active per token. ByteDance Seed-OS
 
 Dense models activate all parameters per token, making them compute-bound rather than memory-bandwidth-bound. Expect ~3x slower decode than equivalently-sized MoE models on the same hardware.
 
+### Qwen3-Coder-30B-A3B (MoE)
+
+Standard Mixture-of-Experts optimized for code generation. 30B total parameters, only 3B active per token.
+
+- 48 transformer layers, standard attention (GQA 32:4)
+- 128 experts per MoE layer, top-8 active per token
+- Quantization: UD-Q4_K_XL via [Unsloth](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) (Dynamic 2.0)
+- Disk: ~17.7 GB | Min VRAM: 16 GB (with expert offloading) | Max context: 262K native
+- Built-in tool calling support via `--jinja` chat template
+
+| GPU | VRAM | Context | Decode | 3-concurrent | VRAM used |
+|-----|------|---------|--------|--------------|-----------|
+| RTX 5090 | 32 GB | 64K/slot | ~275 tok/s | ~497 tok/s | 28.9 GB |
+| Other NVIDIA (16 GB+) | 16+ GB | 16K/slot | varies | varies | varies |
+
+<details>
+<summary>RTX 5090 detailed benchmark</summary>
+
+```
+SINGLE-STREAM DECODE:    ~275 tok/s
+3-CONCURRENT AGGREGATE:  ~497 tok/s  (+81% via MoE expert batching)
+3-CONCURRENT PER-SLOT:   ~168 tok/s  each
+PROMPT PROCESSING:       ~345-1,038 tok/s  (varies with batch position)
+VRAM USAGE:              28.9 GB / 32.6 GB (3.7 GB headroom)
+CONTEXT:                 64K per slot (3 slots, auto-fitted from 192K request)
+```
+
+Benchmarked 2026-03-02 with native sm_120a (Blackwell) compilation and `BLACKWELL_NATIVE_FP4=1` enabled.
+
+**Why 3 slots (not 4)?** With 3 slots, `--fit on` allocates 64K context per slot instead of 48K. Aggregate throughput is identical (497 vs 495 tok/s), but per-agent speed under load is 35% faster (168 vs 124 tok/s). The 4th slot rarely matters for a single-GPU workstation. Override with `FOUNDRY_EXTRA_ARGS="--parallel 4"` if needed.
+
+**vs Qwen3.5-35B-A3B**: 52% faster single-stream, 55% faster aggregate. The standard MoE architecture (no DeltaNet recurrent layers) batches more efficiently on Blackwell. Trades the 192K effective context of Qwen3.5 for raw speed.
+</details>
+
 ## How It Works
 
 Why llama.cpp and not SGLang or vLLM? For **consumer GPUs**, llama.cpp's MoE expert offloading (`--fit on`) is the only engine that can run a 35B-parameter MoE model on a single 16-24 GB card at full speed. SGLang and vLLM require the entire model to fit in VRAM.
@@ -143,7 +177,7 @@ All settings can be overridden via environment variables:
 
 ## Multi-Agent Inference
 
-The RTX 5090 profile is configured with `--parallel 4`, enabling 4 concurrent inference slots. This makes Foundry well-suited for multi-agent workflows where several AI agents share a single GPU.
+The RTX 5090 profiles are configured with multiple concurrent inference slots: `--parallel 4` for Qwen3.5 and Hermes, `--parallel 3` for Qwen3-Coder. This makes Foundry well-suited for multi-agent workflows where several AI agents share a single GPU.
 
 ### Why MoE batching works
 
@@ -151,26 +185,29 @@ Qwen3.5-35B-A3B uses a 256-expert MoE architecture with only 8 experts active pe
 
 ### Throughput scaling
 
-| Active agents | Aggregate throughput | Per-agent speed | VRAM |
-|---------------|---------------------|-----------------|------|
-| 1 | 181 tok/s | 181 tok/s | 25.3 GB |
-| 2 | 234 tok/s | ~117 tok/s each | 25.7 GB |
-| 4 | 320 tok/s | ~80 tok/s each | 26.1 GB |
+Measured on RTX 5090 with Qwen models (MoE):
+
+| Active agents | Qwen3.5-35B-A3B (4 slots) | Qwen3-Coder-30B-A3B (3 slots) |
+|---------------|----------------------------|--------------------------------|
+| 1 | 181 tok/s | 275 tok/s |
+| 2 | 234 tok/s (117 each) | 405 tok/s (204 each) |
+| 3 | — | 497 tok/s (168 each) |
+| 4 | 320 tok/s (80 each) | — |
 
-Single-agent speed is unaffected. The 4 slots only activate when there are concurrent requests.
+Single-agent speed is unaffected. Concurrent slots only activate when there are simultaneous requests.
 
 ### Multi-GPU scaling
 
-With 2x RTX 5090, run two independent instances for 8 total concurrent slots and ~640 tok/s combined aggregate throughput:
+With 2x RTX 5090, run two independent instances for double the concurrent slots and aggregate throughput:
 
 ```bash
-# GPU 0: agents 1-4
+# GPU 0
 docker run --gpus '"device=0"' -p 8080:8080 -v ~/.cache/foundry:/models \
-  ghcr.io/infernet-org/foundry/qwen3.5-35b-a3b:latest
+  ghcr.io/infernet-org/foundry/qwen3-coder-30b-a3b:latest
 
-# GPU 1: agents 5-8
+# GPU 1
 docker run --gpus '"device=1"' -p 8081:8080 -v ~/.cache/foundry:/models \
-  ghcr.io/infernet-org/foundry/qwen3.5-35b-a3b:latest
+  ghcr.io/infernet-org/foundry/qwen3-coder-30b-a3b:latest
 ```
 
 ### Compatible frameworks
@@ -187,6 +224,7 @@ docker compose up
 
 # Choose a different model
 FOUNDRY_MODEL=hermes-4.3-36b docker compose up
+FOUNDRY_MODEL=qwen3-coder-30b-a3b docker compose up
 
 # With explicit profile
 FOUNDRY_PROFILE=rtx5090 docker compose up
@@ -208,6 +246,7 @@ GF_ADMIN_PASSWORD=admin
 ```bash
 make build                        # Build the default model image (qwen3.5-35b-a3b)
 make build MODEL=hermes-4.3-36b   # Build a different model
+make build MODEL=qwen3-coder-30b-a3b  # Build the coding-optimized model
 make run                          # Run with auto-detected GPU
 make test                         # Smoke test: start, wait for health, send one request
 make benchmark                    # Run benchmark against a running server
@@ -332,12 +371,18 @@ foundry/
 │   │   └── profiles/
 │   │       ├── rtx5090.sh           # 192K ctx, 4 slots, ~320 tok/s aggregate
 │   │       └── default.sh           # 16K ctx, conservative settings
-│   └── hermes-4.3-36b/
+│   ├── hermes-4.3-36b/
+│   │   ├── Dockerfile               # Multi-stage: compiles llama.cpp for sm_89 + sm_120a
+│   │   ├── entrypoint.sh            # Copied from scripts/entrypoint.sh at build time
+│   │   └── profiles/
+│   │       ├── rtx5090.sh           # 32K ctx, 4 slots, ~132 tok/s aggregate
+│   │       └── default.sh           # 8K ctx, 24 GB minimum
+│   └── qwen3-coder-30b-a3b/
 │       ├── Dockerfile               # Multi-stage: compiles llama.cpp for sm_89 + sm_120a
 │       ├── entrypoint.sh            # Copied from scripts/entrypoint.sh at build time
 │       └── profiles/
-│           ├── rtx5090.sh           # 32K ctx, 4 slots, ~132 tok/s aggregate
-│           └── default.sh           # 8K ctx, 24 GB minimum
+│           ├── rtx5090.sh           # 192K ctx, 3 slots, ~497 tok/s aggregate
+│           └── default.sh           # 32K ctx, conservative settings
 ├── scripts/
 │   ├── entrypoint.sh                # Shared entrypoint (GPU detect, profile load, model download)
 │   ├── benchmark.py                 # Generation speed, prompt processing, throughput
diff --git a/models/qwen3-coder-30b-a3b/Dockerfile b/models/qwen3-coder-30b-a3b/Dockerfile
new file mode 100644
index 0000000..0fcf326
--- /dev/null
+++ b/models/qwen3-coder-30b-a3b/Dockerfile
@@ -0,0 +1,93 @@
+# ==============================================================================
+# Foundry Model Image: Qwen3-Coder-30B-A3B-Instruct
+# ==============================================================================
+# Multi-stage build for a minimal CUDA runtime.
+# Compiles llama.cpp from source for sm_89 (Ada) and sm_120a (Blackwell),
+# then copies only the binary and required libraries to a clean Ubuntu base.
+#
+# Weights are NOT baked in. They are downloaded on first run or mounted
+# from the host at /models.
+# ==============================================================================
+
+# ------------------------------------------------------------------------------
+# Stage 1: Builder
+# ------------------------------------------------------------------------------
+FROM nvidia/cuda:12.9.1-devel-ubuntu24.04 AS builder
+
+RUN apt-get update && apt-get install -y git cmake g++ curl
+RUN git clone --depth 1 -b b8183 https://github.com/ggml-org/llama.cpp.git /llama.cpp
+WORKDIR /llama.cpp
+
+# Compile explicitly for Ada (sm_89) and Blackwell (sm_120a).
+# GGML_BACKEND_DL=ON builds CUDA as a runtime-loaded plugin (dlopen), which
+# avoids the libcuda.so.1 transitive link error during Docker builds where
+# no real GPU driver is present. This matches the official llama.cpp Dockerfile.
+RUN cmake -B build \
+    -DGGML_NATIVE=OFF \
+    -DGGML_CUDA=ON \
+    -DGGML_BACKEND_DL=ON \
+    -DGGML_CPU_ALL_VARIANTS=ON \
+    -DCMAKE_CUDA_ARCHITECTURES="89;120a" \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DLLAMA_CURL=OFF \
+    -DLLAMA_BUILD_TESTS=OFF \
+    -DLLAMA_BUILD_EXAMPLES=OFF \
+    -DCMAKE_EXE_LINKER_FLAGS="-Wl,--allow-shlib-undefined" && \
+    cmake --build build --config Release -j$(nproc)
+
+# ------------------------------------------------------------------------------
+# Stage 2: Minimal Runtime
+# ------------------------------------------------------------------------------
+FROM ubuntu:24.04
+
+# Install minimal runtime dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    libgomp1 \
+    python3 python3-pip curl \
+    && pip3 install --break-system-packages --no-cache-dir "huggingface-hub>=0.28,<1" "hf_transfer>=0.1.6" \
+    && rm -rf /var/lib/apt/lists/*
+
+# The NVIDIA runtime needs these env vars to mount the CUDA drivers correctly
+ENV NVIDIA_VISIBLE_DEVICES=all
+ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
+
+# Model metadata
+ENV FOUNDRY_MODEL_NAME="Qwen3-Coder-30B-A3B-Instruct"
+ENV FOUNDRY_GGUF_REPO="unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF"
+ENV FOUNDRY_GGUF_FILE="Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf"
+ENV FOUNDRY_ARCH="moe"
+
+# Enable fast downloads
+ENV HF_HUB_ENABLE_HF_TRANSFER="1"
+
+# Runtime defaults (can be overridden)
+ENV FOUNDRY_PROFILE="auto"
+ENV FOUNDRY_PORT="8080"
+ENV FOUNDRY_CTX_LENGTH=""
+ENV FOUNDRY_THREADS=""
+ENV FOUNDRY_EXTRA_ARGS=""
+
+# Copy the compiled binary and all shared libraries from the build output.
+# With GGML_BACKEND_DL=ON, backends (ggml-cuda, ggml-cpu-*) are .so modules
+# loaded at runtime via dlopen. CMake places everything in build/bin/.
+COPY --from=builder /llama.cpp/build/bin/ /app/
+
+# Cherry-pick only the CUDA runtime libs that libggml-cuda.so actually needs.
+# libcuda.so.1 is provided by the NVIDIA container runtime at launch.
+COPY --from=builder /usr/local/cuda/lib64/libcudart.so.12 /app/
+COPY --from=builder /usr/local/cuda/lib64/libcublas.so.12 /app/
+COPY --from=builder /usr/local/cuda/lib64/libcublasLt.so.12 /app/
+ENV LD_LIBRARY_PATH="/app"
+
+# Copy profiles and shared entrypoint
+COPY profiles/ /opt/foundry/profiles/
+COPY entrypoint.sh /opt/foundry/entrypoint.sh
+RUN chmod +x /opt/foundry/entrypoint.sh
+
+# Model storage
+RUN mkdir -p /models
+VOLUME /models
+
+EXPOSE 8080
+
+ENTRYPOINT ["/opt/foundry/entrypoint.sh"]
diff --git a/models/qwen3-coder-30b-a3b/entrypoint.sh b/models/qwen3-coder-30b-a3b/entrypoint.sh
new file mode 100755
index 0000000..9dff64d
--- /dev/null
+++ b/models/qwen3-coder-30b-a3b/entrypoint.sh
@@ -0,0 +1,316 @@
+#!/bin/bash
+# ==============================================================================
+# Foundry Entrypoint (shared across all models)
+# ==============================================================================
+# 1. Detect GPU and load hardware profile
+# 2. Download model if not present
+# 3. Apply architecture-aware tuning (MoE vs Dense)
+# 4. Launch llama-server with tuned parameters
+#
+# Model identity is set via Dockerfile ENV vars:
+#   FOUNDRY_MODEL_NAME   -- display name (e.g. "Qwen3.5-35B-A3B")
+#   FOUNDRY_GGUF_REPO    -- HuggingFace repo (e.g. "unsloth/Qwen3.5-35B-A3B-GGUF")
+#   FOUNDRY_GGUF_FILE    -- GGUF filename
+#   FOUNDRY_ARCH         -- architecture type: "moe" or "dense"
+#
+# Architecture-specific flags are applied automatically based on FOUNDRY_ARCH:
+#
+#   MoE (e.g. Qwen3.5-35B-A3B):
+#     --fit on         Expert offloading: spill inactive experts to CPU
+#
+#   Dense (e.g. Hermes-4.3-36B):
+#     (no --fit)       No experts to offload
+#
+# Model-specific quirks (e.g. --swa-full for hybrid attention, --cache-ram 0
+# for recurrent architectures) belong in PROFILE_EXTRA_ARGS, NOT in the arch
+# tier -- they are not universal to the architecture class.
+#
+# Hardware-specific tuning (context, threads, KV quant, slots) is set by
+# per-GPU profiles in /opt/foundry/profiles/*.sh.
+# ==============================================================================
+
+set -euo pipefail
+
+FOUNDRY_DIR="/opt/foundry"
+PROFILES_DIR="${FOUNDRY_DIR}/profiles"
+MODELS_DIR="/models"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+CYAN='\033[0;36m'
+NC='\033[0m' # No Color
+
+log()  { echo -e "${CYAN}[foundry]${NC} $*"; }
+warn() { echo -e "${YELLOW}[foundry]${NC} $*" >&2; }
+err()  { echo -e "${RED}[foundry]${NC} $*" >&2; }
+
+# ==============================================================================
+# GPU Detection
+# ==============================================================================
+
+detect_gpu() {
+    local gpu_name
+    if ! command -v nvidia-smi &> /dev/null; then
+        warn "nvidia-smi not found, using default profile"
+        echo "default"
+        return
+    fi
+
+    gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader,nounits 2>/dev/null | head -1 | xargs)
+
+    if [ -z "$gpu_name" ]; then
+        warn "Could not detect GPU, using default profile"
+        echo "default"
+        return
+    fi
+
+    # Log to stderr so it doesn't interfere with the captured profile name
+    log "Detected GPU: ${gpu_name}" >&2
+
+    # Map GPU name to profile
+    case "$gpu_name" in
+        *"5090"*)       echo "rtx5090" ;;
+        *)
+            warn "Unknown or unsupported GPU '${gpu_name}', using default profile"
+            echo "default"
+            ;;
+    esac
+}
+
+# ==============================================================================
+# Profile Loading
+# ==============================================================================
+
+load_profile() {
+    local profile_name="$1"
+    local profile_file="${PROFILES_DIR}/${profile_name}.sh"
+
+    if [ ! -f "$profile_file" ]; then
+        warn "Profile '${profile_name}' not found, falling back to default"
+        profile_file="${PROFILES_DIR}/default.sh"
+    fi
+
+    if [ ! -f "$profile_file" ]; then
+        err "No default profile found at ${profile_file}"
+        exit 1
+    fi
+
+    log "Loading profile: ${profile_name}"
+    # shellcheck source=profiles/default.sh
+    source "$profile_file"
+}
+
+# ==============================================================================
+# Model Download
+# ==============================================================================
+
+download_model() {
+    local gguf_path="${MODELS_DIR}/${FOUNDRY_GGUF_FILE}"
+
+    if [ -f "$gguf_path" ]; then
+        local size
+        size=$(du -h "$gguf_path" | cut -f1)
+        log "Model found: ${gguf_path} (${size})"
+        return 0
+    fi
+
+    log "Model not found at ${gguf_path}"
+    log "Downloading ${FOUNDRY_GGUF_FILE} from ${FOUNDRY_GGUF_REPO}..."
+    log "This is a one-time download (~20GB). Subsequent starts will be instant."
+    echo ""
+
+    # Use python3 huggingface_hub to download (huggingface-cli may not be on PATH)
+    # Variables are passed via environment to avoid shell injection in inline Python
+    if python3 -c "import huggingface_hub" 2>/dev/null; then
+        FOUNDRY_GGUF_REPO="${FOUNDRY_GGUF_REPO}" \
+        FOUNDRY_GGUF_FILE="${FOUNDRY_GGUF_FILE}" \
+        FOUNDRY_MODELS_DIR="${MODELS_DIR}" \
+        python3 -c "
+import os
+from huggingface_hub import hf_hub_download
+token = os.environ.get('HF_TOKEN') or os.environ.get('HUGGING_FACE_HUB_TOKEN')
+hf_hub_download(
+    repo_id=os.environ['FOUNDRY_GGUF_REPO'],
+    filename=os.environ['FOUNDRY_GGUF_FILE'],
+    local_dir=os.environ['FOUNDRY_MODELS_DIR'],
+    token=token
+)
+"
+    else
+        err "huggingface-hub not found. Please mount the GGUF at ${gguf_path}"
+        err "Or install huggingface-hub: pip install huggingface-hub"
+        exit 1
+    fi
+
+    if [ ! -f "$gguf_path" ]; then
+        err "Download failed: ${gguf_path} not found after download"
+        exit 1
+    fi
+
+    local size
+    size=$(du -h "$gguf_path" | cut -f1)
+    log "Download complete: ${gguf_path} (${size})"
+}
+
+# ==============================================================================
+# Build Launch Command
+# ==============================================================================
+# Flags are layered in three tiers:
+#   1. Architecture defaults (FOUNDRY_ARCH)  -- systematic, model-class level
+#   2. Hardware profile (PROFILE_*)          -- per-GPU tuning knobs
+#   3. User overrides (FOUNDRY_EXTRA_ARGS)   -- escape hatch, highest priority
+
+build_command() {
+    local gguf_path="${MODELS_DIR}/${FOUNDRY_GGUF_FILE}"
+    local arch="${FOUNDRY_ARCH:-dense}"
+
+    # Use a bash array to safely handle arguments with spaces
+    local -a cmd=("/app/llama-server")
+    cmd+=("--model" "${gguf_path}")
+    cmd+=("--host" "0.0.0.0")
+    cmd+=("--port" "${FOUNDRY_PORT:-8080}")
+
+    # --- Tier 1: Architecture-specific flags ----------------------------------
+    # These are determined by the model class, not by the GPU or user preference.
+
+    if [ "$arch" = "moe" ]; then
+        # MoE: enable expert offloading (spill inactive experts to CPU when VRAM
+        # is tight). On high-VRAM GPUs --fit keeps everything on GPU automatically.
+        cmd+=("--fit" "on")
+    fi
+    # Dense models: no --fit (no experts to offload).
+    # Model-specific flags (--swa-full, --cache-ram) go in PROFILE_EXTRA_ARGS.
+
+    # --- Tier 2: Hardware profile tuning --------------------------------------
+    # These come from the sourced profile .sh file and tune for the specific GPU.
+
+    # Context length (env override > profile > default)
+    local ctx="${FOUNDRY_CTX_LENGTH:-${PROFILE_CTX_LENGTH:-32768}}"
+    cmd+=("--ctx-size" "${ctx}")
+
+    # Thread count (env override > profile > auto)
+    local threads="${FOUNDRY_THREADS:-${PROFILE_THREADS:-}}"
+    if [ -n "$threads" ]; then
+        cmd+=("--threads" "${threads}")
+    fi
+
+    # Batch thread count (can be higher than decode threads for prompt processing)
+    local threads_batch="${PROFILE_THREADS_BATCH:-${threads}}"
+    if [ -n "$threads_batch" ]; then
+        cmd+=("--threads-batch" "${threads_batch}")
+    fi
+
+    # Flash attention (new llama.cpp requires explicit on/off/auto value)
+    local fa="${PROFILE_FLASH_ATTN:-on}"
+    cmd+=("--flash-attn" "${fa}")
+
+    # KV cache quantization
+    local ctk="${PROFILE_KV_TYPE_K:-q8_0}"
+    local ctv="${PROFILE_KV_TYPE_V:-q8_0}"
+    cmd+=("-ctk" "${ctk}" "-ctv" "${ctv}")
+
+    # Memory mapping
+    if [ "${PROFILE_NO_MMAP:-true}" = "true" ]; then
+        cmd+=("--no-mmap")
+    fi
+
+    # Jinja templates (for tool calling / chat templates)
+    if [ "${PROFILE_JINJA:-true}" = "true" ]; then
+        cmd+=("--jinja")
+    fi
+
+    # Parallel slots for concurrent requests
+    local slots="${PROFILE_PARALLEL:-2}"
+    cmd+=("--parallel" "${slots}")
+
+    # Thread priority for reduced scheduling latency
+    local prio="${PROFILE_PRIO:-0}"
+    if [ "$prio" != "0" ]; then
+        cmd+=("--prio" "${prio}")
+    fi
+
+    # Strict CPU placement for cache locality
+    if [ "${PROFILE_CPU_STRICT:-0}" = "1" ]; then
+        cmd+=("--cpu-strict" "1")
+    fi
+
+    # KV cache reuse for multi-turn chat (prefix sharing via KV shifting)
+    local cache_reuse="${PROFILE_CACHE_REUSE:-0}"
+    if [ "$cache_reuse" != "0" ]; then
+        cmd+=("--cache-reuse" "${cache_reuse}")
+    fi
+
+    # Disable web UI for headless server deployments
+    if [ "${PROFILE_NO_WEBUI:-false}" = "true" ]; then
+        cmd+=("--no-webui")
+    fi
+
+    # Prometheus-compatible metrics endpoint
+    if [ "${PROFILE_METRICS:-false}" = "true" ]; then
+        cmd+=("--metrics")
+    fi
+
+    # Profile-specific extra args (split on spaces intentionally)
+    if [ -n "${PROFILE_EXTRA_ARGS:-}" ]; then
+        # shellcheck disable=SC2206
+        cmd+=(${PROFILE_EXTRA_ARGS})
+    fi
+
+    # --- Tier 3: User overrides -----------------------------------------------
+    if [ -n "${FOUNDRY_EXTRA_ARGS:-}" ]; then
+        # shellcheck disable=SC2206
+        cmd+=(${FOUNDRY_EXTRA_ARGS})
+    fi
+
+    # Store the array globally so main() can exec it safely
+    FOUNDRY_CMD=("${cmd[@]}")
+}
+
+# ==============================================================================
+# Main
+# ==============================================================================
+
+main() {
+    echo ""
+    echo -e "${GREEN}╔════════════════════════════════════════════╗${NC}"
+    echo -e "${GREEN}║            Foundry Inference               ║${NC}"
+    echo -e "${GREEN}║   github.com/infernet-org/foundry          ║${NC}"
+    echo -e "${GREEN}╚════════════════════════════════════════════╝${NC}"
+    echo ""
+
+    log "Model: ${FOUNDRY_MODEL_NAME}"
+    log "Architecture: ${FOUNDRY_ARCH:-dense}"
+
+    # 1. Determine profile
+    local profile
+    if [ "${FOUNDRY_PROFILE}" = "auto" ]; then
+        profile=$(detect_gpu)
+    else
+        profile="${FOUNDRY_PROFILE}"
+    fi
+
+    # 2. Load profile
+    load_profile "$profile"
+
+    # 3. Download model if needed
+    download_model
+
+    # 4. Build launch command (sets FOUNDRY_CMD array directly, no subshell)
+    build_command
+
+    echo ""
+    log "Launch command:"
+    echo -e "${CYAN}  ${FOUNDRY_CMD[*]}${NC}"
+    echo ""
+    log "OpenAI-compatible API will be available at:"
+    echo -e "${GREEN}  http://localhost:${FOUNDRY_PORT:-8080}/v1/chat/completions${NC}"
+    echo ""
+
+    # 5. Launch (exec replaces shell process for proper signal handling)
+    # Use the array form to avoid word-splitting issues
+    exec "${FOUNDRY_CMD[@]}"
+}
+
+main "$@"
diff --git a/models/qwen3-coder-30b-a3b/profiles/default.sh b/models/qwen3-coder-30b-a3b/profiles/default.sh
new file mode 100644
index 0000000..6b1880a
--- /dev/null
+++ b/models/qwen3-coder-30b-a3b/profiles/default.sh
@@ -0,0 +1,25 @@
+# ==============================================================================
+# Foundry Profile: Default (16GB+ VRAM)
+# ==============================================================================
+# Qwen3-Coder-30B-A3B-Instruct UD-Q4_K_XL (~17.7GB)
+#
+# Conservative profile for GPUs with 16-24GB VRAM.
+# At 17.7GB model weight, this is the lightest model in the lineup and
+# has the most headroom on 16GB cards with MoE expert offloading.
+# ==============================================================================
+
+PROFILE_CTX_LENGTH=32768        # 32K context -- safe for 16GB+ cards
+PROFILE_THREADS=8               # Conservative thread count
+PROFILE_THREADS_BATCH=8
+PROFILE_FLASH_ATTN="on"
+PROFILE_KV_TYPE_K="q4_0"        # Aggressive KV quantization to save VRAM
+PROFILE_KV_TYPE_V="q4_0"
+PROFILE_NO_MMAP="true"
+PROFILE_JINJA="true"            # Tool calling support
+PROFILE_PARALLEL=2              # 2 slots for smaller GPUs
+PROFILE_PRIO=2
+PROFILE_CPU_STRICT=0
+PROFILE_CACHE_REUSE=256
+PROFILE_NO_WEBUI="true"
+PROFILE_METRICS="true"
+PROFILE_EXTRA_ARGS="--mlock"
diff --git a/models/qwen3-coder-30b-a3b/profiles/rtx5090.sh b/models/qwen3-coder-30b-a3b/profiles/rtx5090.sh
new file mode 100644
index 0000000..4b75c28
--- /dev/null
+++ b/models/qwen3-coder-30b-a3b/profiles/rtx5090.sh
@@ -0,0 +1,54 @@
+# ==============================================================================
+# Foundry Profile: RTX 5090 (32GB)
+# ==============================================================================
+# Qwen3-Coder-30B-A3B-Instruct UD-Q4_K_XL (~17.7GB)
+#
+# Architecture: Qwen3 MoE (standard transformer, NOT hybrid DeltaNet)
+#   - Standard MoE with full KV cache (no recurrent layers)
+#   - 128 experts per MoE layer, top-8 active per token (~3B active)
+#   - Optimized for code generation and tool calling
+#
+# Why --parallel 3 (not 4):
+#   With 3 slots, --fit on allocates 64K context per slot (vs 48K with 4 slots).
+#   This is 33% more context per agent with identical aggregate throughput:
+#     3 slots: 275 tok/s single | 497 tok/s agg | 168 tok/s each | 64K/slot
+#     4 slots: 274 tok/s single | 495 tok/s agg | 124 tok/s each | 48K/slot
+#   The 3rd slot queues only when 3+ requests are in-flight simultaneously,
+#   and per-agent speed under load is 35% faster (168 vs 124 tok/s).
+#
+# VRAM budget (32,607 MiB total):
+#   Model weights:    ~17.7 GB
+#   KV cache (192K):  ~9.8 GB (3 slots x 64K, q8_0)
+#   Compute buffers:  ~2.4 GB
+#   Free headroom:    ~2.7 GB
+#
+# Key differences from Qwen3.5-35B-A3B profile:
+#   - No --swa-full (not a hybrid model, no sliding window attention)
+#   - No --cache-ram 0 (standard KV cache, prompt caching works normally)
+#   - --parallel 3 (vs 4 for Qwen3.5, which has smaller KV due to recurrent layers)
+#   - cache-reuse enabled (effective for coding workflows with repeated context)
+#
+# Benchmarked on RTX 5090 (2026-03-02, native sm_120a, BLACKWELL_NATIVE_FP4=1):
+#   Single-stream decode:  ~275 tok/s  (memory-bandwidth-bound)
+#   3-concurrent aggregate: ~497 tok/s (+81% via MoE expert batching)
+#   3-concurrent per-slot:  ~168 tok/s each
+#   Prompt processing:    ~345-1,038 tok/s (varies with batch position)
+# ==============================================================================
+
+PROFILE_CTX_LENGTH=196608       # 192K total -- --fit on allocates 64K per slot with 3 slots
+PROFILE_THREADS=16              # Physical cores (avoid hyperthreads for decode)
+PROFILE_THREADS_BATCH=20        # Higher thread count for prompt processing
+PROFILE_FLASH_ATTN="on"         # Flash attention for long context perf
+PROFILE_KV_TYPE_K="q8_0"        # KV cache key quantization
+PROFILE_KV_TYPE_V="q8_0"        # KV cache value quantization
+PROFILE_NO_MMAP="true"          # Avoid page faults, load model into RAM
+PROFILE_JINJA="true"            # Chat template / tool calling support
+PROFILE_PARALLEL=3              # 3 slots: 64K/slot, 497 tok/s agg, 168 tok/s each
+                                # (see "Why --parallel 3" above)
+PROFILE_PRIO=2                  # High thread priority for reduced scheduling latency
+PROFILE_CPU_STRICT=1            # Strict CPU placement for cache locality
+PROFILE_CACHE_REUSE=256         # KV cache reuse for multi-turn coding sessions
+PROFILE_NO_WEBUI="true"         # Headless: no web UI, reduce attack surface
+PROFILE_METRICS="true"          # Prometheus-compatible /metrics endpoint
+# --mlock: pin model in RAM; -b/-ub 4096: large batch for fast prompt encode
+PROFILE_EXTRA_ARGS="--mlock -b 4096 -ub 4096"