infernet-org · aWN4Y25pa2EK · Mar 2, 2026 · Mar 2, 2026
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -20,7 +20,7 @@ jobs:
 
     strategy:
       matrix:
-        model: [qwen3.5-35b-a3b, hermes-4.3-36b]
+        model: [qwen3.5-35b-a3b, qwen3-coder-30b-a3b, hermes-4.3-36b]
 
     steps:
       - name: Checkout

diff --git a/AGENTS.md b/AGENTS.md
@@ -51,6 +51,10 @@ No API key is required by default. If your client demands one, any non-empty str
         "qwen": {
           "id": "qwen3.5-35b-a3b",
           "name": "Qwen 3.5 35B A3B"
+        },
+        "qwen-coder": {
+          "id": "qwen3-coder-30b-a3b",
+          "name": "Qwen 3 Coder 30B A3B"
         }
       }
     }
@@ -68,7 +72,7 @@ API Key:  sk-local
 Model:    qwen3.5-35b-a3b
 ```
 
-Cursor uses streaming by default. Foundry supports SSE streaming natively. With 4 parallel slots, you can run Cursor's background indexing and active chat simultaneously without blocking.
+Cursor uses streaming by default. Foundry supports SSE streaming natively. With multiple parallel slots, you can run Cursor's background indexing and active chat simultaneously without blocking.
 
 ### Continue (VS Code / JetBrains)
 
@@ -115,7 +119,7 @@ Model ID: qwen3.5-35b-a3b
 
 ## Multi-Agent Frameworks
 
-Foundry's 4 parallel inference slots make it particularly suited for multi-agent workflows where multiple agents share a single model. Each slot processes requests independently with minimal throughput degradation.
+Foundry's parallel inference slots make it particularly suited for multi-agent workflows where multiple agents share a single model. Each slot processes requests independently with minimal throughput degradation.
 
 ### CrewAI
 
@@ -162,7 +166,7 @@ crew = Crew(
 result = crew.kickoff(inputs={"topic": "GPU inference optimization"})
 ```
 
-With 4 parallel slots, CrewAI can run 4 agents simultaneously at ~80 tok/s each (Qwen MoE) or ~16 tok/s each (Hermes Dense).
+With 3 parallel slots, CrewAI can run 3 agents simultaneously at ~168 tok/s each (Qwen3-Coder MoE) or ~33 tok/s each with Hermes Dense (4 slots).
 
 ### AutoGen
 
@@ -237,7 +241,7 @@ docker run -d -p 3000:8080 \
   ghcr.io/open-webui/open-webui:main
 ```
 
-Open WebUI supports multi-user chat with conversation history. Each user session uses one of Foundry's 4 inference slots.
+Open WebUI supports multi-user chat with conversation history. Each user session uses one of Foundry's inference slots.
 
 ### text-generation-webui (oobabooga)
 
@@ -315,13 +319,14 @@ console.log(response.choices[0].message.content);
 
 | Use case | Recommended model | Why |
 |----------|-------------------|-----|
-| **Coding agents** (OpenCode, Cursor, Aider) | Qwen3.5-35B-A3B | Fast decode (181 tok/s), 192K context for large codebases, good at code |
-| **Multi-agent orchestration** (CrewAI, AutoGen) | Qwen3.5-35B-A3B | 4-concurrent at 320 tok/s aggregate, MoE batching advantage |
+| **Coding agents** (OpenCode, Cursor, Aider) | Qwen3-Coder-30B-A3B | Fastest decode (275 tok/s), purpose-built for code, tool calling support |
+| **Multi-agent orchestration** (CrewAI, AutoGen) | Qwen3-Coder-30B-A3B | 3-concurrent at 497 tok/s aggregate, best MoE batching efficiency |
+| **General coding + long context** | Qwen3.5-35B-A3B | 192K effective context for large codebases, hybrid recurrent architecture |
 | **Reasoning-heavy tasks** | Hermes-4.3-36B | Thinking mode with `<think>` tags, stronger reasoning on hard problems |
-| **Tool use / function calling** | Hermes-4.3-36B | Trained specifically for structured tool calling with `<tool_call>` XML |
+| **Tool use / function calling** | Qwen3-Coder-30B-A3B or Hermes-4.3-36B | Both have strong tool calling; Coder is 4x faster, Hermes more reliable on complex schemas |
 | **Roleplay / creative writing** | Hermes-4.3-36B | NousResearch fine-tune optimized for personality and narrative |
 | **Long document Q&A** | Qwen3.5-35B-A3B | 192K context window, recurrent layers handle long sequences efficiently |
-| **16 GB VRAM GPUs** | Qwen3.5-35B-A3B | MoE expert offloading works on 16 GB; Hermes needs 24 GB minimum |
+| **16 GB VRAM GPUs** | Qwen3-Coder-30B-A3B | Smallest disk footprint (17.7 GB), MoE expert offloading works on 16 GB |
 
 ## Performance Considerations
 
@@ -331,31 +336,41 @@ Single-stream decode latency (time to generate one token):
 
 | Model | Latency per token | Tokens per second |
 |-------|-------------------|-------------------|
+| Qwen3-Coder-30B-A3B | ~3.6 ms | ~275 tok/s |
 | Qwen3.5-35B-A3B | ~5.5 ms | ~181 tok/s |
 | Hermes-4.3-36B | ~15.5 ms | ~64 tok/s |
 
-For interactive coding agents, Qwen delivers a visibly faster typing experience. For batch/background tasks where latency is less critical, Hermes' stronger reasoning may be worth the tradeoff.
+For interactive coding agents, Qwen3-Coder delivers the fastest typing experience. Qwen3.5 trades some speed for 192K effective context. For batch/background tasks where latency is less critical, Hermes' stronger reasoning may be worth the tradeoff.
 
 ### Prompt processing
 
 Prompt processing (prefill) runs at ~1,163 tok/s for Qwen on RTX 5090. A 10K token prompt takes ~8.6 seconds to process. Keep system prompts concise to minimize time-to-first-token.
 
 ### Concurrent agent scaling
 
+Qwen3-Coder-30B-A3B (fastest, 3 slots):
+```
+1 agent:  275 tok/s  (100% per-agent speed)
+2 agents: 405 tok/s  (~204 tok/s each, 74% per-agent)
+3 agents: 497 tok/s  (~168 tok/s each, 61% per-agent)
+```
+
+Qwen3.5-35B-A3B (4 slots):
 ```
 1 agent:  181 tok/s  (100% per-agent speed)
 2 agents: 234 tok/s  (~117 tok/s each, 65% per-agent)
 4 agents: 320 tok/s  (~80 tok/s each, 44% per-agent)
 ```
 
-If your workflow has >4 concurrent agents, requests queue until a slot is free. Consider multi-GPU routing (below) for higher concurrency.
+If your workflow has more concurrent agents than slots, requests queue until a slot is free. Consider multi-GPU routing (below) for higher concurrency.
 
 ### Context window usage
 
 VRAM scales with context usage. The default RTX 5090 profiles are tuned for maximum context:
 
 | Model | Default context | VRAM at idle | VRAM at full context |
 |-------|----------------|--------------|---------------------|
+| Qwen3-Coder-30B-A3B | 192K | 25.0 GB | ~28.9 GB |
 | Qwen3.5-35B-A3B | 192K | 25.3 GB | ~26.1 GB |
 | Hermes-4.3-36B | 32K | 24.5 GB | ~27.8 GB |
 
@@ -370,7 +385,7 @@ docker run --gpus all -p 8080:8080 \
 
 ## Structured Output
 
-Both models support JSON mode for structured outputs:
+All three models support JSON mode for structured outputs:
 
 ```python
 response = client.chat.completions.create(
@@ -440,7 +455,7 @@ if tool_calls:
     print(f"Args: {tool_calls[0].function.arguments}")
 ```
 
-Both models support Jinja chat templates for tool calling. The entrypoint enables `--jinja` by default.
+All models support Jinja chat templates for tool calling. The entrypoint enables `--jinja` by default.
 
 ## Thinking / Reasoning Mode
 
@@ -483,7 +498,7 @@ The Docker Compose configuration includes TCP tuning (BBR congestion control, bu
 
 ## Multi-GPU Agent Routing
 
-For workloads requiring more than 4 concurrent agents, run multiple Foundry instances and load-balance across them.
+For workloads requiring more concurrent agents than slots, run multiple Foundry instances and load-balance across them.
 
 ### Simple round-robin with nginx
 
@@ -511,14 +526,15 @@ For deterministic routing (each agent always hits the same GPU):
 ```python
 import os
 
-# Route based on agent ID
+# Route based on agent ID (adjust slots_per_gpu to match your model's --parallel setting)
+slots_per_gpu = 3  # Qwen3-Coder default; use 4 for Qwen3.5/Hermes
 gpu_endpoints = [
-    "http://localhost:8080/v1",  # GPU 0: agents 0-3
-    "http://localhost:8081/v1",  # GPU 1: agents 4-7
+    "http://localhost:8080/v1",  # GPU 0
+    "http://localhost:8081/v1",  # GPU 1
 ]
 
 def get_client(agent_id: int) -> OpenAI:
-    endpoint = gpu_endpoints[agent_id // 4]
+    endpoint = gpu_endpoints[agent_id // slots_per_gpu]
     return OpenAI(base_url=endpoint, api_key="sk-local")
 ```
 
@@ -536,7 +552,7 @@ If you see `no devices with dedicated memory found` in the logs, the CUDA backen
 
 1. Check if all layers are on GPU: look for `offloaded N/N layers to GPU` in container logs
 2. Check VRAM: `nvidia-smi` -- if VRAM is full, reduce context with `FOUNDRY_CTX_LENGTH`
-3. Check if all 4 slots are occupied: `curl http://localhost:8080/metrics | grep slots`
+3. Check if all slots are occupied: `curl http://localhost:8080/metrics | grep slots`
 
 ### Connection refused
 

diff --git a/Makefile b/Makefile
@@ -12,8 +12,8 @@ MODELS_DIR ?= $(HOME)/.cache/foundry
 .PHONY: help build run run-profile test benchmark monitoring down push push-all clean clean-models download
 
 help: ## Show this help
-	@echo "Available models: qwen3.5-35b-a3b (default), hermes-4.3-36b"
-	@echo "Usage: make run MODEL=hermes-4.3-36b"
+	@echo "Available models: qwen3.5-35b-a3b (default), qwen3-coder-30b-a3b, hermes-4.3-36b"
+	@echo "Usage: make run MODEL=qwen3-coder-30b-a3b"
 	@echo ""
 	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | \
 		awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'

diff --git a/README.md b/README.md
@@ -89,6 +89,40 @@ Dense transformer. 36B total parameters, all active per token. ByteDance Seed-OS
 
 Dense models activate all parameters per token, making them compute-bound rather than memory-bandwidth-bound. Expect ~3x slower decode than equivalently-sized MoE models on the same hardware.
 
+### Qwen3-Coder-30B-A3B (MoE)
+
+Standard Mixture-of-Experts optimized for code generation. 30B total parameters, only 3B active per token.
+
+- 48 transformer layers, standard attention (GQA 32:4)
+- 128 experts per MoE layer, top-8 active per token
+- Quantization: UD-Q4_K_XL via [Unsloth](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) (Dynamic 2.0)
+- Disk: ~17.7 GB | Min VRAM: 16 GB (with expert offloading) | Max context: 262K native
+- Built-in tool calling support via `--jinja` chat template
+
+| GPU | VRAM | Context | Decode | 3-concurrent | VRAM used |
+|-----|------|---------|--------|--------------|-----------|
+| RTX 5090 | 32 GB | 64K/slot | ~275 tok/s | ~497 tok/s | 28.9 GB |
+| Other NVIDIA (16 GB+) | 16+ GB | 16K/slot | varies | varies | varies |
+
+<details>
+<summary>RTX 5090 detailed benchmark</summary>
+
+```
+SINGLE-STREAM DECODE:    ~275 tok/s
+3-CONCURRENT AGGREGATE:  ~497 tok/s  (+81% via MoE expert batching)
+3-CONCURRENT PER-SLOT:   ~168 tok/s  each
+PROMPT PROCESSING:       ~345-1,038 tok/s  (varies with batch position)
+VRAM USAGE:              28.9 GB / 32.6 GB (3.7 GB headroom)
+CONTEXT:                 64K per slot (3 slots, auto-fitted from 192K request)
+```
+
+Benchmarked 2026-03-02 with native sm_120a (Blackwell) compilation and `BLACKWELL_NATIVE_FP4=1` enabled.
+
+**Why 3 slots (not 4)?** With 3 slots, `--fit on` allocates 64K context per slot instead of 48K. Aggregate throughput is identical (497 vs 495 tok/s), but per-agent speed under load is 35% faster (168 vs 124 tok/s). The 4th slot rarely matters for a single-GPU workstation. Override with `FOUNDRY_EXTRA_ARGS="--parallel 4"` if needed.
+
+**vs Qwen3.5-35B-A3B**: 52% faster single-stream, 55% faster aggregate. The standard MoE architecture (no DeltaNet recurrent layers) batches more efficiently on Blackwell. Trades the 192K effective context of Qwen3.5 for raw speed.
+</details>
+
 ## How It Works
 
 Why llama.cpp and not SGLang or vLLM? For **consumer GPUs**, llama.cpp's MoE expert offloading (`--fit on`) is the only engine that can run a 35B-parameter MoE model on a single 16-24 GB card at full speed. SGLang and vLLM require the entire model to fit in VRAM.
@@ -143,34 +177,37 @@ All settings can be overridden via environment variables:
 
 ## Multi-Agent Inference
 
-The RTX 5090 profile is configured with `--parallel 4`, enabling 4 concurrent inference slots. This makes Foundry well-suited for multi-agent workflows where several AI agents share a single GPU.
+The RTX 5090 profiles are configured with multiple concurrent inference slots: `--parallel 4` for Qwen3.5 and Hermes, `--parallel 3` for Qwen3-Coder. This makes Foundry well-suited for multi-agent workflows where several AI agents share a single GPU.
 
 ### Why MoE batching works
 
 Qwen3.5-35B-A3B uses a 256-expert MoE architecture with only 8 experts active per token. During single-stream decode, the GPU's tensor cores are largely idle -- the bottleneck is memory bandwidth, not compute. When multiple agents send concurrent requests, llama.cpp batches token generation across all active slots. Different tokens may route to different experts, and CUDA graphs capture the entire batched MoE operation, significantly improving GPU utilization.
 
 ### Throughput scaling
 
-| Active agents | Aggregate throughput | Per-agent speed | VRAM |
-|---------------|---------------------|-----------------|------|
-| 1 | 181 tok/s | 181 tok/s | 25.3 GB |
-| 2 | 234 tok/s | ~117 tok/s each | 25.7 GB |
-| 4 | 320 tok/s | ~80 tok/s each | 26.1 GB |
+Measured on RTX 5090 with Qwen models (MoE):
+
+| Active agents | Qwen3.5-35B-A3B (4 slots) | Qwen3-Coder-30B-A3B (3 slots) |
+|---------------|----------------------------|--------------------------------|
+| 1 | 181 tok/s | 275 tok/s |
+| 2 | 234 tok/s (117 each) | 405 tok/s (204 each) |
+| 3 | — | 497 tok/s (168 each) |
+| 4 | 320 tok/s (80 each) | — |
 
-Single-agent speed is unaffected. The 4 slots only activate when there are concurrent requests.
+Single-agent speed is unaffected. Concurrent slots only activate when there are simultaneous requests.
 
 ### Multi-GPU scaling
 
-With 2x RTX 5090, run two independent instances for 8 total concurrent slots and ~640 tok/s combined aggregate throughput:
+With 2x RTX 5090, run two independent instances for double the concurrent slots and aggregate throughput:
 
 ```bash
-# GPU 0: agents 1-4
+# GPU 0
 docker run --gpus '"device=0"' -p 8080:8080 -v ~/.cache/foundry:/models \
-  ghcr.io/infernet-org/foundry/qwen3.5-35b-a3b:latest
+  ghcr.io/infernet-org/foundry/qwen3-coder-30b-a3b:latest
 
-# GPU 1: agents 5-8
+# GPU 1
 docker run --gpus '"device=1"' -p 8081:8080 -v ~/.cache/foundry:/models \
-  ghcr.io/infernet-org/foundry/qwen3.5-35b-a3b:latest
+  ghcr.io/infernet-org/foundry/qwen3-coder-30b-a3b:latest
 ```
 
 ### Compatible frameworks
@@ -187,6 +224,7 @@ docker compose up
 
 # Choose a different model
 FOUNDRY_MODEL=hermes-4.3-36b docker compose up
+FOUNDRY_MODEL=qwen3-coder-30b-a3b docker compose up
 
 # With explicit profile
 FOUNDRY_PROFILE=rtx5090 docker compose up
@@ -208,6 +246,7 @@ GF_ADMIN_PASSWORD=admin
 ```bash
 make build                        # Build the default model image (qwen3.5-35b-a3b)
 make build MODEL=hermes-4.3-36b   # Build a different model
+make build MODEL=qwen3-coder-30b-a3b  # Build the coding-optimized model
 make run                          # Run with auto-detected GPU
 make test                         # Smoke test: start, wait for health, send one request
 make benchmark                    # Run benchmark against a running server
@@ -332,12 +371,18 @@ foundry/
 │   │   └── profiles/
 │   │       ├── rtx5090.sh           # 192K ctx, 4 slots, ~320 tok/s aggregate
 │   │       └── default.sh           # 16K ctx, conservative settings
-│   └── hermes-4.3-36b/
+│   ├── hermes-4.3-36b/
+│   │   ├── Dockerfile               # Multi-stage: compiles llama.cpp for sm_89 + sm_120a
+│   │   ├── entrypoint.sh            # Copied from scripts/entrypoint.sh at build time
+│   │   └── profiles/
+│   │       ├── rtx5090.sh           # 32K ctx, 4 slots, ~132 tok/s aggregate
+│   │       └── default.sh           # 8K ctx, 24 GB minimum
+│   └── qwen3-coder-30b-a3b/
 │       ├── Dockerfile               # Multi-stage: compiles llama.cpp for sm_89 + sm_120a
 │       ├── entrypoint.sh            # Copied from scripts/entrypoint.sh at build time
 │       └── profiles/
-│           ├── rtx5090.sh           # 32K ctx, 4 slots, ~132 tok/s aggregate
-│           └── default.sh           # 8K ctx, 24 GB minimum
+│           ├── rtx5090.sh           # 192K ctx, 3 slots, ~497 tok/s aggregate
+│           └── default.sh           # 32K ctx, conservative settings
 ├── scripts/
 │   ├── entrypoint.sh                # Shared entrypoint (GPU detect, profile load, model download)
 │   ├── benchmark.py                 # Generation speed, prompt processing, throughput