Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:

strategy:
matrix:
model: [qwen3.5-35b-a3b, hermes-4.3-36b]
model: [qwen3.5-35b-a3b, qwen3-coder-30b-a3b, hermes-4.3-36b]

steps:
- name: Checkout
Expand Down
52 changes: 34 additions & 18 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ No API key is required by default. If your client demands one, any non-empty str
"qwen": {
"id": "qwen3.5-35b-a3b",
"name": "Qwen 3.5 35B A3B"
},
"qwen-coder": {
"id": "qwen3-coder-30b-a3b",
"name": "Qwen 3 Coder 30B A3B"
}
}
}
Expand All @@ -68,7 +72,7 @@ API Key: sk-local
Model: qwen3.5-35b-a3b
```

Cursor uses streaming by default. Foundry supports SSE streaming natively. With 4 parallel slots, you can run Cursor's background indexing and active chat simultaneously without blocking.
Cursor uses streaming by default. Foundry supports SSE streaming natively. With multiple parallel slots, you can run Cursor's background indexing and active chat simultaneously without blocking.

### Continue (VS Code / JetBrains)

Expand Down Expand Up @@ -115,7 +119,7 @@ Model ID: qwen3.5-35b-a3b

## Multi-Agent Frameworks

Foundry's 4 parallel inference slots make it particularly suited for multi-agent workflows where multiple agents share a single model. Each slot processes requests independently with minimal throughput degradation.
Foundry's parallel inference slots make it particularly suited for multi-agent workflows where multiple agents share a single model. Each slot processes requests independently with minimal throughput degradation.

### CrewAI

Expand Down Expand Up @@ -162,7 +166,7 @@ crew = Crew(
result = crew.kickoff(inputs={"topic": "GPU inference optimization"})
```

With 4 parallel slots, CrewAI can run 4 agents simultaneously at ~80 tok/s each (Qwen MoE) or ~16 tok/s each (Hermes Dense).
With 3 parallel slots, CrewAI can run 3 agents simultaneously at ~168 tok/s each (Qwen3-Coder MoE) or ~33 tok/s each with Hermes Dense (4 slots).

### AutoGen

Expand Down Expand Up @@ -237,7 +241,7 @@ docker run -d -p 3000:8080 \
ghcr.io/open-webui/open-webui:main
```

Open WebUI supports multi-user chat with conversation history. Each user session uses one of Foundry's 4 inference slots.
Open WebUI supports multi-user chat with conversation history. Each user session uses one of Foundry's inference slots.

### text-generation-webui (oobabooga)

Expand Down Expand Up @@ -315,13 +319,14 @@ console.log(response.choices[0].message.content);

| Use case | Recommended model | Why |
|----------|-------------------|-----|
| **Coding agents** (OpenCode, Cursor, Aider) | Qwen3.5-35B-A3B | Fast decode (181 tok/s), 192K context for large codebases, good at code |
| **Multi-agent orchestration** (CrewAI, AutoGen) | Qwen3.5-35B-A3B | 4-concurrent at 320 tok/s aggregate, MoE batching advantage |
| **Coding agents** (OpenCode, Cursor, Aider) | Qwen3-Coder-30B-A3B | Fastest decode (275 tok/s), purpose-built for code, tool calling support |
| **Multi-agent orchestration** (CrewAI, AutoGen) | Qwen3-Coder-30B-A3B | 3-concurrent at 497 tok/s aggregate, best MoE batching efficiency |
| **General coding + long context** | Qwen3.5-35B-A3B | 192K effective context for large codebases, hybrid recurrent architecture |
| **Reasoning-heavy tasks** | Hermes-4.3-36B | Thinking mode with `<think>` tags, stronger reasoning on hard problems |
| **Tool use / function calling** | Hermes-4.3-36B | Trained specifically for structured tool calling with `<tool_call>` XML |
| **Tool use / function calling** | Qwen3-Coder-30B-A3B or Hermes-4.3-36B | Both have strong tool calling; Coder is 4x faster, Hermes more reliable on complex schemas |
| **Roleplay / creative writing** | Hermes-4.3-36B | NousResearch fine-tune optimized for personality and narrative |
| **Long document Q&A** | Qwen3.5-35B-A3B | 192K context window, recurrent layers handle long sequences efficiently |
| **16 GB VRAM GPUs** | Qwen3.5-35B-A3B | MoE expert offloading works on 16 GB; Hermes needs 24 GB minimum |
| **16 GB VRAM GPUs** | Qwen3-Coder-30B-A3B | Smallest disk footprint (17.7 GB), MoE expert offloading works on 16 GB |

## Performance Considerations

Expand All @@ -331,31 +336,41 @@ Single-stream decode latency (time to generate one token):

| Model | Latency per token | Tokens per second |
|-------|-------------------|-------------------|
| Qwen3-Coder-30B-A3B | ~3.6 ms | ~275 tok/s |
| Qwen3.5-35B-A3B | ~5.5 ms | ~181 tok/s |
| Hermes-4.3-36B | ~15.5 ms | ~64 tok/s |

For interactive coding agents, Qwen delivers a visibly faster typing experience. For batch/background tasks where latency is less critical, Hermes' stronger reasoning may be worth the tradeoff.
For interactive coding agents, Qwen3-Coder delivers the fastest typing experience. Qwen3.5 trades some speed for 192K effective context. For batch/background tasks where latency is less critical, Hermes' stronger reasoning may be worth the tradeoff.

### Prompt processing

Prompt processing (prefill) runs at ~1,163 tok/s for Qwen on RTX 5090. A 10K token prompt takes ~8.6 seconds to process. Keep system prompts concise to minimize time-to-first-token.

### Concurrent agent scaling

Qwen3-Coder-30B-A3B (fastest, 3 slots):
```
1 agent: 275 tok/s (100% per-agent speed)
2 agents: 405 tok/s (~204 tok/s each, 74% per-agent)
3 agents: 497 tok/s (~168 tok/s each, 61% per-agent)
```

Qwen3.5-35B-A3B (4 slots):
```
1 agent: 181 tok/s (100% per-agent speed)
2 agents: 234 tok/s (~117 tok/s each, 65% per-agent)
4 agents: 320 tok/s (~80 tok/s each, 44% per-agent)
```

If your workflow has >4 concurrent agents, requests queue until a slot is free. Consider multi-GPU routing (below) for higher concurrency.
If your workflow has more concurrent agents than slots, requests queue until a slot is free. Consider multi-GPU routing (below) for higher concurrency.

### Context window usage

VRAM scales with context usage. The default RTX 5090 profiles are tuned for maximum context:

| Model | Default context | VRAM at idle | VRAM at full context |
|-------|----------------|--------------|---------------------|
| Qwen3-Coder-30B-A3B | 192K | 25.0 GB | ~28.9 GB |
| Qwen3.5-35B-A3B | 192K | 25.3 GB | ~26.1 GB |
| Hermes-4.3-36B | 32K | 24.5 GB | ~27.8 GB |

Expand All @@ -370,7 +385,7 @@ docker run --gpus all -p 8080:8080 \

## Structured Output

Both models support JSON mode for structured outputs:
All three models support JSON mode for structured outputs:

```python
response = client.chat.completions.create(
Expand Down Expand Up @@ -440,7 +455,7 @@ if tool_calls:
print(f"Args: {tool_calls[0].function.arguments}")
```

Both models support Jinja chat templates for tool calling. The entrypoint enables `--jinja` by default.
All models support Jinja chat templates for tool calling. The entrypoint enables `--jinja` by default.

## Thinking / Reasoning Mode

Expand Down Expand Up @@ -483,7 +498,7 @@ The Docker Compose configuration includes TCP tuning (BBR congestion control, bu

## Multi-GPU Agent Routing

For workloads requiring more than 4 concurrent agents, run multiple Foundry instances and load-balance across them.
For workloads requiring more concurrent agents than slots, run multiple Foundry instances and load-balance across them.

### Simple round-robin with nginx

Expand Down Expand Up @@ -511,14 +526,15 @@ For deterministic routing (each agent always hits the same GPU):
```python
import os

# Route based on agent ID
# Route based on agent ID (adjust slots_per_gpu to match your model's --parallel setting)
slots_per_gpu = 3 # Qwen3-Coder default; use 4 for Qwen3.5/Hermes
gpu_endpoints = [
"http://localhost:8080/v1", # GPU 0: agents 0-3
"http://localhost:8081/v1", # GPU 1: agents 4-7
"http://localhost:8080/v1", # GPU 0
"http://localhost:8081/v1", # GPU 1
]

def get_client(agent_id: int) -> OpenAI:
endpoint = gpu_endpoints[agent_id // 4]
endpoint = gpu_endpoints[agent_id // slots_per_gpu]
return OpenAI(base_url=endpoint, api_key="sk-local")
```

Expand All @@ -536,7 +552,7 @@ If you see `no devices with dedicated memory found` in the logs, the CUDA backen

1. Check if all layers are on GPU: look for `offloaded N/N layers to GPU` in container logs
2. Check VRAM: `nvidia-smi` -- if VRAM is full, reduce context with `FOUNDRY_CTX_LENGTH`
3. Check if all 4 slots are occupied: `curl http://localhost:8080/metrics | grep slots`
3. Check if all slots are occupied: `curl http://localhost:8080/metrics | grep slots`

### Connection refused

Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ MODELS_DIR ?= $(HOME)/.cache/foundry
.PHONY: help build run run-profile test benchmark monitoring down push push-all clean clean-models download

help: ## Show this help
@echo "Available models: qwen3.5-35b-a3b (default), hermes-4.3-36b"
@echo "Usage: make run MODEL=hermes-4.3-36b"
@echo "Available models: qwen3.5-35b-a3b (default), qwen3-coder-30b-a3b, hermes-4.3-36b"
@echo "Usage: make run MODEL=qwen3-coder-30b-a3b"
@echo ""
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | \
awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'
Expand Down
75 changes: 60 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,40 @@ Dense transformer. 36B total parameters, all active per token. ByteDance Seed-OS

Dense models activate all parameters per token, making them compute-bound rather than memory-bandwidth-bound. Expect ~3x slower decode than equivalently-sized MoE models on the same hardware.

### Qwen3-Coder-30B-A3B (MoE)

Standard Mixture-of-Experts optimized for code generation. 30B total parameters, only 3B active per token.

- 48 transformer layers, standard attention (GQA 32:4)
- 128 experts per MoE layer, top-8 active per token
- Quantization: UD-Q4_K_XL via [Unsloth](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) (Dynamic 2.0)
- Disk: ~17.7 GB | Min VRAM: 16 GB (with expert offloading) | Max context: 262K native
- Built-in tool calling support via `--jinja` chat template

| GPU | VRAM | Context | Decode | 3-concurrent | VRAM used |
|-----|------|---------|--------|--------------|-----------|
| RTX 5090 | 32 GB | 64K/slot | ~275 tok/s | ~497 tok/s | 28.9 GB |
| Other NVIDIA (16 GB+) | 16+ GB | 16K/slot | varies | varies | varies |

<details>
<summary>RTX 5090 detailed benchmark</summary>

```
SINGLE-STREAM DECODE: ~275 tok/s
3-CONCURRENT AGGREGATE: ~497 tok/s (+81% via MoE expert batching)
3-CONCURRENT PER-SLOT: ~168 tok/s each
PROMPT PROCESSING: ~345-1,038 tok/s (varies with batch position)
VRAM USAGE: 28.9 GB / 32.6 GB (3.7 GB headroom)
CONTEXT: 64K per slot (3 slots, auto-fitted from 192K request)
```

Benchmarked 2026-03-02 with native sm_120a (Blackwell) compilation and `BLACKWELL_NATIVE_FP4=1` enabled.

**Why 3 slots (not 4)?** With 3 slots, `--fit on` allocates 64K context per slot instead of 48K. Aggregate throughput is identical (497 vs 495 tok/s), but per-agent speed under load is 35% faster (168 vs 124 tok/s). The 4th slot rarely matters for a single-GPU workstation. Override with `FOUNDRY_EXTRA_ARGS="--parallel 4"` if needed.

**vs Qwen3.5-35B-A3B**: 52% faster single-stream, 55% faster aggregate. The standard MoE architecture (no DeltaNet recurrent layers) batches more efficiently on Blackwell. Trades the 192K effective context of Qwen3.5 for raw speed.
</details>

## How It Works

Why llama.cpp and not SGLang or vLLM? For **consumer GPUs**, llama.cpp's MoE expert offloading (`--fit on`) is the only engine that can run a 35B-parameter MoE model on a single 16-24 GB card at full speed. SGLang and vLLM require the entire model to fit in VRAM.
Expand Down Expand Up @@ -143,34 +177,37 @@ All settings can be overridden via environment variables:

## Multi-Agent Inference

The RTX 5090 profile is configured with `--parallel 4`, enabling 4 concurrent inference slots. This makes Foundry well-suited for multi-agent workflows where several AI agents share a single GPU.
The RTX 5090 profiles are configured with multiple concurrent inference slots: `--parallel 4` for Qwen3.5 and Hermes, `--parallel 3` for Qwen3-Coder. This makes Foundry well-suited for multi-agent workflows where several AI agents share a single GPU.

### Why MoE batching works

Qwen3.5-35B-A3B uses a 256-expert MoE architecture with only 8 experts active per token. During single-stream decode, the GPU's tensor cores are largely idle -- the bottleneck is memory bandwidth, not compute. When multiple agents send concurrent requests, llama.cpp batches token generation across all active slots. Different tokens may route to different experts, and CUDA graphs capture the entire batched MoE operation, significantly improving GPU utilization.

### Throughput scaling

| Active agents | Aggregate throughput | Per-agent speed | VRAM |
|---------------|---------------------|-----------------|------|
| 1 | 181 tok/s | 181 tok/s | 25.3 GB |
| 2 | 234 tok/s | ~117 tok/s each | 25.7 GB |
| 4 | 320 tok/s | ~80 tok/s each | 26.1 GB |
Measured on RTX 5090 with Qwen models (MoE):

| Active agents | Qwen3.5-35B-A3B (4 slots) | Qwen3-Coder-30B-A3B (3 slots) |
|---------------|----------------------------|--------------------------------|
| 1 | 181 tok/s | 275 tok/s |
| 2 | 234 tok/s (117 each) | 405 tok/s (204 each) |
| 3 | — | 497 tok/s (168 each) |
| 4 | 320 tok/s (80 each) | — |

Single-agent speed is unaffected. The 4 slots only activate when there are concurrent requests.
Single-agent speed is unaffected. Concurrent slots only activate when there are simultaneous requests.

### Multi-GPU scaling

With 2x RTX 5090, run two independent instances for 8 total concurrent slots and ~640 tok/s combined aggregate throughput:
With 2x RTX 5090, run two independent instances for double the concurrent slots and aggregate throughput:

```bash
# GPU 0: agents 1-4
# GPU 0
docker run --gpus '"device=0"' -p 8080:8080 -v ~/.cache/foundry:/models \
ghcr.io/infernet-org/foundry/qwen3.5-35b-a3b:latest
ghcr.io/infernet-org/foundry/qwen3-coder-30b-a3b:latest

# GPU 1: agents 5-8
# GPU 1
docker run --gpus '"device=1"' -p 8081:8080 -v ~/.cache/foundry:/models \
ghcr.io/infernet-org/foundry/qwen3.5-35b-a3b:latest
ghcr.io/infernet-org/foundry/qwen3-coder-30b-a3b:latest
```

### Compatible frameworks
Expand All @@ -187,6 +224,7 @@ docker compose up

# Choose a different model
FOUNDRY_MODEL=hermes-4.3-36b docker compose up
FOUNDRY_MODEL=qwen3-coder-30b-a3b docker compose up

# With explicit profile
FOUNDRY_PROFILE=rtx5090 docker compose up
Expand All @@ -208,6 +246,7 @@ GF_ADMIN_PASSWORD=admin
```bash
make build # Build the default model image (qwen3.5-35b-a3b)
make build MODEL=hermes-4.3-36b # Build a different model
make build MODEL=qwen3-coder-30b-a3b # Build the coding-optimized model
make run # Run with auto-detected GPU
make test # Smoke test: start, wait for health, send one request
make benchmark # Run benchmark against a running server
Expand Down Expand Up @@ -332,12 +371,18 @@ foundry/
│ │ └── profiles/
│ │ ├── rtx5090.sh # 192K ctx, 4 slots, ~320 tok/s aggregate
│ │ └── default.sh # 16K ctx, conservative settings
│ └── hermes-4.3-36b/
│ ├── hermes-4.3-36b/
│ │ ├── Dockerfile # Multi-stage: compiles llama.cpp for sm_89 + sm_120a
│ │ ├── entrypoint.sh # Copied from scripts/entrypoint.sh at build time
│ │ └── profiles/
│ │ ├── rtx5090.sh # 32K ctx, 4 slots, ~132 tok/s aggregate
│ │ └── default.sh # 8K ctx, 24 GB minimum
│ └── qwen3-coder-30b-a3b/
│ ├── Dockerfile # Multi-stage: compiles llama.cpp for sm_89 + sm_120a
│ ├── entrypoint.sh # Copied from scripts/entrypoint.sh at build time
│ └── profiles/
│ ├── rtx5090.sh # 32K ctx, 4 slots, ~132 tok/s aggregate
│ └── default.sh # 8K ctx, 24 GB minimum
│ ├── rtx5090.sh # 192K ctx, 3 slots, ~497 tok/s aggregate
│ └── default.sh # 32K ctx, conservative settings
├── scripts/
│ ├── entrypoint.sh # Shared entrypoint (GPU detect, profile load, model download)
│ ├── benchmark.py # Generation speed, prompt processing, throughput
Expand Down
Loading