[Runtime] Add Qwen3-32B, Qwen3-Embedding-8B, and Qwen3Guard-Gen-8B vLLM runtimes by Juno13340 · Pull Request #628 · ome-projects/ome

Juno13340 · 2026-06-09T19:20:26Z

What this PR does

Adds OME configuration for serving three Qwen3 models on vLLM:

Adds the qwen3guard-gen-8b ClusterBaseModel pointing to hf://Qwen/Qwen3Guard-Gen-8B, and completes the qwen3-embedding-8b ClusterBaseModel metadata (architecture / format / framework / parameter size).
Adds the vllm-qwen3-32b ClusterServingRuntime with SMG router + vLLM settings for Qwen3ForCausalLM, 4-way tensor parallelism, 32K context, chunked prefill, Qwen3 reasoning parsing, and Hermes tool-call parsing.
Adds the vllm-qwen3-embedding-8b ClusterServingRuntime (embedding/pooling via --runner pooling, Qwen3ForCausalLM, TP=1).
Adds the vllm-qwen3guard-gen-8b ClusterServingRuntime (text-generation guard, Qwen3ForCausalLM, TP=1).
Registers the models and runtimes in the kustomizations.
Adds sample InferenceServices for the Qwen namespaces.

All runtimes use docker.io/vllm/vllm-openai:v0.20.0. Framework versions are pinned to each model's upstream config.json transformers_version (32B: 4.51.0, Embedding-8B: 4.51.2, Guard-Gen-8B: 4.51.1) to match the runtime selector's Equal comparison.

Why we need it

Enables serving for Qwen3-32B (chat), Qwen3-Embedding-8B (embeddings), and Qwen3Guard-Gen-8B (content moderation).

Fixes #

How to test

Validated each engine locally on an 8×A100 host via standalone docker run against vllm/vllm-openai:v0.20.0:

Qwen3-32B — TP=4, /v1/chat/completions returns expected output.
Qwen3-Embedding-8B — --runner pooling, /v1/embeddings returns embedding vectors.
Qwen3Guard-Gen-8B — TP=1, /v1/chat/completions returns expected output.

kubectl kustomize config/models and kubectl kustomize config/runtimes both build cleanly.

Checklist

Tests added/updated (if applicable)
Docs updated (if applicable)
make test passes locally

…-Gen-8B

- qwen3-32b: TP=2/2xH100 (was TP=4); add command [vllm, serve]; readinessProbe 90/60; router resource requests - lower startupProbe failureThreshold: 8B 150->60, 32b 150->100

feat: add vLLM runtimes for Qwen3-32B, Qwen3-Embedding-8B, Qwen3Guard…

8ab0d5b

…-Gen-8B

Juno13340 requested review from CatherineSue, XinyueZhang369 and slin1237 as code owners June 9, 2026 19:20

github-actions Bot added runtime Runtime configuration changes models Model configuration changes config Configuration changes labels Jun 9, 2026

Juno13340 changed the title ~~feat: add vLLM runtimes for Qwen3-32B, Qwen3-Embedding-8B, Qwen3Guard…~~ [Runtime] Add Qwen3-32B, Qwen3-Embedding-8B, and Qwen3Guard-Gen-8B vLLM runtimes Jun 9, 2026

YouNeedCryDear reviewed Jun 9, 2026

View reviewed changes

Comment thread config/runtimes/vllm/qwen3-32b-rt.yaml Outdated

Comment thread config/runtimes/vllm/qwen3-32b-rt.yaml

Comment thread config/runtimes/vllm/qwen3-embedding-8b-rt.yaml Outdated

Comment thread config/runtimes/vllm/qwen3guard-gen-8b-rt.yaml Outdated

Juno13340 added 2 commits June 9, 2026 18:18

Address review feedback on Qwen vLLM runtimes

b2545a0

- qwen3-32b: TP=2/2xH100 (was TP=4); add command [vllm, serve]; readinessProbe 90/60; router resource requests - lower startupProbe failureThreshold: 8B 150->60, 32b 150->100

Make Qwen3-8B vLLM runtime generic for guard and base models

5b68de2

YouNeedCryDear approved these changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Runtime] Add Qwen3-32B, Qwen3-Embedding-8B, and Qwen3Guard-Gen-8B vLLM runtimes#628

[Runtime] Add Qwen3-32B, Qwen3-Embedding-8B, and Qwen3Guard-Gen-8B vLLM runtimes#628
Juno13340 wants to merge 3 commits into
ome-projects:mainfrom
Juno13340:genhuang/model-import-qwen-32b-embeeding-guard-runtime

Juno13340 commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Juno13340 commented Jun 9, 2026

What this PR does

Why we need it

How to test

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants