Skip to content

[Runtime] Add Qwen3-32B, Qwen3-Embedding-8B, and Qwen3Guard-Gen-8B vLLM runtimes#628

Open
Juno13340 wants to merge 3 commits into
ome-projects:mainfrom
Juno13340:genhuang/model-import-qwen-32b-embeeding-guard-runtime
Open

[Runtime] Add Qwen3-32B, Qwen3-Embedding-8B, and Qwen3Guard-Gen-8B vLLM runtimes#628
Juno13340 wants to merge 3 commits into
ome-projects:mainfrom
Juno13340:genhuang/model-import-qwen-32b-embeeding-guard-runtime

Conversation

@Juno13340

Copy link
Copy Markdown

What this PR does

Adds OME configuration for serving three Qwen3 models on vLLM:

  • Adds the qwen3guard-gen-8b ClusterBaseModel pointing to hf://Qwen/Qwen3Guard-Gen-8B, and completes the qwen3-embedding-8b ClusterBaseModel metadata (architecture / format / framework / parameter size).

  • Adds the vllm-qwen3-32b ClusterServingRuntime with SMG router + vLLM settings for Qwen3ForCausalLM, 4-way tensor parallelism, 32K context, chunked prefill, Qwen3 reasoning parsing, and Hermes tool-call parsing.

  • Adds the vllm-qwen3-embedding-8b ClusterServingRuntime (embedding/pooling via --runner pooling, Qwen3ForCausalLM, TP=1).

  • Adds the vllm-qwen3guard-gen-8b ClusterServingRuntime (text-generation guard, Qwen3ForCausalLM, TP=1).

  • Registers the models and runtimes in the kustomizations.

  • Adds sample InferenceServices for the Qwen namespaces.

All runtimes use docker.io/vllm/vllm-openai:v0.20.0. Framework versions are pinned to each model's upstream config.json transformers_version (32B: 4.51.0, Embedding-8B: 4.51.2, Guard-Gen-8B: 4.51.1) to match the runtime selector's Equal comparison.

Why we need it

Enables serving for Qwen3-32B (chat), Qwen3-Embedding-8B (embeddings), and Qwen3Guard-Gen-8B (content moderation).

Fixes #

How to test

Validated each engine locally on an 8×A100 host via standalone docker run against vllm/vllm-openai:v0.20.0:

  • Qwen3-32B — TP=4, /v1/chat/completions returns expected output.
  • Qwen3-Embedding-8B--runner pooling, /v1/embeddings returns embedding vectors.
  • Qwen3Guard-Gen-8B — TP=1, /v1/chat/completions returns expected output.

kubectl kustomize config/models and kubectl kustomize config/runtimes both build cleanly.

Checklist

  • Tests added/updated (if applicable)
  • Docs updated (if applicable)
  • make test passes locally

@github-actions github-actions Bot added runtime Runtime configuration changes models Model configuration changes config Configuration changes labels Jun 9, 2026
@Juno13340 Juno13340 changed the title feat: add vLLM runtimes for Qwen3-32B, Qwen3-Embedding-8B, Qwen3Guard… [Runtime] Add Qwen3-32B, Qwen3-Embedding-8B, and Qwen3Guard-Gen-8B vLLM runtimes Jun 9, 2026
Comment thread config/runtimes/vllm/qwen3-32b-rt.yaml Outdated
Comment thread config/runtimes/vllm/qwen3-32b-rt.yaml
Comment thread config/runtimes/vllm/qwen3-embedding-8b-rt.yaml Outdated
Comment thread config/runtimes/vllm/qwen3guard-gen-8b-rt.yaml Outdated
Juno13340 added 2 commits June 9, 2026 18:18
- qwen3-32b: TP=2/2xH100 (was TP=4); add command [vllm, serve]; readinessProbe 90/60; router resource requests
- lower startupProbe failureThreshold: 8B 150->60, 32b 150->100
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration changes models Model configuration changes runtime Runtime configuration changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants