[Runtime] Add Qwen3-32B, Qwen3-Embedding-8B, and Qwen3Guard-Gen-8B vLLM runtimes#628
Open
Juno13340 wants to merge 3 commits into
Open
Conversation
- qwen3-32b: TP=2/2xH100 (was TP=4); add command [vllm, serve]; readinessProbe 90/60; router resource requests - lower startupProbe failureThreshold: 8B 150->60, 32b 150->100
YouNeedCryDear
approved these changes
Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Adds OME configuration for serving three Qwen3 models on vLLM:
Adds the
qwen3guard-gen-8bClusterBaseModel pointing tohf://Qwen/Qwen3Guard-Gen-8B, and completes theqwen3-embedding-8bClusterBaseModel metadata (architecture / format / framework / parameter size).Adds the
vllm-qwen3-32bClusterServingRuntime with SMG router + vLLM settings forQwen3ForCausalLM, 4-way tensor parallelism, 32K context, chunked prefill, Qwen3 reasoning parsing, and Hermes tool-call parsing.Adds the
vllm-qwen3-embedding-8bClusterServingRuntime (embedding/pooling via--runner pooling,Qwen3ForCausalLM, TP=1).Adds the
vllm-qwen3guard-gen-8bClusterServingRuntime (text-generation guard,Qwen3ForCausalLM, TP=1).Registers the models and runtimes in the kustomizations.
Adds sample InferenceServices for the Qwen namespaces.
All runtimes use
docker.io/vllm/vllm-openai:v0.20.0. Framework versions are pinned to each model's upstreamconfig.jsontransformers_version(32B: 4.51.0, Embedding-8B: 4.51.2, Guard-Gen-8B: 4.51.1) to match the runtime selector'sEqualcomparison.Why we need it
Enables serving for Qwen3-32B (chat), Qwen3-Embedding-8B (embeddings), and Qwen3Guard-Gen-8B (content moderation).
Fixes #
How to test
Validated each engine locally on an 8×A100 host via standalone
docker runagainstvllm/vllm-openai:v0.20.0:Qwen3-32B— TP=4,/v1/chat/completionsreturns expected output.Qwen3-Embedding-8B—--runner pooling,/v1/embeddingsreturns embedding vectors.Qwen3Guard-Gen-8B— TP=1,/v1/chat/completionsreturns expected output.kubectl kustomize config/modelsandkubectl kustomize config/runtimesboth build cleanly.Checklist
make testpasses locally