Add Kimi K2.5 generation vLLM runtime and K2.6/K2.7-Code models#633
Open
Juno13340 wants to merge 4 commits into
Open
Add Kimi K2.5 generation vLLM runtime and K2.6/K2.7-Code models#633Juno13340 wants to merge 4 commits into
Juno13340 wants to merge 4 commits into
Conversation
ebb9ed0 to
8935008
Compare
YouNeedCryDear
requested changes
Jun 23, 2026
| # tensorParallelismOverride: | ||
| # tensorParallelSize: 8 | ||
| modelSizeRange: | ||
| min: 150B |
Collaborator
There was a problem hiding this comment.
How is this size calculated? Isn't it a 1T parameter model?
Author
There was a problem hiding this comment.
1T logically, yes, but OME matches on the safetensors element count, which ignores dtype. These are int4-packed, so it comes out ~150–300B, not 1T. Tried 900–1100B first and autoSelect failed for this exact reason.
…equests, use /health startup probe, remove dead comments
…purpose multimodal model convention
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Adds OME config for the Kimi K2.5-generation (
KimiK25ForConditionalGeneration) models:ClusterServingRuntimevllm-kimi-k25-single-node-8gpu(single node, 8 GPU, TP=8) that both models auto-select.ClusterBaseModelentries:kimi-k2-6(moonshotai/Kimi-K2.6) andkimi-k2-7-code(moonshotai/Kimi-K2.7-Code).kustomization.yaml.Why we need it
Enables serving the new Kimi K2.6 and K2.7-Code models on OME. Both share an identical vLLM serving config (same architecture, size range, args, images), so they are served by one generic runtime via
autoSelectrather than duplicating per-model runtimes — consistent with how the Qwen3 guard/base runtimes were consolidated. Thek25naming keeps it distinct from the existing original-K2 runtimes, which useDeepseekV3ForCausalLM.Fixes #
How to test
N/A — config-only change. Validated with
kubectl kustomize config/runtimesandkubectl kustomize config/models(both build cleanly). After applying, a Kimi K2.6/K2.7-Code model auto-selectsvllm-kimi-k25-single-node-8gpuwithout naming it in the InferenceService.Checklist
kubectl kustomizebuilds locally for both config/runtimes and config/models