Skip to content

Add Kimi K2.5 generation vLLM runtime and K2.6/K2.7-Code models#633

Open
Juno13340 wants to merge 4 commits into
ome-projects:mainfrom
Juno13340:genhuang/kimi-k2-6-k2-7-runtimes
Open

Add Kimi K2.5 generation vLLM runtime and K2.6/K2.7-Code models#633
Juno13340 wants to merge 4 commits into
ome-projects:mainfrom
Juno13340:genhuang/kimi-k2-6-k2-7-runtimes

Conversation

@Juno13340

Copy link
Copy Markdown

What this PR does

Adds OME config for the Kimi K2.5-generation (KimiK25ForConditionalGeneration) models:

  • A single shared vLLM ClusterServingRuntime vllm-kimi-k25-single-node-8gpu (single node, 8 GPU, TP=8) that both models auto-select.
  • Two ClusterBaseModel entries: kimi-k2-6 (moonshotai/Kimi-K2.6) and kimi-k2-7-code (moonshotai/Kimi-K2.7-Code).
  • Registers all three in the respective kustomization.yaml.

Why we need it

Enables serving the new Kimi K2.6 and K2.7-Code models on OME. Both share an identical vLLM serving config (same architecture, size range, args, images), so they are served by one generic runtime via autoSelect rather than duplicating per-model runtimes — consistent with how the Qwen3 guard/base runtimes were consolidated. The k25 naming keeps it distinct from the existing original-K2 runtimes, which use DeepseekV3ForCausalLM.

Fixes #

How to test

N/A — config-only change. Validated with kubectl kustomize config/runtimes and kubectl kustomize config/models (both build cleanly). After applying, a Kimi K2.6/K2.7-Code model auto-selects vllm-kimi-k25-single-node-8gpu without naming it in the InferenceService.

Checklist

  • Tests added/updated (if applicable) — N/A (config only)
  • Docs updated (if applicable) — N/A
  • kubectl kustomize builds locally for both config/runtimes and config/models

@github-actions github-actions Bot added runtime Runtime configuration changes models Model configuration changes config Configuration changes labels Jun 22, 2026
@Juno13340 Juno13340 force-pushed the genhuang/kimi-k2-6-k2-7-runtimes branch from ebb9ed0 to 8935008 Compare June 22, 2026 22:06
Comment thread config/models/moonshotai/Kimi-K2.6.yaml
Comment thread config/runtimes/vllm/moonshotai/kimi-k25-single-node-8gpu-rt.yaml Outdated
Comment thread config/runtimes/vllm/moonshotai/kimi-k25-single-node-8gpu-rt.yaml Outdated
Comment thread config/runtimes/vllm/moonshotai/kimi-k25-single-node-8gpu-rt.yaml Outdated
Comment thread config/runtimes/vllm/moonshotai/kimi-k25-tp8-rt.yaml
Comment thread config/runtimes/vllm/moonshotai/kimi-k25-single-node-8gpu-rt.yaml Outdated
# tensorParallelismOverride:
# tensorParallelSize: 8
modelSizeRange:
min: 150B

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this size calculated? Isn't it a 1T parameter model?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1T logically, yes, but OME matches on the safetensors element count, which ignores dtype. These are int4-packed, so it comes out ~150–300B, not 1T. Tried 900–1100B first and autoSelect failed for this exact reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration changes models Model configuration changes runtime Runtime configuration changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants