Update Kimi Linear AMD recipe YAML format by haic0 · Pull Request #550 · vllm-project/recipes

haic0 · 2026-06-15T07:34:26Z

Summary

Replaces the legacy Markdown AMD update in Update Kimi-Linear.md for AMD GPU #155 with the current YAML recipe format for Kimi-Linear-48B-A3B-Instruct.
Encodes AMD hardware support, ROCm installation guidance, launch commands, and runtime overrides in models/...yaml.

Test plan

node scripts/build-recipes-api.mjs
Parsed the updated YAML recipes and verified required top-level schema order against .claude/skills/add-recipe/SKILL.md.
Verified DCO sign-off as haic0.

Replaces #155.

Signed-off-by: haic0 <haic0@users.noreply.github.com>

vercel · 2026-06-15T07:34:31Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
vllm-recipes	Ready	Preview, Comment	Jun 15, 2026 7:44am

gemini-code-assist

Code Review

This pull request adds AMD ROCm hardware support, overrides, and setup/run guides for the Kimi-Linear-48B-A3B-Instruct model. The reviewer suggested adding the --port 8000 argument to the ROCm launch command to maintain consistency with other examples in the guide.

gemini-code-assist · 2026-06-15T07:38:04Z

+  export SAFETENSORS_FAST_GPU=1
+  vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
+    --tensor-parallel-size 8 \
+    --max-model-len 1048576 \
+    --no-enable-prefix-caching \
+    --trust-remote-code
+  ```


For consistency with the other launch snippets in this guide (such as the 4-GPU and 8-GPU examples), please explicitly specify --port 8000 in the ROCm launch command.

export SAFETENSORS_FAST_GPU=1 vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \ --port 8000 \ --tensor-parallel-size 8 \ --max-model-len 1048576 \ --no-enable-prefix-caching \ --trust-remote-code

Pangyh2001 · 2026-06-29T11:36:41Z

# Kimi-Linear-48B-A3B-Instruct vLLM Issues

## Summary

We are testing `Kimi-Linear-48B-A3B-Instruct` with vLLM `0.11.2`.

We observe three major issues:

1. Short-input requests can degenerate into repeated `!` tokens and timeout.
2. Long-context requests under high concurrency mostly timeout.
3. vLLM repeatedly logs xgrammar FSM errors during the failure period.

## Environment

- Model: `Kimi-Linear-48B-A3B-Instruct`
- Backend: vLLM `0.11.2`
- GPU: single NVIDIA H200
- Serving mode: OpenAI-compatible `/v1/chat/completions`
- Context length: `32768`
- dtype: `bfloat16`

## Launch Command

```bash
CUDA_VISIBLE_DEVICES=6 \
PYTHONPATH=/tmp/vllm_triton_allocator \
/tmp/vllm-0.11.2/bin/python -u -m vllm.entrypoints.cli.main serve \
  /upfs/models/Kimi-Linear-48B-A3B-Instruct/Kimi-Linear-48B-A3B-Instruct \
  --host 127.0.0.1 \
  --port 8003 \
  --served-model-name Kimi-Linear-48B-A3B-Instruct \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --trust-remote-code \
  --enforce-eager

Request Parameters

We call /v1/chat/completions with:

{
  "temperature": 0.1,
  "top_p": 0.4,
  "max_tokens": 8192,
  "stream": true,
  "response_format": {
    "type": "json_object"
  }
}

The prompt asks the model to return exactly one valid JSON object.

Issue 1: Short Input Degenerates Into `!` Loop

Test Setup

Input length: about 256 words
Requests: 32
Concurrency: 8
max_tokens: 8192
Timeout: 300 seconds

Result

completed: 20
total_timeout: 12
json_ok: 20
parse_failed: 12

Observed Failure Pattern

The failed requests did not fail because of a low output-token limit. They ran until the 300-second timeout and produced mostly repeated exclamation marks:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
...

Questions

Is this a known issue with Kimi-Linear-48B-A3B-Instruct under vLLM?
Could this be caused by tokenizer behavior, logits processing, or the model’s custom remote-code implementation?
Are there recommended sampling parameters, stop tokens, or serving flags to prevent this repeated ! loop?
Is response_format={"type":"json_object"} recommended for this model?

Issue 2: Long Context High-Concurrency Requests Mostly Timeout

Test Setup

We used a real long-context financial analysis prompt.

Requests: 192
Concurrency: 192
max_tokens: 8192
Timeout: 300 seconds
Same prompt repeated 192 times

Result

total requests: 192
completed with finish_reason=stop: 19
total_timeout: 173
json_ok: 29
parse_failed: 163
latency_min: 181.8s
latency_p50: 312.6s
latency_p95: 331.1s
latency_max: 334.8s

Many timeout requests returned zero characters. Some requests produced partial JSON and then degenerated into repeated !.

Example:

{"role": null, "evidence_assessment": {"main_positive...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

vLLM Runtime Metrics

During the test, vLLM showed heavy queueing:

Running: 26 reqs
Waiting: 147 reqs
GPU KV cache usage: 96.9%

Questions

Is concurrency 192 unrealistic for this model on a single H200 with long-context prompts?
What maximum concurrency would you recommend for this model on one H200?
Does this model require any special vLLM flags for stable high-concurrency serving?
Is the timeout mainly due to queueing/KV cache pressure, model generation degeneration, or both?

Issue 3: xgrammar FSM Errors

During the long-context high-concurrency test, vLLM repeatedly logged:

backend_xgrammar.py:158 Failed to advance FSM for request ... for tokens 0. Please file an issue.

This appeared many times around the timeout period.

Questions For vLLM Developers

Does response_format={"type":"json_object"} use xgrammar in vLLM 0.11.2?
What does Failed to advance FSM ... for tokens 0 usually mean?
Could xgrammar be interacting with token generation in a way that causes the repeated ! loop?
Is there a recommended workaround?
- Disable structured decoding?
- Use a different guided decoding backend?
- Avoid response_format and rely on prompt-only JSON formatting?
- Change tokenizer or guided decoding settings?

Low-Concurrency Observation

With the same long-context prompt but only 4 repeated requests at concurrency 2, all requests completed successfully:

requests: 4
completed: 4
finish_reason=stop: 4
json_ok: 4
latency range: about 51.8s to 156.5s

So vLLM 0.11.2 appears better than the previous deployment at low concurrency, but high concurrency remains unstable.

Main Questions

We would like to know whether this is more likely caused by:

The model checkpoint itself.
The custom Kimi model implementation.
Tokenizer behavior.
vLLM structured decoding / xgrammar.
Excessive concurrency on a single H200.
An interaction between these factors.

What We Need

We need the model to reliably produce one valid JSON object for long-context financial role-agent prompts.

Could you please provide guidance on:

The recommended vLLM version for Kimi-Linear-48B-A3B-Instruct.
The recommended launch command and serving flags.
Whether response_format={"type":"json_object"} is supported/recommended.
The recommended maximum concurrency on one H200.
How to avoid repeated ! generation loops.
How to handle or avoid the xgrammar FSM errors.

Update Kimi Linear AMD recipe YAML format

94d3a06

Signed-off-by: haic0 <haic0@users.noreply.github.com>

haic0 mentioned this pull request Jun 15, 2026

Update Kimi-Linear.md for AMD GPU #155

Closed

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

vercel Bot deployed to Preview June 15, 2026 07:44 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Kimi Linear AMD recipe YAML format#550

Update Kimi Linear AMD recipe YAML format#550
haic0 wants to merge 1 commit into
vllm-project:mainfrom
haic0:haic0/replace-kimi-linear-yaml

haic0 commented Jun 15, 2026

Uh oh!

vercel Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

Pangyh2001 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

haic0 commented Jun 15, 2026

Summary

Test plan

Uh oh!

vercel Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Pangyh2001 commented Jun 29, 2026

Request Parameters

Issue 1: Short Input Degenerates Into ! Loop

Test Setup

Result

Observed Failure Pattern

Questions

Issue 2: Long Context High-Concurrency Requests Mostly Timeout

Test Setup

Result

vLLM Runtime Metrics

Questions

Issue 3: xgrammar FSM Errors

Questions For vLLM Developers

Low-Concurrency Observation

Main Questions

What We Need

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Jun 15, 2026 •

edited

Loading

Issue 1: Short Input Degenerates Into `!` Loop