Skip to content

Update Kimi Linear AMD recipe YAML format#550

Open
haic0 wants to merge 1 commit into
vllm-project:mainfrom
haic0:haic0/replace-kimi-linear-yaml
Open

Update Kimi Linear AMD recipe YAML format#550
haic0 wants to merge 1 commit into
vllm-project:mainfrom
haic0:haic0/replace-kimi-linear-yaml

Conversation

@haic0

@haic0 haic0 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replaces the legacy Markdown AMD update in Update Kimi-Linear.md for AMD GPU #155 with the current YAML recipe format for Kimi-Linear-48B-A3B-Instruct.
  • Encodes AMD hardware support, ROCm installation guidance, launch commands, and runtime overrides in models/...yaml.

Test plan

  • node scripts/build-recipes-api.mjs
  • Parsed the updated YAML recipes and verified required top-level schema order against .claude/skills/add-recipe/SKILL.md.
  • Verified DCO sign-off as haic0.

Replaces #155.

Signed-off-by: haic0 <haic0@users.noreply.github.com>
@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment Jun 15, 2026 7:44am

Request Review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds AMD ROCm hardware support, overrides, and setup/run guides for the Kimi-Linear-48B-A3B-Instruct model. The reviewer suggested adding the --port 8000 argument to the ROCm launch command to maintain consistency with other examples in the guide.

Comment on lines +118 to +124
export SAFETENSORS_FAST_GPU=1
vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 1048576 \
--no-enable-prefix-caching \
--trust-remote-code
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the other launch snippets in this guide (such as the 4-GPU and 8-GPU examples), please explicitly specify --port 8000 in the ROCm launch command.

  export SAFETENSORS_FAST_GPU=1
  vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
    --port 8000 \
    --tensor-parallel-size 8 \
    --max-model-len 1048576 \
    --no-enable-prefix-caching \
    --trust-remote-code

@Pangyh2001

Copy link
Copy Markdown
# Kimi-Linear-48B-A3B-Instruct vLLM Issues

## Summary

We are testing `Kimi-Linear-48B-A3B-Instruct` with vLLM `0.11.2`.

We observe three major issues:

1. Short-input requests can degenerate into repeated `!` tokens and timeout.
2. Long-context requests under high concurrency mostly timeout.
3. vLLM repeatedly logs xgrammar FSM errors during the failure period.

## Environment

- Model: `Kimi-Linear-48B-A3B-Instruct`
- Backend: vLLM `0.11.2`
- GPU: single NVIDIA H200
- Serving mode: OpenAI-compatible `/v1/chat/completions`
- Context length: `32768`
- dtype: `bfloat16`

## Launch Command

```bash
CUDA_VISIBLE_DEVICES=6 \
PYTHONPATH=/tmp/vllm_triton_allocator \
/tmp/vllm-0.11.2/bin/python -u -m vllm.entrypoints.cli.main serve \
  /upfs/models/Kimi-Linear-48B-A3B-Instruct/Kimi-Linear-48B-A3B-Instruct \
  --host 127.0.0.1 \
  --port 8003 \
  --served-model-name Kimi-Linear-48B-A3B-Instruct \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --trust-remote-code \
  --enforce-eager

Request Parameters

We call /v1/chat/completions with:

{
  "temperature": 0.1,
  "top_p": 0.4,
  "max_tokens": 8192,
  "stream": true,
  "response_format": {
    "type": "json_object"
  }
}

The prompt asks the model to return exactly one valid JSON object.

Issue 1: Short Input Degenerates Into ! Loop

Test Setup

  • Input length: about 256 words
  • Requests: 32
  • Concurrency: 8
  • max_tokens: 8192
  • Timeout: 300 seconds

Result

completed: 20
total_timeout: 12
json_ok: 20
parse_failed: 12

Observed Failure Pattern

The failed requests did not fail because of a low output-token limit. They ran until the 300-second timeout and produced mostly repeated exclamation marks:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
...

Questions

  • Is this a known issue with Kimi-Linear-48B-A3B-Instruct under vLLM?
  • Could this be caused by tokenizer behavior, logits processing, or the model’s custom remote-code implementation?
  • Are there recommended sampling parameters, stop tokens, or serving flags to prevent this repeated ! loop?
  • Is response_format={"type":"json_object"} recommended for this model?

Issue 2: Long Context High-Concurrency Requests Mostly Timeout

Test Setup

We used a real long-context financial analysis prompt.

  • Requests: 192
  • Concurrency: 192
  • max_tokens: 8192
  • Timeout: 300 seconds
  • Same prompt repeated 192 times

Result

total requests: 192
completed with finish_reason=stop: 19
total_timeout: 173
json_ok: 29
parse_failed: 163
latency_min: 181.8s
latency_p50: 312.6s
latency_p95: 331.1s
latency_max: 334.8s

Many timeout requests returned zero characters. Some requests produced partial JSON and then degenerated into repeated !.

Example:

{"role": null, "evidence_assessment": {"main_positive...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

vLLM Runtime Metrics

During the test, vLLM showed heavy queueing:

Running: 26 reqs
Waiting: 147 reqs
GPU KV cache usage: 96.9%

Questions

  • Is concurrency 192 unrealistic for this model on a single H200 with long-context prompts?
  • What maximum concurrency would you recommend for this model on one H200?
  • Does this model require any special vLLM flags for stable high-concurrency serving?
  • Is the timeout mainly due to queueing/KV cache pressure, model generation degeneration, or both?

Issue 3: xgrammar FSM Errors

During the long-context high-concurrency test, vLLM repeatedly logged:

backend_xgrammar.py:158 Failed to advance FSM for request ... for tokens 0. Please file an issue.

This appeared many times around the timeout period.

Questions For vLLM Developers

  • Does response_format={"type":"json_object"} use xgrammar in vLLM 0.11.2?
  • What does Failed to advance FSM ... for tokens 0 usually mean?
  • Could xgrammar be interacting with token generation in a way that causes the repeated ! loop?
  • Is there a recommended workaround?
    • Disable structured decoding?
    • Use a different guided decoding backend?
    • Avoid response_format and rely on prompt-only JSON formatting?
    • Change tokenizer or guided decoding settings?

Low-Concurrency Observation

With the same long-context prompt but only 4 repeated requests at concurrency 2, all requests completed successfully:

requests: 4
completed: 4
finish_reason=stop: 4
json_ok: 4
latency range: about 51.8s to 156.5s

So vLLM 0.11.2 appears better than the previous deployment at low concurrency, but high concurrency remains unstable.

Main Questions

We would like to know whether this is more likely caused by:

  1. The model checkpoint itself.
  2. The custom Kimi model implementation.
  3. Tokenizer behavior.
  4. vLLM structured decoding / xgrammar.
  5. Excessive concurrency on a single H200.
  6. An interaction between these factors.

What We Need

We need the model to reliably produce one valid JSON object for long-context financial role-agent prompts.

Could you please provide guidance on:

  1. The recommended vLLM version for Kimi-Linear-48B-A3B-Instruct.
  2. The recommended launch command and serving flags.
  3. Whether response_format={"type":"json_object"} is supported/recommended.
  4. The recommended maximum concurrency on one H200.
  5. How to avoid repeated ! generation loops.
  6. How to handle or avoid the xgrammar FSM errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants