Update Kimi Linear AMD recipe YAML format#550
Conversation
Signed-off-by: haic0 <haic0@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Code Review
This pull request adds AMD ROCm hardware support, overrides, and setup/run guides for the Kimi-Linear-48B-A3B-Instruct model. The reviewer suggested adding the --port 8000 argument to the ROCm launch command to maintain consistency with other examples in the guide.
| export SAFETENSORS_FAST_GPU=1 | ||
| vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \ | ||
| --tensor-parallel-size 8 \ | ||
| --max-model-len 1048576 \ | ||
| --no-enable-prefix-caching \ | ||
| --trust-remote-code | ||
| ``` |
There was a problem hiding this comment.
For consistency with the other launch snippets in this guide (such as the 4-GPU and 8-GPU examples), please explicitly specify --port 8000 in the ROCm launch command.
export SAFETENSORS_FAST_GPU=1
vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 1048576 \
--no-enable-prefix-caching \
--trust-remote-code
# Kimi-Linear-48B-A3B-Instruct vLLM Issues
## Summary
We are testing `Kimi-Linear-48B-A3B-Instruct` with vLLM `0.11.2`.
We observe three major issues:
1. Short-input requests can degenerate into repeated `!` tokens and timeout.
2. Long-context requests under high concurrency mostly timeout.
3. vLLM repeatedly logs xgrammar FSM errors during the failure period.
## Environment
- Model: `Kimi-Linear-48B-A3B-Instruct`
- Backend: vLLM `0.11.2`
- GPU: single NVIDIA H200
- Serving mode: OpenAI-compatible `/v1/chat/completions`
- Context length: `32768`
- dtype: `bfloat16`
## Launch Command
```bash
CUDA_VISIBLE_DEVICES=6 \
PYTHONPATH=/tmp/vllm_triton_allocator \
/tmp/vllm-0.11.2/bin/python -u -m vllm.entrypoints.cli.main serve \
/upfs/models/Kimi-Linear-48B-A3B-Instruct/Kimi-Linear-48B-A3B-Instruct \
--host 127.0.0.1 \
--port 8003 \
--served-model-name Kimi-Linear-48B-A3B-Instruct \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--trust-remote-code \
--enforce-eagerRequest ParametersWe call {
"temperature": 0.1,
"top_p": 0.4,
"max_tokens": 8192,
"stream": true,
"response_format": {
"type": "json_object"
}
}The prompt asks the model to return exactly one valid JSON object. Issue 1: Short Input Degenerates Into
|
Summary
models/...yaml.Test plan
node scripts/build-recipes-api.mjs.claude/skills/add-recipe/SKILL.md.haic0.Replaces #155.