Update Trinity Large Thinking ROCm command#593
Conversation
Add the ROCm env, trust-remote-code, TP=8, and max-model-len 32768 launch settings for Trinity Large Thinking. Signed-off-by: haic0 <haichzha@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Code Review
This pull request updates the configuration for models/arcee-ai/Trinity-Large-Thinking.yaml by adding --trust-remote-code and --max-model-len 32768 to the base arguments, introducing AMD hardware overrides, and specifying a tensor parallel size of 8 for single-node strategy overrides. The guide's launch commands and descriptions were also updated to reflect these changes. The reviewer feedback suggests keeping the launch commands and documentation generic by removing AMD-specific environment variables and references, as the platform automatically handles AMD-specific environment variables and the configuration is also compatible with NVIDIA hardware.
| VLLM_ROCM_USE_AITER=1 vllm serve arcee-ai/Trinity-Large-Thinking \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size 8 \ | ||
| --max-model-len 32768 |
There was a problem hiding this comment.
Since VLLM_ROCM_USE_AITER: "1" is already defined under hardware_overrides.amd.extra_env, the deployment platform will automatically inject this environment variable when running on AMD hardware. Hardcoding it in the generic launch command can be confusing for NVIDIA users (especially those deploying the nvfp4 variant on Blackwell GPUs). It is cleaner to keep the launch command generic.
vllm serve arcee-ai/Trinity-Large-Thinking \
--trust-remote-code \
--tensor-parallel-size 8 \
--max-model-len 32768| VLLM_ROCM_USE_AITER=1 vllm serve arcee-ai/Trinity-Large-Thinking \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size 8 \ | ||
| --max-model-len 32768 \ |
There was a problem hiding this comment.
| - `--enable-auto-tool-choice` lets the model decide when to call tools. | ||
| - `--tool-call-parser qwen3_coder` converts tool calls into OpenAI-style `tool_calls`. | ||
| - `--dtype bfloat16` matches the recommended serving dtype. | ||
| - `--max-model-len 32768` keeps the KV cache practical for the TP=8 AMD launch. |
There was a problem hiding this comment.
Summary
Test plan
node scripts/build-recipes-api.mjson the complete validated recipe update set.Made with Cursor