Skip to content

Update Trinity Large Thinking ROCm command#593

Open
haic0 wants to merge 1 commit into
vllm-project:mainfrom
haic0:haic0/update-trinity-large-thinking-rocm-command
Open

Update Trinity Large Thinking ROCm command#593
haic0 wants to merge 1 commit into
vllm-project:mainfrom
haic0:haic0/update-trinity-large-thinking-rocm-command

Conversation

@haic0

@haic0 haic0 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add the ROCm env, trust-remote-code, TP=8, and max-model-len 32768 launch settings for Trinity Large Thinking.
  • Aligns the recipe launch guidance with the provided vLLM serve command.

Test plan

  • Ran node scripts/build-recipes-api.mjs on the complete validated recipe update set.

Made with Cursor

Add the ROCm env, trust-remote-code, TP=8, and max-model-len 32768 launch settings for Trinity Large Thinking.

Signed-off-by: haic0 <haichzha@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment Jun 29, 2026 1:49pm

Request Review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the configuration for models/arcee-ai/Trinity-Large-Thinking.yaml by adding --trust-remote-code and --max-model-len 32768 to the base arguments, introducing AMD hardware overrides, and specifying a tensor parallel size of 8 for single-node strategy overrides. The guide's launch commands and descriptions were also updated to reflect these changes. The reviewer feedback suggests keeping the launch commands and documentation generic by removing AMD-specific environment variables and references, as the platform automatically handles AMD-specific environment variables and the configuration is also compatible with NVIDIA hardware.

Comment on lines +100 to +103
VLLM_ROCM_USE_AITER=1 vllm serve arcee-ai/Trinity-Large-Thinking \
--trust-remote-code \
--tensor-parallel-size 8 \
--max-model-len 32768

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since VLLM_ROCM_USE_AITER: "1" is already defined under hardware_overrides.amd.extra_env, the deployment platform will automatically inject this environment variable when running on AMD hardware. Hardcoding it in the generic launch command can be confusing for NVIDIA users (especially those deploying the nvfp4 variant on Blackwell GPUs). It is cleaner to keep the launch command generic.

  vllm serve arcee-ai/Trinity-Large-Thinking \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --max-model-len 32768

Comment on lines +109 to +112
VLLM_ROCM_USE_AITER=1 vllm serve arcee-ai/Trinity-Large-Thinking \
--trust-remote-code \
--tensor-parallel-size 8 \
--max-model-len 32768 \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly, we should remove the AMD-specific VLLM_ROCM_USE_AITER=1 prefix from the optional parser flags launch command to keep it generic and avoid confusion for NVIDIA users.

  vllm serve arcee-ai/Trinity-Large-Thinking \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \

- `--enable-auto-tool-choice` lets the model decide when to call tools.
- `--tool-call-parser qwen3_coder` converts tool calls into OpenAI-style `tool_calls`.
- `--dtype bfloat16` matches the recommended serving dtype.
- `--max-model-len 32768` keeps the KV cache practical for the TP=8 AMD launch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since this recipe is also compatible with NVIDIA hardware (such as the nvfp4 variant), the KV cache limitation is practical for any TP=8 launch, not just AMD. We should make this description more general.

  - `--max-model-len 32768` keeps the KV cache practical for the TP=8 launch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants