Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 25 additions & 7 deletions models/arcee-ai/Trinity-Large-Thinking.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,9 @@ model:
active_parameters: "13B"
context_length: 262144
base_args:
- "--dtype"
- "bfloat16"
- "--trust-remote-code"
- "--max-model-len"
- "32768"
base_env: {}

features:
Expand Down Expand Up @@ -59,8 +60,14 @@ compatible_strategies:
- multi_node_tep
- multi_node_dep

hardware_overrides: {}
strategy_overrides: {}
hardware_overrides:
amd:
extra_env:
VLLM_ROCM_USE_AITER: "1"

strategy_overrides:
single_node_tp:
tp: 8

guide: |
## Overview
Expand Down Expand Up @@ -90,8 +97,19 @@ guide: |
## Launch command

```bash
vllm serve arcee-ai/Trinity-Large-Thinking \
--dtype bfloat16 \
VLLM_ROCM_USE_AITER=1 vllm serve arcee-ai/Trinity-Large-Thinking \
--trust-remote-code \
--tensor-parallel-size 8 \
--max-model-len 32768
Comment on lines +100 to +103

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since VLLM_ROCM_USE_AITER: "1" is already defined under hardware_overrides.amd.extra_env, the deployment platform will automatically inject this environment variable when running on AMD hardware. Hardcoding it in the generic launch command can be confusing for NVIDIA users (especially those deploying the nvfp4 variant on Blackwell GPUs). It is cleaner to keep the launch command generic.

  vllm serve arcee-ai/Trinity-Large-Thinking \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --max-model-len 32768

```

Optional parser flags:

```bash
VLLM_ROCM_USE_AITER=1 vllm serve arcee-ai/Trinity-Large-Thinking \
--trust-remote-code \
--tensor-parallel-size 8 \
--max-model-len 32768 \
Comment on lines +109 to +112

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly, we should remove the AMD-specific VLLM_ROCM_USE_AITER=1 prefix from the optional parser flags launch command to keep it generic and avoid confusion for NVIDIA users.

  vllm serve arcee-ai/Trinity-Large-Thinking \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \

--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Expand All @@ -101,7 +119,7 @@ guide: |
- `--reasoning-parser deepseek_r1` extracts `<think>...</think>` into `message.reasoning`.
- `--enable-auto-tool-choice` lets the model decide when to call tools.
- `--tool-call-parser qwen3_coder` converts tool calls into OpenAI-style `tool_calls`.
- `--dtype bfloat16` matches the recommended serving dtype.
- `--max-model-len 32768` keeps the KV cache practical for the TP=8 AMD launch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since this recipe is also compatible with NVIDIA hardware (such as the nvfp4 variant), the KV cache limitation is practical for any TP=8 launch, not just AMD. We should make this description more general.

  - `--max-model-len 32768` keeps the KV cache practical for the TP=8 launch.


Add parallelism flags (`--tensor-parallel-size`, `--data-parallel-size`, or
`--enable-expert-parallel`) for your hardware. Lower `--max-model-len` if you don't
Expand Down