Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 88 additions & 2 deletions examples/models/core/qwen/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -929,11 +929,97 @@ Dynamo supports TensorRT LLM as one of its inference engine. For details on how

## Qwen3-Next

Below is the command to run the Qwen3-Next model.
We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings. Note that we should set kv_cache_reuse to false.

```shell
EXTRA_LLM_API_FILE=/tmp/config.yml

cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: false
cuda_graph_config:
enable_padding: true
max_batch_size: 720
moe_config:
backend: TRTLLM
stream_interval: 20
num_postprocess_workers: 4
kv_cache_config:
enable_block_reuse: false
EOF
```

Below is an example command to launch the TRT-LLM server with the Qwen3-Next model from within the container. Note that we currently only support pytorch backend.

```shell
trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 1 \
--max_num_tokens 4096 \
--kv_cache_free_gpu_memory_fraction 0.6 \
--tp_size 4 \
--ep_size 4 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
```
Comment on lines +951 to +965
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the max_batch_size inconsistency.

The trtllm-serve command sets --max_batch_size 1 (line 958), but the YAML configuration file sets max_batch_size: 720 in the cuda_graph_config (line 941). This could cause confusion about which value takes precedence or whether they serve different purposes.

Consider adding a comment explaining this discrepancy, or aligning the values if they should be consistent. For example:

+# Note: --max_batch_size controls the runtime batch limit, while cuda_graph_config.max_batch_size
+# defines the CUDA graph batch size configuration for performance optimization.
 trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Below is an example command to launch the TRT-LLM server with the Qwen3-Next model from within the container. Note that we currently only support pytorch backend.
```shell
trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 1 \
--max_num_tokens 4096 \
--kv_cache_free_gpu_memory_fraction 0.6 \
--tp_size 4 \
--ep_size 4 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
```
# Note: --max_batch_size controls the runtime batch limit, while cuda_graph_config.max_batch_size
# defines the CUDA graph batch size configuration for performance optimization.
trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 1 \
--max_num_tokens 4096 \
--kv_cache_free_gpu_memory_fraction 0.6 \
--tp_size 4 \
--ep_size 4 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
🤖 Prompt for AI Agents
In examples/models/core/qwen/README.md around lines 951 to 965, the example
command uses --max_batch_size 1 while the YAML cuda_graph_config earlier sets
max_batch_size: 720, causing potential confusion about precedence or intent;
update the README to either align the CLI example and YAML value or add a short
clarifying comment explaining that the CLI --max_batch_size overrides YAML at
runtime (or that the YAML setting is for cuda_graph internal sizing and can be
larger), and show a consistent example or explicit note which value is
authoritative.



After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.

```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-Next-80B-A3B-Thinking",
"messages": [
{
"role": "user",
"content": "Where is New York?"
}
],
"max_tokens": 1024,
"top_p": 1.0
}' -w "\n"
```


To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.

```shell
cat <<'EOF' > bench.sh
#!/usr/bin/env bash
set -euo pipefail

concurrency_list="1 2 4 8 16 32 64 128 256"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/qwen3_output

for concurrency in ${concurrency_list}; do
num_prompts=$((concurrency * multi_round))
python -m tensorrt_llm.serve.scripts.benchmark_serving \
--model Qwen/Qwen3-Next-80B-A3B-Thinking \
--backend openai \
--dataset-name "random" \
--random-input-len ${isl} \
--random-output-len ${osl} \
--random-prefix-len 0 \
--random-ids \
--num-prompts ${num_prompts} \
--max-concurrency ${concurrency} \
--ignore-eos \
--tokenize-on-client \
--percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh
```
Comment on lines +985 to +1016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove or utilize the unused result_dir variable.

Line 996 defines result_dir=/tmp/qwen3_output, but this variable is never used in the benchmark_serving.py invocation. Either remove this variable or add output redirection to use it.

Option 1 - Remove the unused variable:

 concurrency_list="1 2 4 8 16 32 64 128 256"
 multi_round=5
 isl=1024
 osl=1024
-result_dir=/tmp/qwen3_output

Option 2 - Use the variable for output:

     python -m tensorrt_llm.serve.scripts.benchmark_serving \
         --model Qwen/Qwen3-Next-80B-A3B-Thinking \
         --backend openai \
+        --save-result \
+        --result-dir ${result_dir} \
         --dataset-name "random" \
🤖 Prompt for AI Agents
In examples/models/core/qwen/README.md around lines 985 to 1016, the variable
result_dir=/tmp/qwen3_output is declared but never used; either remove that line
to clean up the script, or wire it into the benchmark run by directing script
output into files under that directory (create the directory if needed) or by
passing it as an output/result path argument to benchmark_serving if the script
supports one (e.g., --output-dir or redirect stdout/stderr into files inside
$result_dir).



In addition, below is the command to run the Qwen3-Next model using the `quickstart_advanced.py` file.

```bash
mpirun -n 1 --allow-run-as-root --oversubscribe python3 examples/llm-api/quickstart_advanced.py --model_dir /Qwen3-Next-80B-A3B-Thinking --kv_cache_fraction 0.6 --disable_kv_cache_reuse --max_batch_size 1 --tp_size 4

```
Comment on lines +1019 to 1023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use a placeholder for the model directory path.

Line 1022 uses an absolute path /Qwen3-Next-80B-A3B-Thinking which is inconsistent with the rest of the documentation that uses placeholders like <YOUR_MODEL_DIR> (line 614, 622) or relative paths like Qwen/Qwen3-Next-80B-A3B-Thinking (line 954).

Apply this diff to maintain consistency:

-mpirun -n 1 --allow-run-as-root --oversubscribe python3 examples/llm-api/quickstart_advanced.py --model_dir /Qwen3-Next-80B-A3B-Thinking --kv_cache_fraction 0.6 --disable_kv_cache_reuse --max_batch_size 1 --tp_size 4
+mpirun -n 1 --allow-run-as-root --oversubscribe python3 examples/llm-api/quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --kv_cache_fraction 0.6 --disable_kv_cache_reuse --max_batch_size 1 --tp_size 4

Additionally, consider documenting why mpirun -n 1 is used for a single-process execution, or remove it if not necessary.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
In addition, below is the command to run the Qwen3-Next model using the `quickstart_advanced.py` file.
```bash
mpirun -n 1 --allow-run-as-root --oversubscribe python3 examples/llm-api/quickstart_advanced.py --model_dir /Qwen3-Next-80B-A3B-Thinking --kv_cache_fraction 0.6 --disable_kv_cache_reuse --max_batch_size 1 --tp_size 4
```
In addition, below is the command to run the Qwen3-Next model using the `quickstart_advanced.py` file.
🤖 Prompt for AI Agents
In examples/models/core/qwen/README.md around lines 1019 to 1023, replace the
hardcoded absolute model path (/Qwen3-Next-80B-A3B-Thinking) with the
documentation placeholder (e.g. <YOUR_MODEL_DIR>) or a consistent relative path
(e.g. Qwen/Qwen3-Next-80B-A3B-Thinking) so it matches other examples; update the
example command accordingly. Also either remove the unnecessary mpirun -n 1
wrapper for single-process execution or add one short inline note immediately
before or after the command explaining why mpirun -n 1 is retained
(single-process invocation on MPI setups), so readers understand the choice.


## Notes and Troubleshooting
Expand Down