-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[doc] Add Qwen3 Next Guide to Core README #8101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -929,11 +929,97 @@ Dynamo supports TensorRT LLM as one of its inference engine. For details on how | |||||||||||||||||
|
||||||||||||||||||
## Qwen3-Next | ||||||||||||||||||
|
||||||||||||||||||
Below is the command to run the Qwen3-Next model. | ||||||||||||||||||
We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings. Note that we should set kv_cache_reuse to false. | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
EXTRA_LLM_API_FILE=/tmp/config.yml | ||||||||||||||||||
|
||||||||||||||||||
cat << EOF > ${EXTRA_LLM_API_FILE} | ||||||||||||||||||
enable_attention_dp: false | ||||||||||||||||||
cuda_graph_config: | ||||||||||||||||||
enable_padding: true | ||||||||||||||||||
max_batch_size: 720 | ||||||||||||||||||
moe_config: | ||||||||||||||||||
backend: TRTLLM | ||||||||||||||||||
stream_interval: 20 | ||||||||||||||||||
num_postprocess_workers: 4 | ||||||||||||||||||
kv_cache_config: | ||||||||||||||||||
enable_block_reuse: false | ||||||||||||||||||
EOF | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Below is an example command to launch the TRT-LLM server with the Qwen3-Next model from within the container. Note that we currently only support pytorch backend. | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking \ | ||||||||||||||||||
--host 0.0.0.0 \ | ||||||||||||||||||
--port 8000 \ | ||||||||||||||||||
--backend pytorch \ | ||||||||||||||||||
--max_batch_size 1 \ | ||||||||||||||||||
--max_num_tokens 4096 \ | ||||||||||||||||||
--kv_cache_free_gpu_memory_fraction 0.6 \ | ||||||||||||||||||
--tp_size 4 \ | ||||||||||||||||||
--ep_size 4 \ | ||||||||||||||||||
--trust_remote_code \ | ||||||||||||||||||
--extra_llm_api_options ${EXTRA_LLM_API_FILE} | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server. | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ | ||||||||||||||||||
"model": "Qwen/Qwen3-Next-80B-A3B-Thinking", | ||||||||||||||||||
"messages": [ | ||||||||||||||||||
{ | ||||||||||||||||||
"role": "user", | ||||||||||||||||||
"content": "Where is New York?" | ||||||||||||||||||
} | ||||||||||||||||||
], | ||||||||||||||||||
"max_tokens": 1024, | ||||||||||||||||||
"top_p": 1.0 | ||||||||||||||||||
}' -w "\n" | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script. | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
cat <<'EOF' > bench.sh | ||||||||||||||||||
#!/usr/bin/env bash | ||||||||||||||||||
set -euo pipefail | ||||||||||||||||||
|
||||||||||||||||||
concurrency_list="1 2 4 8 16 32 64 128 256" | ||||||||||||||||||
multi_round=5 | ||||||||||||||||||
isl=1024 | ||||||||||||||||||
osl=1024 | ||||||||||||||||||
result_dir=/tmp/qwen3_output | ||||||||||||||||||
|
||||||||||||||||||
for concurrency in ${concurrency_list}; do | ||||||||||||||||||
num_prompts=$((concurrency * multi_round)) | ||||||||||||||||||
python -m tensorrt_llm.serve.scripts.benchmark_serving \ | ||||||||||||||||||
--model Qwen/Qwen3-Next-80B-A3B-Thinking \ | ||||||||||||||||||
--backend openai \ | ||||||||||||||||||
--dataset-name "random" \ | ||||||||||||||||||
--random-input-len ${isl} \ | ||||||||||||||||||
--random-output-len ${osl} \ | ||||||||||||||||||
--random-prefix-len 0 \ | ||||||||||||||||||
--random-ids \ | ||||||||||||||||||
--num-prompts ${num_prompts} \ | ||||||||||||||||||
--max-concurrency ${concurrency} \ | ||||||||||||||||||
--ignore-eos \ | ||||||||||||||||||
--tokenize-on-client \ | ||||||||||||||||||
--percentile-metrics "ttft,tpot,itl,e2el" | ||||||||||||||||||
done | ||||||||||||||||||
EOF | ||||||||||||||||||
chmod +x bench.sh | ||||||||||||||||||
``` | ||||||||||||||||||
Comment on lines
+985
to
+1016
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove or utilize the unused Line 996 defines Option 1 - Remove the unused variable: concurrency_list="1 2 4 8 16 32 64 128 256"
multi_round=5
isl=1024
osl=1024
-result_dir=/tmp/qwen3_output Option 2 - Use the variable for output: python -m tensorrt_llm.serve.scripts.benchmark_serving \
--model Qwen/Qwen3-Next-80B-A3B-Thinking \
--backend openai \
+ --save-result \
+ --result-dir ${result_dir} \
--dataset-name "random" \ 🤖 Prompt for AI Agents
|
||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
In addition, below is the command to run the Qwen3-Next model using the `quickstart_advanced.py` file. | ||||||||||||||||||
|
||||||||||||||||||
```bash | ||||||||||||||||||
mpirun -n 1 --allow-run-as-root --oversubscribe python3 examples/llm-api/quickstart_advanced.py --model_dir /Qwen3-Next-80B-A3B-Thinking --kv_cache_fraction 0.6 --disable_kv_cache_reuse --max_batch_size 1 --tp_size 4 | ||||||||||||||||||
|
||||||||||||||||||
``` | ||||||||||||||||||
Comment on lines
+1019
to
1023
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use a placeholder for the model directory path. Line 1022 uses an absolute path Apply this diff to maintain consistency: -mpirun -n 1 --allow-run-as-root --oversubscribe python3 examples/llm-api/quickstart_advanced.py --model_dir /Qwen3-Next-80B-A3B-Thinking --kv_cache_fraction 0.6 --disable_kv_cache_reuse --max_batch_size 1 --tp_size 4
+mpirun -n 1 --allow-run-as-root --oversubscribe python3 examples/llm-api/quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --kv_cache_fraction 0.6 --disable_kv_cache_reuse --max_batch_size 1 --tp_size 4 Additionally, consider documenting why 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
|
||||||||||||||||||
|
||||||||||||||||||
## Notes and Troubleshooting | ||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify the
max_batch_size
inconsistency.The
trtllm-serve
command sets--max_batch_size 1
(line 958), but the YAML configuration file setsmax_batch_size: 720
in thecuda_graph_config
(line 941). This could cause confusion about which value takes precedence or whether they serve different purposes.Consider adding a comment explaining this discrepancy, or aligning the values if they should be consistent. For example:
📝 Committable suggestion
🤖 Prompt for AI Agents