fix(google/gemma-4-26b-a4b-it): add --max-num-batched-tokens to single-GPU command#572
Conversation
…e-GPU command Co-authored-by: muhammadfawaz1 <135441198+muhammadfawaz1@users.noreply.github.com> Signed-off-by: mahadrehmann <mahadrehman04@gmail.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Code Review
This pull request updates the deployment guide for the Google Gemma-4-26B-A4B-it model by adding the --max-num-batched-tokens 4096 flag to the basic vllm serve command. The reviewer points out that other deployment commands in the same file, as well as in the corresponding markdown documentation, should also be updated with this flag to prevent users from encountering the same error across different setup configurations.
There was a problem hiding this comment.
Pull request overview
This PR updates the Gemma 4 26B MoE single-GPU (A100/H100, BF16) launch command to include an explicit --max-num-batched-tokens value, preventing a vLLM runtime error when the multimodal encoder token budget exceeds the default batch token limit.
Changes:
- Added
--max-num-batched-tokens 4096to the “26B MoE on 1x A100/H100 (BF16)”vllm servecommand in the Gemma 4 26B A4B IT recipe.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ingle-GPU commands Co-authored-by: muhammadfawaz1 <135441198+muhammadfawaz1@users.noreply.github.com> Signed-off-by: mahadrehmann <mahadrehman04@gmail.com>
|
Friendly ping, all checks are passing and no conflicts. Would appreciate a review when you get a chance. @Isotr0py @ywang96 @jeejeelee |
Summary
Fixes #441
The single-GPU BF16 command for
gemma-4-26B-A4B-itwas missing--max-num-batched-tokens, causing this error on 1× A100/H100:ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496)
is larger than max_num_batched_tokens (2048).
Please increase max_num_batched_tokens.
This happens because the model's multimodal token budget (2496) exceeds
vLLM's default
max_num_batched_tokensof 2048. Added--max-num-batched-tokens 4096to all single-GPU command blocks to clearthis threshold.
Changes
models/Google/gemma-4-26B-A4B-it.yaml: added--max-num-batched-tokens 4096to:### 26B MoE on 1x A100/H100 (BF16)command block### Full-Featured Server Launchcommand block### Docker (NVIDIA)command blockCo-authored-by: @muhammadfawaz1