Skip to content

fix(google/gemma-4-26b-a4b-it): add --max-num-batched-tokens to single-GPU command#572

Open
mahadrehmann wants to merge 2 commits into
vllm-project:mainfrom
mahadrehmann:fix/gemma4-26b-max-batched-tokens
Open

fix(google/gemma-4-26b-a4b-it): add --max-num-batched-tokens to single-GPU command#572
mahadrehmann wants to merge 2 commits into
vllm-project:mainfrom
mahadrehmann:fix/gemma4-26b-max-batched-tokens

Conversation

@mahadrehmann

@mahadrehmann mahadrehmann commented Jun 22, 2026

Copy link
Copy Markdown

Summary

Fixes #441

The single-GPU BF16 command for gemma-4-26B-A4B-it was missing
--max-num-batched-tokens, causing this error on 1× A100/H100:

ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496)
is larger than max_num_batched_tokens (2048).
Please increase max_num_batched_tokens.

This happens because the model's multimodal token budget (2496) exceeds
vLLM's default max_num_batched_tokens of 2048. Added
--max-num-batched-tokens 4096 to all single-GPU command blocks to clear
this threshold.

Changes

  • models/Google/gemma-4-26B-A4B-it.yaml: added --max-num-batched-tokens 4096 to:
    • ### 26B MoE on 1x A100/H100 (BF16) command block
    • ### Full-Featured Server Launch command block
    • ### Docker (NVIDIA) command block

Co-authored-by: @muhammadfawaz1

…e-GPU command

Co-authored-by: muhammadfawaz1 <135441198+muhammadfawaz1@users.noreply.github.com>
Signed-off-by: mahadrehmann <mahadrehman04@gmail.com>
Copilot AI review requested due to automatic review settings June 22, 2026 11:30
@vercel

vercel Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment Jun 22, 2026 11:38am

Request Review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the deployment guide for the Google Gemma-4-26B-A4B-it model by adding the --max-num-batched-tokens 4096 flag to the basic vllm serve command. The reviewer points out that other deployment commands in the same file, as well as in the corresponding markdown documentation, should also be updated with this flag to prevent users from encountering the same error across different setup configurations.

Comment thread models/Google/gemma-4-26B-A4B-it.yaml

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Gemma 4 26B MoE single-GPU (A100/H100, BF16) launch command to include an explicit --max-num-batched-tokens value, preventing a vLLM runtime error when the multimodal encoder token budget exceeds the default batch token limit.

Changes:

  • Added --max-num-batched-tokens 4096 to the “26B MoE on 1x A100/H100 (BF16)” vllm serve command in the Gemma 4 26B A4B IT recipe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ingle-GPU commands

Co-authored-by: muhammadfawaz1 <135441198+muhammadfawaz1@users.noreply.github.com>
Signed-off-by: mahadrehmann <mahadrehman04@gmail.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread models/Google/gemma-4-26B-A4B-it.yaml
Comment thread models/Google/gemma-4-26B-A4B-it.yaml
@mahadrehmann

Copy link
Copy Markdown
Author

Friendly ping, all checks are passing and no conflicts. Would appreciate a review when you get a chance. @Isotr0py @ywang96 @jeejeelee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma4: max_num_batched_tokens error for 26B MoE on 1× A100/H100 (BF16) command

2 participants