[ROCm] update MiniMax-M3 bf16 recipe on docker image and fp8_per_chanel quantization notes#598
Conversation
…nel quantization notes Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Code Review
This pull request updates the MiniMax-M3 model configuration to use the nightly AMD ROCm Docker image and adds a new documentation section for running the model with TP8 and per-channel FP8 quantization. The review feedback suggests improving the markdown formatting by splitting a multi-line inline code block into separate blocks and refining the sentence structure, capitalization, and trailing spaces in the quantization description.
| Add the vision-encoder flags (`--mm-encoder-tp-mode data | ||
| --mm-encoder-attn-backend ROCM_AITER_FA`) for multimodal serving. |
There was a problem hiding this comment.
The inline code block (using backticks) is split across two lines. This can cause markdown parsers to render the newline and leading spaces literally inside the code block, making it difficult to read and copy. It is better to wrap each flag in its own inline code block.
Add the vision-encoder flags (`--mm-encoder-tp-mode data` and
`--mm-encoder-attn-backend ROCM_AITER_FA`) for multimodal serving.| Online **per-channel FP8 (PTPC)** quantization of the BF16 checkpoint halves the | ||
| weight footprint (≈100 → **≈50 GiB/GPU**, ~1.75× more KV cache) and lifts | ||
| batched throughput with **gsm8k unchanged from BF16** (lossless); | ||
| These env vars and the flag are for the **BF16 checkpoint only**. |
There was a problem hiding this comment.
The sentence structure here is slightly awkward due to the semicolon at the end of line 284 followed by a capitalized sentence on line 285. Additionally, there is a trailing space on line 284, and "gsm8k" should be capitalized as "GSM8K" to match the style used elsewhere in the repository.
Online **per-channel FP8 (PTPC)** quantization of the BF16 checkpoint halves the
weight footprint (≈100 → **≈50 GiB/GPU**, ~1.75× more KV cache) and lifts
batched throughput with **GSM8K unchanged from BF16** (lossless).
These env vars and the flag are for the **BF16 checkpoint only**.
For MiniMax-M3 default bf16 model:
(1) use nightly docker image
(2) add a section about support of fp8_per_chanel quantization and extra env for performance.