[ROCm] Add MI355X-only MiniMax-M3 MXFP4 variant#580
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
2d841ac to
190b98e
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces support for variant-level hardware allowlists (e.g., supported_hardware) and hardware-specific overrides (such as docker_image, extra_args, and extra_env). It also adds a new mxfp4 variant for the MiniMax-M3 model targeting AMD Instinct MI355X hardware. The review feedback highlights a potential runtime crash when handling invalid hardware IDs in the URL query parameters, and suggests improving consistency in the UI by using the variant's label instead of just its precision when displaying disabled hardware reasons.
| const requestedHwAllowed = requestedHwId | ||
| && isPrecisionCompatible(requestedHwProfile, requestedVariant) | ||
| && isHardwareSupported(recipe, requestedHwId) | ||
| && isVariantHardwareSupported(requestedVariant, requestedHwId); |
There was a problem hiding this comment.
If requestedHwId is an invalid hardware profile ID (e.g., passed via URL query parameters), taxonomy.hardware_profiles?.[requestedHwId] will be undefined. Since requestedHwId is truthy, requestedHwAllowed will proceed to evaluate isPrecisionCompatible(requestedHwProfile, requestedVariant). If the variant has a precision constraint, matchesConstraint will attempt to access profile.brand on undefined (or {} which doesn't crash but returns true if constraint is undefined, leading to hwId being set to an invalid ID). This can cause runtime crashes or broken UI state. We should explicitly verify that requestedHwId exists in taxonomy.hardware_profiles before allowing it.
| const requestedHwAllowed = requestedHwId | |
| && isPrecisionCompatible(requestedHwProfile, requestedVariant) | |
| && isHardwareSupported(recipe, requestedHwId) | |
| && isVariantHardwareSupported(requestedVariant, requestedHwId); | |
| const requestedHwAllowed = requestedHwId | |
| && !!taxonomy.hardware_profiles?.[requestedHwId] | |
| && isPrecisionCompatible(requestedHwProfile, requestedVariant) | |
| && isHardwareSupported(recipe, requestedHwId) | |
| && isVariantHardwareSupported(requestedVariant, requestedHwId); |
| const reason = !variantHardwareOk | ||
| ? `${currentVariant.precision?.toUpperCase()} is only supported on ${(currentVariant.supported_hardware || []).map((hw) => taxonomy.hardware_profiles?.[hw]?.display_name || hw).join(", ")}` | ||
| : !precisionOk | ||
| ? `${currentVariant.precision?.toUpperCase()} requires NVIDIA Blackwell` |
There was a problem hiding this comment.
For consistency with other parts of the component (such as line 1516) and to support custom variant labels correctly, we should use (currentVariant.label || currentVariant.precision) instead of just currentVariant.precision when rendering the error message.
| const reason = !variantHardwareOk | |
| ? `${currentVariant.precision?.toUpperCase()} is only supported on ${(currentVariant.supported_hardware || []).map((hw) => taxonomy.hardware_profiles?.[hw]?.display_name || hw).join(", ")}` | |
| : !precisionOk | |
| ? `${currentVariant.precision?.toUpperCase()} requires NVIDIA Blackwell` | |
| const reason = !variantHardwareOk | |
| ? `${(currentVariant.label || currentVariant.precision)?.toUpperCase()} is only supported on ${(currentVariant.supported_hardware || []).map((hw) => taxonomy.hardware_profiles?.[hw]?.display_name || hw).join(", ")}` | |
| : !precisionOk | |
| ? `${(currentVariant.label || currentVariant.precision)?.toUpperCase()} requires NVIDIA Blackwell` |
| const reason = !variantHardwareOk | ||
| ? `${currentVariant.precision?.toUpperCase()} is only supported on ${(currentVariant.supported_hardware || []).map((hw) => taxonomy.hardware_profiles?.[hw]?.display_name || hw).join(", ")}` | ||
| : !precisionOk | ||
| ? `${currentVariant.precision?.toUpperCase()} requires NVIDIA Blackwell` |
There was a problem hiding this comment.
For consistency with other parts of the component (such as line 1516) and to support custom variant labels correctly, we should use (currentVariant.label || currentVariant.precision) instead of just currentVariant.precision when rendering the error message.
| const reason = !variantHardwareOk | |
| ? `${currentVariant.precision?.toUpperCase()} is only supported on ${(currentVariant.supported_hardware || []).map((hw) => taxonomy.hardware_profiles?.[hw]?.display_name || hw).join(", ")}` | |
| : !precisionOk | |
| ? `${currentVariant.precision?.toUpperCase()} requires NVIDIA Blackwell` | |
| const reason = !variantHardwareOk | |
| ? `${(currentVariant.label || currentVariant.precision)?.toUpperCase()} is only supported on ${(currentVariant.supported_hardware || []).map((hw) => taxonomy.hardware_profiles?.[hw]?.display_name || hw).join(", ")}` | |
| : !precisionOk | |
| ? `${(currentVariant.label || currentVariant.precision)?.toUpperCase()} requires NVIDIA Blackwell` |
|
suggest to update the subject of PR: replace [codex] with [AMD] or [ROCm] |
Summary
amd/MiniMax-M3-MXFP4as an MXFP4 variant of the existingMiniMaxAI/MiniMax-M3recipenightlyimage and the validated TP8/encoder settings on MI355XWhy
The AMD Quark MXFP4 checkpoint is currently supported only on MI355X. Treating MXFP4 as a generally selectable precision produced invalid commands for NVIDIA and older AMD hardware.
Local verification
based off of vllm-project/vllm#45794
accuracy gsm8k & perf verfieid https://github.com/SemiAnalysisAI/InferenceX/actions/runs/28195297568/job/83520506068?pr=1935
SemiAnalysisAI/InferenceX#1935

User impact
Selecting MXFP4 now selects MI355X automatically. On every other hardware profile, the MXFP4 pill is disabled with an MI355X-only explanation. Generated API data for the promoted MXFP4 checkpoint exposes only MI355X.
Validation
node scripts/build-recipes-api.mjs— 142 models, 116 promoted variantsnode --check src/lib/command-synthesis.jsnode --check scripts/build-recipes-api.mjsmi355xNeed help on this PR? Tag
/codesmithwith what you need. Autofix is disabled.