Handle granite 4 as MoE models in training #669

mtake · 2025-11-18T09:57:57Z

Granite 4 models are MoE models. They have router parameters which should be frozen in the training phase, and the auxiliary losses should be accumulated. This PR is to apply the existing code for freezing the parameters and accumulating the losses for GPT-OSS model to granite 4 models as well.

Summary by CodeRabbit

New Features
- Added support for granitemoehybrid MoE models in training.
Refactor
- Auxiliary losses are now included whenever present, simplifying loss handling.
- Router-freeze logic broadened to cover more MoE architectures and now reports frozen only if router params were actually changed.
Documentation
- Updated router-freeze wording to reference MoE models.

coderabbitai · 2025-11-18T09:58:07Z

Walkthrough

Extended MoE handling: added detection for granitemoehybrid models, broadened router-freeze and auxiliary-loss paths to include MoE models, removed a GPT-OSS-only gate so auxiliary loss is applied whenever present, and introduced a new is_known_model utility (duplicated in file).

Changes

Cohort / File(s)	Summary
Auxiliary loss computation `src/instructlab/training/batch_loss_manager.py`	Removed the `is_gpt_oss` gate in `_compute_average_loss`; accumulated aux loss is now added to total batch loss whenever present (after reduction).
Model type detection utility `src/instructlab/training/gpt_oss_utils_correct.py`	Added public `is_known_model(model_path_or_config, known_model_type)` function; the same function appears twice (duplicate declarations) in the file.
Model initialization and loss handling `src/instructlab/training/model.py`	Imported `is_known_model`; added boolean `is_granitemoehybrid` based on model config; expanded MoE aux-loss gating to include `is_granitemoehybrid`; removed prior GPT-OSS-only aux-loss gate so aux loss is considered when non-None.
Router parameter freezing `src/instructlab/training/utils.py`	Updated `freeze_router_params()` docstring and logs to reference MoE; changed return semantics to True only if at least one router parameter was frozen.
Training main flow `src/instructlab/training/main_ds.py`	Expanded router-freeze condition from `is_gpt_oss` to `is_gpt_oss or is_granitemoehybrid`; now uses `freeze_router_params` return value to conditionally log and set FSDP original-params flag.

Sequence Diagram(s)

sequenceDiagram
    participant Main as main_ds.py
    participant Model as model.py
    participant Utils as utils.py
    participant Loss as batch_loss_manager.py

    Main->>Model: Initialize(model_path)
    Model->>Model: detect is_gpt_oss / is_granitemoehybrid (is_known_model)

    Main->>Utils: freeze_router_params(model)
    alt Router params frozen
        Utils-->>Main: return True
        Main->>Main: set fsdp_use_orig_params / log frozen router
    else No router params frozen
        Utils-->>Main: return False
    end

    Main->>Model: forward batch
    Model-->>Loss: pass outputs (may include aux_loss)

    Loss->>Loss: _compute_average_loss()
    alt accumulated_aux_loss not None
        Loss->>Loss: add accumulated_aux_loss to total_batch_loss (no model flag gate)
    end
    Loss-->>Main: return reduced loss

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas to inspect:
- Consolidate or remove the duplicate is_known_model declaration in gpt_oss_utils_correct.py.
- Verify auxiliary-loss semantics after removing the is_gpt_oss gate (ensure no unintended double-counting or missing reductions).
- Confirm callers of freeze_router_params() handle the new boolean semantics correctly (not relying on old unconditional True).
- Check interactions between model.py gating and batch_loss_manager.py unconditional aux-loss inclusion.

Poem

🐰 I sniffed the code and hopped around the tree,

Aux loss freed from gates, now flowing free.
Granite MoE and GPT both take a bow,
Routers frozen only when they truly bow.
Hooray — a rabbit's patch, light-footed and merry!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Handle granite 4 as MoE models in training' directly and clearly describes the main change: extending MoE model handling to granite 4 models, which is the core objective across all modified files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c97c9ba and 34b7cd6.

📒 Files selected for processing (1)

src/instructlab/training/gpt_oss_utils_correct.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: pylint
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: Summary

🔇 Additional comments (1)

src/instructlab/training/gpt_oss_utils_correct.py (1)

401-402: LGTM: Clean delegation to shared utility.

The refactoring of is_gpt_oss to delegate to is_known_model is appropriate and maintains backward compatibility while enabling code reuse.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/instructlab/training/main_ds.py (1)

349-349: Consider clarifying or removing the "NOTE is this guard needed?" comment.

The guard if m.is_gpt_oss or m.is_granitemoehybrid: appears necessary to ensure that router parameter freezing and fsdp_should_use_orig_params configuration only apply to MoE models. Without this guard, non-MoE models would unnecessarily go through the router freezing logic, and freeze_router_params would always return False for them (since they have no router parameters).

If the guard is intentional and necessary, consider removing the "NOTE is this guard needed?" comment to avoid confusion. If there's genuine uncertainty about whether this guard is required, please clarify the intended behavior.

src/instructlab/training/model.py (1)

421-428: Consider clarifying the guard necessity.

The guard if (self.is_gpt_oss or self.is_granitemoehybrid) before checking for output.aux_loss appears to be an optimization to avoid unnecessary hasattr checks on non-MoE models. However, the guard may be redundant since the subsequent checks (hasattr(output, "aux_loss") and output.aux_loss is not None) would safely handle non-MoE models anyway.

The "NOTE is this guard needed?" comment suggests uncertainty. Consider either:

Removing the guard if the hasattr and is not None checks are sufficient

Removing the comment if the guard provides meaningful performance benefits or code clarity

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 637afae and 161ae80.

📒 Files selected for processing (5)

src/instructlab/training/batch_loss_manager.py (1 hunks)
src/instructlab/training/gpt_oss_utils_correct.py (1 hunks)
src/instructlab/training/main_ds.py (1 hunks)
src/instructlab/training/model.py (4 hunks)
src/instructlab/training/utils.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

src/instructlab/training/model.py (1)

src/instructlab/training/gpt_oss_utils_correct.py (2)

is_gpt_oss (397-411)

is_known_model (414-429)

src/instructlab/training/main_ds.py (2)

src/instructlab/training/gpt_oss_utils_correct.py (1)

is_gpt_oss (397-411)

src/instructlab/training/utils.py (1)

freeze_router_params (903-926)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (6)

src/instructlab/training/gpt_oss_utils_correct.py (1)

414-430: LGTM! Well-structured utility function.

The is_known_model function provides a clean generalization of the is_gpt_oss pattern, enabling support for multiple model types including granitemoehybrid. The implementation properly handles both string and list inputs for known_model_type, and follows the same validation pattern as is_gpt_oss.

src/instructlab/training/utils.py (1)

903-927: LGTM! Improved return semantics for MoE models.

The updated freeze_router_params function now correctly returns True only when router parameters were actually frozen, rather than always returning True. The docstring and log messages have been appropriately generalized from "GPT-OSS" to "MoE models," aligning with the PR's objective to support granitemoehybrid MoE models.

The caller in main_ds.py (lines 350-355) properly handles the new return value.

src/instructlab/training/batch_loss_manager.py (1)

177-178: LGTM! Generalized auxiliary loss handling.

Removing the is_gpt_oss gate and applying auxiliary loss whenever accumulated_aux_loss is not None correctly generalizes the logic to support both GPT-OSS and granitemoehybrid MoE models. This change aligns with the broader PR objective to extend MoE support beyond GPT-OSS.

src/instructlab/training/main_ds.py (1)

349-355: LGTM! Correctly extends MoE support to granitemoehybrid.

The gate expansion to include is_granitemoehybrid properly extends router parameter freezing to support granite 4 MoE models. The code correctly captures and uses the return value from freeze_router_params to conditionally set fsdp_should_use_orig_params, which aligns with the updated return semantics in utils.py.

src/instructlab/training/model.py (2)

46-46: LGTM! Proper initialization of granitemoehybrid detection.

The import of is_known_model and initialization of self.is_granitemoehybrid using is_known_model(model_path, "granitemoehybrid") correctly follows the established pattern used for is_gpt_oss. This enables proper detection of granite 4 MoE models throughout the training pipeline.

Also applies to: 68-68

433-434: LGTM! Simplified auxiliary loss application.

Removing the is_gpt_oss check and applying auxiliary loss whenever aux_loss is not None correctly generalizes the logic for all MoE models. This change is consistent with the similar update in batch_loss_manager.py (line 177) and aligns with the PR's objective to support granitemoehybrid MoE models.

mtake · 2025-11-18T13:43:50Z

The first failure is

error: the configured Python interpreter version (3.14) is newer than PyO3's maximum supported version (3.13)

The second through fifth failures are due to missing EC2 credential.
None of them are related to this PR. @Maxusmusti @RobotSail

RobotSail · 2025-11-18T20:04:30Z

Hi @mtake that's fine, I have a fix for this issue in PR #670

RobotSail · 2025-11-18T20:05:27Z

@mergify rebase

mergify · 2025-11-18T20:05:36Z

rebase

✅ Branch has been successfully rebased

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 161ae80 and c97c9ba.

📒 Files selected for processing (5)

src/instructlab/training/batch_loss_manager.py (1 hunks)
src/instructlab/training/gpt_oss_utils_correct.py (1 hunks)
src/instructlab/training/main_ds.py (1 hunks)
src/instructlab/training/model.py (4 hunks)
src/instructlab/training/utils.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

src/instructlab/training/model.py (1)

src/instructlab/training/gpt_oss_utils_correct.py (2)

is_gpt_oss (397-411)

is_known_model (414-433)

src/instructlab/training/main_ds.py (2)

src/instructlab/training/gpt_oss_utils_correct.py (1)

is_gpt_oss (397-411)

src/instructlab/training/utils.py (1)

freeze_router_params (903-926)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (7)

src/instructlab/training/utils.py (1)

903-927: LGTM! Improved return semantics and broader MoE support.

The updated docstring and return semantics accurately reflect the broader MoE context. Returning True only when router parameters are actually frozen (rather than unconditionally) is more precise and aligns well with the updated usage in src/instructlab/training/main_ds.py (line 351).

src/instructlab/training/model.py (4)

46-46: LGTM! Proper import of the new utility.

The import of is_known_model is appropriate for detecting granitemoehybrid models.

68-68: LGTM! New attribute for granitemoehybrid detection.

The new is_granitemoehybrid attribute follows the same pattern as is_gpt_oss and enables proper MoE model detection.

421-428: LGTM! Expanded MoE auxiliary loss gating.

The condition correctly includes both GPT-OSS and granitemoehybrid models when checking for auxiliary loss presence.

433-434: LGTM! Unconditional auxiliary loss application.

The removal of the GPT-OSS-specific gate aligns with the broader MoE support. The auxiliary loss is now applied whenever present, which is appropriate for multiple MoE model types.

src/instructlab/training/batch_loss_manager.py (1)

177-178: Auxiliary loss compatibility verified—change is correct.

Both GPT-OSS and granitemoehybrid models extract aux_loss through the identical path in model.py (lines 421–427), converting it via output.aux_loss.float(). The unconditional check at lines 177–178 is valid because both model types produce aux_loss in the same format, and accumulated_aux_loss remains 0.0 when no auxiliary loss exists. No compatibility issues found.

Note: The comment at model.py:420 referencing only GPT-OSS is outdated and should be updated to reflect that both MoE model types now produce auxiliary loss.

src/instructlab/training/main_ds.py (1)

349-355: Code changes verified and approved.

The model_type string "granitemoehybrid" is confirmed as correct for IBM Granite 4 MoE models in HuggingFace transformers. The logic correctly extends MoE router parameter freezing to granitemoehybrid models and appropriately uses the return value from freeze_router_params.

src/instructlab/training/gpt_oss_utils_correct.py

…into granitemoehybrid-support

RobotSail

LGTM

Support granite 4 models as MoE models

161ae80

mergify bot added the ci-failure label Nov 18, 2025

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

fix ruff errors

e8f8922

mtake added 2 commits November 18, 2025 20:05

Support granite 4 models as MoE models

da1c034

fix ruff errors

c97c9ba

RobotSail force-pushed the granitemoehybrid-support branch from e8f8922 to c97c9ba Compare November 18, 2025 20:05

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

src/instructlab/training/gpt_oss_utils_correct.py Show resolved Hide resolved

mtake added 3 commits November 19, 2025 15:38

Merge branch 'granitemoehybrid-support' of github.com:mtake/training …

ec5b73f

…into granitemoehybrid-support

Merge branch 'main' into granitemoehybrid-support

784c651

address bot's comment

34b7cd6

RobotSail approved these changes Nov 19, 2025

View reviewed changes

mergify bot added the one-approval label Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle granite 4 as MoE models in training #669

Handle granite 4 as MoE models in training #669

Uh oh!

mtake commented Nov 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 18, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

mtake commented Nov 18, 2025

Uh oh!

RobotSail commented Nov 18, 2025

Uh oh!

RobotSail commented Nov 18, 2025

Uh oh!

mergify bot commented Nov 18, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

RobotSail left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Handle granite 4 as MoE models in training #669

Are you sure you want to change the base?

Handle granite 4 as MoE models in training #669

Uh oh!

Conversation

mtake commented Nov 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

mtake commented Nov 18, 2025

Uh oh!

RobotSail commented Nov 18, 2025

Uh oh!

RobotSail commented Nov 18, 2025

Uh oh!

mergify bot commented Nov 18, 2025

✅ Branch has been successfully rebased

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mtake commented Nov 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 18, 2025 •

edited

Loading