Skip to content

Conversation

@mtake
Copy link
Contributor

@mtake mtake commented Nov 18, 2025

Granite 4 models are MoE models. They have router parameters which should be frozen in the training phase, and the auxiliary losses should be accumulated. This PR is to apply the existing code for freezing the parameters and accumulating the losses for GPT-OSS model to granite 4 models as well.

Summary by CodeRabbit

  • New Features

    • Added support for granitemoehybrid MoE models in training.
  • Refactor

    • Auxiliary losses are now included whenever present, simplifying loss handling.
    • Router-freeze logic broadened to cover more MoE architectures and now reports frozen only if router params were actually changed.
  • Documentation

    • Updated router-freeze wording to reference MoE models.

@coderabbitai
Copy link

coderabbitai bot commented Nov 18, 2025

Walkthrough

Extended MoE handling: added detection for granitemoehybrid models, broadened router-freeze and auxiliary-loss paths to include MoE models, removed a GPT-OSS-only gate so auxiliary loss is applied whenever present, and introduced a new is_known_model utility (duplicated in file).

Changes

Cohort / File(s) Summary
Auxiliary loss computation
src/instructlab/training/batch_loss_manager.py
Removed the is_gpt_oss gate in _compute_average_loss; accumulated aux loss is now added to total batch loss whenever present (after reduction).
Model type detection utility
src/instructlab/training/gpt_oss_utils_correct.py
Added public is_known_model(model_path_or_config, known_model_type) function; the same function appears twice (duplicate declarations) in the file.
Model initialization and loss handling
src/instructlab/training/model.py
Imported is_known_model; added boolean is_granitemoehybrid based on model config; expanded MoE aux-loss gating to include is_granitemoehybrid; removed prior GPT-OSS-only aux-loss gate so aux loss is considered when non-None.
Router parameter freezing
src/instructlab/training/utils.py
Updated freeze_router_params() docstring and logs to reference MoE; changed return semantics to True only if at least one router parameter was frozen.
Training main flow
src/instructlab/training/main_ds.py
Expanded router-freeze condition from is_gpt_oss to is_gpt_oss or is_granitemoehybrid; now uses freeze_router_params return value to conditionally log and set FSDP original-params flag.

Sequence Diagram(s)

sequenceDiagram
    participant Main as main_ds.py
    participant Model as model.py
    participant Utils as utils.py
    participant Loss as batch_loss_manager.py

    Main->>Model: Initialize(model_path)
    Model->>Model: detect is_gpt_oss / is_granitemoehybrid (is_known_model)

    Main->>Utils: freeze_router_params(model)
    alt Router params frozen
        Utils-->>Main: return True
        Main->>Main: set fsdp_use_orig_params / log frozen router
    else No router params frozen
        Utils-->>Main: return False
    end

    Main->>Model: forward batch
    Model-->>Loss: pass outputs (may include aux_loss)

    Loss->>Loss: _compute_average_loss()
    alt accumulated_aux_loss not None
        Loss->>Loss: add accumulated_aux_loss to total_batch_loss (no model flag gate)
    end
    Loss-->>Main: return reduced loss
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Areas to inspect:
    • Consolidate or remove the duplicate is_known_model declaration in gpt_oss_utils_correct.py.
    • Verify auxiliary-loss semantics after removing the is_gpt_oss gate (ensure no unintended double-counting or missing reductions).
    • Confirm callers of freeze_router_params() handle the new boolean semantics correctly (not relying on old unconditional True).
    • Check interactions between model.py gating and batch_loss_manager.py unconditional aux-loss inclusion.

Poem

🐰 I sniffed the code and hopped around the tree,

Aux loss freed from gates, now flowing free.
Granite MoE and GPT both take a bow,
Routers frozen only when they truly bow.
Hooray — a rabbit's patch, light-footed and merry!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Handle granite 4 as MoE models in training' directly and clearly describes the main change: extending MoE model handling to granite 4 models, which is the core objective across all modified files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c97c9ba and 34b7cd6.

📒 Files selected for processing (1)
  • src/instructlab/training/gpt_oss_utils_correct.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: pylint
  • GitHub Check: unit: 3.13 on ubuntu-latest
  • GitHub Check: unit: 3.11 on ubuntu-latest
  • GitHub Check: unit: 3.12 on ubuntu-latest
  • GitHub Check: Summary
🔇 Additional comments (1)
src/instructlab/training/gpt_oss_utils_correct.py (1)

401-402: LGTM: Clean delegation to shared utility.

The refactoring of is_gpt_oss to delegate to is_known_model is appropriate and maintains backward compatibility while enabling code reuse.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify bot added the ci-failure label Nov 18, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/instructlab/training/main_ds.py (1)

349-349: Consider clarifying or removing the "NOTE is this guard needed?" comment.

The guard if m.is_gpt_oss or m.is_granitemoehybrid: appears necessary to ensure that router parameter freezing and fsdp_should_use_orig_params configuration only apply to MoE models. Without this guard, non-MoE models would unnecessarily go through the router freezing logic, and freeze_router_params would always return False for them (since they have no router parameters).

If the guard is intentional and necessary, consider removing the "NOTE is this guard needed?" comment to avoid confusion. If there's genuine uncertainty about whether this guard is required, please clarify the intended behavior.

src/instructlab/training/model.py (1)

421-428: Consider clarifying the guard necessity.

The guard if (self.is_gpt_oss or self.is_granitemoehybrid) before checking for output.aux_loss appears to be an optimization to avoid unnecessary hasattr checks on non-MoE models. However, the guard may be redundant since the subsequent checks (hasattr(output, "aux_loss") and output.aux_loss is not None) would safely handle non-MoE models anyway.

The "NOTE is this guard needed?" comment suggests uncertainty. Consider either:

  1. Removing the guard if the hasattr and is not None checks are sufficient
  2. Removing the comment if the guard provides meaningful performance benefits or code clarity
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 637afae and 161ae80.

📒 Files selected for processing (5)
  • src/instructlab/training/batch_loss_manager.py (1 hunks)
  • src/instructlab/training/gpt_oss_utils_correct.py (1 hunks)
  • src/instructlab/training/main_ds.py (1 hunks)
  • src/instructlab/training/model.py (4 hunks)
  • src/instructlab/training/utils.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/instructlab/training/model.py (1)
src/instructlab/training/gpt_oss_utils_correct.py (2)
  • is_gpt_oss (397-411)
  • is_known_model (414-429)
src/instructlab/training/main_ds.py (2)
src/instructlab/training/gpt_oss_utils_correct.py (1)
  • is_gpt_oss (397-411)
src/instructlab/training/utils.py (1)
  • freeze_router_params (903-926)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: unit: 3.13 on ubuntu-latest
  • GitHub Check: unit: 3.11 on ubuntu-latest
  • GitHub Check: unit: 3.12 on ubuntu-latest
  • GitHub Check: pylint
  • GitHub Check: Summary
🔇 Additional comments (6)
src/instructlab/training/gpt_oss_utils_correct.py (1)

414-430: LGTM! Well-structured utility function.

The is_known_model function provides a clean generalization of the is_gpt_oss pattern, enabling support for multiple model types including granitemoehybrid. The implementation properly handles both string and list inputs for known_model_type, and follows the same validation pattern as is_gpt_oss.

src/instructlab/training/utils.py (1)

903-927: LGTM! Improved return semantics for MoE models.

The updated freeze_router_params function now correctly returns True only when router parameters were actually frozen, rather than always returning True. The docstring and log messages have been appropriately generalized from "GPT-OSS" to "MoE models," aligning with the PR's objective to support granitemoehybrid MoE models.

The caller in main_ds.py (lines 350-355) properly handles the new return value.

src/instructlab/training/batch_loss_manager.py (1)

177-178: LGTM! Generalized auxiliary loss handling.

Removing the is_gpt_oss gate and applying auxiliary loss whenever accumulated_aux_loss is not None correctly generalizes the logic to support both GPT-OSS and granitemoehybrid MoE models. This change aligns with the broader PR objective to extend MoE support beyond GPT-OSS.

src/instructlab/training/main_ds.py (1)

349-355: LGTM! Correctly extends MoE support to granitemoehybrid.

The gate expansion to include is_granitemoehybrid properly extends router parameter freezing to support granite 4 MoE models. The code correctly captures and uses the return value from freeze_router_params to conditionally set fsdp_should_use_orig_params, which aligns with the updated return semantics in utils.py.

src/instructlab/training/model.py (2)

46-46: LGTM! Proper initialization of granitemoehybrid detection.

The import of is_known_model and initialization of self.is_granitemoehybrid using is_known_model(model_path, "granitemoehybrid") correctly follows the established pattern used for is_gpt_oss. This enables proper detection of granite 4 MoE models throughout the training pipeline.

Also applies to: 68-68


433-434: LGTM! Simplified auxiliary loss application.

Removing the is_gpt_oss check and applying auxiliary loss whenever aux_loss is not None correctly generalizes the logic for all MoE models. This change is consistent with the similar update in batch_loss_manager.py (line 177) and aligns with the PR's objective to support granitemoehybrid MoE models.

@mtake
Copy link
Contributor Author

mtake commented Nov 18, 2025

The first failure is

error: the configured Python interpreter version (3.14) is newer than PyO3's maximum supported version (3.13)

The second through fifth failures are due to missing EC2 credential.
None of them are related to this PR. @Maxusmusti @RobotSail

@RobotSail
Copy link
Member

Hi @mtake that's fine, I have a fix for this issue in PR #670

@RobotSail
Copy link
Member

@mergify rebase

@mergify
Copy link
Contributor

mergify bot commented Nov 18, 2025

rebase

✅ Branch has been successfully rebased

@RobotSail RobotSail force-pushed the granitemoehybrid-support branch from e8f8922 to c97c9ba Compare November 18, 2025 20:05
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 161ae80 and c97c9ba.

📒 Files selected for processing (5)
  • src/instructlab/training/batch_loss_manager.py (1 hunks)
  • src/instructlab/training/gpt_oss_utils_correct.py (1 hunks)
  • src/instructlab/training/main_ds.py (1 hunks)
  • src/instructlab/training/model.py (4 hunks)
  • src/instructlab/training/utils.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/instructlab/training/model.py (1)
src/instructlab/training/gpt_oss_utils_correct.py (2)
  • is_gpt_oss (397-411)
  • is_known_model (414-433)
src/instructlab/training/main_ds.py (2)
src/instructlab/training/gpt_oss_utils_correct.py (1)
  • is_gpt_oss (397-411)
src/instructlab/training/utils.py (1)
  • freeze_router_params (903-926)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: unit: 3.12 on ubuntu-latest
  • GitHub Check: unit: 3.13 on ubuntu-latest
  • GitHub Check: unit: 3.11 on ubuntu-latest
  • GitHub Check: pylint
  • GitHub Check: Summary
🔇 Additional comments (7)
src/instructlab/training/utils.py (1)

903-927: LGTM! Improved return semantics and broader MoE support.

The updated docstring and return semantics accurately reflect the broader MoE context. Returning True only when router parameters are actually frozen (rather than unconditionally) is more precise and aligns well with the updated usage in src/instructlab/training/main_ds.py (line 351).

src/instructlab/training/model.py (4)

46-46: LGTM! Proper import of the new utility.

The import of is_known_model is appropriate for detecting granitemoehybrid models.


68-68: LGTM! New attribute for granitemoehybrid detection.

The new is_granitemoehybrid attribute follows the same pattern as is_gpt_oss and enables proper MoE model detection.


421-428: LGTM! Expanded MoE auxiliary loss gating.

The condition correctly includes both GPT-OSS and granitemoehybrid models when checking for auxiliary loss presence.


433-434: LGTM! Unconditional auxiliary loss application.

The removal of the GPT-OSS-specific gate aligns with the broader MoE support. The auxiliary loss is now applied whenever present, which is appropriate for multiple MoE model types.

src/instructlab/training/batch_loss_manager.py (1)

177-178: Auxiliary loss compatibility verified—change is correct.

Both GPT-OSS and granitemoehybrid models extract aux_loss through the identical path in model.py (lines 421–427), converting it via output.aux_loss.float(). The unconditional check at lines 177–178 is valid because both model types produce aux_loss in the same format, and accumulated_aux_loss remains 0.0 when no auxiliary loss exists. No compatibility issues found.

Note: The comment at model.py:420 referencing only GPT-OSS is outdated and should be updated to reflect that both MoE model types now produce auxiliary loss.

src/instructlab/training/main_ds.py (1)

349-355: Code changes verified and approved.

The model_type string "granitemoehybrid" is confirmed as correct for IBM Granite 4 MoE models in HuggingFace transformers. The logic correctly extends MoE router parameter freezing to granitemoehybrid models and appropriately uses the return value from freeze_router_params.

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot added the one-approval label Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants