Skip to content

Add routing replay for megatron runner#774

Open
fzyzcjy wants to merge 10 commits intomainfrom
ac8452/1/8
Open

Add routing replay for megatron runner#774
fzyzcjy wants to merge 10 commits intomainfrom
ac8452/1/8

Conversation

@fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Mar 20, 2026

No description provided.

fzyzcjy added 9 commits March 20, 2026 18:47
Add dumper CLI arguments (--dumper-enable, --dumper-dir, per-phase config),
dumper_utils.py for SGLang/Megatron dumper integration, model.py hooks
for forward-only and forward-backward phases, rollout env var plumbing,
source patcher wiring in training actors, and basic e2e test.
Add _maybe_apply_dumper_overrides to disable heartbeats, force single
rollout, and disable eval/save when --dumper-enable is set.
Add conftest_dumper.py with shared source patcher YAML configs and
comparator helpers. Expand test_dumper.py with full MoE parallelism
coverage, field verification, and cross-framework (SGLang vs Megatron)
activation comparison. Update dumper_utils.py to nest engine dumps under
engines/ subdirectory.
Add DataclassArgparseBridge for dataclass-to-argparse conversion,
refactor typer_utils with dataclass_cli decorator, and add
parallel_utils for running parallel CLI commands. Include tests.
Add run_megatron worker batch.py for input batch construction and
cross-entropy loss, script_args.py for CLI-to-worker argument passing
via DataclassArgparseBridge, and args.py with CommonRunArgs/RunArgs
dataclass definitions. Include tests.
Add context parallelism (CP) zigzag slicing to batch preparation with
CP-aware next-token labels using position_ids-based gathering. Add cp
field to RunArgs. Include comprehensive CP tests.
Add path_utils.py for Megatron path and model script resolution,
prompt_utils.py for token ID generation (math/file/text modes),
and parallel_utils.py for parallel config parsing. Include tests.
Add worker/main.py for standalone Megatron forward/backward via torchrun,
worker_executor.py for building torchrun commands, run.py CLI command,
and package entry points (__main__.py, __init__.py). Include tests.
Add worker/replay.py for routing replay recording and loading with CP
zigzag and SP slicing support. Wire replay into worker main.py and CLI
(run.py, worker_executor.py, args.py). Include tests.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces MoE routing replay support to the megatron runner, enabling the reproduction of expert routing decisions during inference. This is achieved by adding new command-line arguments, integrating the replay functionality into the worker executor, and providing comprehensive tests to ensure correct behavior.

Highlights

  • MoE Routing Replay: Added support for MoE routing replay to reproduce expert routing decisions during inference.
  • Integration: Integrated routing replay functionality into the run command and worker executor.
  • Testing: Extended existing tests and added new tests to validate the routing replay implementation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces routing replay support for the Megatron runner, a valuable feature for debugging and reproducing expert routing decisions. The implementation is well-structured, with the core logic neatly encapsulated in a new replay.py module and integrated into the existing CLI and worker scripts. The changes are also well-tested. I have a couple of suggestions for improvement regarding code simplification and a potential security consideration.

Comment on lines +93 to +99
initialized: bool = mpu.is_initialized()
return _ParallelRanks(
cp_size=mpu.get_context_parallel_world_size() if initialized else 1,
cp_rank=mpu.get_context_parallel_rank() if initialized else 0,
tp_size=mpu.get_tensor_model_parallel_world_size() if initialized else 1,
tp_rank=mpu.get_tensor_model_parallel_rank() if initialized else 0,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function can be simplified. The mpu.get_* functions already handle the case where mpu is not initialized by returning default values (1 for sizes, 0 for ranks), so the initialized check is redundant.

Additionally, for consistency and style, you could move from megatron.core import mpu to the top of the file.

    return _ParallelRanks(
        cp_size=mpu.get_context_parallel_world_size(),
        cp_rank=mpu.get_context_parallel_rank(),
        tp_size=mpu.get_tensor_model_parallel_world_size(),
        tp_rank=mpu.get_tensor_model_parallel_rank(),
    )

sequence_parallel: bool,
) -> None:
"""Load replay from rank 0's file with CP zigzag slicing and SP slicing."""
saved_replays: list[list[torch.Tensor]] = torch.load(replay_file, weights_only=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Using torch.load with weights_only=False can execute arbitrary code from a malicious file. For a debugging utility, this might be an acceptable risk if the replay file is trusted. However, it's worth noting this potential security vulnerability.

Co-authored-by: Yueming Yuan <112649537+yueming-yuan@users.noreply.github.com>
@fzyzcjy fzyzcjy changed the base branch from ac8452/1/7 to main March 20, 2026 13:58
@fzyzcjy fzyzcjy requested a review from yushengsu-thu as a code owner March 20, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant