Skip to content

Megatron --debug-rollout-only still enters actor train path and crashes on missing parallel_state #751

@mvillmow

Description

@mvillmow

Miles rollout-only profiling bug in Megatron backend

Summary

qwen3-30B-A3B-05-rollout-nsys.sh successfully runs rollout generation and produces an
Nsight Systems report, but the job still exits with a failure afterward when using the
Megatron backend.

The generated report is valid, but the overall stage is marked failed because Miles still
falls into the actor training path in --debug-rollout-only mode.

Reproduction

From the RL360 repo root:

./scripts/profiling/qwen3-30B-A3B-05-rollout-nsys.sh

Observed in the March 18, 2026 run:

  • rollout Nsight Systems capture completed successfully
  • report was written to:
/workspace/logs/execution/qwen3-30B-A3B/20260318-193009/nsys/rollout_only.nsys-rep
  • the job then failed with:
AttributeError: 'MegatronTrainRayActor' object has no attribute 'parallel_state'

Why this appears to be a Miles bug

In the Megatron backend:

  • MegatronTrainRayActor.init() returns early when args.debug_rollout_only is set
  • that means self.parallel_state is never initialized
  • later, MegatronTrainRayActor.train() still executes preprocessing that accesses
    self.parallel_state

Relevant code paths in this repo mirror:

  • miles/miles/backends/megatron_utils/actor.py
  • miles/train.py

The failure happens after rollout data has already been generated, which is why:

  • qwen3-30B-A3B-01-e2e-nsys.sh passes
  • qwen3-30B-A3B-05-rollout-nsys.sh produces a usable .nsys-rep
  • but the rollout-only stage still exits non-zero

Error snippet

ray.exceptions.RayTaskError(AttributeError): ray::MegatronTrainRayActor.train()
...
  File "/root/miles/miles/backends/megatron_utils/actor.py", line 291, in train
    rollout_data = get_rollout_data(self.args, rollout_data_ref, self.parallel_state)
AttributeError: 'MegatronTrainRayActor' object has no attribute 'parallel_state'

Current workaround

  • treat the generated rollout_only.nsys-rep as valid profiling output
  • ignore the stage failure for now
  • prefer e2e or train_only stages when you need a clean green run

Suggested upstream fix

Guard the Megatron actor training path for debug_rollout_only before any access to
self.parallel_state, or ensure parallel_state is initialized even in rollout-only mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions