Miles rollout-only profiling bug in Megatron backend
Summary
qwen3-30B-A3B-05-rollout-nsys.sh successfully runs rollout generation and produces an
Nsight Systems report, but the job still exits with a failure afterward when using the
Megatron backend.
The generated report is valid, but the overall stage is marked failed because Miles still
falls into the actor training path in --debug-rollout-only mode.
Reproduction
From the RL360 repo root:
./scripts/profiling/qwen3-30B-A3B-05-rollout-nsys.sh
Observed in the March 18, 2026 run:
- rollout Nsight Systems capture completed successfully
- report was written to:
/workspace/logs/execution/qwen3-30B-A3B/20260318-193009/nsys/rollout_only.nsys-rep
- the job then failed with:
AttributeError: 'MegatronTrainRayActor' object has no attribute 'parallel_state'
Why this appears to be a Miles bug
In the Megatron backend:
MegatronTrainRayActor.init() returns early when args.debug_rollout_only is set
- that means
self.parallel_state is never initialized
- later,
MegatronTrainRayActor.train() still executes preprocessing that accesses
self.parallel_state
Relevant code paths in this repo mirror:
miles/miles/backends/megatron_utils/actor.py
miles/train.py
The failure happens after rollout data has already been generated, which is why:
qwen3-30B-A3B-01-e2e-nsys.sh passes
qwen3-30B-A3B-05-rollout-nsys.sh produces a usable .nsys-rep
- but the rollout-only stage still exits non-zero
Error snippet
ray.exceptions.RayTaskError(AttributeError): ray::MegatronTrainRayActor.train()
...
File "/root/miles/miles/backends/megatron_utils/actor.py", line 291, in train
rollout_data = get_rollout_data(self.args, rollout_data_ref, self.parallel_state)
AttributeError: 'MegatronTrainRayActor' object has no attribute 'parallel_state'
Current workaround
- treat the generated
rollout_only.nsys-rep as valid profiling output
- ignore the stage failure for now
- prefer
e2e or train_only stages when you need a clean green run
Suggested upstream fix
Guard the Megatron actor training path for debug_rollout_only before any access to
self.parallel_state, or ensure parallel_state is initialized even in rollout-only mode.
Miles rollout-only profiling bug in Megatron backend
Summary
qwen3-30B-A3B-05-rollout-nsys.shsuccessfully runs rollout generation and produces anNsight Systems report, but the job still exits with a failure afterward when using the
Megatron backend.
The generated report is valid, but the overall stage is marked failed because Miles still
falls into the actor training path in
--debug-rollout-onlymode.Reproduction
From the RL360 repo root:
Observed in the March 18, 2026 run:
Why this appears to be a Miles bug
In the Megatron backend:
MegatronTrainRayActor.init()returns early whenargs.debug_rollout_onlyis setself.parallel_stateis never initializedMegatronTrainRayActor.train()still executes preprocessing that accessesself.parallel_stateRelevant code paths in this repo mirror:
miles/miles/backends/megatron_utils/actor.pymiles/train.pyThe failure happens after rollout data has already been generated, which is why:
qwen3-30B-A3B-01-e2e-nsys.shpassesqwen3-30B-A3B-05-rollout-nsys.shproduces a usable.nsys-repError snippet
Current workaround
rollout_only.nsys-repas valid profiling outpute2eortrain_onlystages when you need a clean green runSuggested upstream fix
Guard the Megatron actor training path for
debug_rollout_onlybefore any access toself.parallel_state, or ensureparallel_stateis initialized even in rollout-only mode.