Megatron --debug-rollout-only still enters actor train path and crashes on missing parallel_state

# Miles rollout-only profiling bug in Megatron backend

## Summary

`qwen3-30B-A3B-05-rollout-nsys.sh` successfully runs rollout generation and produces an
Nsight Systems report, but the job still exits with a failure afterward when using the
Megatron backend.

The generated report is valid, but the overall stage is marked failed because Miles still
falls into the actor training path in `--debug-rollout-only` mode.

## Reproduction

From the RL360 repo root:

```bash
./scripts/profiling/qwen3-30B-A3B-05-rollout-nsys.sh
```

Observed in the March 18, 2026 run:

- rollout Nsight Systems capture completed successfully
- report was written to:

```bash
/workspace/logs/execution/qwen3-30B-A3B/20260318-193009/nsys/rollout_only.nsys-rep
```

- the job then failed with:

```text
AttributeError: 'MegatronTrainRayActor' object has no attribute 'parallel_state'
```

## Why this appears to be a Miles bug

In the Megatron backend:

- `MegatronTrainRayActor.init()` returns early when `args.debug_rollout_only` is set
- that means `self.parallel_state` is never initialized
- later, `MegatronTrainRayActor.train()` still executes preprocessing that accesses
  `self.parallel_state`

Relevant code paths in this repo mirror:

- `miles/miles/backends/megatron_utils/actor.py`
- `miles/train.py`

The failure happens after rollout data has already been generated, which is why:

- `qwen3-30B-A3B-01-e2e-nsys.sh` passes
- `qwen3-30B-A3B-05-rollout-nsys.sh` produces a usable `.nsys-rep`
- but the rollout-only stage still exits non-zero

## Error snippet

```text
ray.exceptions.RayTaskError(AttributeError): ray::MegatronTrainRayActor.train()
...
  File "/root/miles/miles/backends/megatron_utils/actor.py", line 291, in train
    rollout_data = get_rollout_data(self.args, rollout_data_ref, self.parallel_state)
AttributeError: 'MegatronTrainRayActor' object has no attribute 'parallel_state'
```

## Current workaround

- treat the generated `rollout_only.nsys-rep` as valid profiling output
- ignore the stage failure for now
- prefer `e2e` or `train_only` stages when you need a clean green run

## Suggested upstream fix

Guard the Megatron actor training path for `debug_rollout_only` before any access to
`self.parallel_state`, or ensure `parallel_state` is initialized even in rollout-only mode.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron --debug-rollout-only still enters actor train path and crashes on missing parallel_state #751

Miles rollout-only profiling bug in Megatron backend

Summary

Reproduction

Why this appears to be a Miles bug

Error snippet

Current workaround

Suggested upstream fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Megatron --debug-rollout-only still enters actor train path and crashes on missing parallel_state #751

Description

Miles rollout-only profiling bug in Megatron backend

Summary

Reproduction

Why this appears to be a Miles bug

Error snippet

Current workaround

Suggested upstream fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions