R3 is buggy when --expert-model-parallel-size != --actor-num-gpus-per-node

@humansand
## Description

MoE R3 training run is spiky when --expert-model-parallel-size != --actor-num-gpus-per-node .
<img width="3328" height="1040" alt="Image" src="https://github.com/user-attachments/assets/2283b214-7675-4a39-b619-e82fef950a49" />

Addtionally, there is an R3 related warning:
```
B200-88-gqioom8y
(MegatronTrainRayActor pid=64295) /root/miles/miles/backends/training_utils/data.py:82: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:206.)
B200-88-gqioom8y
(MegatronTrainRayActor pid=64295)   rollout_data["rollout_routed_experts"] = [torch.from_numpy(r) for r in rollout_data["rollout_routed_experts"]]
B200-88-y11hadgz
[2026-02-14 08:24:15] reloadable_process_group.py:152 - Reloading 27 process groups in pid 63472
B200-88-y11hadgz
[2026-02-14 08:24:15] memory_utils.py:41 - [Rank 0] Memory-Usage after wake_up model: {'gpu': '0', 'total_GB': 178.35, 'free_GB': 111.69, 'used_GB': 66.66, 'allocated_GB': 56.94, 'reserved_GB': 58.64}
B200-88-y11hadgz
[2026-02-14 08:24:15] timer.py:32 - Timer wake_up end (elapsed: 7.4s)
B200-88-y11hadgz
[2026-02-14 08:24:15] timer.py:24 - Timer data_preprocess start
B200-88-y11hadgz
/root/miles/miles/backends/training_utils/data.py:82: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:206.)

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R3 is buggy when --expert-model-parallel-size != --actor-num-gpus-per-node #599

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

R3 is buggy when --expert-model-parallel-size != --actor-num-gpus-per-node #599

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions