Skip to content

R3 is buggy when --expert-model-parallel-size != --actor-num-gpus-per-node #599

@zianglih

Description

@zianglih

@HumansAnd

Description

MoE R3 training run is spiky when --expert-model-parallel-size != --actor-num-gpus-per-node .
Image

Addtionally, there is an R3 related warning:

B200-88-gqioom8y
(MegatronTrainRayActor pid=64295) /root/miles/miles/backends/training_utils/data.py:82: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:206.)
B200-88-gqioom8y
(MegatronTrainRayActor pid=64295)   rollout_data["rollout_routed_experts"] = [torch.from_numpy(r) for r in rollout_data["rollout_routed_experts"]]
B200-88-y11hadgz
[2026-02-14 08:24:15] reloadable_process_group.py:152 - Reloading 27 process groups in pid 63472
B200-88-y11hadgz
[2026-02-14 08:24:15] memory_utils.py:41 - [Rank 0] Memory-Usage after wake_up model: {'gpu': '0', 'total_GB': 178.35, 'free_GB': 111.69, 'used_GB': 66.66, 'allocated_GB': 56.94, 'reserved_GB': 58.64}
B200-88-y11hadgz
[2026-02-14 08:24:15] timer.py:32 - Timer wake_up end (elapsed: 7.4s)
B200-88-y11hadgz
[2026-02-14 08:24:15] timer.py:24 - Timer data_preprocess start
B200-88-y11hadgz
/root/miles/miles/backends/training_utils/data.py:82: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:206.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions