-
Notifications
You must be signed in to change notification settings - Fork 123
Open
Description
Description
MoE R3 training run is spiky when --expert-model-parallel-size != --actor-num-gpus-per-node .

Addtionally, there is an R3 related warning:
B200-88-gqioom8y
(MegatronTrainRayActor pid=64295) /root/miles/miles/backends/training_utils/data.py:82: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:206.)
B200-88-gqioom8y
(MegatronTrainRayActor pid=64295) rollout_data["rollout_routed_experts"] = [torch.from_numpy(r) for r in rollout_data["rollout_routed_experts"]]
B200-88-y11hadgz
[2026-02-14 08:24:15] reloadable_process_group.py:152 - Reloading 27 process groups in pid 63472
B200-88-y11hadgz
[2026-02-14 08:24:15] memory_utils.py:41 - [Rank 0] Memory-Usage after wake_up model: {'gpu': '0', 'total_GB': 178.35, 'free_GB': 111.69, 'used_GB': 66.66, 'allocated_GB': 56.94, 'reserved_GB': 58.64}
B200-88-y11hadgz
[2026-02-14 08:24:15] timer.py:32 - Timer wake_up end (elapsed: 7.4s)
B200-88-y11hadgz
[2026-02-14 08:24:15] timer.py:24 - Timer data_preprocess start
B200-88-y11hadgz
/root/miles/miles/backends/training_utils/data.py:82: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:206.)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels