[QUESTION] Throughput low for 70B training #1086
Replies: 2 comments 1 reply
-
Can you try PP2 and DP8? I presume you are also using |
Beta Was this translation helpful? Give feedback.
-
I get OOM when using PP2 and TP8, to prevent oom with PP8 and TP8, I used --recompute-granularity full, --recompute-method uniform, --recompute-num-layers 5 and --distribute-saved-activations. May I check if this is similar to how you all implement it? Also, when using --tp-comm-overlap and --sequence-parallel, I get error [dgx-16:4064993:0] cl_basic_team.c:132 CL_BASIC_ERROR no tl teams were created [dgx-16:4064993:0:4064993] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) |
Beta Was this translation helpful? Give feedback.
-
Hi Megatron team,
Could I check if the following is expected:
I saw that in your README, the throughput for 70B using TP8 PP2 and DP 48 on 768GPUs was 420.5 TFLOPs/s/GPU
However, running on 70B on TP8 PP8 and DP2 on 128 GPUs (H100s) with activation checkpointing was only 296.5 TFLOPs/s/GPU for me.
May I check what are the optimizations that you all have used to maximise your throughput as reported in the table in the README? I have also used --overlap-grad-reduce and --overlap-param-gather
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions