[QUESTION] Throughput low for 70B training #1086

clarence-lee-sheng · 2024-08-26T01:52:57Z

clarence-lee-sheng
Aug 26, 2024

Hi Megatron team,

Could I check if the following is expected:

I saw that in your README, the throughput for 70B using TP8 PP2 and DP 48 on 768GPUs was 420.5 TFLOPs/s/GPU
However, running on 70B on TP8 PP8 and DP2 on 128 GPUs (H100s) with activation checkpointing was only 296.5 TFLOPs/s/GPU for me.

May I check what are the optimizations that you all have used to maximise your throughput as reported in the table in the README? I have also used --overlap-grad-reduce and --overlap-param-gather

Thank you!

deepakn94 · 2024-08-26T16:07:42Z

deepakn94
Aug 26, 2024
Maintainer

Can you try PP2 and DP8? I presume you are also using --use-distributed-optimizer? --tp-comm-overlap should also help, but you will need to set have a container built with TE + UserBuffer support.

1 reply

ZeroAGI Sep 10, 2024

Can you try PP2 and DP8? I presume you are also using --use-distributed-optimizer? --tp-comm-overlap should also help, but you will need to set have a container built with TE + UserBuffer support.

same here. #1022

I use TP=8 PP=2 DP=1, with 2 * 8 A100 GPUs. I only get ~20TFLOPs.

Do you have any ideas? Thanks

clarence-lee-sheng · 2024-08-27T04:07:22Z

clarence-lee-sheng
Aug 27, 2024
Author

I get OOM when using PP2 and TP8, to prevent oom with PP8 and TP8, I used --recompute-granularity full, --recompute-method uniform, --recompute-num-layers 5 and --distribute-saved-activations. May I check if this is similar to how you all implement it? Also, when using --tp-comm-overlap and --sequence-parallel, I get error [dgx-16:4064993:0] cl_basic_team.c:132 CL_BASIC_ERROR no tl teams were created [dgx-16:4064993:0:4064993] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Throughput low for 70B training #1086

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[QUESTION] Throughput low for 70B training #1086

clarence-lee-sheng Aug 26, 2024

Replies: 2 comments · 1 reply

deepakn94 Aug 26, 2024 Maintainer

ZeroAGI Sep 10, 2024

clarence-lee-sheng Aug 27, 2024 Author

clarence-lee-sheng
Aug 26, 2024

Replies: 2 comments 1 reply

deepakn94
Aug 26, 2024
Maintainer

clarence-lee-sheng
Aug 27, 2024
Author