Could you provide scripts for training larger models such as Llama-3.1-70B-Instruct.
It seems that only data parallel like ZeRO is enough for 8 GPUs, since only AttnGates are trained. But I do not know whether my calculation is right. Did you use model parallel during training?