TPU multi-worker(pod) training #146

DimensionSTP · 2025-02-02T15:45:49Z

Hi, I’m currently using a TPU v4-64 pod, and I encountered an issue when trying to run multi-worker training llama with the example provided for TPU v5-8. Each worker seems to run independently instead of syncing properly during training. Could you provide an example specifically for TPU pod multi-worker (e.g., TPU v4-64) training where the entire pod is used as a single unit?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU multi-worker(pod) training #146

TPU multi-worker(pod) training #146

DimensionSTP commented Feb 2, 2025

TPU multi-worker(pod) training #146

TPU multi-worker(pod) training #146

Comments

DimensionSTP commented Feb 2, 2025