Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU multi-worker(pod) training #146

Open
DimensionSTP opened this issue Feb 2, 2025 · 0 comments
Open

TPU multi-worker(pod) training #146

DimensionSTP opened this issue Feb 2, 2025 · 0 comments

Comments

@DimensionSTP
Copy link

Hi, I’m currently using a TPU v4-64 pod, and I encountered an issue when trying to run multi-worker training llama with the example provided for TPU v5-8. Each worker seems to run independently instead of syncing properly during training. Could you provide an example specifically for TPU pod multi-worker (e.g., TPU v4-64) training where the entire pod is used as a single unit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant