Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent difference in convergence between Aurora and Polaris training #8

Open
5 tasks done
rickybalin opened this issue Jan 31, 2025 · 0 comments
Open
5 tasks done
Assignees
Labels
bug Something isn't working

Comments

@rickybalin
Copy link

rickybalin commented Jan 31, 2025

Given the same trajectory data from a p=3 simulation of the backward facing step, we are observing differences in GNN model convergence between recent Aurora and older Polaris runs, with the Polaris runs reaching deeper convergence. The difference is apparent after a few thousand training iterations.

GNN model parameters used:

n_messagePassing_layers=8
n_mlp_hidden_layers=2
hidden_channels=256
seed=64
phase1_steps=1000
phase2_steps=15000
phase3_steps=0
lr_phase12=0.001
lr_phase23=0.0000003

Things to check:

  • Rerun on Polaris with recent changes to GNN code
  • Run on Aurora with all_to_all halo exchange
  • Run on Aurora without IPEX module
  • Run on Aurora with fix for the layer norm calls in the MLPs
  • Run on Aurora with PT 2.5
@rickybalin rickybalin self-assigned this Jan 31, 2025
@rickybalin rickybalin added the bug Something isn't working label Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant