I had an extremely large gradient norm during MoE training. I replaced them with the native PyTorch version, and the problem went away