多机多卡训练容易超时,超时的话如何自动从已经保存的模型恢复训练? #5027
Unanswered
jiejie1993
asked this question in
Community | Q&A
Replies: 1 comment
-
any update? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
多机多卡训练过程中,发生NCCL timeout超时,在torch中有--max-restarts对训练进行重启,但是如何去自动加载最新的已经保存的模型?使用--load-checkpoint需要多节点都有这个保存的模型,但训练中只会在master节点保存模型,手动复制到所有节点的话无法实现训练自动重启,有没有什么办法实现自动重启中断的训练,并从已经保存的最新模型恢复的功能?
Beta Was this translation helpful? Give feedback.
All reactions