Commit 94cc621
Rank local checkpointing in DCP internal without collectives (#989)
Summary:
Pull Request resolved: #989
### Context
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing (XLFormers style checkpointing) which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.
Differential Revision: D723903261 parent c1dfeb7 commit 94cc621
1 file changed
+6
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
991 | 991 | | |
992 | 992 | | |
993 | 993 | | |
994 | | - | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
995 | 997 | | |
996 | 998 | | |
997 | 999 | | |
998 | 1000 | | |
999 | 1001 | | |
1000 | 1002 | | |
1001 | 1003 | | |
1002 | | - | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
1003 | 1007 | | |
0 commit comments