Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing#2288
Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing#2288lvdunlin wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
|
This code path is going to be removed in Mcore and move to https://github.com/NVIDIA/nvidia-resiliency-ext. |
Thank you for pointing that out and for the link to the new repository. |
|
|
I'm wondering if you have done a concrete validation of tensors with this option enabled. |
@sbak5 This issue was discovered when we were using Megatron for LLM training, and the saved checkpoints contained corrupted data. After the fix in this PR, no corrupted data has been found since. |
|
/ok to test bdb6bf2 |
@dimapihtar, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 89ae0f4 |
|
/ok to test 518692f |
What does this PR do ?
Summary
Technical
Background
Fix
Impact & Risk
Validation
Affected file: megatron/core/dist_checkpointing/strategies/async_utils.py (TemporalAsyncCaller).
Pre-checks
Core 0.8)