Skip to content

Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing#2288

Open
lvdunlin wants to merge 3 commits intoNVIDIA:mainfrom
lvdunlin:fix-async-ckpt
Open

Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing#2288
lvdunlin wants to merge 3 commits intoNVIDIA:mainfrom
lvdunlin:fix-async-ckpt

Conversation

@lvdunlin
Copy link

What does this PR do ?

Summary

  • Fix TemporalAsyncCaller D2H pin_memory lifetime in async checkpointing to prevent dirty data in the forked writer.
    Technical

Background

  • Async checkpointing stages per-tensor non-blocking D2H using pin_memory .
  • pin_memory lacks Copy-On-Write under fork ; temporary pinned tensors in the main process can be reused by PyTorch, causing the child process to read modified buffers.

Fix

  • Retain D2H pinned tensors at the TemporalAsyncCaller level for the entire checkpoint execution window, releasing them only after the forked process joins. This prevents buffer reuse during write.

Impact & Risk

  • No API changes; only affects the async checkpointing path.
  • Longer retention temporarily increases pinned memory usage, bounded by checkpoint bucket sizes.

Validation

  • With async checkpointing enabled, generate and verify checkpoint integrity (no dirty data) and confirm pre/post-fix consistency.
    Affected file: megatron/core/dist_checkpointing/strategies/async_utils.py (TemporalAsyncCaller).

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

@lvdunlin lvdunlin requested review from a team as code owners November 18, 2025 07:15
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@sbak5
Copy link
Contributor

sbak5 commented Nov 26, 2025

This code path is going to be removed in Mcore and move to https://github.com/NVIDIA/nvidia-resiliency-ext.

@lvdunlin
Copy link
Author

This code path is going to be removed in Mcore and move to https://github.com/NVIDIA/nvidia-resiliency-ext.

Thank you for pointing that out and for the link to the new repository.
I understand the code is being migrated to nvidia-resiliency-ext. To ensure the fix is available in the current codebase, I will proceed with the PR here for Mcore. I will also prepare a corresponding PR for the nvidia-resiliency-extrepository once the changes here are finalized.
I would greatly appreciate it if you could review the current changes in this PR. Your feedback will help ensure the fix is correct before I port it to the new location.
Thank you for your time and guidance.

@lvdunlin
Copy link
Author

lvdunlin commented Dec 1, 2025

This code path is going to be removed in Mcore and move to https://github.com/NVIDIA/nvidia-resiliency-ext.

Thank you for pointing that out and for the link to the new repository. I understand the code is being migrated to nvidia-resiliency-ext. To ensure the fix is available in the current codebase, I will proceed with the PR here for Mcore. I will also prepare a corresponding PR for the nvidia-resiliency-extrepository once the changes here are finalized. I would greatly appreciate it if you could review the current changes in this PR. Your feedback will help ensure the fix is correct before I port it to the new location. Thank you for your time and guidance.

@sbak5

@sbak5
Copy link
Contributor

sbak5 commented Dec 11, 2025

I'm wondering if you have done a concrete validation of tensors with this option enabled.
It looks good to me at this PR but any validation with a real model would be highly appreciated.

@lvdunlin
Copy link
Author

lvdunlin commented Dec 11, 2025

I'm wondering if you have done a concrete validation of tensors with this option enabled. It looks good to me at this PR but any validation with a real model would be highly appreciated.

@sbak5 This issue was discovered when we were using Megatron for LLM training, and the saved checkpoints contained corrupted data. After the fix in this PR, no corrupted data has been found since.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026
@dimapihtar dimapihtar added Run functional tests Final Review PR is in the "final review" stage labels Jan 29, 2026
@dimapihtar
Copy link
Contributor

/ok to test bdb6bf2

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 29, 2026

/ok to test bdb6bf2

@dimapihtar, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@dimapihtar dimapihtar self-assigned this Jan 29, 2026
@dimapihtar
Copy link
Contributor

/ok to test 89ae0f4

@chtruong814 chtruong814 added needs-follow-up Issue needs follow-up and removed needs-follow-up Issue needs follow-up labels Feb 4, 2026
@dimapihtar
Copy link
Contributor

/ok to test 518692f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Final Review PR is in the "final review" stage needs-follow-up Issue needs follow-up Run functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants