Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing by lvdunlin · Pull Request #2288 · NVIDIA/Megatron-LM

lvdunlin · 2025-11-18T07:15:58Z

What does this PR do ?

Summary

Fix TemporalAsyncCaller D2H pin_memory lifetime in async checkpointing to prevent dirty data in the forked writer.
Technical

Background

Async checkpointing stages per-tensor non-blocking D2H using pin_memory .
pin_memory lacks Copy-On-Write under fork ; temporary pinned tensors in the main process can be reused by PyTorch, causing the child process to read modified buffers.

Fix

Retain D2H pinned tensors at the TemporalAsyncCaller level for the entire checkpoint execution window, releasing them only after the forked process joins. This prevents buffer reuse during write.

Impact & Risk

No API changes; only affects the async checkpointing path.
Longer retention temporarily increases pinned memory usage, bounded by checkpoint bucket sizes.

Validation

With async checkpointing enabled, generate and verify checkpoint integrity (no dirty data) and confirm pre/post-fix consistency.
Affected file: megatron/core/dist_checkpointing/strategies/async_utils.py (TemporalAsyncCaller).

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

copy-pr-bot · 2025-11-18T07:16:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dimapihtar

LGTM. Thank you!

sbak5 · 2025-11-26T03:43:08Z

This code path is going to be removed in Mcore and move to https://github.com/NVIDIA/nvidia-resiliency-ext.

lvdunlin · 2025-11-27T08:30:37Z

This code path is going to be removed in Mcore and move to https://github.com/NVIDIA/nvidia-resiliency-ext.

Thank you for pointing that out and for the link to the new repository.
I understand the code is being migrated to nvidia-resiliency-ext. To ensure the fix is available in the current codebase, I will proceed with the PR here for Mcore. I will also prepare a corresponding PR for the nvidia-resiliency-extrepository once the changes here are finalized.
I would greatly appreciate it if you could review the current changes in this PR. Your feedback will help ensure the fix is correct before I port it to the new location.
Thank you for your time and guidance.

lvdunlin · 2025-12-01T07:32:05Z

This code path is going to be removed in Mcore and move to https://github.com/NVIDIA/nvidia-resiliency-ext.

Thank you for pointing that out and for the link to the new repository. I understand the code is being migrated to nvidia-resiliency-ext. To ensure the fix is available in the current codebase, I will proceed with the PR here for Mcore. I will also prepare a corresponding PR for the nvidia-resiliency-extrepository once the changes here are finalized. I would greatly appreciate it if you could review the current changes in this PR. Your feedback will help ensure the fix is correct before I port it to the new location. Thank you for your time and guidance.

@sbak5

sbak5 · 2025-12-11T06:16:14Z

I'm wondering if you have done a concrete validation of tensors with this option enabled.
It looks good to me at this PR but any validation with a real model would be highly appreciated.

lvdunlin · 2025-12-11T11:38:44Z

I'm wondering if you have done a concrete validation of tensors with this option enabled. It looks good to me at this PR but any validation with a real model would be highly appreciated.

@sbak5 This issue was discovered when we were using Megatron for LLM training, and the saved checkpoints contained corrupted data. After the fix in this PR, no corrupted data has been found since.

dimapihtar · 2026-01-29T20:27:29Z

/ok to test bdb6bf2

copy-pr-bot · 2026-01-29T20:27:32Z

/ok to test bdb6bf2

@dimapihtar, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

dimapihtar · 2026-01-29T20:28:16Z

/ok to test 89ae0f4

dimapihtar · 2026-03-20T12:57:32Z

/ok to test 518692f

Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing

bdb6bf2

lvdunlin requested review from a team as code owners November 18, 2025 07:15

dimapihtar approved these changes Nov 19, 2025

View reviewed changes

github-actions bot added the community-request label Nov 26, 2025

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026

dimapihtar added Run functional tests Final Review PR is in the "final review" stage labels Jan 29, 2026

Merge branch 'main' into fix-async-ckpt

89ae0f4

dimapihtar self-assigned this Jan 29, 2026

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 20:28 Inactive

ko3n1g added this to the Core 0.16 milestone Jan 29, 2026

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 20:28 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 29, 2026 20:28 Failure

copy-pr-bot bot temporarily deployed to test January 29, 2026 20:29 Inactive

jaredcasper approved these changes Feb 4, 2026

View reviewed changes

chtruong814 added needs-follow-up Issue needs follow-up and removed needs-follow-up Issue needs follow-up labels Feb 4, 2026

gautham-kollu requested a review from ericharper February 10, 2026 16:55

Merge branch 'main' into fix-async-ckpt

518692f

copy-pr-bot bot temporarily deployed to test March 20, 2026 12:58 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing#2288

Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing#2288
lvdunlin wants to merge 3 commits intoNVIDIA:mainfrom
lvdunlin:fix-async-ckpt

lvdunlin commented Nov 18, 2025

Uh oh!

copy-pr-bot bot commented Nov 18, 2025

Uh oh!

dimapihtar left a comment

Uh oh!

sbak5 commented Nov 26, 2025

Uh oh!

lvdunlin commented Nov 27, 2025

Uh oh!

lvdunlin commented Dec 1, 2025

Uh oh!

sbak5 commented Dec 11, 2025

Uh oh!

lvdunlin commented Dec 11, 2025 •

edited

Loading

Uh oh!

dimapihtar commented Jan 29, 2026

Uh oh!

copy-pr-bot bot commented Jan 29, 2026

Uh oh!

dimapihtar commented Jan 29, 2026

Uh oh!

dimapihtar commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

lvdunlin commented Nov 18, 2025

What does this PR do ?

Summary

Background

Fix

Impact & Risk

Validation

Pre-checks

Uh oh!

copy-pr-bot bot commented Nov 18, 2025

Uh oh!

dimapihtar left a comment

Choose a reason for hiding this comment

Uh oh!

sbak5 commented Nov 26, 2025

Uh oh!

lvdunlin commented Nov 27, 2025

Uh oh!

lvdunlin commented Dec 1, 2025

Uh oh!

sbak5 commented Dec 11, 2025

Uh oh!

lvdunlin commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimapihtar commented Jan 29, 2026

Uh oh!

copy-pr-bot bot commented Jan 29, 2026

Uh oh!

dimapihtar commented Jan 29, 2026

Uh oh!

dimapihtar commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lvdunlin commented Dec 11, 2025 •

edited

Loading