-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Adding NVRx as a dependency and keeping the current code base optionally #3899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dimapihtar
wants to merge
53
commits into
NVIDIA:main
Choose a base branch
from
sbak5:sbak/ckpt_migrate
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,581
−601
Open
Changes from all commits
Commits
Show all changes
53 commits
Select commit
Hold shift + click to select a range
a760b5e
Handle quantized CUDA tensors in async checkpoint writer
sbak5 1587314
Lint applied
sbak5 a1aad02
Import resiliency-ext async checkpointing
sbak5 65bb819
Revert "Handle quantized CUDA tensors in async checkpoint writer"
dimapihtar 814f2fb
keep both nvrx & mcore async save strategies
dimapihtar 9fc5757
rename variable
dimapihtar c2559d9
refactor get_async_strategy
dimapihtar 1b9ad87
pass async strategy to save/load strategy
dimapihtar 55b6a00
Merge branch 'main' into sbak/ckpt_migrate
dimapihtar 4fa312f
fix imports
dimapihtar cb04e5a
properly pass async_strategy to async_save
dimapihtar 247027f
properly pass async_strategy
dimapihtar c3673fd
properly pass async_strategy load
dimapihtar cc6fd79
remove extra code
dimapihtar 9c69546
fix code style
dimapihtar c5adc81
add deprecation warning
dimapihtar 79ed36e
fix style
dimapihtar 33e2872
fix code style
dimapihtar ac936ed
set mcore async-strategy for some func tests
dimapihtar 1063aef
update nvrx version
dimapihtar 3a7af3e
add unit tests
dimapihtar bfeb20a
move warning
dimapihtar 6ba7a29
revert changes
dimapihtar bf7f792
fix bug
dimapihtar 0f3779d
update unit tests
dimapihtar 0b71e78
revert changes
dimapihtar 755483c
Revert "revert changes"
dimapihtar 70ce1f5
Revert "update nvrx version"
dimapihtar d18bec2
Merge branch 'main' into sbak/ckpt_migrate
dimapihtar c3dabfb
update nvrx version
dimapihtar 51c5501
fix style
dimapihtar d2cfee3
fix unit tests
dimapihtar c5c86f5
fix unit tests
dimapihtar 0f81f06
avoid async_strategy param at serialization.load()
dimapihtar 5ceef34
fix unit test
dimapihtar 57c422c
fix unit test
dimapihtar 9f690f0
move warning
dimapihtar 9a36493
fix unit test
dimapihtar 48c0d95
fix unit test
dimapihtar 67703f2
Revert "fix unit test"
dimapihtar c8ed36c
disable async_save
dimapihtar 8030764
fix unit test
dimapihtar a0591c8
fix typo
dimapihtar 64f4280
Merge branch 'main' into sbak/ckpt_migrate
dimapihtar 1f4979e
disable async_save
dimapihtar ddc37b5
fix unit test
dimapihtar 2c85bea
fix warning
dimapihtar ffe53e1
fix unit test
dimapihtar 6833d87
Revert "update nvrx version"
dimapihtar e26b2d2
Merge branch 'main' into sbak/ckpt_migrate
dimapihtar 2a92e71
update nvrx version
dimapihtar 47e190d
update uv.lock
dimapihtar c56a3eb
Fix issue setting up `CachedFileSystemReader` and incorrect use of ck…
sbak5 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove any load related changes. We don't have anything yet for loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbak5 load calss
_get_filesystem_reader: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/dist_checkpointing/strategies/torch.py#L807which uses
CachedMetadataFileSystemReader: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/dist_checkpointing/strategies/torch.py#L766 so we need to import it properly in respect toasync_strategy.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm but I guess we can take it from common state dict
let me see