Adding NVRx as a dependency and keeping the current code base optionally#3899
Adding NVRx as a dependency and keeping the current code base optionally#3899dimapihtar wants to merge 51 commits intoNVIDIA:mainfrom
Conversation
Fix dequantize check to use `"dequantize" in type(ten).__dict__` instead of `hasattr(ten, "dequantize")`. The latter returns True for all torch.Tensor objects since dequantize is defined on the base class, causing RuntimeError on non-quantized tensors. The new check only matches TE subclasses (e.g. Float8Tensor, MXFP8Tensor) that define their own dequantize override.
This reverts commit a760b5e.
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
|
/claude review |
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
|
/ok to test cc6fd79 |
|
/ok to test d2cfee3 |
|
https://github.com/NVIDIA/Megatron-LM/actions/runs/23266155298/job/67649083947?pr=3899 |
Signed-off-by: dimapihtar <dpihtar@gmail.com>
|
/ok to test c5c86f5 |
|
|
||
| @debug_time("FullyParallelLoadStrategyWrapper.load", logger) | ||
| def load(self, sharded_state_dict: ShardedStateDict, checkpoint_dir: Path) -> StateDict: | ||
| def load( |
There was a problem hiding this comment.
Let's remove any load related changes. We don't have anything yet for loading.
There was a problem hiding this comment.
@sbak5 load calss _get_filesystem_reader: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/dist_checkpointing/strategies/torch.py#L807
which uses CachedMetadataFileSystemReader: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/dist_checkpointing/strategies/torch.py#L766 so we need to import it properly in respect to async_strategy.
There was a problem hiding this comment.
hmm but I guess we can take it from common state dict
let me see
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
This reverts commit 48c0d95.
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
|
/ok to test ddc37b5 |
Signed-off-by: dimapihtar <dpihtar@gmail.com>
|
/ok to test 2c85bea |
Signed-off-by: dimapihtar <dpihtar@gmail.com>
|
/ok to test ffe53e1 |
This reverts commit c3dabfb.
Signed-off-by: dimapihtar <dpihtar@gmail.com>
What does this PR do ?
async_strategyparam to make it configurable from the user perspective.--async-strategy mcore.Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.