Skip to content

V2.5based dev#1

Draft
fanshiqing wants to merge 39 commits intobase_v2.5from
v2.5based_dev
Draft

V2.5based dev#1
fanshiqing wants to merge 39 commits intobase_v2.5from
v2.5based_dev

Conversation

@fanshiqing
Copy link
Copy Markdown
Owner

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

janekb04 and others added 30 commits June 9, 2025 11:54
…VIDIA#1856)

Use public API instead of removed private function
* replaced use of _load_state_dict_into_model with model.load_state_dict because the private function _load_state_dict_into_model was removed in huggingface/transformers#36335

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
* Manage deps and add einops

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update build.yml

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Support MLA (qk_dim != v_dim) for AttnFuncWithCPAndKVP2P

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>

* add UT for MLA CP

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine the code

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine the code

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>
* Initial basic setup

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* rm setup reqs

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* buil-isolation support

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* rm not needed funcs

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix workflows

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix wheel

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix invalid wheel

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix JAX build in baremetal env

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update install inst in readme

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update build.yml

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* docstring fix

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
)

Fix for loading old ckpt formats

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
typo fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Implemented GroupedDense and TestGroupedDense for BF16, FP16, and FP8 
* Fix GroupedGemmFFI cuBLAS workspace alignment bug

Signed-off-by: Hua Huang <huah@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Added double buffering support initial commit

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>

* Fixed bugs

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>

* Make only one double buffer creation

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>

* Fixed bug

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>

* Fixed typo

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>

* Fixed flag setting

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Merge conflict

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lint fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>
Signed-off-by: Selvaraj Anandaraj <anandaraj@wisc.edu>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Revert "[JAX] GroupedDense v.2 without dynamic shape (NVIDIA#1721)"

This reverts commit 5d01ef2.

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Implemented GroupedDense and TestGroupedDense for BF16, FP16, and FP8 
* Fix GroupedGemmFFI cuBLAS workspace alignment bug

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Hua Huang <huah@nvidia.com>
…#1864)

* Support L2Norm basic op

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Add L2Norm module wrapper

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Expose qk_norm to MHA nd transformer laayer

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Move tests into separate file

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix pass

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Add license

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Remove  module

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Resollve comments

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Evgeny <etsykunov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…t test (NVIDIA#1873)

Distinguish the reasons why fp8 is not supported and mxfp8 is not supported

Signed-off-by: Hua Huang <huah@nvidia.com>
* fixes for jittable grouped_quantize

* fixes for jittable grouped_gemm

* fix contracting_dim for wgrad gemm

* exclude jitted grouped_gemm from the unit test as it does not work cudaGraph

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add support for overlapping wgrad NCCL AG with dgrad GEMM

Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>

* Remove unused wait on memcpy API from UB

Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>

* Add better commenting to MXFP8 overlap

Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>

---------

Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
Co-authored-by: dastokes <dastokes@dastokes-dvt-01.nvidia.com>
…nit__` (NVIDIA#1870)

* Flatten basic op params during fuser init

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 949abe97070721b1da5117903067608250f5fb61)

* Add caching for is_non_tn_fp8_gemm_supported

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit fd830ae24ffbd2d0727010b1a8a119ca72f61ce5)

* Pass fuser to _OperationFuserAutogradFunction.forward and moving computation to __init__

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit fd808991993958b670726896254b82fcb967fa07)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Pass basic_op_kwargs and is_grad_enabled as parameters rather than in fuser

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
… column-wise usage (NVIDIA#1847)

* Do not initialize quantized weights with column-wise usage in inference mode

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug in test

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use no-grad mode instead of inference mode in tests

Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…VIDIA#1858)

* Add FP8 current scaling to te.Sequential tests

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Helper function for test/ref tensors does not produce quantized tensor by default

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add FP8 current scaling to distributed te.Sequential tests

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add FP8 current scaling to Userbuffers te.Sequential tests

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Debug MXFP8 tests

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Added support of FP4 data type

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring to BitsNum in progress

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed compilation errors. All C++ tests passed

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed a typo

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added FP4 guard to TMA tensor descriptor data type

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed errors in JAX C++ extensions

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed dummy NVFP4 C++ test file

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Make pytorch changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Refactored the code per the review notes. Fixed JAX build error.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed unnecessary static casts

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Typo fix

Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>

* Pass correct num bits to create_2D_tensor_map; fixes CI

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* inline funcs

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* add support for head dim > 128

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* remove debugging

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* raise tols slightly to tolerate 1/2048 mismatches

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix is_training for test_te_layer

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add bprop support for blackwell

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor tweak for format

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix backend selection results

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* bump sm100 to sm100+

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add sq=1 test for MLA

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* enable sq=1 for bprop

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* minor tweak in comments

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix head_dim logic and remove pytest skip

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add FE fix for d>128

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* update FE again to take in small fixes

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add cuDNN version info in L0 tests

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* increase tols for Unfused + large dim

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* Revert "add cuDNN version info in L0 tests"

This reverts commit 3e1b426.

Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix tols for Unfused

Signed-off-by: Charlene Yang <charleney@nvidia.com>

---------

Signed-off-by: Charlene Yang <charleney@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…A#1851)

* Add support for Fused Attn MLA head_dim_qk != head_dim_v
	Modify is_fused_attn_kernel_available() to accept different head_dims for qk and v
	Modify FusedAttnHelper to accept different head_dims for qk and v and modify assert dims checks in parse_qkv_aval()
	Modify FusedAttnFwdPrimitive and FusedAttnBwdPrimitive to accept different head_dims for qk and v
	Modify Fused Attn related cpp and csrc extension API calls to accept different head_dims for qk and v
	Modify DotProductAttention call() to extract head dims separately for qk and v
	Modify the FusedAttn Tests to accommodate for API changes in FusedAttn API
	Add test case for head_dim_qk != head_dim_v (failing)
	Modify the baseline JAX appropriately to reshape the output vector based on v dims and not q dims

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix context dims in general DPA in test_fused_attn

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Fix dim for output tensor by replacing with v head dim rather than q head dim
Add test cases for jax fused attn where head_dim_qk != head_dim_v for a combination of data types and attention type

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Modify the fused attn jax unit test case for head dim qk != head dim v

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Use new FusedAttnRunner function signature for separate hidden dim for qk and v in Fused Attn distributed tests
Code clean up

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Fix usage of is_fused_attn signature in distributed tests

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Remove unnecessary assert

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

---------

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
…VIDIA#1871)

* Support MXFP8 and handle empty matrices

Signed-off-by: Hua Huang <huah@nvidia.com>

---------

Signed-off-by: Hua Huang <huah@nvidia.com>
* Fix an issue when mcore uses te fusion ce implementation

Signed-off-by: lit <lit@nvidia.com>

* simplify unit test code

Signed-off-by: lit <lit@nvidia.com>

* Update tests/pytorch/test_parallel_cross_entropy.py

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------

Signed-off-by: lit <lit@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* include previously accidentally excluded tests

* Execute run_test_multiprocessing_encoder with nested bash + exit code for inner bash shell

* Adapt run_test_multiprocessing to handle segfault

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…1844)

* TensorUsage + FP8 GEMM with all layouts handling on BW

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>


---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…IA#1831)

* Use FP16 tols for tests with TF32

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use uniform init instead of constant init

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Revert constant init test, but reduce value

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* Fix cppunittest test.sh for editable installs

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update tests/cpp/CMakeLists.txt

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…1793)

* finish python ref impl for bulk alloc

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* c++ bulk alloc worked, still draft version

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* clean up

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* resolve rebase conflict

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add license

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* use shared_ptr to auto manage reference count

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* attempt to fix misc training error

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* attempt to handle case where experts get zero token

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* updated with fused C++ function calls

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* clean up

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* experiment with reducing py object construction time

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* fix seg fault bug in inference mode

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* fix lint

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* fuse torch split into bulk alloc

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* clean up

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* rebase to latest main

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* fix unit test failure

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* fix lint error

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* refactor create_tensor to use get_scale_shape

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* refactor quantize to call quantize_cpp

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Implement separate functions for multi-tensor quantize and split + multi-tensor quantize

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update grouped linear module with fused split+quantize func

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Move multi-tensor quantize func to cast.cpp

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Do not expose quantizer helper function externally

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warnings

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert cuDNN frontend commit

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* fix corner cases with zero tokens

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* add comments

Signed-off-by: zhongboz <zhongboz@nvidia.com>

---------

Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
* [PyTorch|common] Implement unpadding kernel for FP8

1. Add multi-tensor unpadding kernel
2. Replace split+cat with unpadding kernel in Fp8Padding and Fp8Unpadding
3. Add unpadding with padding unit tests

Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add license

Signed-off-by: Xin Yao <xiny@nvidia.com>

* Update padding.cu

Signed-off-by: Xin Yao <xiny@nvidia.com>

---------

Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
pggPL and others added 5 commits June 27, 2025 00:29
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…VIDIA#1843)

* fixed the bug

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lint fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* test change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…#1898)

Use keyword args for jit in_shardings and out_shardings

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* skip kv cache for sm89, cudnn < 9.12

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix test_numerics

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* fix: (1) UT ignores MLA; (2) bshd format runtime error. Ban fp8 mla attn + cp due to correctness problem

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>

* only disable FP8 CP for MLA

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
@fanshiqing fanshiqing force-pushed the v2.5based_dev branch 2 times, most recently from 7f0abd7 to 9595f51 Compare July 30, 2025 07:29
…ted patterns;

   - ag->fc2_wgrad
   - ag->fc1_wgrad
   - fc1_dgrad->rs
   - ag->proj_wgrad
   - ag->qkv_wgrad
   - qkv_dgrad->rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.