Draft
Conversation
…VIDIA#1856) Use public API instead of removed private function * replaced use of _load_state_dict_into_model with model.load_state_dict because the private function _load_state_dict_into_model was removed in huggingface/transformers#36335 Signed-off-by: Jan Bielak <jbielak@nvidia.com>
* Manage deps and add einops Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update build.yml Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Support MLA (qk_dim != v_dim) for AttnFuncWithCPAndKVP2P Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com> * add UT for MLA CP Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine the code Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine the code Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>
* Initial basic setup Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rm setup reqs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * buil-isolation support Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rm not needed funcs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix workflows Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix wheel Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix invalid wheel Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix JAX build in baremetal env Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update install inst in readme Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update build.yml Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * docstring fix Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
typo fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Implemented GroupedDense and TestGroupedDense for BF16, FP16, and FP8 * Fix GroupedGemmFFI cuBLAS workspace alignment bug Signed-off-by: Hua Huang <huah@nvidia.com> Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Added double buffering support initial commit Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> * Fixed bugs Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * Make only one double buffer creation Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * Fixed bug Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * Fixed typo Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * Fixed flag setting Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Merge conflict Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> Signed-off-by: Selvaraj Anandaraj <anandaraj@wisc.edu> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Revert "[JAX] GroupedDense v.2 without dynamic shape (NVIDIA#1721)" This reverts commit 5d01ef2. Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Implemented GroupedDense and TestGroupedDense for BF16, FP16, and FP8 * Fix GroupedGemmFFI cuBLAS workspace alignment bug Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by: Hua Huang <huah@nvidia.com>
…#1864) * Support L2Norm basic op Signed-off-by: Evgeny <etsykunov@nvidia.com> * Add L2Norm module wrapper Signed-off-by: Evgeny <etsykunov@nvidia.com> * Expose qk_norm to MHA nd transformer laayer Signed-off-by: Evgeny <etsykunov@nvidia.com> * Move tests into separate file Signed-off-by: Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix pass Signed-off-by: Evgeny <etsykunov@nvidia.com> * Add license Signed-off-by: Evgeny <etsykunov@nvidia.com> * Remove module Signed-off-by: Evgeny <etsykunov@nvidia.com> * Resollve comments Signed-off-by: Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Evgeny <etsykunov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…t test (NVIDIA#1873) Distinguish the reasons why fp8 is not supported and mxfp8 is not supported Signed-off-by: Hua Huang <huah@nvidia.com>
* fixes for jittable grouped_quantize * fixes for jittable grouped_gemm * fix contracting_dim for wgrad gemm * exclude jitted grouped_gemm from the unit test as it does not work cudaGraph --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add support for overlapping wgrad NCCL AG with dgrad GEMM Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com> * Remove unused wait on memcpy API from UB Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com> * Add better commenting to MXFP8 overlap Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com> --------- Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com> Co-authored-by: dastokes <dastokes@dastokes-dvt-01.nvidia.com>
…nit__` (NVIDIA#1870) * Flatten basic op params during fuser init Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 949abe97070721b1da5117903067608250f5fb61) * Add caching for is_non_tn_fp8_gemm_supported Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit fd830ae24ffbd2d0727010b1a8a119ca72f61ce5) * Pass fuser to _OperationFuserAutogradFunction.forward and moving computation to __init__ Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit fd808991993958b670726896254b82fcb967fa07) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Pass basic_op_kwargs and is_grad_enabled as parameters rather than in fuser Signed-off-by: Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jan Bielak <jbielak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
… column-wise usage (NVIDIA#1847) * Do not initialize quantized weights with column-wise usage in inference mode Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix bug in test Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use no-grad mode instead of inference mode in tests Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…VIDIA#1858) * Add FP8 current scaling to te.Sequential tests Signed-off-by: Tim Moon <tmoon@nvidia.com> * Helper function for test/ref tensors does not produce quantized tensor by default Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add FP8 current scaling to distributed te.Sequential tests Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add FP8 current scaling to Userbuffers te.Sequential tests Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug MXFP8 tests Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Added support of FP4 data type Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring to BitsNum in progress Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed compilation errors. All C++ tests passed Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed a typo Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added FP4 guard to TMA tensor descriptor data type Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed errors in JAX C++ extensions Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed dummy NVFP4 C++ test file Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Make pytorch changes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Refactored the code per the review notes. Fixed JAX build error. Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed unnecessary static casts Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Typo fix Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Pass correct num bits to create_2D_tensor_map; fixes CI Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * inline funcs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* add support for head dim > 128 Signed-off-by: Charlene Yang <charleney@nvidia.com> * remove debugging Signed-off-by: Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * raise tols slightly to tolerate 1/2048 mismatches Signed-off-by: Charlene Yang <charleney@nvidia.com> * fix is_training for test_te_layer Signed-off-by: Charlene Yang <charleney@nvidia.com> * add bprop support for blackwell Signed-off-by: Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweak for format Signed-off-by: Charlene Yang <charleney@nvidia.com> * fix backend selection results Signed-off-by: Charlene Yang <charleney@nvidia.com> * bump sm100 to sm100+ Signed-off-by: Charlene Yang <charleney@nvidia.com> * add sq=1 test for MLA Signed-off-by: Charlene Yang <charleney@nvidia.com> * enable sq=1 for bprop Signed-off-by: Charlene Yang <charleney@nvidia.com> * minor tweak in comments Signed-off-by: Charlene Yang <charleney@nvidia.com> * fix head_dim logic and remove pytest skip Signed-off-by: Charlene Yang <charleney@nvidia.com> * add FE fix for d>128 Signed-off-by: Charlene Yang <charleney@nvidia.com> * update FE again to take in small fixes Signed-off-by: Charlene Yang <charleney@nvidia.com> * add cuDNN version info in L0 tests Signed-off-by: Charlene Yang <charleney@nvidia.com> * increase tols for Unfused + large dim Signed-off-by: Charlene Yang <charleney@nvidia.com> * Revert "add cuDNN version info in L0 tests" This reverts commit 3e1b426. Signed-off-by: Charlene Yang <charleney@nvidia.com> * fix tols for Unfused Signed-off-by: Charlene Yang <charleney@nvidia.com> --------- Signed-off-by: Charlene Yang <charleney@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…A#1851) * Add support for Fused Attn MLA head_dim_qk != head_dim_v Modify is_fused_attn_kernel_available() to accept different head_dims for qk and v Modify FusedAttnHelper to accept different head_dims for qk and v and modify assert dims checks in parse_qkv_aval() Modify FusedAttnFwdPrimitive and FusedAttnBwdPrimitive to accept different head_dims for qk and v Modify Fused Attn related cpp and csrc extension API calls to accept different head_dims for qk and v Modify DotProductAttention call() to extract head dims separately for qk and v Modify the FusedAttn Tests to accommodate for API changes in FusedAttn API Add test case for head_dim_qk != head_dim_v (failing) Modify the baseline JAX appropriately to reshape the output vector based on v dims and not q dims Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix context dims in general DPA in test_fused_attn Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Fix dim for output tensor by replacing with v head dim rather than q head dim Add test cases for jax fused attn where head_dim_qk != head_dim_v for a combination of data types and attention type Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Modify the fused attn jax unit test case for head dim qk != head dim v Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Use new FusedAttnRunner function signature for separate hidden dim for qk and v in Fused Attn distributed tests Code clean up Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Fix usage of is_fused_attn signature in distributed tests Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Remove unnecessary assert Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> --------- Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
…VIDIA#1871) * Support MXFP8 and handle empty matrices Signed-off-by: Hua Huang <huah@nvidia.com> --------- Signed-off-by: Hua Huang <huah@nvidia.com>
* Fix an issue when mcore uses te fusion ce implementation Signed-off-by: lit <lit@nvidia.com> * simplify unit test code Signed-off-by: lit <lit@nvidia.com> * Update tests/pytorch/test_parallel_cross_entropy.py Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by: lit <lit@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* include previously accidentally excluded tests * Execute run_test_multiprocessing_encoder with nested bash + exit code for inner bash shell * Adapt run_test_multiprocessing to handle segfault Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…1844) * TensorUsage + FP8 GEMM with all layouts handling on BW Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…IA#1831) * Use FP16 tols for tests with TF32 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use uniform init instead of constant init Signed-off-by: Tim Moon <tmoon@nvidia.com> * Revert constant init test, but reduce value Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* Fix cppunittest test.sh for editable installs Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Update tests/cpp/CMakeLists.txt Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…1793) * finish python ref impl for bulk alloc Signed-off-by: zhongboz <zhongboz@nvidia.com> * c++ bulk alloc worked, still draft version Signed-off-by: zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by: zhongboz <zhongboz@nvidia.com> * resolve rebase conflict Signed-off-by: zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add license Signed-off-by: zhongboz <zhongboz@nvidia.com> * use shared_ptr to auto manage reference count Signed-off-by: zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * attempt to fix misc training error Signed-off-by: zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * attempt to handle case where experts get zero token Signed-off-by: zhongboz <zhongboz@nvidia.com> * updated with fused C++ function calls Signed-off-by: zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by: zhongboz <zhongboz@nvidia.com> * experiment with reducing py object construction time Signed-off-by: zhongboz <zhongboz@nvidia.com> * fix seg fault bug in inference mode Signed-off-by: zhongboz <zhongboz@nvidia.com> * fix lint Signed-off-by: zhongboz <zhongboz@nvidia.com> * fuse torch split into bulk alloc Signed-off-by: zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by: zhongboz <zhongboz@nvidia.com> * rebase to latest main Signed-off-by: zhongboz <zhongboz@nvidia.com> * fix unit test failure Signed-off-by: zhongboz <zhongboz@nvidia.com> * fix lint error Signed-off-by: zhongboz <zhongboz@nvidia.com> * refactor create_tensor to use get_scale_shape Signed-off-by: zhongboz <zhongboz@nvidia.com> * refactor quantize to call quantize_cpp Signed-off-by: zhongboz <zhongboz@nvidia.com> * Implement separate functions for multi-tensor quantize and split + multi-tensor quantize Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update grouped linear module with fused split+quantize func Signed-off-by: Tim Moon <tmoon@nvidia.com> * Move multi-tensor quantize func to cast.cpp Signed-off-by: Tim Moon <tmoon@nvidia.com> * Do not expose quantizer helper function externally Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert cuDNN frontend commit Signed-off-by: Tim Moon <tmoon@nvidia.com> * fix corner cases with zero tokens Signed-off-by: zhongboz <zhongboz@nvidia.com> * add comments Signed-off-by: zhongboz <zhongboz@nvidia.com> --------- Signed-off-by: zhongboz <zhongboz@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <tmoon@nvidia.com>
* [PyTorch|common] Implement unpadding kernel for FP8 1. Add multi-tensor unpadding kernel 2. Replace split+cat with unpadding kernel in Fp8Padding and Fp8Unpadding 3. Add unpadding with padding unit tests Signed-off-by: xiaoxi-wangfj <690912414@qq.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add license Signed-off-by: Xin Yao <xiny@nvidia.com> * Update padding.cu Signed-off-by: Xin Yao <xiny@nvidia.com> --------- Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
* fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…VIDIA#1843) * fixed the bug Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * test change Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…#1898) Use keyword args for jit in_shardings and out_shardings Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* skip kv cache for sm89, cudnn < 9.12 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix test_numerics Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* fix: (1) UT ignores MLA; (2) bshd format runtime error. Ban fp8 mla attn + cp due to correctness problem Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com> * only disable FP8 CP for MLA Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
7f0abd7 to
9595f51
Compare
…ted patterns; - ag->fc2_wgrad - ag->fc1_wgrad - fc1_dgrad->rs - ag->proj_wgrad - ag->qkv_wgrad - qkv_dgrad->rs
9595f51 to
c58598f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: