Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : llama.cpp #1122

Merged
merged 29 commits into from
Feb 25, 2025
Merged

sync : llama.cpp #1122

merged 29 commits into from
Feb 25, 2025

Conversation

ggerganov
Copy link
Member

No description provided.

IMbackK and others added 29 commits February 25, 2025 11:44
* ggml : x2 speed for WASM by optimizing SIMD

* fix bad merging

* rm trailing spaces

* rm redundant clamp

* better quantize_row_q8_K

Co-authored-by: camel-cdr <[email protected]>

* remove memset that causes buffer overflow
Co-authored-by: camel-cdr <[email protected]>

---------

Co-authored-by: camel-cdr <[email protected]>
* ggml-cpu : add chunking support to mul_mat_id

* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row

* disable for arm

* cleanup

* better way to disable for arm

* fix uninitialized counter when using 1 thread only

* revert test-backend-ops changes
* musa: Update MUSA SDK version to rc3.1.1

Signed-off-by: Xiaodong Ye <[email protected]>

* musa: Remove workaround in PR #10042

Signed-off-by: Xiaodong Ye <[email protected]>

---------

Signed-off-by: Xiaodong Ye <[email protected]>
* mm subgroup size

* upload vulkan x86 builds
* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX

* Optimize mul_sum_i8_pairs_float for LoongArch ASX

* Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX
* opencl: fix `ROPE`

* opencl: fix `SOFT_MAX`

* Add fp16 variant

* opencl: enforce subgroup size for `soft_max`
* vulkan: initial support for IQ1_S and IQ1_M quantizations

* vulkan: define MMV kernels for IQ1 quantizations

* devops: increase timeout of Vulkan tests again

* vulkan: simplify ifdef for init_iq_shmem
* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
* vulkan: support memset_tensor

* vulkan: support GGML_OP_SUM

* vulkan: implement GGML_OP_ARGMAX

* vulkan: implement GGML_OP_SUB

* vulkan: implement GGML_OP_COUNT_EQUAL

* vulkan: implement GGML_OP_OPT_STEP_ADAMW

* vulkan: fix check_results RWKV_WKV6 crash and memory leaks

* vulkan: implement GGML_OP_REPEAT_BACK

* tests: remove invalid test-backend-ops REPEAT_BACK tests

* vulkan: fix COUNT_EQUAL memset using a fillBuffer command
* CUDA: use async data loading for FlashAttention

---------

Co-authored-by: Diego Devesa <[email protected]>
…11917)

* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in  ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
* ggml-cpu: Add CPU backend support for KleidiAI library

* Add environmental variable GGML_KLEIDIAI_SME

* Add support for multithread LHS conversion

* Switch kernel selection order to dotprod and i8mm

* updates for review comments

* More updates for review comments

* Reorganize and rename KleidiAI files

* Move ggml-cpu-traits.h to source file

* Update cmake for SME build and add alignment for SME

* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list
* MUSA:  support ARM64 and enable __dp4a .etc

* fix cross entropy loss op for musa

* update

* add cc info log for musa

* add comment for the MUSA .cc calculation block

---------

Co-authored-by: Bodhi Hu <[email protected]>
* CUDA: correct the lowest Maxwell supported by CUDA 12

---------

Co-authored-by: Johannes Gäßler <[email protected]>
* ggml: add s390x ARCH_FLAGS for compilation

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add SIMD for s390x using vector intrinsics

SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix missing escape character in GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix s390x GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <[email protected]>

* ggml: full SIMD activation for F32,F16 s390x

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add option to disable s390x VXE/VXE2

Signed-off-by: Aaron Teo <[email protected]>

* ggml: change vecintrin.h include to ggml-cpu-impl

* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <[email protected]>

* cmake: add s390x target detection for VX/VXE/VXE2

Signed-off-by: Aaron Teo <[email protected]>

* ggml: move s390x vector intrinsics to ggml-cpu-impl.h

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x Q8_0 SIMD

Signed-off-by: Aaron Teo <[email protected]>

* ggml: correct documentation for Q8_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x reduce code complexity Q8_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x bugfix typo Q8_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activated for Q4_1

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x inline vec_reve

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for Q4_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add VXE backend feature

Signed-off-by: Aaron Teo <[email protected]>

* ggml: remove test.py

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for quantize_row_q8_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for quantize_row_q8_1

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for iq4_xs

Signed-off-by: Aaron Teo <[email protected]>

* ggml: bugfix iq4_xs

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for iq4_nl

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add float, double, and long vector data type

Signed-off-by: Aaron Teo <[email protected]>

* ggml: clean up iq4_xs SIMD

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix improper use of restrict keyword

Signed-off-by: Aaron Teo <[email protected]>

* ggml: update warning message for ggml_vec_tbl

Signed-off-by: Aaron Teo <[email protected]>

* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

Signed-off-by: Aaron Teo <[email protected]>

* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

Signed-off-by: Aaron Teo <[email protected]>

* ggml: switch to restrict for iq4_nl

Signed-off-by: Aaron Teo <[email protected]>

* ggml: slight dot product speed improvement for q4_1_q8_1

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for q6_K

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add missing `_t` to ggml_int8x16x4_t

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix missing `_t` for ggml_vec_xl_s8x4

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix more missing `_t`

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add unroll and prefetch to Q8_0

increase of 3.86% for prompt processing and 32.22% for token generation

Signed-off-by: Aaron Teo <[email protected]>

* ggml: patch Q8_0 to use proper vector sizes

Signed-off-by: Aaron Teo <[email protected]>

* ggml: optimise Q8_0 dot prod compute kernel further

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add unroll and prefetch to Q4_1

Signed-off-by: Aaron Teo <[email protected]>

* ggml: refactor Q6_K variable naming for readability

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix Q6_K typos

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for Q5_K

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix wrong char*x16_t naming

Signed-off-by: Aaron Teo <[email protected]>

* ggml: Q5_K y0 wrong signness

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for Q4_K

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix Q4_K invalid vector intrinsics

Signed-off-by: Aaron Teo <[email protected]>

* ggml: simplify ggml_padd_s16 compute kernel

Signed-off-by: Aaron Teo <[email protected]>

* ggml: correct ggml-cpu vxe wording

Signed-off-by: Aaron Teo <[email protected]>

* ggml: change ggml_aligned_malloc alignment to 256

256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <[email protected]>

* ggml: resolve pr merge via cherry-pick 225bbbf

Signed-off-by: Aaron Teo <[email protected]>

* ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)

* ggml: resolve pr merge via cherry-pick 4571953

Signed-off-by: Aaron Teo <[email protected]>

* ggml: cmake remove fork when determining s390x machine type

thank you @ericcurtin

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
Co-authored-by: Jinyang He <[email protected]>
Co-authored-by: junchao-zhao <[email protected]>
* opt performance by reorder for Intel GPU

* detect hw type and save opt feature, and print opt feature

* correct name

* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed

* add env variable GGML_SYCL_DISABLE_OPT for debug

* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT

* add performance data

* mv getrows functions to separeted files

* fix global variables

---------

Co-authored-by: arthw <[email protected]>
* opencl: fix small shape gemv, remove unused extensions

* opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size

* opencl: fix for token length < 4

* opencl: use wave size of 64 for all Adreno GPUs

---------

Co-authored-by: Shawn Gu <[email protected]>
Co-authored-by: Skyler Szot <[email protected]>
metal: use dequantize_q templates

---------

Co-authored-by: Georgi Gerganov <[email protected]>
@ggerganov ggerganov merged commit c21d976 into master Feb 25, 2025
9 checks passed
@ggerganov ggerganov deleted the sync-llama.cpp-25-02-25 branch February 25, 2025 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.