cuda/cpu: Increase support for fp16 unary operations #1125

cmdr2 · 2025-02-26T09:19:25Z

// apologies for adding even more lines to ggml-cpu.c, I will templatize these in my next PR (to shrink the code again). I kept them as simple copies for easier readability during review, and to keep the PR focused on one thing.

This PR increases support for pure float16 operations (in CUDA and CPU) for the following unary operations: abs, sgn, neg, step, tanh, gelu, silu, silu_back, gelu_quick, relu, sigmoid, hardsigmoid, exp, hardswish, leaky_relu, sqr, sqrt, sin, cos, log, clamp.

Sorry for the large-sized PR, but it's actually just three simple things (repeated several times):

cpu: copy the fp32 version and use the corresponding fp16 operation
cuda: templatize the unary kernels to work with float or half
cuda: add kernels for abs, sgn and log

Also: Added test cases for these operations in test-backend-ops, and updated supports_op for Metal (to avoid CI failure).

test-backend-ops passes.

Thanks!
PS: unary.cu could also be templatized further, since most of the functions are duplicates.

…, to prevent test-backend-ops from failing

slaren · 2025-02-26T12:32:43Z

src/ggml-cpu/ggml-cpu.c

+inline static void ggml_vec_neg_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
+    for (int i = 0; i < n; ++i) {
+        y[i] = GGML_FP32_TO_FP16(-GGML_FP16_TO_FP32(x[i]));
+    }
+}


If you are converting entire vectors to F16, there may be a significant performance advantage using ggml_fp32_to_fp16_row, since it can use F16C to convert multiple values in a single cycle. The inverse (ggml_fp16_to_fp32_row) could also be accelerated with F16C, but it does not have an implementation at the moment. However, the F16C implementation of these function would need to be moved to the CPU backend, currently it is in ggml.c which normally isn't compiled with support for these instructions (it's a leftover from when the CPU backend was split to a different target).

Not very important right now, just something to consider. When the code is ported to C++, it should be possible to abstract these conversions, and reuse the same code for all F16 ops.

Good point, thanks! Strangely enough, the row-based version seems to be 10-15% slower than converting per cycle.

Maybe msvc isn't enabling F16C? Not sure. I'm using cmake -B build; cmake --build build --config Release on Windows with msvc command line. CMake says: Adding CPU backend variant ggml-cpu: /arch:AVX512 GGML_AVX512

I'm using a 1 GB fp16 tensor, and it takes 1088ms without row optimization, and 1330ms with row optimization.

I used this for vec_neg_f16:

inline static void ggml_vec_neg_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) { float * y_f32 = (float *)malloc(sizeof(float) * n); for (int i = 0; i < n; ++i) { y_f32[i] = -GGML_FP16_TO_FP32(x[i]); } ggml_fp32_to_fp16_row(y_f32, y, n); }

I'm happy to continue optimizing it in this PR, if you wish. Or we can pursue the optimization in another PR, after using templates, so that all the functions would optimized in one shot, like you said.

I'm okay with either, please let me know. Thanks!

Did you move the function to the CPU backend? Otherwise it will be built without the /arch flag.

slaren · 2025-02-26T12:38:14Z

src/ggml-cuda/unary.cu

+template <class T>
+static __global__ void op_abs(const T * x, T * dst, const int k) {
+    const int i = blockDim.x*blockIdx.x + threadIdx.x;
+
+    if (i >= k) {
+        return;
+    }
+
+    dst[i] = fabsf(x[i]);
+}


Not super important right now, but it would be nice to generalize all of these functions to a single kernel that takes the unary function to perform as a parameter. This could be done in a similar way as it is done in binbcast.cu.

Yes, I'm working on that PR next, using binbcast.cu as the template :)

For the unary operators in the CPU as well as unary.cu.

I kept these functions separate for now to have a single-focus PR, and do refactoring in a second PR.

cmdr2 · 2025-02-27T07:24:37Z

@slaren Thanks for your comments! I've replied to them, please let me know if you'd like any changes.

ggerganov

You can fix the Metal backend with this patch:

diff --git a/src/ggml-metal/ggml-metal.m b/src/ggml-metal/ggml-metal.m
index 63e944e..ff88d2c 100644
--- a/src/ggml-metal/ggml-metal.m
+++ b/src/ggml-metal/ggml-metal.m
@@ -1210,11 +1210,13 @@ static bool ggml_metal_supports_op(const struct ggml_backend_metal_device_contex
         case GGML_OP_TRANSPOSE:
         case GGML_OP_PERMUTE:
         case GGML_OP_CONCAT:
+            return true;
         case GGML_OP_ADD:
         case GGML_OP_SUB:
-        case GGML_OP_ACC:
         case GGML_OP_MUL:
         case GGML_OP_DIV:
+            return op->src[0]->type == GGML_TYPE_F32;
+        case GGML_OP_ACC:
         case GGML_OP_REPEAT:
         case GGML_OP_SCALE:
         case GGML_OP_CONV_TRANSPOSE_1D:
@@ -1225,8 +1227,9 @@ static bool ggml_metal_supports_op(const struct ggml_backend_metal_device_contex
         case GGML_OP_SQRT:
         case GGML_OP_SIN:
         case GGML_OP_COS:
-        case GGML_OP_LOG:
             return ggml_is_contiguous(op->src[0]) && op->src[0]->type == GGML_TYPE_F32;
+        case GGML_OP_LOG:
+           return false; // TODO: implement
         case GGML_OP_SUM_ROWS:
         case GGML_OP_SOFT_MAX:
         case GGML_OP_GROUP_NORM:
@@ -1256,11 +1259,12 @@ static bool ggml_metal_supports_op(const struct ggml_backend_metal_device_contex
         case GGML_OP_UPSCALE:
         case GGML_OP_PAD:
         case GGML_OP_PAD_REFLECT_1D:
-        case GGML_OP_ARANGE:
         case GGML_OP_TIMESTEP_EMBEDDING:
         case GGML_OP_ARGSORT:
         case GGML_OP_LEAKY_RELU:
             return op->src[0]->type == GGML_TYPE_F32;
+        case GGML_OP_ARANGE:
+            return true;
         case GGML_OP_FLASH_ATTN_EXT:
             if (op->src[1]->type != op->src[2]->type) {
                 return false;

src/ggml-metal/ggml-metal.m

cmdr2 · 2025-02-27T10:58:08Z

You can fix the Metal backend with this patch

@ggerganov Thanks, I appreciate the patch and help! :) I've pushed it to the PR's branch.

Sorry, yeah it's a bit embarrassing for me to commit changes for Metal blindly :) In the future, I'll try to get hold of some cloud runner if possible to test. My ancient Macbook Pro (early 2013) fails to compile ggml with Metal, since it's stuck at macOS 10.15 (Catalina).

cmdr2 · 2025-02-27T11:04:07Z

Also, it looks like the 'test-macos-metal' CI on PRs is only running on BLAS, instead of Metal. For e.g. this is the CI that ran on my latest commit in this PR: https://github.com/ggml-org/ggml/actions/runs/13564669680/job/37914974037?pr=1125

Testing 2 devices
Backend 1/2: BLAS
...
Backend 2/2: CPU
...

ggerganov · 2025-02-27T11:08:32Z

Also, it looks like the 'test-macos-metal' CI on PRs is only running on BLAS, instead of Metal.

Yes, the Github runners don't have Metal - this CI workflow is incorrectly named.

Now that you are a collaborator to the project, you can push branches in this repo and they will run the ggml-ci which covers a wide range of hardware, including Metal on Apple Silicon.

cmdr2 · 2025-02-27T11:14:18Z

Thanks! That's very helpful, and I appreciate the trust. I won't misuse it.

cmdr2 and others added 6 commits February 24, 2025 17:16

Support fp16 unary operations in the CUDA backend

b1ecea1

Merge branch 'ggml-org:master' into fp16-unary

201c346

cpu: increase fp16 support for unary operators in the CPU backend

8baf31a

cuda: increase fp16 support for unary operators in the CUDA backend

f3041c8

Add test cases for fp16 unary operators

95ffd0d

metal: update supports_op for unary operators that don't support fp16…

e031542

…, to prevent test-backend-ops from failing

cmdr2 mentioned this pull request Feb 26, 2025

Can't add two float16 tensors on CUDA? #1117

Closed

slaren reviewed Feb 26, 2025

View reviewed changes

ggerganov approved these changes Feb 27, 2025

View reviewed changes

src/ggml-metal/ggml-metal.m Show resolved Hide resolved

src/ggml-metal/ggml-metal.m Outdated Show resolved Hide resolved

metal: fix PR comments for unary op support after fp16 unary tests

67589e0

slaren approved these changes Feb 27, 2025

View reviewed changes

ggerganov merged commit 0d1ea2e into ggml-org:master Feb 28, 2025
3 checks passed

ggerganov mentioned this pull request Feb 28, 2025

sync : ggml ggml-org/llama.cpp#12104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda/cpu: Increase support for fp16 unary operations #1125

cuda/cpu: Increase support for fp16 unary operations #1125

cmdr2 commented Feb 26, 2025

slaren Feb 26, 2025

cmdr2 Feb 27, 2025 •

edited

Loading

slaren Feb 27, 2025

slaren Feb 26, 2025

cmdr2 Feb 26, 2025

cmdr2 commented Feb 27, 2025

ggerganov left a comment

cmdr2 commented Feb 27, 2025 •

edited

Loading

cmdr2 commented Feb 27, 2025

ggerganov commented Feb 27, 2025

cmdr2 commented Feb 27, 2025

cuda/cpu: Increase support for fp16 unary operations #1125

cuda/cpu: Increase support for fp16 unary operations #1125

Conversation

cmdr2 commented Feb 26, 2025

slaren Feb 26, 2025

Choose a reason for hiding this comment

cmdr2 Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

slaren Feb 27, 2025

Choose a reason for hiding this comment

slaren Feb 26, 2025

Choose a reason for hiding this comment

cmdr2 Feb 26, 2025

Choose a reason for hiding this comment

cmdr2 commented Feb 27, 2025

ggerganov left a comment

Choose a reason for hiding this comment

cmdr2 commented Feb 27, 2025 • edited Loading

cmdr2 commented Feb 27, 2025

ggerganov commented Feb 27, 2025

cmdr2 commented Feb 27, 2025

cmdr2 Feb 27, 2025 •

edited

Loading

cmdr2 commented Feb 27, 2025 •

edited

Loading