-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync : ggml #12104
sync : ggml #12104
Conversation
… backend (ggml/1121) * Support float16-to-float16 add/sub/mul/div operations in the CUDA backend * Add fp16 support for add/sub/mul/div on the CPU backend * Add test cases for fp16 add/sub/mul/div
It is used by Whisper talk-llama example. Co-authored-by: Petter Reinholdtsen <[email protected]>
* Add small comment re: VSX to readme Co-authored-by: midnight <[email protected]>
* whisper : support GGML_BACKEND_DL * fix DTW crash * whisper.objc : fix build - add ggml-cpp.h --------- Co-authored-by: Georgi Gerganov <[email protected]>
* Support fp16 unary operations in the CUDA backend * cpu: increase fp16 support for unary operators in the CPU backend * cuda: increase fp16 support for unary operators in the CUDA backend * Add test cases for fp16 unary operators * metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing * metal: fix PR comments for unary op support after fp16 unary tests
ggml-ci
@cmdr2 The CUDA builds are failing with the following error after the recent changes (ggml-org/ggml#1125): https://github.com/ggml-org/llama.cpp/actions/runs/13583176617/job/37972693333?pr=12104#step:7:140 Any suggestions how to fix it? |
Taking a look |
Hi, some notes:
SILU_BACK results (for {10, 1, 1, 1} in the test):
Different enough to cross the threshold. If this reasoning makes sense, I could remove the Thanks |
Thanks, sounds good. |
…s_op (ggml/1129) ggml-ci
ggml-ci
e46c9f8
to
8ffa8be
Compare
Update: The CUDA 11.7 failures are due to the lack of operator overloading for And it fails only when compiling for arch 50. Arch 60 onwards compile and work fine (even with CUDA 11.7) CUDA 12.1 doc - no operator overloading CUDA 12.2 doc - overloaded operators I'm still working on this. |
tl;dr; - Since this is blocking ggml's sync with other repos, maybe we can put the new FP16 function calls behind the Short version: I don't like this solution tbh (would like always-on fp16 support), and would prefer a better solution, if available. A better solution might be to replicate the Since this is blocking ggml's sync with other repos, maybe we can put the new FP16 lines behind the Details: For e.g. using CUDA 11.7's
And even if I use So I suppose we shouldn't compile the half version for arch 50 anyway. Unfortunately it looks like |
It should work if you cast everything to float manually. You are not going to get any extra performance from using F16 math with one element at a time anyway. |
@slaren Are you referring to the binbcast.cu approach (whose kernel takes float args), but actually does fp16 operations without consuming extra VRAM? With binbcast, fp16-fp16 addition is about 30-35% faster for me than fp32-fp32. Same for clamp. I just tested clamp (the latest implementation) with a 1 GB tensor, and it takes 8ms with fp16, and 13ms with fp32. On my 3060 12 GB. |
What I mean is to change the implementation of clamp to something like this: dst[i] = (T)fminf(fmaxf((float)x[i], (float)min), (float)max); You get better performance with F16 tensors since this kernel is entirely memory bound, but the math itself is also not any faster with F16 either, unless you are computing multiple values at a time (with |
@slaren Would you suggest just casting to float in all those unary operators as well? I have to try that, not sure. |
Yes, that's probably a good idea. All of the unary operators are almost certainly memory bound, so I don't think it would be even worth to write specific versions for |
@slaren Thanks, I'm giving that a try now. @ggerganov In the meanwhile, I've also pushed a "plan B" to my fp16-fix branch, which puts the new FP16 unary code paths behind a compile-time I've tested that This is simply a "plan B" if the build needs to be unblocked urgently. I'd definitely prefer a better solution that doesn't need to do this. I'm looking at @slaren 's suggestion now. |
No worries, there is nothing urgent. slaren's suggestion should work. |
Thanks. As a thought, would it be possible to use the same runners on ggml's branches too? Would help catch problems earlier, and make syncs less likely to bring surprises. |
Submitted a PR with the suggested change, thanks - ggml-org/ggml#1130 |
No description provided.