-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't add two float16 tensors on CUDA? #1117
Comments
ggml (CUDA) operators tend to be added as they are needed. The addition of 2 FP16 tensors has so far not seen any use so it is not implemented. PRs welcome. |
Thanks, fair enough. If this is okay to add, I'd be happy to take a stab at this. I'm not sure if this is a strong-enough justification, but:
Before introducing quantization to readers, fp16 is the obvious first way to reduce a model's size for inference. Thanks! |
|
Thanks! Sorry about the long reply. Part 1I agree that Part 2I actually got fp16 ops working in ggml for CUDA (for binbcast ops - add/sub/mul/div), and tested that it doesn't consume more VRAM than necessary, and is about 35% faster than fp32. The change was fairly small. Tensor additions that took 4 GB for fp32 took 2 GB with fp16, and CUDA's peak VRAM usage didn't exceed the expected amount (using General change in binbcast.cu (I could make my change cleaner): GGML_ASSERT(src1->type == GGML_TYPE_F32 || src1->type == GGML_TYPE_F16);
...
} else if (src0->type == GGML_TYPE_F16 && src1->type == GGML_TYPE_F16 && dst->type == GGML_TYPE_F16) {
op()(src0, src1, dst, (const half *) src0_dd, (const half *)src1_dd, (half *) dst_dd, stream);
} The next challenge was with getting Part 3My real motivation here (and behind learning ggml) is to have Stable Diffusion inference working at competitive speeds. I'm the maintainer of Easy Diffusion, and have been seriously considering stable-diffusion.cpp as our new backend. But its performance is half of vanilla diffusers (pre-Flux models), so I dug further into it. I saw that a lot of operations in sd.cpp are done in fp32, even if the weights are fp16. Right from the initial latent. If I run a diffusers pipeline at fp32, it's performance is similar to sd.cpp. If I run diffusers at fp16, the performance (it/sec) is double. In sd.cpp, there's no real perf difference between fp16 and fp32. I admit there's an assumption here - that fp16 is the reason for the poor performance. It might also be due to implementation differences. But it just seems odd that the speed of diffusers doubles with fp16, but its fp32 perf is the same as sd.cpp. A vast number of the popular SD models are fp16. My main blocker right now is performance. With sd.cpp it's half of what we get right now. I'm definitely happy to work on increasing fp16 support, it's fun to hack on ggml code. I found ggml really simple and intuitive, especially after the new backend API. :) Thanks! |
If your goal is to optimize performance, use NVIDIA NSight Systems to determine which kernels take up which amount of runtime because that imposes a hard limit on how much performance can be gained from optimization. The performance bottlenecks for neural networks are usually either matrix multiplications or convolutions. IIRC Stable Diffusion uses convolutional layers which are as of right now poorly supported in ggml. Instead of a dedicated convolution operator they are instead converted to matrix multiplications using |
Also take a look at #971 . |
Hi, I'm new to ggml, so apologies if I'm missing something obvious.
I wrote a simple program to add two float32 tensors in ggml using CUDA, and that works fine.
But when I changed the two tensor types to
GGML_TYPE_F16
and tried to add them, I got a GGML assertion error:ggml-cuda\binbcast.cu:297: GGML_ASSERT(src1->type == GGML_TYPE_F32) failed
Key snippets (and I've included the complete program at the bottom):
I'm sending float16 data, but that doesn't seem to matter.
I have an NVIDIA 3060 12 GB, with compute capability 8.6. PyTorch works just fine in float16 for me.
Digging into the code, it looks like a lot of operations enforce F32 for the second tensor (add, sub, mul, div etc).
Am I missing something, and if not, why can't we add two float16 tensors using ggml?
Thanks for your help! :)
Complete program
The text was updated successfully, but these errors were encountered: