Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix embeddings with quantized models #601

Merged
merged 1 commit into from
Mar 1, 2025
Merged

Conversation

stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Feb 22, 2025

fixes #600

@vmobilis
Copy link
Contributor

vmobilis commented Feb 22, 2025

@stduhpf, will that fix ggml-org/ggml#1118 be useful after this patch, or it becomes needless?

Upd. @stduhpf, please ignore this strange question, I've already checked it myself. 😅
Yes, #1118 is not needed anymore, the tensors are always GGML_TYPE_F32 now.

Even more, my patch is probably wrong, because it produces oversaturated image, comparing with your result, which looks more naturally.

@vmobilis
Copy link
Contributor

@stduhpf, on second thought, disabling quantization for conditioner has disadvantage: CLIP will consume more memory – depending on quantization, significantly more.
So maybe this should be at least optional.

But, if it won't complicate things much, @stduhpf, you brought good idea.
Considering that stable-diffusion.cpp is able to quantize parts of model on the fly, setting different quantization separately for conditioner and diffuser will produce 24² = 576 variants for same prompt and sampler.
For example:

clip f32, diffuser f32clip f32, diffuser q4_1clip q4_1, diffuser q4_1

@stduhpf
Copy link
Contributor Author

stduhpf commented Feb 24, 2025

@stduhpf, on second thought, disabling quantization for conditioner has disadvantage: CLIP will consume more memory – depending on quantization, significantly more. So maybe this should be at least optional.

Well with this PR It's not the whole clip model being forced to f32, "just" the token_embedding.weight tensor, wich is about 30% of clip-l total params, but only 10% of clip-g.

There would be about a hundred megabytes of compute buffer to save with proper support for quantized concat(), compared to forcing f32. But then, the custom embeddings would get quantized too, which might cause significant quality loss... (maybe this could explain why your result is oversaturated?)

Edit: just tried your ggml patch, I still get black images when token_embedding.weight is quantized to f16, both on Vulkan and CPU. And Vulkan just crashes when it's quantized to non float quants (CPU is also black)... Not sure what I'm doing wrong.

@vmobilis
Copy link
Contributor

@stduhpf, maybe it's because I'm building for ARM? If you want, here is the version I'm currently building from, will it make black images too?

I didn't sync SD_TYPE_COUNT and GGML_TYPE_COUNT, therefore weight type must always be specified. It's with your patch, and to temporarily enable quantization, there is an option "--keep-quantization".

Build options are in file "build_a55.sh", it makes clean build. But note the CPU options, they're tuned for ARM.
sdcpp.tar.xz.zip

@stduhpf
Copy link
Contributor Author

stduhpf commented Feb 25, 2025

I was getting normal-looking images with your version of the code, but that was just because of this line: bool keep_quant = false; (clip.hpp:542). Setting it to true, I got black images again.
I don't have a decent ARM system to test this on, and I'm not compiling with blas either. Maybe that could be the reason.

@vmobilis
Copy link
Contributor

vmobilis commented Feb 25, 2025

@stduhpf, setting keep_quant to false turns the patch on. I see then, it doesn't work, maybe it's just my luck that it works for me... A friend even tried to run it on 3 Gb phone, it barely runs SD1.5 model with q4_0 quantization, closing Google services due to lack of memory. 🙃

I tried without Blas, it works too (and I'm building on Termux, any phone with Android 7+ and 6+ Gb of RAM will be sufficient).
I did not try with Vulkan because it crashes with non-f32 quantization ealier, when applying LoRA.
Anyway, without working concat() arbitrary quantization won't work.

Thanks for your work!

@leejet leejet merged commit fbd42b6 into leejet:master Mar 1, 2025
9 checks passed
@leejet
Copy link
Owner

leejet commented Mar 1, 2025

Thank you for your contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vulkan embeddings and lora issue
3 participants