Fix embeddings with quantized models #601

stduhpf · 2025-02-22T15:56:28Z

fixes #600

vmobilis · 2025-02-22T23:28:22Z

~~@stduhpf, will that fix ggml-org/ggml#1118 be useful after this patch, or it becomes needless?~~

Upd. @stduhpf, please ignore this strange question, I've already checked it myself. 😅
Yes, #1118 is not needed anymore, the tensors are always GGML_TYPE_F32 now.

Even more, my patch is probably wrong, because it produces oversaturated image, comparing with your result, which looks more naturally.

vmobilis · 2025-02-24T12:55:50Z

@stduhpf, on second thought, disabling quantization for conditioner has disadvantage: CLIP will consume more memory – depending on quantization, significantly more.
So maybe this should be at least optional.

But, if it won't complicate things much, @stduhpf, you brought good idea.
Considering that stable-diffusion.cpp is able to quantize parts of model on the fly, setting different quantization separately for conditioner and diffuser will produce 24² = 576 variants for same prompt and sampler.
For example:

clip f32, diffuser f32	clip f32, diffuser q4_1	clip q4_1, diffuser q4_1

stduhpf · 2025-02-24T15:57:19Z

@stduhpf, on second thought, disabling quantization for conditioner has disadvantage: CLIP will consume more memory – depending on quantization, significantly more. So maybe this should be at least optional.

Well with this PR It's not the whole clip model being forced to f32, "just" the token_embedding.weight tensor, wich is about 30% of clip-l total params, but only 10% of clip-g.

There would be about a hundred megabytes of compute buffer to save with proper support for quantized concat(), compared to forcing f32. But then, the custom embeddings would get quantized too, which might cause significant quality loss... (maybe this could explain why your result is oversaturated?)

Edit: just tried your ggml patch, I still get black images when token_embedding.weight is quantized to f16, both on Vulkan and CPU. And Vulkan just crashes when it's quantized to non float quants (CPU is also black)... Not sure what I'm doing wrong.

vmobilis · 2025-02-24T23:24:55Z

@stduhpf, maybe it's because I'm building for ARM? If you want, here is the version I'm currently building from, will it make black images too?

I didn't sync SD_TYPE_COUNT and GGML_TYPE_COUNT, therefore weight type must always be specified. It's with your patch, and to temporarily enable quantization, there is an option "--keep-quantization".

Build options are in file "build_a55.sh", it makes clean build. But note the CPU options, they're tuned for ARM.
sdcpp.tar.xz.zip

stduhpf · 2025-02-25T00:00:51Z

I was getting normal-looking images with your version of the code, but that was just because of this line: bool keep_quant = false; (clip.hpp:542). Setting it to true, I got black images again.
I don't have a decent ARM system to test this on, and I'm not compiling with blas either. Maybe that could be the reason.

vmobilis · 2025-02-25T08:55:52Z

@stduhpf, setting keep_quant to false turns the patch on. I see then, it doesn't work, maybe it's just my luck that it works for me... A friend even tried to run it on 3 Gb phone, it barely runs SD1.5 model with q4_0 quantization, closing Google services due to lack of memory. 🙃

I tried without Blas, it works too (and I'm building on Termux, any phone with Android 7+ and 6+ Gb of RAM will be sufficient).
I did not try with Vulkan because it crashes with non-f32 quantization ealier, when applying LoRA.
Anyway, without working concat() arbitrary quantization won't work.

Thanks for your work!

leejet · 2025-03-01T03:45:54Z

Thank you for your contribution.

Fix embeddings with quantized models

cd20a93

stduhpf mentioned this pull request Feb 22, 2025

Vulkan embeddings and lora issue #600

Closed

vmobilis mentioned this pull request Feb 23, 2025

ggml_compute_forward_concat() for arbitrary tensor type ggml-org/ggml#1118

Merged

leejet merged commit fbd42b6 into leejet:master Mar 1, 2025
9 checks passed

thxCode pushed a commit to thxCode/stable-diffusion.cpp that referenced this pull request Jun 18, 2025

fix: fix embeddings with quantized models (leejet#601)

f4216b6

wbruna mentioned this pull request Aug 14, 2025

feat: reduce CLIP memory usage with no embeddings #768

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix embeddings with quantized models #601

Fix embeddings with quantized models #601

Uh oh!

stduhpf commented Feb 22, 2025

Uh oh!

vmobilis commented Feb 22, 2025 •

edited

Loading

Uh oh!

vmobilis commented Feb 24, 2025

Uh oh!

stduhpf commented Feb 24, 2025 •

edited

Loading

Uh oh!

vmobilis commented Feb 24, 2025

Uh oh!

stduhpf commented Feb 25, 2025

Uh oh!

vmobilis commented Feb 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

leejet commented Mar 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix embeddings with quantized models #601

Fix embeddings with quantized models #601

Uh oh!

Conversation

stduhpf commented Feb 22, 2025

Uh oh!

vmobilis commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vmobilis commented Feb 24, 2025

Uh oh!

stduhpf commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vmobilis commented Feb 24, 2025

Uh oh!

stduhpf commented Feb 25, 2025

Uh oh!

vmobilis commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

leejet commented Mar 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vmobilis commented Feb 22, 2025 •

edited

Loading

stduhpf commented Feb 24, 2025 •

edited

Loading

vmobilis commented Feb 25, 2025 •

edited

Loading