[draft] Attempt to implement the Flux Klein bonsai model#1580
[draft] Attempt to implement the Flux Klein bonsai model#1580Juste-Leo2 wants to merge 2 commits into
Conversation
- Add FLUX.2 tensor name mappings for shared modulation, fused single-block QKV+MLP, and separate double-block Q/K/V - Add SelfAttention fused_qkv option for separate Q/K/V weights - Add sharded safetensors loader for LLM text encoder - Add image-studio workspace
…Bonsai) VAE config has 'force_upcast: true', which diffusers uses to cast the entire VAE to float32 before decode. C++ has no equivalent, causing yellow/chroma noise in output. Workaround: set expected_type = GGML_TYPE_F32 on all first_stage_model tensor entries before ae.init().
|
I picked the wrong thread entirely; I've closed it |
|
I suggest you use my script on the model first. https://huggingface.co/Green-Sky/bonsai-image-binary-4B-GGUF/blob/main/f2_from_diffusers.py Literally the only thing that needs implementing is the ternary quant into ggml, which this is not the right place to do. |
Thanks so much for the advice !. I'll look into that. My original plan was to build the engine specifically for this model, fully optimized. Yeah, that was a misclick on my part, it's in the wrong place :) |
|
I had the chance to test your model, and it works great :) . So I don't really understand the issue with noisy images that I had with the dequantized versions. Maybe I didn't handle it correctly. Thanks again @Green-Sky for implementing this GGUF; it's going to help me make progress on a potential optimization. |
You really should check the python script again I have in the hf repo :) |
I ran some more tests based on the code. I adapted the C code and it worked (I should have done that from the start—thanks for insisting 😅). The issue was with the |
|
sd.cpp generally uses comfyui adopted safetensors instead of the diffusers format safetensors, this is not unique to bonsai-image/flux2. So I recommend the python converted way, if you want to avoid too many differences with upstream and your project to survive beyond the prototype stage :) edit: also regarding optimizations like megakernels like they did in the paper, you should look at op-fusing (in ggml backends). |
It's definitely worth getting used to comfyUI. I'll grab the Python script you wrote—thanks again! |
This draft adds the necessary mappings to make the dequantized 4B model work.
Opencode was used, but I am noticing significant differences during inference that I haven't been able to patch using DeepSeek Flash on Opencode.
Here are some avenues to explore:
I suspect the problem lies either in the denoising process or in the VAE conversion. However, despite comparing it with the Python implementation, I haven't been able to pinpoint the issue yet. Any help would be greatly appreciated!
Here is an inference example:
Here is the result using the Python reference:
Here is the result using the fork:
