Skip to content

Commit d70feb4

Browse files
author
Mark Caldwell
committed
feat: PuLID-Flux identity-injection support
Adds PuLID-Flux identity injection to the Flux denoise path: a pulid.hpp module, the id-embedding threaded through flux.hpp and stable-diffusion.cpp, CLI flags in examples/common, and scripts/pulid_extract_id.py to produce the embedding. The id-embedding is stored as a gguf container (a single fp16 tensor) and loaded through the same gguf_init_from_file path as the pulid_ca weights, so there's no bespoke binary header.
1 parent 7948df8 commit d70feb4

9 files changed

Lines changed: 797 additions & 17 deletions

File tree

docs/pulid.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# PuLID-Flux face-identity preservation
2+
3+
stable-diffusion.cpp supports the [PuLID-Flux](https://github.com/ToTheBeginning/PuLID)
4+
identity-injection technique on top of Flux.1 (schnell or dev) models.
5+
Given a single source portrait, PuLID-Flux produces new generations that
6+
preserve the source person's face across arbitrary scenes, poses, and
7+
prompts.
8+
9+
Unlike PhotoMaker (which extracts the identity inside the inference
10+
process from a directory of images), PuLID-Flux's identity extractor is
11+
a heavy stack (insightface ArcFace + EVA-CLIP-L + IDFormer encoder) that
12+
is impractical to port to C++/ggml. To keep this implementation small and
13+
cross-vendor, **stable-diffusion.cpp consumes a precomputed identity
14+
embedding** produced by an external Python tool that runs once per source
15+
portrait. Everything downstream of that one-shot extraction is C++ and
16+
runs on any backend (Vulkan, CUDA, Metal, ROCm, CPU).
17+
18+
## Architecture summary
19+
20+
The PuLID-Flux contribution to the Flux denoise loop is a stack of 20
21+
small cross-attention modules (`PerceiverAttentionCA`) inserted between
22+
the Flux transformer blocks:
23+
24+
- After every 2nd of the 19 double-stream blocks (10 hook points)
25+
- After every 4th of the 38 single-stream blocks (10 hook points)
26+
27+
Each cross-attention layer takes the current image tokens as query, the
28+
32-token / 2048-dim identity embedding as key+value, and adds its output
29+
(scaled by `id_weight`, typically 1.0) back to the image tokens.
30+
31+
## Required weights
32+
33+
Three files in addition to the standard Flux weight set:
34+
35+
1. **Flux base** (transformer + VAE + clip_l + t5xxl) -- exactly as
36+
[docs/flux.md](flux.md) describes.
37+
2. **PuLID weights** -- download from
38+
[guozinan/PuLID](https://huggingface.co/guozinan/PuLID):
39+
- `pulid_flux_v0.9.0.safetensors` or `pulid_flux_v0.9.1.safetensors`
40+
(recommended; this implementation is verified against v0.9.1)
41+
- **v1.1 (`pulid_v1.1.safetensors`) is NOT yet supported** -- it uses
42+
renamed keys (`id_adapter_attn_layers.*` instead of `pulid_ca.*`)
43+
and possibly different module structure. Future PR.
44+
3. **Identity embedding (.pulidembd)** -- produced by the precompute
45+
tool below.
46+
47+
## Precompute the identity embedding
48+
49+
The precompute tool runs the PyTorch identity-extraction stack on a
50+
single portrait image and writes the resulting `(32, 2048)` embedding
51+
to a `.pulidembd` binary file (about 131 KB). Run it once per source
52+
person; the same file is reused for any number of generations.
53+
54+
A reference Python script is provided alongside this docs file at
55+
[`scripts/pulid_extract_id.py`](../scripts/pulid_extract_id.py). It
56+
requires:
57+
- A working CUDA / CPU PyTorch + diffusers stack
58+
- `insightface`, `facexlib`, `eva-clip`, `torchvision`
59+
- The PuLID weights file (same one stable-diffusion.cpp will load below)
60+
- The ToTheBeginning/PuLID repo's `pulid/pipeline_flux.py` (and its
61+
dependencies under `pulid/` and `flux/`) -- recommended to vendor
62+
rather than pip-install due to upstream packaging quirks
63+
64+
Run it as:
65+
66+
```
67+
python pulid_extract_id.py \
68+
--portrait /path/to/source-photo.jpg \
69+
--pulid-weights /path/to/pulid_flux_v0.9.1.safetensors \
70+
--out /path/to/source.pulidembd
71+
```
72+
73+
## Format (gguf)
74+
75+
The embedding is a standard **gguf** container holding a single tensor:
76+
77+
```
78+
tensor name : "pulid_id"
79+
shape : [token_dim, num_tokens] (ggml order; typically [2048, 32])
80+
type : F16 (also accepts F32 / BF16)
81+
metadata : general.architecture = "pulid", pulid.version = 1
82+
```
83+
84+
stable-diffusion.cpp loads it with the normal gguf reader
85+
(`gguf_init_from_file`) and converts to fp32 at load time -- no bespoke
86+
parser. Total file size for the typical (32, 2048, fp16) case is ~131 KB.
87+
88+
## Command-line usage
89+
90+
```
91+
.\bin\Release\sd-cli.exe \
92+
--diffusion-model models\flux1-schnell-Q4_K_S.gguf \
93+
--vae models\ae.safetensors \
94+
--clip_l models\clip_l.safetensors \
95+
--t5xxl models\t5xxl_fp16.safetensors \
96+
--pulid-weights models\pulid_flux_v0.9.1.safetensors \
97+
--pulid-id-embedding source.pulidembd \
98+
--pulid-id-weight 1.0 \
99+
-p "candid photograph of a young woman on a beach at sunset" \
100+
--cfg-scale 1.0 --sampling-method euler --steps 4 -W 512 -H 512 \
101+
--seed 42 --clip-on-cpu \
102+
-o out.png
103+
```
104+
105+
For Flux Dev (instead of Schnell), add `--guidance 3.5` and `--steps 20`.
106+
107+
## Flags
108+
109+
| Flag | Purpose |
110+
|----------------------------|-------------------------------------------------------------------|
111+
| `--pulid-weights <path>` | Path to `pulid_flux_v0.9.x.safetensors`. Loaded with the model. |
112+
| `--pulid-id-embedding <p>` | Path to a `.pulidembd` binary produced by the precompute tool. |
113+
| `--pulid-id-weight <f>` | Identity-injection strength. Typical 0.7-1.2; default 1.0. |
114+
115+
All three flags must be set together to activate PuLID. Setting only
116+
`--pulid-weights` (no embedding) loads the weights but disables injection
117+
at runtime. Setting `--pulid-id-weight 0` zeros out the contribution
118+
(useful for falsification testing: outputs should be byte-identical to
119+
a no-PuLID run with the same seed).
120+
121+
## Memory budget
122+
123+
At 512x512, 4 steps (Schnell), the 20 cross-attention layers add roughly
124+
10% to denoise time and almost nothing to peak VRAM. Tested on a 12 GB
125+
consumer card alongside Flux Schnell Q4 GGUF + CPU-offloaded clip_l and
126+
t5xxl + GPU-resident VAE.
127+
128+
At 1024x1024 with Flux Dev Q4 + 20 steps + PuLID, the VAE decode compute
129+
buffer doesn't fit on a 12 GB card even with `--vae-on-cpu`. Workaround:
130+
explicitly route VAE to the CPU backend instead of the offload flag:
131+
132+
```
133+
--backend "diffusion=vulkan0,vae=cpu"
134+
```
135+
136+
The `--vae-on-cpu` flag offloads VAE weights but leaves the compute graph
137+
on the default backend; this is existing stable-diffusion.cpp behavior,
138+
not a PuLID-specific issue. Documented here because anyone running PuLID
139+
at 1024 will hit it.
140+
141+
## Backend selection
142+
143+
The standard `--backend` flag works as documented. Common patterns:
144+
145+
```
146+
# AMD Vulkan
147+
--backend "diffusion=vulkan0,vae=cpu"
148+
149+
# NVIDIA Vulkan
150+
--backend "diffusion=vulkan1,vae=cpu"
151+
152+
# CUDA
153+
--backend "diffusion=cuda0,vae=cpu"
154+
```
155+
156+
The PuLID cross-attention layers run on the same backend as the main
157+
diffusion model. They have not yet been independently profiled on every
158+
backend; only Vulkan and CPU have been tested by the original contributor.
159+
160+
## Verification
161+
162+
A three-way SHA-256 check is the recommended sanity test when bringing up
163+
a new combination of model + backend + hardware:
164+
165+
| Run | Expected hash relation |
166+
|----------------------------------------------|------------------------------------|
167+
| A: no `--pulid-*` flags | baseline |
168+
| B: PuLID flags, `--pulid-id-weight 0.0` | **byte-identical to A** |
169+
| C: PuLID flags, `--pulid-id-weight 1.0` | **different from A,B**, preserves source identity |
170+
171+
If A and C differ but A and B differ too, the injection is allocating
172+
or computing something even at zero weight -- likely a bug.
173+
174+
## Limitations / not yet supported
175+
176+
- **`--skip-layers` (skip-layer-guidance / SLG) combined with PuLID** is not
177+
supported. The `pulid_ca` index advances per non-skipped block, so a
178+
skipped block silently misaligns the cross-attention weight assignment
179+
vs. the trained intervals. The reference PyTorch implementation does
180+
not have SLG either, so there is no well-defined behavior to emulate.
181+
Use either feature alone.
182+
- **PuLID v1.1 weights** (`pulid_v1.1.safetensors`, renamed key layout).
183+
- **Multiple ID images.** The reference PyTorch implementation can fuse
184+
several portraits into one embedding for stronger identity. This
185+
implementation accepts a single embedding produced from one or more
186+
images by the external precompute tool.
187+
- **Negative-prompt branch of CFG.** PuLID only injects on the positive
188+
conditioning path in the published reference, and the implementation
189+
here follows that. Flux's distilled guidance doesn't run a separate
190+
uncond branch in normal use, so this matters only for `--true-cfg`
191+
workflows that aren't standard for Flux.
192+
- **Backends other than Vulkan and CPU** are untested by the original
193+
contributor. The implementation is pure-ggml and should work on CUDA,
194+
ROCm, and Metal, but verification by users on those backends is
195+
welcomed.

examples/common/common.cpp

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -404,6 +404,10 @@ ArgOptions SDContextParams::get_options() {
404404
"--photo-maker",
405405
"path to PHOTOMAKER model",
406406
&photo_maker_path},
407+
{"",
408+
"--pulid-weights",
409+
"path to PuLID flux weights (e.g. pulid_flux_v0.9.1.safetensors). Identity is injected during the denoise loop when paired with --pulid-id-embedding.",
410+
&pulid_weights_path},
407411
{"",
408412
"--upscale-model",
409413
"path to esrgan model.",
@@ -772,6 +776,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f
772776
embedding_vec.data(),
773777
static_cast<uint32_t>(embedding_vec.size()),
774778
photo_maker_path.c_str(),
779+
pulid_weights_path.c_str(),
775780
tensor_type_rules.c_str(),
776781
vae_decode_only,
777782
free_params_immediately,
@@ -852,6 +857,10 @@ ArgOptions SDGenerationParams::get_options() {
852857
"--pm-id-embed-path",
853858
"path to PHOTOMAKER v2 id embed",
854859
&pm_id_embed_path},
860+
{"",
861+
"--pulid-id-embedding",
862+
"path to a .pulidembd binary produced by pulid_extract_id.py. Carries a (32, 2048) identity embedding extracted from a source portrait. Pair with --pulid-weights on the context.",
863+
&pulid_id_embedding_path},
855864
{"",
856865
"--hires-upscaler",
857866
"highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent (nearest-exact), "
@@ -1002,6 +1011,10 @@ ArgOptions SDGenerationParams::get_options() {
10021011
"--pm-style-strength",
10031012
"",
10041013
&pm_style_strength},
1014+
{"",
1015+
"--pulid-id-weight",
1016+
"strength of PuLID identity injection (default: 1.0). 0.7-1.2 are typical; lower lets the prompt override the face more, higher tightens identity match.",
1017+
&pulid_id_weight},
10051018
{"",
10061019
"--control-strength",
10071020
"strength to apply Control Net (default: 0.9). 1.0 corresponds to full destruction of information in init image",
@@ -2234,6 +2247,11 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
22342247
pm_style_strength,
22352248
};
22362249

2250+
sd_pulid_params_t pulid_params = {
2251+
pulid_id_embedding_path.empty() ? nullptr : pulid_id_embedding_path.c_str(),
2252+
pulid_id_weight,
2253+
};
2254+
22372255
params.loras = lora_vec.empty() ? nullptr : lora_vec.data();
22382256
params.lora_count = static_cast<uint32_t>(lora_vec.size());
22392257
params.prompt = prompt.c_str();
@@ -2254,6 +2272,7 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
22542272
params.control_image = control_image.get();
22552273
params.control_strength = control_strength;
22562274
params.pm_params = pm_params;
2275+
params.pulid_params = pulid_params;
22572276
params.vae_tiling_params = vae_tiling_params;
22582277
params.cache = cache_params;
22592278

examples/common/common.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,11 @@ struct SDContextParams {
101101
std::string control_net_path;
102102
std::string embedding_dir;
103103
std::string photo_maker_path;
104+
// PuLID-Flux identity-preservation context path: the safetensors blob
105+
// carrying the PerceiverAttentionCA cross-attention weights. Loaded
106+
// once with the model. Per-generation pulid_id_embedding_path lives in
107+
// SDGenerationParams below.
108+
std::string pulid_weights_path;
104109
sd_type_t wtype = SD_TYPE_COUNT;
105110
std::string tensor_type_rules;
106111
std::string lora_model_dir = ".";
@@ -197,6 +202,12 @@ struct SDGenerationParams {
197202
std::string pm_id_embed_path;
198203
float pm_style_strength = 20.f;
199204

205+
// PuLID-Flux: per-generation identity embedding (binary file produced by
206+
// runtime-scripts/pulid_extract_id.py). Format documented in
207+
// include/stable-diffusion.h sd_pulid_params_t.
208+
std::string pulid_id_embedding_path;
209+
float pulid_id_weight = 1.0f;
210+
200211
int upscale_repeats = 1;
201212
int upscale_tile_size = 128;
202213

include/stable-diffusion.h

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,16 @@ typedef struct {
194194
const sd_embedding_t* embeddings;
195195
uint32_t embedding_count;
196196
const char* photo_maker_path;
197+
/**
198+
* Path to pulid_flux_v0.9.1.safetensors (the PuLID identity-injection
199+
* cross-attention weights). When set together with sd_img_gen_params_t.
200+
* pulid_params.id_embedding_path, the Flux diffusion model performs PuLID
201+
* cross-attention injection during the denoise loop. Loaded once with
202+
* the model; the embedding is per-generation. Currently only meaningful
203+
* for Flux (depth=19 double, 38 single blocks); silently ignored for
204+
* other model versions.
205+
*/
206+
const char* pulid_weights_path;
197207
const char* tensor_type_rules;
198208
bool vae_decode_only;
199209
bool free_params_immediately;
@@ -275,6 +285,25 @@ typedef struct {
275285
float style_strength;
276286
} sd_pm_params_t; // photo maker
277287

288+
/**
289+
* PuLID-Flux identity preservation params.
290+
*
291+
* Unlike PhotoMaker (which extracts the ID embedding inside the inference
292+
* process from a directory of images), PuLID's ID extraction is a heavy
293+
* Python-only stack (insightface ArcFace + EVA-CLIP-L + IDFormer). To stay
294+
* cross-vendor in C++/Vulkan, sd.cpp consumes a precomputed binary file
295+
* produced by an external tool (runtime-scripts/pulid_extract_id.py in the
296+
* Cloudhands client tree).
297+
*
298+
* Format: a gguf container with a single tensor "pulid_id" of shape
299+
* [token_dim, num_tokens] (ggml order; typically [2048, 32]) in F16/F32/BF16.
300+
* Loaded with the standard gguf reader; see docs/pulid.md.
301+
*/
302+
typedef struct {
303+
const char* id_embedding_path; // path to .pulidembd file produced by pulid_extract_id.py
304+
float id_weight; // strength of the ID injection; typical 0.7-1.2, default 1.0
305+
} sd_pulid_params_t;
306+
278307
enum sd_cache_mode_t {
279308
SD_CACHE_DISABLED = 0,
280309
SD_CACHE_EASYCACHE,
@@ -367,6 +396,7 @@ typedef struct {
367396
sd_image_t control_image;
368397
float control_strength;
369398
sd_pm_params_t pm_params;
399+
sd_pulid_params_t pulid_params;
370400
sd_tiling_params_t vae_tiling_params;
371401
sd_cache_params_t cache;
372402
sd_hires_params_t hires;

0 commit comments

Comments
 (0)