Skip to content

Commit 11ab16f

Browse files
author
Mark Caldwell
committed
refactor: adapt PuLID-Flux to the generation-extension mechanism
Move PuLID-Flux's runtime onto the GenerationExtension framework (the same mechanism PhotoMaker uses), instead of threading an id-embedding tensor through the sampler signature. PhotoMaker fits the existing conditioning-only extension hooks because it only modifies the conditioning. PuLID does not: it injects identity via cross-attention *inside* the Flux denoise forward (the pulid_ca blocks), fed a precomputed id-embedding. So this adds one general hook to the interface: virtual void before_diffusion(DiffusionParams& params, int step) const {} Called for each enabled extension after the per-step DiffusionParams (and its version-specific `extra`) is built and before diffusion_model->compute(). It is the diffusion-forward analog of before_condition: any extension, any architecture, can set/override fields on `params` (typically the matching `extra` variant). Mutates `params` only, never the extension. PuLID's runtime now lives in src/extensions/pulid_extension.cpp: - init(): enable when pulid_weights_path is set - prepare_condition(): load the per-generation gguf id-embedding (does not touch the conditioning) - before_diffusion(): hand the embedding + weight to FluxDiffusionExtra, which flux.hpp reads to drive the pulid_ca cross-attention The pulid_ca.* weight merge stays in the context ctor: those weights are part of the Flux diffusion model (its pulid_ca blocks are constructed when the tensor map contains pulid_ca.* keys), so they must be in the map before the model is built -- which is before any extension init() runs. Everything else PuLID-related is in the extension; the per-sample id-embedding threading is gone. pulid_params is carried in GenerationExtensionConditionContext alongside the existing pm_params, following the same idiom.
1 parent bb90bfa commit 11ab16f

11 files changed

Lines changed: 878 additions & 13 deletions

File tree

docs/pulid.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# PuLID-Flux face-identity preservation
2+
3+
stable-diffusion.cpp supports the [PuLID-Flux](https://github.com/ToTheBeginning/PuLID)
4+
identity-injection technique on top of Flux.1 (schnell or dev) models.
5+
Given a single source portrait, PuLID-Flux produces new generations that
6+
preserve the source person's face across arbitrary scenes, poses, and
7+
prompts.
8+
9+
Unlike PhotoMaker (which extracts the identity inside the inference
10+
process from a directory of images), PuLID-Flux's identity extractor is
11+
a heavy stack (insightface ArcFace + EVA-CLIP-L + IDFormer encoder) that
12+
is impractical to port to C++/ggml. To keep this implementation small and
13+
cross-vendor, **stable-diffusion.cpp consumes a precomputed identity
14+
embedding** produced by an external Python tool that runs once per source
15+
portrait. Everything downstream of that one-shot extraction is C++ and
16+
runs on any backend (Vulkan, CUDA, Metal, ROCm, CPU).
17+
18+
## Architecture summary
19+
20+
The PuLID-Flux contribution to the Flux denoise loop is a stack of 20
21+
small cross-attention modules (`PerceiverAttentionCA`) inserted between
22+
the Flux transformer blocks:
23+
24+
- After every 2nd of the 19 double-stream blocks (10 hook points)
25+
- After every 4th of the 38 single-stream blocks (10 hook points)
26+
27+
Each cross-attention layer takes the current image tokens as query, the
28+
32-token / 2048-dim identity embedding as key+value, and adds its output
29+
(scaled by `id_weight`, typically 1.0) back to the image tokens.
30+
31+
## Required weights
32+
33+
Three files in addition to the standard Flux weight set:
34+
35+
1. **Flux base** (transformer + VAE + clip_l + t5xxl) -- exactly as
36+
[docs/flux.md](flux.md) describes.
37+
2. **PuLID weights** -- download from
38+
[guozinan/PuLID](https://huggingface.co/guozinan/PuLID):
39+
- `pulid_flux_v0.9.0.safetensors` or `pulid_flux_v0.9.1.safetensors`
40+
(recommended; this implementation is verified against v0.9.1)
41+
- **v1.1 (`pulid_v1.1.safetensors`) is NOT yet supported** -- it uses
42+
renamed keys (`id_adapter_attn_layers.*` instead of `pulid_ca.*`)
43+
and possibly different module structure. Future PR.
44+
3. **Identity embedding (.pulidembd)** -- produced by the precompute
45+
tool below.
46+
47+
## Precompute the identity embedding
48+
49+
The precompute tool runs the PyTorch identity-extraction stack on a
50+
single portrait image and writes the resulting `(32, 2048)` embedding
51+
to a `.pulidembd` binary file (about 131 KB). Run it once per source
52+
person; the same file is reused for any number of generations.
53+
54+
A reference Python script is provided alongside this docs file at
55+
[`scripts/pulid_extract_id.py`](../scripts/pulid_extract_id.py). It
56+
requires:
57+
- A working CUDA / CPU PyTorch + diffusers stack
58+
- `insightface`, `facexlib`, `eva-clip`, `torchvision`
59+
- The PuLID weights file (same one stable-diffusion.cpp will load below)
60+
- The ToTheBeginning/PuLID repo's `pulid/pipeline_flux.py` (and its
61+
dependencies under `pulid/` and `flux/`) -- recommended to vendor
62+
rather than pip-install due to upstream packaging quirks
63+
64+
Run it as:
65+
66+
```
67+
python pulid_extract_id.py \
68+
--portrait /path/to/source-photo.jpg \
69+
--pulid-weights /path/to/pulid_flux_v0.9.1.safetensors \
70+
--out /path/to/source.pulidembd
71+
```
72+
73+
## Format (gguf)
74+
75+
The embedding is a standard **gguf** container holding a single tensor:
76+
77+
```
78+
tensor name : "pulid_id"
79+
shape : [token_dim, num_tokens] (ggml order; typically [2048, 32])
80+
type : F16 (also accepts F32 / BF16)
81+
metadata : general.architecture = "pulid", pulid.version = 1
82+
```
83+
84+
stable-diffusion.cpp loads it with the normal gguf reader
85+
(`gguf_init_from_file`) and converts to fp32 at load time -- no bespoke
86+
parser. Total file size for the typical (32, 2048, fp16) case is ~131 KB.
87+
88+
## Command-line usage
89+
90+
```
91+
.\bin\Release\sd-cli.exe \
92+
--diffusion-model models\flux1-schnell-Q4_K_S.gguf \
93+
--vae models\ae.safetensors \
94+
--clip_l models\clip_l.safetensors \
95+
--t5xxl models\t5xxl_fp16.safetensors \
96+
--pulid-weights models\pulid_flux_v0.9.1.safetensors \
97+
--pulid-id-embedding source.pulidembd \
98+
--pulid-id-weight 1.0 \
99+
-p "candid photograph of a young woman on a beach at sunset" \
100+
--cfg-scale 1.0 --sampling-method euler --steps 4 -W 512 -H 512 \
101+
--seed 42 --clip-on-cpu \
102+
-o out.png
103+
```
104+
105+
For Flux Dev (instead of Schnell), add `--guidance 3.5` and `--steps 20`.
106+
107+
## Flags
108+
109+
| Flag | Purpose |
110+
|----------------------------|-------------------------------------------------------------------|
111+
| `--pulid-weights <path>` | Path to `pulid_flux_v0.9.x.safetensors`. Loaded with the model. |
112+
| `--pulid-id-embedding <p>` | Path to a `.pulidembd` binary produced by the precompute tool. |
113+
| `--pulid-id-weight <f>` | Identity-injection strength. Typical 0.7-1.2; default 1.0. |
114+
115+
All three flags must be set together to activate PuLID. Setting only
116+
`--pulid-weights` (no embedding) loads the weights but disables injection
117+
at runtime. Setting `--pulid-id-weight 0` zeros out the contribution
118+
(useful for falsification testing: outputs should be byte-identical to
119+
a no-PuLID run with the same seed).
120+
121+
## Memory budget
122+
123+
At 512x512, 4 steps (Schnell), the 20 cross-attention layers add roughly
124+
10% to denoise time and almost nothing to peak VRAM. Tested on a 12 GB
125+
consumer card alongside Flux Schnell Q4 GGUF + CPU-offloaded clip_l and
126+
t5xxl + GPU-resident VAE.
127+
128+
At 1024x1024 with Flux Dev Q4 + 20 steps + PuLID, the VAE decode compute
129+
buffer doesn't fit on a 12 GB card even with `--vae-on-cpu`. Workaround:
130+
explicitly route VAE to the CPU backend instead of the offload flag:
131+
132+
```
133+
--backend "diffusion=vulkan0,vae=cpu"
134+
```
135+
136+
The `--vae-on-cpu` flag offloads VAE weights but leaves the compute graph
137+
on the default backend; this is existing stable-diffusion.cpp behavior,
138+
not a PuLID-specific issue. Documented here because anyone running PuLID
139+
at 1024 will hit it.
140+
141+
## Backend selection
142+
143+
The standard `--backend` flag works as documented. Common patterns:
144+
145+
```
146+
# AMD Vulkan
147+
--backend "diffusion=vulkan0,vae=cpu"
148+
149+
# NVIDIA Vulkan
150+
--backend "diffusion=vulkan1,vae=cpu"
151+
152+
# CUDA
153+
--backend "diffusion=cuda0,vae=cpu"
154+
```
155+
156+
The PuLID cross-attention layers run on the same backend as the main
157+
diffusion model. They have not yet been independently profiled on every
158+
backend; only Vulkan and CPU have been tested by the original contributor.
159+
160+
## Verification
161+
162+
A three-way SHA-256 check is the recommended sanity test when bringing up
163+
a new combination of model + backend + hardware:
164+
165+
| Run | Expected hash relation |
166+
|----------------------------------------------|------------------------------------|
167+
| A: no `--pulid-*` flags | baseline |
168+
| B: PuLID flags, `--pulid-id-weight 0.0` | **byte-identical to A** |
169+
| C: PuLID flags, `--pulid-id-weight 1.0` | **different from A,B**, preserves source identity |
170+
171+
If A and C differ but A and B differ too, the injection is allocating
172+
or computing something even at zero weight -- likely a bug.
173+
174+
## Limitations / not yet supported
175+
176+
- **`--skip-layers` (skip-layer-guidance / SLG) combined with PuLID** is not
177+
supported. The `pulid_ca` index advances per non-skipped block, so a
178+
skipped block silently misaligns the cross-attention weight assignment
179+
vs. the trained intervals. The reference PyTorch implementation does
180+
not have SLG either, so there is no well-defined behavior to emulate.
181+
Use either feature alone.
182+
- **PuLID v1.1 weights** (`pulid_v1.1.safetensors`, renamed key layout).
183+
- **Multiple ID images.** The reference PyTorch implementation can fuse
184+
several portraits into one embedding for stronger identity. This
185+
implementation accepts a single embedding produced from one or more
186+
images by the external precompute tool.
187+
- **Negative-prompt branch of CFG.** PuLID only injects on the positive
188+
conditioning path in the published reference, and the implementation
189+
here follows that. Flux's distilled guidance doesn't run a separate
190+
uncond branch in normal use, so this matters only for `--true-cfg`
191+
workflows that aren't standard for Flux.
192+
- **Backends other than Vulkan and CPU** are untested by the original
193+
contributor. The implementation is pure-ggml and should work on CUDA,
194+
ROCm, and Metal, but verification by users on those backends is
195+
welcomed.

examples/common/common.cpp

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -415,6 +415,10 @@ ArgOptions SDContextParams::get_options() {
415415
"--photo-maker",
416416
"path to PHOTOMAKER model",
417417
&photo_maker_path},
418+
{"",
419+
"--pulid-weights",
420+
"path to PuLID flux weights (e.g. pulid_flux_v0.9.1.safetensors). Identity is injected during the denoise loop when paired with --pulid-id-embedding.",
421+
&pulid_weights_path},
418422
{"",
419423
"--upscale-model",
420424
"path to esrgan model.",
@@ -812,6 +816,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool taesd_preview) {
812816
sd_ctx_params.embeddings = embedding_vec.data();
813817
sd_ctx_params.embedding_count = static_cast<uint32_t>(embedding_vec.size());
814818
sd_ctx_params.photo_maker_path = photo_maker_path.c_str();
819+
sd_ctx_params.pulid_weights_path = pulid_weights_path.c_str();
815820
sd_ctx_params.tensor_type_rules = tensor_type_rules.c_str();
816821
sd_ctx_params.n_threads = n_threads;
817822
sd_ctx_params.wtype = wtype;
@@ -887,6 +892,10 @@ ArgOptions SDGenerationParams::get_options() {
887892
"--pm-id-embed-path",
888893
"path to PHOTOMAKER v2 id embed",
889894
&pm_id_embed_path},
895+
{"",
896+
"--pulid-id-embedding",
897+
"path to a .pulidembd binary produced by pulid_extract_id.py. Carries a (32, 2048) identity embedding extracted from a source portrait. Pair with --pulid-weights on the context.",
898+
&pulid_id_embedding_path},
890899
{"",
891900
"--hires-upscaler",
892901
"highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent (nearest-exact), "
@@ -1037,6 +1046,10 @@ ArgOptions SDGenerationParams::get_options() {
10371046
"--pm-style-strength",
10381047
"",
10391048
&pm_style_strength},
1049+
{"",
1050+
"--pulid-id-weight",
1051+
"strength of PuLID identity injection (default: 1.0). 0.7-1.2 are typical; lower lets the prompt override the face more, higher tightens identity match.",
1052+
&pulid_id_weight},
10401053
{"",
10411054
"--control-strength",
10421055
"strength to apply Control Net (default: 0.9). 1.0 corresponds to full destruction of information in init image",
@@ -2269,6 +2282,11 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
22692282
pm_style_strength,
22702283
};
22712284

2285+
sd_pulid_params_t pulid_params = {
2286+
pulid_id_embedding_path.empty() ? nullptr : pulid_id_embedding_path.c_str(),
2287+
pulid_id_weight,
2288+
};
2289+
22722290
params.loras = lora_vec.empty() ? nullptr : lora_vec.data();
22732291
params.lora_count = static_cast<uint32_t>(lora_vec.size());
22742292
params.prompt = prompt.c_str();
@@ -2289,6 +2307,7 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
22892307
params.control_image = control_image.get();
22902308
params.control_strength = control_strength;
22912309
params.pm_params = pm_params;
2310+
params.pulid_params = pulid_params;
22922311
params.vae_tiling_params = vae_tiling_params;
22932312
params.cache = cache_params;
22942313

examples/common/common.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,11 @@ struct SDContextParams {
133133
std::string control_net_path;
134134
std::string embedding_dir;
135135
std::string photo_maker_path;
136+
// PuLID-Flux identity-preservation context path: the safetensors blob
137+
// carrying the PerceiverAttentionCA cross-attention weights. Loaded
138+
// once with the model. Per-generation pulid_id_embedding_path lives in
139+
// SDGenerationParams below.
140+
std::string pulid_weights_path;
136141
sd_type_t wtype = SD_TYPE_COUNT;
137142
std::string tensor_type_rules;
138143
std::string lora_model_dir = ".";
@@ -234,6 +239,12 @@ struct SDGenerationParams {
234239
std::string pm_id_embed_path;
235240
float pm_style_strength = 20.f;
236241

242+
// PuLID-Flux: per-generation identity embedding (binary file produced by
243+
// runtime-scripts/pulid_extract_id.py). Format documented in
244+
// include/stable-diffusion.h sd_pulid_params_t.
245+
std::string pulid_id_embedding_path;
246+
float pulid_id_weight = 1.0f;
247+
237248
int upscale_repeats = 1;
238249
int upscale_tile_size = 128;
239250

include/stable-diffusion.h

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,16 @@ typedef struct {
195195
const sd_embedding_t* embeddings;
196196
uint32_t embedding_count;
197197
const char* photo_maker_path;
198+
/**
199+
* Path to pulid_flux_v0.9.1.safetensors (the PuLID identity-injection
200+
* cross-attention weights). When set together with sd_img_gen_params_t.
201+
* pulid_params.id_embedding_path, the Flux diffusion model performs PuLID
202+
* cross-attention injection during the denoise loop. Loaded once with
203+
* the model; the embedding is per-generation. Currently only meaningful
204+
* for Flux (depth=19 double, 38 single blocks); silently ignored for
205+
* other model versions.
206+
*/
207+
const char* pulid_weights_path;
198208
const char* tensor_type_rules;
199209
int n_threads;
200210
enum sd_type_t wtype;
@@ -272,6 +282,25 @@ typedef struct {
272282
float style_strength;
273283
} sd_pm_params_t; // photo maker
274284

285+
/**
286+
* PuLID-Flux identity preservation params.
287+
*
288+
* Unlike PhotoMaker (which extracts the ID embedding inside the inference
289+
* process from a directory of images), PuLID's ID extraction is a heavy
290+
* Python-only stack (insightface ArcFace + EVA-CLIP-L + IDFormer). To stay
291+
* cross-vendor in C++/Vulkan, sd.cpp consumes a precomputed binary file
292+
* produced by an external tool (runtime-scripts/pulid_extract_id.py in the
293+
* Cloudhands client tree).
294+
*
295+
* Format: a gguf container with a single tensor "pulid_id" of shape
296+
* [token_dim, num_tokens] (ggml order; typically [2048, 32]) in F16/F32/BF16.
297+
* Loaded with the standard gguf reader; see docs/pulid.md.
298+
*/
299+
typedef struct {
300+
const char* id_embedding_path; // path to .pulidembd file produced by pulid_extract_id.py
301+
float id_weight; // strength of the ID injection; typical 0.7-1.2, default 1.0
302+
} sd_pulid_params_t;
303+
275304
enum sd_cache_mode_t {
276305
SD_CACHE_DISABLED = 0,
277306
SD_CACHE_EASYCACHE,
@@ -364,6 +393,7 @@ typedef struct {
364393
sd_image_t control_image;
365394
float control_strength;
366395
sd_pm_params_t pm_params;
396+
sd_pulid_params_t pulid_params;
367397
sd_tiling_params_t vae_tiling_params;
368398
sd_cache_params_t cache;
369399
sd_hires_params_t hires;

0 commit comments

Comments
 (0)