|
| 1 | +# PuLID-Flux face-identity preservation |
| 2 | + |
| 3 | +stable-diffusion.cpp supports the [PuLID-Flux](https://github.com/ToTheBeginning/PuLID) |
| 4 | +identity-injection technique on top of Flux.1 (schnell or dev) models. |
| 5 | +Given a single source portrait, PuLID-Flux produces new generations that |
| 6 | +preserve the source person's face across arbitrary scenes, poses, and |
| 7 | +prompts. |
| 8 | + |
| 9 | +Unlike PhotoMaker (which extracts the identity inside the inference |
| 10 | +process from a directory of images), PuLID-Flux's identity extractor is |
| 11 | +a heavy stack (insightface ArcFace + EVA-CLIP-L + IDFormer encoder) that |
| 12 | +is impractical to port to C++/ggml. To keep this implementation small and |
| 13 | +cross-vendor, **stable-diffusion.cpp consumes a precomputed identity |
| 14 | +embedding** produced by an external Python tool that runs once per source |
| 15 | +portrait. Everything downstream of that one-shot extraction is C++ and |
| 16 | +runs on any backend (Vulkan, CUDA, Metal, ROCm, CPU). |
| 17 | + |
| 18 | +## Architecture summary |
| 19 | + |
| 20 | +The PuLID-Flux contribution to the Flux denoise loop is a stack of 20 |
| 21 | +small cross-attention modules (`PerceiverAttentionCA`) inserted between |
| 22 | +the Flux transformer blocks: |
| 23 | + |
| 24 | +- After every 2nd of the 19 double-stream blocks (10 hook points) |
| 25 | +- After every 4th of the 38 single-stream blocks (10 hook points) |
| 26 | + |
| 27 | +Each cross-attention layer takes the current image tokens as query, the |
| 28 | +32-token / 2048-dim identity embedding as key+value, and adds its output |
| 29 | +(scaled by `id_weight`, typically 1.0) back to the image tokens. |
| 30 | + |
| 31 | +## Required weights |
| 32 | + |
| 33 | +Three files in addition to the standard Flux weight set: |
| 34 | + |
| 35 | +1. **Flux base** (transformer + VAE + clip_l + t5xxl) -- exactly as |
| 36 | + [docs/flux.md](flux.md) describes. |
| 37 | +2. **PuLID weights** -- download from |
| 38 | + [guozinan/PuLID](https://huggingface.co/guozinan/PuLID): |
| 39 | + - `pulid_flux_v0.9.0.safetensors` or `pulid_flux_v0.9.1.safetensors` |
| 40 | + (recommended; this implementation is verified against v0.9.1) |
| 41 | + - **v1.1 (`pulid_v1.1.safetensors`) is NOT yet supported** -- it uses |
| 42 | + renamed keys (`id_adapter_attn_layers.*` instead of `pulid_ca.*`) |
| 43 | + and possibly different module structure. Future PR. |
| 44 | +3. **Identity embedding (.pulidembd)** -- produced by the precompute |
| 45 | + tool below. |
| 46 | + |
| 47 | +## Precompute the identity embedding |
| 48 | + |
| 49 | +The precompute tool runs the PyTorch identity-extraction stack on a |
| 50 | +single portrait image and writes the resulting `(32, 2048)` embedding |
| 51 | +to a `.pulidembd` binary file (about 131 KB). Run it once per source |
| 52 | +person; the same file is reused for any number of generations. |
| 53 | + |
| 54 | +A reference Python script is provided alongside this docs file at |
| 55 | +[`scripts/pulid_extract_id.py`](../scripts/pulid_extract_id.py). It |
| 56 | +requires: |
| 57 | +- A working CUDA / CPU PyTorch + diffusers stack |
| 58 | +- `insightface`, `facexlib`, `eva-clip`, `torchvision` |
| 59 | +- The PuLID weights file (same one stable-diffusion.cpp will load below) |
| 60 | +- The ToTheBeginning/PuLID repo's `pulid/pipeline_flux.py` (and its |
| 61 | + dependencies under `pulid/` and `flux/`) -- recommended to vendor |
| 62 | + rather than pip-install due to upstream packaging quirks |
| 63 | + |
| 64 | +Run it as: |
| 65 | + |
| 66 | +``` |
| 67 | +python pulid_extract_id.py \ |
| 68 | + --portrait /path/to/source-photo.jpg \ |
| 69 | + --pulid-weights /path/to/pulid_flux_v0.9.1.safetensors \ |
| 70 | + --out /path/to/source.pulidembd |
| 71 | +``` |
| 72 | + |
| 73 | +## Format (gguf) |
| 74 | + |
| 75 | +The embedding is a standard **gguf** container holding a single tensor: |
| 76 | + |
| 77 | +``` |
| 78 | +tensor name : "pulid_id" |
| 79 | +shape : [token_dim, num_tokens] (ggml order; typically [2048, 32]) |
| 80 | +type : F16 (also accepts F32 / BF16) |
| 81 | +metadata : general.architecture = "pulid", pulid.version = 1 |
| 82 | +``` |
| 83 | + |
| 84 | +stable-diffusion.cpp loads it with the normal gguf reader |
| 85 | +(`gguf_init_from_file`) and converts to fp32 at load time -- no bespoke |
| 86 | +parser. Total file size for the typical (32, 2048, fp16) case is ~131 KB. |
| 87 | + |
| 88 | +## Command-line usage |
| 89 | + |
| 90 | +``` |
| 91 | +.\bin\Release\sd-cli.exe \ |
| 92 | + --diffusion-model models\flux1-schnell-Q4_K_S.gguf \ |
| 93 | + --vae models\ae.safetensors \ |
| 94 | + --clip_l models\clip_l.safetensors \ |
| 95 | + --t5xxl models\t5xxl_fp16.safetensors \ |
| 96 | + --pulid-weights models\pulid_flux_v0.9.1.safetensors \ |
| 97 | + --pulid-id-embedding source.pulidembd \ |
| 98 | + --pulid-id-weight 1.0 \ |
| 99 | + -p "candid photograph of a young woman on a beach at sunset" \ |
| 100 | + --cfg-scale 1.0 --sampling-method euler --steps 4 -W 512 -H 512 \ |
| 101 | + --seed 42 --clip-on-cpu \ |
| 102 | + -o out.png |
| 103 | +``` |
| 104 | + |
| 105 | +For Flux Dev (instead of Schnell), add `--guidance 3.5` and `--steps 20`. |
| 106 | + |
| 107 | +## Flags |
| 108 | + |
| 109 | +| Flag | Purpose | |
| 110 | +|----------------------------|-------------------------------------------------------------------| |
| 111 | +| `--pulid-weights <path>` | Path to `pulid_flux_v0.9.x.safetensors`. Loaded with the model. | |
| 112 | +| `--pulid-id-embedding <p>` | Path to a `.pulidembd` binary produced by the precompute tool. | |
| 113 | +| `--pulid-id-weight <f>` | Identity-injection strength. Typical 0.7-1.2; default 1.0. | |
| 114 | + |
| 115 | +All three flags must be set together to activate PuLID. Setting only |
| 116 | +`--pulid-weights` (no embedding) loads the weights but disables injection |
| 117 | +at runtime. Setting `--pulid-id-weight 0` zeros out the contribution |
| 118 | +(useful for falsification testing: outputs should be byte-identical to |
| 119 | +a no-PuLID run with the same seed). |
| 120 | + |
| 121 | +## Memory budget |
| 122 | + |
| 123 | +At 512x512, 4 steps (Schnell), the 20 cross-attention layers add roughly |
| 124 | +10% to denoise time and almost nothing to peak VRAM. Tested on a 12 GB |
| 125 | +consumer card alongside Flux Schnell Q4 GGUF + CPU-offloaded clip_l and |
| 126 | +t5xxl + GPU-resident VAE. |
| 127 | + |
| 128 | +At 1024x1024 with Flux Dev Q4 + 20 steps + PuLID, the VAE decode compute |
| 129 | +buffer doesn't fit on a 12 GB card even with `--vae-on-cpu`. Workaround: |
| 130 | +explicitly route VAE to the CPU backend instead of the offload flag: |
| 131 | + |
| 132 | +``` |
| 133 | +--backend "diffusion=vulkan0,vae=cpu" |
| 134 | +``` |
| 135 | + |
| 136 | +The `--vae-on-cpu` flag offloads VAE weights but leaves the compute graph |
| 137 | +on the default backend; this is existing stable-diffusion.cpp behavior, |
| 138 | +not a PuLID-specific issue. Documented here because anyone running PuLID |
| 139 | +at 1024 will hit it. |
| 140 | + |
| 141 | +## Backend selection |
| 142 | + |
| 143 | +The standard `--backend` flag works as documented. Common patterns: |
| 144 | + |
| 145 | +``` |
| 146 | +# AMD Vulkan |
| 147 | +--backend "diffusion=vulkan0,vae=cpu" |
| 148 | +
|
| 149 | +# NVIDIA Vulkan |
| 150 | +--backend "diffusion=vulkan1,vae=cpu" |
| 151 | +
|
| 152 | +# CUDA |
| 153 | +--backend "diffusion=cuda0,vae=cpu" |
| 154 | +``` |
| 155 | + |
| 156 | +The PuLID cross-attention layers run on the same backend as the main |
| 157 | +diffusion model. They have not yet been independently profiled on every |
| 158 | +backend; only Vulkan and CPU have been tested by the original contributor. |
| 159 | + |
| 160 | +## Verification |
| 161 | + |
| 162 | +A three-way SHA-256 check is the recommended sanity test when bringing up |
| 163 | +a new combination of model + backend + hardware: |
| 164 | + |
| 165 | +| Run | Expected hash relation | |
| 166 | +|----------------------------------------------|------------------------------------| |
| 167 | +| A: no `--pulid-*` flags | baseline | |
| 168 | +| B: PuLID flags, `--pulid-id-weight 0.0` | **byte-identical to A** | |
| 169 | +| C: PuLID flags, `--pulid-id-weight 1.0` | **different from A,B**, preserves source identity | |
| 170 | + |
| 171 | +If A and C differ but A and B differ too, the injection is allocating |
| 172 | +or computing something even at zero weight -- likely a bug. |
| 173 | + |
| 174 | +## Limitations / not yet supported |
| 175 | + |
| 176 | +- **`--skip-layers` (skip-layer-guidance / SLG) combined with PuLID** is not |
| 177 | + supported. The `pulid_ca` index advances per non-skipped block, so a |
| 178 | + skipped block silently misaligns the cross-attention weight assignment |
| 179 | + vs. the trained intervals. The reference PyTorch implementation does |
| 180 | + not have SLG either, so there is no well-defined behavior to emulate. |
| 181 | + Use either feature alone. |
| 182 | +- **PuLID v1.1 weights** (`pulid_v1.1.safetensors`, renamed key layout). |
| 183 | +- **Multiple ID images.** The reference PyTorch implementation can fuse |
| 184 | + several portraits into one embedding for stronger identity. This |
| 185 | + implementation accepts a single embedding produced from one or more |
| 186 | + images by the external precompute tool. |
| 187 | +- **Negative-prompt branch of CFG.** PuLID only injects on the positive |
| 188 | + conditioning path in the published reference, and the implementation |
| 189 | + here follows that. Flux's distilled guidance doesn't run a separate |
| 190 | + uncond branch in normal use, so this matters only for `--true-cfg` |
| 191 | + workflows that aren't standard for Flux. |
| 192 | +- **Backends other than Vulkan and CPU** are untested by the original |
| 193 | + contributor. The implementation is pure-ggml and should work on CUDA, |
| 194 | + ROCm, and Metal, but verification by users on those backends is |
| 195 | + welcomed. |
0 commit comments