Add tensor splitting (row + tensor), lazy loading and autofitting logic#1470
Add tensor splitting (row + tensor), lazy loading and autofitting logic#1470pwilkin wants to merge 4 commits into
Conversation
| "whether to memory-map model", | ||
| true, &enable_mmap}, | ||
| {"", | ||
| "--control-net-cpu", |
There was a problem hiding this comment.
As mentioned in #1184 (comment) , we should keep old flags as aliases.
Maybe they could be hidden from the normal help to avoid too much clutter, and be shown only with -v/--verbose?
| "--auto-fit", | ||
| "automatically pick DiT/VAE/Conditioner device placements based on " | ||
| "free GPU memory (default ON)", | ||
| true, &auto_fit}, |
There was a problem hiding this comment.
Would this include Vulkan backends with a main dGPU and an iGPU? If so, auto-fit by default may not be a good idea.
There was a problem hiding this comment.
To be honest, I'd still have to check how well this works on Vulkan (I don't think I have a setup with an iGPU to test though), but the idea is the algorithm is supposed to take those types of quirks into account. For now I only tested on my CUDA setup (also have to fix row-split to be supported on Vulkan if possible).
There was a problem hiding this comment.
I do have a Vulkan SDK setup FWIW and I've build llama.cpp with Vulkan already, so I'll setup the Vulkan build and test next.
| cpu_fallback_backend = pending->cpu_fallback; | ||
| row_split_ratios = pending->tensor_split_ratios; | ||
| row_main_device = pending->main_device; | ||
| #ifdef SD_USE_CUDA |
There was a problem hiding this comment.
We have just removed all these ifdefs. Please switch to sd_backend_is.
| // killer doesn't touch them, but they ARE counted against | ||
| // overcommit and can starve other allocations on tight-RAM | ||
| // systems). posix_fadvise(POSIX_FADV_DONTNEED) is the documented | ||
| // way to evict pagecache for a specific fd's pages. |
There was a problem hiding this comment.
I could add this to #1414 . In fact, maybe that PR could even simplify the lazy-loading? I could adapt it if needed (or at least so we don't step in each other's toes).
| return STRINGIZE(SDCPP_BUILD_VERSION); | ||
| } | ||
|
|
||
| void sd_list_devices(void) { |
There was a problem hiding this comment.
This is the wrong place to do this: version.cpp is supposed to only bundle binary version information, so that it can be easily replaced if the library is built without it. I'd move it to util.cpp (there is also an overlap with a log over there)
| if(SD_CUDA) | ||
| message("-- Use CUDA as backend stable-diffusion") | ||
| set(GGML_CUDA ON) | ||
| add_definitions(-DSD_USE_CUDA) |
There was a problem hiding this comment.
| const std::vector<int64_t>& share_bytes, | ||
| std::vector<ggml_backend_t>& out_extra_backends, | ||
| MultiBackendSpec& out_spec) -> bool { | ||
| #ifdef SD_USE_CUDA |
There was a problem hiding this comment.
|
Yeah, that's a problem with CC, it tramples code style consistency. I'll refactor. |
Two changes from wbruna's review:
1. Replace `#ifdef SD_USE_CUDA` blocks with runtime backend dispatch.
- Add `ggml_backend_split_buffer_type(backend, ...)` helper in
`ggml_extend_backend.hpp` that looks up `ggml_backend_split_buffer_type`
via `reg_get_proc_address`. Both CUDA and SYCL publish this proc, so
row-split is no longer compile-time gated to CUDA.
- Drop `#include "ggml-cuda.h"` and the `#ifdef SD_USE_CUDA` blocks in
`ggml_extend.hpp` (constructor + `alloc_params_buffer_row_split`).
- In `stable-diffusion.cpp::prepare_row_split_spec`, derive the backend
registry from the device-name prefix (CUDA0 -> reg "CUDA", SYCL1 ->
reg "SYCL") instead of calling `ggml_backend_cuda_get_device_count`.
- Drop `add_definitions(-DSD_USE_CUDA)` from CMakeLists.txt; the macro
is no longer referenced.
2. Restore the removed CPU-placement flags as soft-deprecated aliases
(matching the existing `--qwen2vl` / `--qwen2vl_vision` deprecation
pattern). Each alias sets the new `--*-backend-device` to "CPU" and
disables auto-fit so the placement is honored verbatim:
- `--clip-on-cpu` -> `--clip-backend-device CPU`
- `--vae-on-cpu` -> `--vae-backend-device CPU`
- `--control-net-cpu` -> `--control-net-backend-device CPU`
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| "--upscale-model", | ||
| "path to esrgan model.", | ||
| &esrgan_path}, | ||
| {"", |
There was a problem hiding this comment.
I’d prefer not to add one CLI/API field per model component. This may become hard to maintain as more components are added.
Could we use a generic placement syntax instead, e.g.:
--backend 'clip=CPU,vae=CUDA0,diffusion=Vulkan0'
and parse it in new_sd_ctx into a map from component name to backend/device?
That should scale better, keep the API surface smaller, and make future components easier to support without adding new flags every time.
| "--chroma-t5-mask-pad", | ||
| "t5 mask pad size of chroma", | ||
| &chroma_t5_mask_pad}, | ||
| {"", |
There was a problem hiding this comment.
Similar concern here: this introduces multiple --fit-* flags per component (dit/vae/cond).
Could we consider a generic form instead, for example:
--fit-compute-reserve dit=...,vae=...,cond=...
and parse it into a component -> value map?
This would keep the CLI/API more scalable and consistent, and avoid growing a large number of per-component flags as new modules are added.
|
Just wanna say I'm sorry I've been slacking on this, going to try to rebase this on top of @wbruna 's optimizations with mmap some time this week. |
Add an auto-fit planner that picks DiT / VAE / Conditioner device placements from free GPU memory, treating each component as atomic (no intra-tensor row split — equivalent to llama.cpp's LLAMA_SPLIT_MODE_LAYER at component granularity, so views never land on a split buffer and no ggml patch is needed). Also adopt the PR leejet#1184 CLI conventions: - new: --main-backend-device, --diffusion-backend-device, --clip-backend-device, --vae-backend-device, --control-net-backend-device, --tae-backend-device, --upscaler-backend-device, --photomaker-backend-device, --vision-backend-device, --list-devices - removed: --clip-on-cpu, --vae-on-cpu, --control-net-cpu (and the matching keep_*_on_cpu fields on sd_ctx_params_t) Auto-fit knobs: --auto-fit / --no-auto-fit, --no-multi-gpu, --fit-target, --fit-compute-reserve-{dit,vae,cond}, --fit-dry-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@leejet @wbruna Okay, so I finally managed to get this mess resolved and integrated with the solutions in master. This was tougher than expected, fortunately Anthropic releasing Fable helped ;) got the entire pipeline working again with the entire LTX-2 fully on GPU on my 26GB VRAM setup (Q8 Gemma, Q6 LTX-2.5). Managed to make it work this time without any hacks in the GGML code, so that's an extra plus. |
|
Would you mind posting an example command line of this working? Monitoring with 'nvtop', it never uses more than one GPU for DIT. It also crashes out at the end. I manually used this backend spec in master prior to this patch set and it works. I can see CUDA0 and CUDA1 being used at different stages, but never at the same time just as before. The new error that doesnt happen in master: |
|
@OtherCrashOverride could you please post your command that triggers this crash? I'm wondering if this is a real problem with IM2COL or some variant of OOM. |
leejet
left a comment
There was a problem hiding this comment.
Thanks a lot for the contribution and for continuing to rebase this on top of the latest master.
Since master has recently moved to the new weight / model management module, there are now quite a few related branches and overlapping changes. I tried doing a merge locally to make the diff easier to compare against the current codebase. However, I have not tested that merge, so I am not fully sure whether it introduces any issue. If you run into problems caused by that merge, feel free to revert the related merge commits.
At the moment, this PR is still very large and touches several major areas at once, which makes it difficult for me to review and merge safely. Also, since the current ModelManager implementation already supports lazy loading, I think the duplicated lazy-loading parts in this PR can be removed or greatly simplified.
I would suggest splitting the remaining functionality into three smaller PRs:
PR 1: ModelManager-based layer split + ggml runner scheduling
This PR should focus on implementing layer split and ggml runner scheduling on top of the existing ModelManager.
For backend selection, I would prefer using the generic --backend syntax instead of adding per-component CLI/API fields. For example:
--backend "dit=cuda0&cuda1"This means the DiT component should use both cuda0 and cuda1.
This first PR should only support explicit/manual placement. No auto-fit logic is needed here. The goal is to make the lower-level multi-device execution mechanism easy to review and test first.
PR 2: split buffer support on top of PR 1
After PR 1 is merged, a second PR can add split buffer support. I think this should be built around the current backend / model manager design instead of adding another parallel allocation path.
A possible implementation direction:
-
Create split buffer type at the backend layer
Around
src/core/ggml_extend_backend.cpp:257, add a helper toSDBackendManager:-
get the device from
ggml_backend_get_device(backend) -
get the registry from
ggml_backend_dev_backend_reg(dev) -
for CUDA, get the proc with:
ggml_backend_reg_get_proc_address(reg, "ggml_backend_split_buffer_type")
-
find the current device index inside that registry and use it as
main_device -
call:
fn(main_device, tensor_split.data())
I would start with CUDA only. SYCL appears to expose a different proc signature, so I would avoid trying to generalize this in the first version.
-
-
Let
ModelManagerown the split buffer typeIn
src/model_manager.h, add something like:void set_split_buffer_type(ggml_backend_t backend, ggml_backend_buffer_type_t split_buft);
Internally this can be stored as:
std::map<ggml_backend_t, ggml_backend_buffer_type_t>
-
Update buffer type selection
The current entry point seems to be around
src/model_manager.cpp:680.Current behavior is roughly:
- if params backend != compute backend, use the compute device host buffer
- otherwise use the params backend default buffer
I think this should become:
- if the tensor is eligible for split and the target is the compute backend, return the split buffer type
- otherwise keep the existing behavior
-
Handle allocation during params-backend CPU / streaming load
Around
src/model_manager.cpp:228, the code currently uses:ggml_backend_alloc_ctx_tensors(..., compute_backend)which always uses the default buffer type.
If we want split buffer to work when
--params-backend cpuis used, or during streaming load, tensors should probably be grouped by buffer type. The split group can then be allocated with:ggml_backend_alloc_ctx_tensors_from_buft(staging_ctx, split_buft)while the rest keeps the default path.
-
LoRA handling
src/model_manager.cpp:319currently applies LoRA by buildingadd_inplace/cpygraphs over model tensors. Split tensors participating in non-matmul ops may cause issues.For the MVP, I suggest either disabling this path when split buffer is enabled, or forcing LoRA to use runtime apply mode instead of modifying split model tensors directly.
PR 3: auto-fit / automatic placement
I also think the auto-fit logic should be split into a separate PR.
Auto-fit is a policy layer built on top of the lower-level mechanisms, while layer split, runner scheduling, and split buffer support are mechanism-level changes. Keeping auto-fit separate would make the core multi-device implementation easier to review and test first, and would also make it easier to evaluate the fitting heuristics independently later.
For this PR, I would prefer not to introduce more auto-fit-specific options such as --fit-compute-reserve / --fit-target. Instead, it should reuse a more general mechanism such as --max-vram, so the user can specify the available memory budget and the auto-fit logic can derive a placement plan from that.
Overall, I think the core ideas are useful, but I would like to reduce the scope first. A smaller first PR for manual layer split / scheduler integration, followed by split-buffer support, and finally a separate auto-fit PR, would make this much easier to review, test, and merge safely.
|
Yup, feels good. Will divide it into separate PRs and submit separately, with lazy loading I should be able to get a low enough quant to test everything on my machine :) |
Add logic + commandline arguments to handle automatic fitting, lazy loading (only load DiT after conditioner finishes) and row + tensor splits (row splits are actually, unlike in core llama.cpp, recommended here since they save much more VRAM). Also does automatic calculation of VAE tiling.
Requires a GGML patch (attached in
ggml.patch).Allowed me to run @leejet 's branch of LTX-2 on my 16 GB 5060 + 10 GB 3080 fully on VRAM with Gemma 3 12B Q8_K_XL + LTX-2 DiT Q6_K.
Supersedes #1184 (besides the RPC support).