perf: --eager-load-params for fast steady-state streaming#1646
perf: --eager-load-params for fast steady-state streaming#1646fszontagh wants to merge 2 commits into
Conversation
…arams # Conflicts: # examples/common/common.cpp # src/stable-diffusion.cpp
|
Closing this as obsolete. The original issue this PR tried to address no longer applies to the current default behavior. Params storage is kept after first use in the default params backend, so later runs can reuse the loaded weights. Storage is only released for tensors with So I don't think we need a separate |
|
@leejet thanks for taking a look. After the merge with #1644 and the later refactors I tested again and you're correct that storage is kept after first use - But the first use is still lazy. With Per-call probe inside
So storage is kept (great), but the cold disk-read cost is paid in the first sampling step. For users on With
If you'd prefer a different shape (e.g. always-on when `stream_layers && params_backend is cpu`, or an `--params-backend cpu-eager` token instead of a new flag), happy to rework. Or if you think paying the lazy first-load is the right default and users should accept it for the smaller startup RAM, I'll close. |
Summary
After #1644 centralized weight staging, params are loaded from disk to the params backend lazily on the first
prepare_paramscall. For multi-segment streaming on a large model this means the first sampling step pays the entire disk-read cost (8-15 seconds per segment on Z-Image bf16), and batch images re-pay it wheneverrunner_done()releases the params storage.This PR adds a
sd_ctx_params_t::eager_load_paramsflag (CLI:--eager-load-params) that loads every registered tensor into the params backend right after metadata validation. Default off, so the lazy behavior is preserved for users who want lower peak host RAM at model-load time.Numbers
RTX 3060 12 GB,
--offload-to-cpu --stream-layers --max-vram -1:--eager-load-paramsgenerate_imagegenerate_imageFor long-lived processes (servers, batch generation) the eager path also reduces total wallclock because images 2..N reuse the warm pinned-host cache instead of re-reading the model from disk.
Implementation
ModelManager::load_all_params_eagerly()collects all registered states and calls the existingload_tensors_to_params_backend.sd_ctx_params_t::eager_load_params,init, andto_str.examples/common.Checklist