perf: --eager-load-params for fast steady-state streaming by fszontagh · Pull Request #1646 · leejet/stable-diffusion.cpp

fszontagh · 2026-06-13T08:19:31Z

Summary

After #1644 centralized weight staging, params are loaded from disk to the params backend lazily on the first prepare_params call. For multi-segment streaming on a large model this means the first sampling step pays the entire disk-read cost (8-15 seconds per segment on Z-Image bf16), and batch images re-pay it whenever runner_done() releases the params storage.

This PR adds a sd_ctx_params_t::eager_load_params flag (CLI: --eager-load-params) that loads every registered tensor into the params backend right after metadata validation. Default off, so the lazy behavior is preserved for users who want lower peak host RAM at model-load time.

Numbers

RTX 3060 12 GB, --offload-to-cpu --stream-layers --max-vram -1:

Workload	Default (lazy)	`--eager-load-params`
SDXL bf16 1152x896 batch=2 8 steps `generate_image`	21 s	17 s
Z-Image bf16 1024x688 batch=2 9 steps `generate_image`	359 s	58 s

For long-lived processes (servers, batch generation) the eager path also reduces total wallclock because images 2..N reuse the warm pinned-host cache instead of re-reading the model from disk.

Implementation

ModelManager::load_all_params_eagerly() collects all registered states and calls the existing load_tensors_to_params_backend.
Plumbed through sd_ctx_params_t::eager_load_params, init, and to_str.
CLI flag added in examples/common.

Checklist

I have read and confirmed this PR follows the contribution guidelines.

…arams # Conflicts: # examples/common/common.cpp # src/stable-diffusion.cpp

leejet · 2026-06-14T11:51:29Z

Closing this as obsolete.

The original issue this PR tried to address no longer applies to the current default behavior. Params storage is kept after first use in the default params backend, so later runs can reuse the loaded weights. Storage is only released for tensors with Disk residency, which corresponds to --params-backend disk, and that reload/release behavior is expected for that mode.

So I don't think we need a separate --eager-load-params option anymore. Thanks for the contribution!

fszontagh · 2026-06-14T17:30:00Z

@leejet thanks for taking a look. After the merge with #1644 and the later refactors I tested again and you're correct that storage is kept after first use - release_params_storage_blocks only releases blocks where residency_mode == Disk, so batch image 2 reuses what image 1 already loaded.

But the first use is still lazy. With --stream-layers, every merged segment's first prepare_params call triggers load_tensors_to_params_backend for tensors it hasn't touched yet, and that work lands inside the first sampling step.

Per-call probe inside prepare_params on the current merged branch, Z-Image bf16 1024x688 batch=2 2 steps, RTX 3060 12 GB:

stage	calls	load time per call	total
image 1 step 1 cold loads	~36	one ~30s, one ~40s, ~32 small ~3s each	~180s
image 1 step 2+	all	0 ms	0
image 2 entirely	all	0 ms	0

So storage is kept (great), but the cold disk-read cost is paid in the first sampling step. For users on --stream-layers (the perf-sensitive path that motivated #1612, #1601, #1611) this is the slow first step that's hard to avoid without a hook to load up front.

With --eager-load-params the same disk reads happen at model load instead, and step 1 becomes the same speed as step 2+:

	default (lazy)	--eager-load-params
Z-Image bf16 1024x688 batch=2 9 steps generate_image	244 s	63 s
SDXL bf16 1152x896 batch=2 8 steps generate_image	21 s	17 s

If you'd prefer a different shape (e.g. always-on when `stream_layers && params_backend is cpu`, or an `--params-backend cpu-eager` token instead of a new flag), happy to rework. Or if you think paying the lazy first-load is the right default and users should accept it for the smaller startup RAM, I'll close.

fszontagh added 2 commits June 13, 2026 09:41

perf: --eager-load-params for fast steady-state streaming

466698a

Merge remote-tracking branch 'upstream/master' into perf/eager-load-p…

81e16ca

…arams # Conflicts: # examples/common/common.cpp # src/stable-diffusion.cpp

leejet closed this Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: --eager-load-params for fast steady-state streaming#1646

perf: --eager-load-params for fast steady-state streaming#1646
fszontagh wants to merge 2 commits into
leejet:masterfrom
fszontagh:perf/eager-load-params

fszontagh commented Jun 13, 2026

Uh oh!

leejet commented Jun 14, 2026

Uh oh!

fszontagh commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fszontagh commented Jun 13, 2026

Summary

Numbers

Implementation

Checklist

Uh oh!

leejet commented Jun 14, 2026

Uh oh!

fszontagh commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fszontagh commented Jun 14, 2026 •

edited

Loading