Description
Is your feature request related to a problem? Please describe.
First off, thanks for the great codebase and providing so many resources! I just wanted to provide some insight into an improvement I made for myself, in case you'd like to include it for all samplers. I'm using the FlowMatchEulerDiscreteScheduler
and after profiling, I've noticed that it's unexpectedly slowing down my training speeds. I'll describe the issue and proposed solution here rather than making a PR, since this would touch a lot of code and perhaps someone on the diffusers team would like to implement it.
Describe the solution you'd like.
This line in particular is very slow because it is a for loop step_indices = [self.index_for_timestep(t, schedule_timesteps) for t in timestep]
and the self.index_for_timestep()
is calling a nonzero() function which is slow.
Describe alternatives you've considered.
I've changed the code as follows:
# huggingface code
def index_for_timestep(self, timestep, schedule_timesteps=None):
if schedule_timesteps is None:
schedule_timesteps = self.timesteps
indices = (schedule_timesteps == timestep).nonzero()
# The sigma index that is taken for the **very** first `step`
# is always the second index (or the last index if there is only 1)
# This way we can ensure we don't accidentally skip a sigma in
# case we start in the middle of the denoising schedule (e.g. for image-to-image)
pos = 1 if len(indices) > 1 else 0
return indices[pos].item()
changed to =>
# my code
def index_for_timestep(self, timestep, schedule_timesteps=None):
if schedule_timesteps is None:
schedule_timesteps = self.timesteps
num_steps = len(schedule_timesteps)
start = schedule_timesteps[0].item()
end = schedule_timesteps[-1].item()
indices = torch.round(((timestep - start) / (end - start)) * (num_steps - 1)).long()
return indices
and
# huggingface code
# self.begin_index is None when scheduler is used for training, or pipeline does not implement set_begin_index
if self.begin_index is None:
step_indices = [self.index_for_timestep(t, schedule_timesteps) for t in timestep]
changed to =>
# my code
# self.begin_index is None when scheduler is used for training, or pipeline does not implement set_begin_index
if self.begin_index is None:
step_indices = self.index_for_timestep(timestep, schedule_timesteps)
Additional context.
Just wanted to bring this modification to your attention since it could be a training speedup for folks. 🙂 Especially when someone has a large batch size > 1 and this for loop it occurring with nonzero search operations. Some other small changes might be necessary to ensure compatibility of the function changes, but I suspect it could help everyone. Thanks for the consideration!
Activity
asomoza commentedon Sep 11, 2024
cc: @yiyixuxu
yiyixuxu commentedon Sep 12, 2024
hey @ethanweber, thanks for the issue!
yes we are aware this
nonzero()
operation is a surprising expensive one, that's why we added the step_index counter to avoiding using it as much as possible for inferenceIndeed, it will still slow down the training, but I think maybe you can just not using this method, in diffusers example script we just do this
diffusers/examples/dreambooth/train_dreambooth_lora_flux.py
Line 1647 in 5e1427a
I think the solution you provided assumes that sigmas are always evenly paced, no? that is the case for most flow-match models, and it makes sense, but sd3 makes the steps a little bit differently. I think it might not work for them (when the shift value is not 1). like here you see it append a zero in the end, and the last step might have a different size
diffusers/src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py
Line 208 in 5e1427a
dianyo commentedon Sep 12, 2024
Hi @yiyixuxu,
I've encounter the similar issue before, I was thinking an elegant solution might be maintain two copies
timesteps
on both cpu (python list) and gpu (pytorch tensors). Something likeindex_for_timestep
should use cpu version for prevent from slownon_zero
.I do see this method would change a lot of code in schedulers and pipelines, but I still think it's worth. If you're okay with the idea, I'm grad to create a PR for the implementation.
yiyixuxu commentedon Sep 13, 2024
@dianyo would you be able to explain the context your issue?
@ethanweber mentioned he has this problem during training, is it same for you?
note that
index_for_timestep
is only there for backward compatiblities reason, or training for some cases - we do not use this anymore with theset_begin_index
featuregithub-actions commentedon Oct 12, 2024
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
sayakpaul commentedon Nov 1, 2024
Closing due to inactivity.
a-r-r-o-w commentedon Nov 1, 2024
I think this is still an issue we will be looking at when removing cuda syncs across our pipelines and schedulers Good to keep open for now for tracking
sayakpaul commentedon Nov 1, 2024
Oops my bad. Related:
#9475
a-r-r-o-w commentedon Nov 20, 2024
A gentle ping every few days to keep the stale bot away
yiyixuxu commentedon Nov 20, 2024
we should just set begin index for flow matching pipelines (sd3, flux etc)
like this
diffusers/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_img2img.py
Line 625 in 99c0483
but for text 2 image
opening up this to community now
github-actions commentedon Dec 14, 2024
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.