Skip to content

Suggestion for speeding up index_for_timestep by removing sequential nonzero() calls in samplers #9417

Open
@ethanweber

Description

@ethanweber

Is your feature request related to a problem? Please describe.
First off, thanks for the great codebase and providing so many resources! I just wanted to provide some insight into an improvement I made for myself, in case you'd like to include it for all samplers. I'm using the FlowMatchEulerDiscreteScheduler and after profiling, I've noticed that it's unexpectedly slowing down my training speeds. I'll describe the issue and proposed solution here rather than making a PR, since this would touch a lot of code and perhaps someone on the diffusers team would like to implement it.

Describe the solution you'd like.
This line in particular is very slow because it is a for loop step_indices = [self.index_for_timestep(t, schedule_timesteps) for t in timestep] and the self.index_for_timestep() is calling a nonzero() function which is slow.

step_indices = [self.index_for_timestep(t, schedule_timesteps) for t in timestep]

Describe alternatives you've considered.
I've changed the code as follows:

# huggingface code
def index_for_timestep(self, timestep, schedule_timesteps=None):
    if schedule_timesteps is None:
        schedule_timesteps = self.timesteps

    indices = (schedule_timesteps == timestep).nonzero()

    # The sigma index that is taken for the **very** first `step`
    # is always the second index (or the last index if there is only 1)
    # This way we can ensure we don't accidentally skip a sigma in
    # case we start in the middle of the denoising schedule (e.g. for image-to-image)
    pos = 1 if len(indices) > 1 else 0

    return indices[pos].item()

changed to =>

# my code
def index_for_timestep(self, timestep, schedule_timesteps=None):
    if schedule_timesteps is None:
        schedule_timesteps = self.timesteps

    num_steps = len(schedule_timesteps)
    start = schedule_timesteps[0].item()
    end = schedule_timesteps[-1].item()
    indices = torch.round(((timestep - start) / (end - start)) * (num_steps - 1)).long()

    return indices

and

# huggingface code
# self.begin_index is None when scheduler is used for training, or pipeline does not implement set_begin_index
if self.begin_index is None:
    step_indices = [self.index_for_timestep(t, schedule_timesteps) for t in timestep]

changed to =>

# my code
# self.begin_index is None when scheduler is used for training, or pipeline does not implement set_begin_index
if self.begin_index is None:
    step_indices = self.index_for_timestep(timestep, schedule_timesteps)

Additional context.
Just wanted to bring this modification to your attention since it could be a training speedup for folks. 🙂 Especially when someone has a large batch size > 1 and this for loop it occurring with nonzero search operations. Some other small changes might be necessary to ensure compatibility of the function changes, but I suspect it could help everyone. Thanks for the consideration!

Activity

asomoza

asomoza commented on Sep 11, 2024

@asomoza
Member
yiyixuxu

yiyixuxu commented on Sep 12, 2024

@yiyixuxu
Collaborator

hey @ethanweber, thanks for the issue!
yes we are aware this nonzero() operation is a surprising expensive one, that's why we added the step_index counter to avoiding using it as much as possible for inference

Indeed, it will still slow down the training, but I think maybe you can just not using this method, in diffusers example script we just do this

noisy_model_input = (1.0 - sigmas) * model_input + sigmas * noise

I think the solution you provided assumes that sigmas are always evenly paced, no? that is the case for most flow-match models, and it makes sense, but sd3 makes the steps a little bit differently. I think it might not work for them (when the shift value is not 1). like here you see it append a zero in the end, and the last step might have a different size

self.sigmas = torch.cat([sigmas, torch.zeros(1, device=sigmas.device)])

dianyo

dianyo commented on Sep 12, 2024

@dianyo
Contributor

Hi @yiyixuxu,

I've encounter the similar issue before, I was thinking an elegant solution might be maintain two copies timesteps on both cpu (python list) and gpu (pytorch tensors). Something like index_for_timestep should use cpu version for prevent from slow non_zero.

I do see this method would change a lot of code in schedulers and pipelines, but I still think it's worth. If you're okay with the idea, I'm grad to create a PR for the implementation.

yiyixuxu

yiyixuxu commented on Sep 13, 2024

@yiyixuxu
Collaborator

@dianyo would you be able to explain the context your issue?
@ethanweber mentioned he has this problem during training, is it same for you?

note that index_for_timestep is only there for backward compatiblities reason, or training for some cases - we do not use this anymore with the set_begin_index feature

github-actions

github-actions commented on Oct 12, 2024

@github-actions
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul

sayakpaul commented on Nov 1, 2024

@sayakpaul
Member

Closing due to inactivity.

a-r-r-o-w

a-r-r-o-w commented on Nov 1, 2024

@a-r-r-o-w
Member

I think this is still an issue we will be looking at when removing cuda syncs across our pipelines and schedulers Good to keep open for now for tracking

removed
staleIssues that haven't received updates
on Nov 1, 2024
sayakpaul

sayakpaul commented on Nov 1, 2024

@sayakpaul
Member

Oops my bad. Related:
#9475

added
performanceAnything related to performance improvements, profiling and benchmarking
on Nov 20, 2024
a-r-r-o-w

a-r-r-o-w commented on Nov 20, 2024

@a-r-r-o-w
Member

A gentle ping every few days to keep the stale bot away

yiyixuxu

yiyixuxu commented on Nov 20, 2024

@yiyixuxu
Collaborator

we should just set begin index for flow matching pipelines (sd3, flux etc)
like this

self.scheduler.set_begin_index(t_start * self.scheduler.order)

but for text 2 image

opening up this to community now

github-actions

github-actions commented on Dec 14, 2024

@github-actions
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

added and removed
staleIssues that haven't received updates
on Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    contributions-welcomehelp wantedExtra attention is neededperformanceAnything related to performance improvements, profiling and benchmarkingwip

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @asomoza@dianyo@yiyixuxu@ethanweber@sayakpaul

        Issue actions

          Suggestion for speeding up `index_for_timestep` by removing sequential `nonzero()` calls in samplers · Issue #9417 · huggingface/diffusers