Skip to content

metal : increase GPU duty-cycle during inference #9507

@ggerganov

Description

@ggerganov

Apparently there is a significant GPU downtime between Metal compute encoders within a single ggml_metal_graph_compute():

image

See #6506 for instructions how to generate the trace from the picture.

My expectation was that enqueuing the command buffers in parallel would make them execute without any downtime. The goal of this issue is to understand where this overhead comes from and if there is a way to avoid it.

Obviously, using a single command buffer will avoid all the GPU downtime, but it is much slower to construct it in a single thread. Ideally, we want to continue queuing multiple encoders, but not have the gaps in-between during execution.

Metadata

Metadata

Assignees

Labels

Apple Metalhttps://en.wikipedia.org/wiki/Metal_(API)help wantedNeeds help from the communityperformanceSpeed related topics

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions