metal : increase GPU duty-cycle during inference #9507

ggerganov · 2024-09-16T12:14:00Z

Apparently there is a significant GPU downtime between Metal compute encoders within a single ggml_metal_graph_compute():

See #6506 for instructions how to generate the trace from the picture.

My expectation was that enqueuing the command buffers in parallel would make them execute without any downtime. The goal of this issue is to understand where this overhead comes from and if there is a way to avoid it.

Obviously, using a single command buffer will avoid all the GPU downtime, but it is much slower to construct it in a single thread. Ideally, we want to continue queuing multiple encoders, but not have the gaps in-between during execution.

The text was updated successfully, but these errors were encountered:

ggerganov added help wanted Extra attention is needed performance Speed related topics Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : increase GPU duty-cycle during inference #9507

metal : increase GPU duty-cycle during inference #9507

ggerganov commented Sep 16, 2024 •

edited

Loading

metal : increase GPU duty-cycle during inference #9507

metal : increase GPU duty-cycle during inference #9507

Comments

ggerganov commented Sep 16, 2024 • edited Loading

ggerganov commented Sep 16, 2024 •

edited

Loading