metal : reduce command encoding overhead #9698

ggerganov · 2024-09-30T18:06:04Z

Submit the first 128 nodes from the main thread and while it is processing, enqueue and submit the rest of the command buffers.

API Changes

remove ggml_backend_metal_set_n_cb

Benches

./scripts/compare-commits.sh master gg/perf-metal \
    -m ./models/tinyllama-1b/ggml-model-q4_0.gguf \
    -m ./models/tinyllama-1b/ggml-model-q8_0.gguf \
    -m ./models/tinyllama-1b/ggml-model-f16.gguf -r 10

Model	Model Size [GiB]	Test	t/s master	t/s gg/perf-metal	Speedup
llama 1B F16	2.05	pp512	7576.39	7588.18	1.00
llama 1B F16	2.05	tg128	148.00	153.89	1.04
llama 1B Q4_0	0.59	pp512	6797.19	6821.33	1.00
llama 1B Q4_0	0.59	tg128	233.30	245.85	1.05
llama 1B Q8_0	1.09	pp512	6861.19	6905.65	1.01
llama 1B Q8_0	1.09	tg128	199.71	211.15	1.06

./scripts/compare-commits.sh master gg/perf-metal \
    -m ./models/llama-3.2-1b-instruct/ggml-model-q4_0.gguf \
    -m ./models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf \
    -m ./models/llama-3.2-1b-instruct/ggml-model-f16.gguf \
    -m ./models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf \
    -m ./models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf \
    -m ./models/llama-3.2-3b-instruct/ggml-model-f16.gguf -r 10 -fa 1

CPU	Model	Model Size [GiB]	Test	t/s master	t/s gg/perf-metal	Speedup
M1 Pro	llama 1B F16	2.79	pp512	2026.97	2030.72	1.00
M1 Pro	llama 1B F16	2.79	tg128	59.93	60.95	1.02
M1 Pro	llama 3B F16	6.72	pp512	720.95	721.32	1.00
M1 Pro	llama 3B F16	6.72	tg128	25.02	24.94	1.00
M1 Pro	llama 1B Q4_0	0.91	pp512	1805.67	1820.12	1.01
M1 Pro	llama 1B Q4_0	0.91	tg128	129.72	134.97	1.04
M1 Pro	llama 3B Q4_0	2.08	pp512	639.57	640.64	1.00
M1 Pro	llama 3B Q4_0	2.08	tg128	65.16	66.31	1.02
M1 Pro	llama 1B Q8_0	1.48	pp512	1836.40	1838.10	1.00
M1 Pro	llama 1B Q8_0	1.48	tg128	98.69	101.36	1.03
M1 Pro	llama 3B Q8_0	3.57	pp512	648.21	648.73	1.00
M1 Pro	llama 3B Q8_0	3.57	tg128	43.36	44.12	1.02

CPU	Model	Model Size [GiB]	Test	t/s master	t/s gg/perf-metal	Speedup
M2 Ultra	llama 1B F16	2.79	pp512	8553.00	8561.62	1.00
M2 Ultra	llama 1B F16	2.79	tg128	152.10	158.84	1.04
M2 Ultra	llama 3B F16	6.72	pp512	3183.39	3190.32	1.00
M2 Ultra	llama 3B F16	6.72	tg128	70.81	72.14	1.02
M2 Ultra	llama 1B Q4_0	0.91	pp512	7751.11	7742.41	1.00
M2 Ultra	llama 1B Q4_0	0.91	tg128	255.28	269.99	1.06
M2 Ultra	llama 3B Q4_0	2.08	pp512	2857.36	2857.06	1.00
M2 Ultra	llama 3B Q4_0	2.08	tg128	148.58	153.84	1.04
M2 Ultra	llama 1B Q8_0	1.48	pp512	7728.96	7738.59	1.00
M2 Ultra	llama 1B Q8_0	1.48	tg128	218.00	230.20	1.06
M2 Ultra	llama 3B Q8_0	3.57	pp512	2880.60	2881.06	1.00
M2 Ultra	llama 3B Q8_0	3.57	tg128	113.95	117.39	1.03

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ggml/src/ggml-metal.m

src/llama.cpp

slaren · 2024-09-30T21:11:39Z

examples/perf-metal/perf-metal.cpp

It might make more sense to move this example to ggml instead, since it does nothing specific to llama.cpp. Also applies to the benchmark-matmult example, although that one could probably be removed entirely, since test-backend-ops can now measure mat mult FLOPs.

ggml-ci

ggerganov · 2024-10-01T08:05:30Z

ggml/src/ggml-metal.m

+        // TODO: how to avoid this allocation? I tried initializing it in ggml_backend_metal_set_n_cb but it crashes.
+        ctx->encode_async = ^(size_t iter) {
+            const int cb_idx = iter;


This callback should not be created each time here. Instead, it should be created once in ggml_backend_meta_set_n_cb(). But for some reason when I do it like this, we crash on the first compute. I'm missing some understanding of how Obj-C lifetime works - hopefully someone will figure this out in the future and fix it. For now, we keep creating the callback on each compute.

* metal : reduce command encoding overhead ggml-ci * metal : add comments

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 30, 2024

ggerganov changed the title ~~examples : add basic metal perf tool [no ci]~~ metal : reduce command encoding overhead Sep 30, 2024

slaren reviewed Sep 30, 2024

View reviewed changes

ggml/src/ggml-metal.m Outdated Show resolved Hide resolved

src/llama.cpp Outdated Show resolved Hide resolved

slaren reviewed Sep 30, 2024

View reviewed changes

metal : reduce command encoding overhead

43b9d69

ggml-ci

ggerganov force-pushed the gg/perf-metal branch from 699eaab to 43b9d69 Compare October 1, 2024 07:24

metal : add comments

5273e59

ggerganov marked this pull request as ready for review October 1, 2024 08:02

ggerganov commented Oct 1, 2024

View reviewed changes

ggerganov mentioned this pull request Oct 1, 2024

metal : increase GPU duty-cycle during inference #9507

Closed

ggerganov merged commit cad341d into master Oct 1, 2024
54 checks passed

ggerganov deleted the gg/perf-metal branch October 1, 2024 13:00

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

metal : reduce command encoding overhead (ggml-org#9698)

ed786f0

* metal : reduce command encoding overhead ggml-ci * metal : add comments

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

metal : reduce command encoding overhead (ggml-org#9698)

c4fed7d

* metal : reduce command encoding overhead ggml-ci * metal : add comments

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

metal : reduce command encoding overhead (ggml-org#9698)

9173a07

* metal : reduce command encoding overhead ggml-ci * metal : add comments

stduhpf mentioned this pull request Dec 5, 2024

Build failing with Metal leejet/stable-diffusion.cpp#512

Closed

ikawrakow mentioned this pull request Apr 2, 2025

Metal: much faster MoE prompt processing ikawrakow/ik_llama.cpp#307

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal : reduce command encoding overhead #9698

metal : reduce command encoding overhead #9698

Uh oh!

ggerganov commented Sep 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

slaren Sep 30, 2024

Uh oh!

ggerganov Oct 1, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

metal : reduce command encoding overhead #9698

metal : reduce command encoding overhead #9698

Uh oh!

Conversation

ggerganov commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API Changes

Benches

Uh oh!

Uh oh!

Uh oh!

slaren Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Sep 30, 2024 •

edited

Loading

ggerganov Oct 1, 2024 •

edited

Loading