Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : reduce command encoding overhead #9698

Merged
merged 2 commits into from
Oct 1, 2024
Merged

metal : reduce command encoding overhead #9698

merged 2 commits into from
Oct 1, 2024

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Sep 30, 2024

fix #9507

Submit the first 128 nodes from the main thread and while it is processing, enqueue and submit the rest of the command buffers.

API Changes

  • remove ggml_backend_metal_set_n_cb

Benches

./scripts/compare-commits.sh master gg/perf-metal \
    -m ./models/tinyllama-1b/ggml-model-q4_0.gguf \
    -m ./models/tinyllama-1b/ggml-model-q8_0.gguf \
    -m ./models/tinyllama-1b/ggml-model-f16.gguf -r 10
CPU Model Model Size [GiB] Test t/s master t/s gg/perf-metal Speedup
llama 1B F16 2.05 pp512 7576.39 7588.18 1.00
llama 1B F16 2.05 tg128 148.00 153.89 1.04
llama 1B Q4_0 0.59 pp512 6797.19 6821.33 1.00
llama 1B Q4_0 0.59 tg128 233.30 245.85 1.05
llama 1B Q8_0 1.09 pp512 6861.19 6905.65 1.01
llama 1B Q8_0 1.09 tg128 199.71 211.15 1.06
./scripts/compare-commits.sh master gg/perf-metal \
    -m ./models/llama-3.2-1b-instruct/ggml-model-q4_0.gguf \
    -m ./models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf \
    -m ./models/llama-3.2-1b-instruct/ggml-model-f16.gguf \
    -m ./models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf \
    -m ./models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf \
    -m ./models/llama-3.2-3b-instruct/ggml-model-f16.gguf -r 10 -fa 1
CPU Model Model Size [GiB] Test t/s master t/s gg/perf-metal Speedup
M1 Pro llama 1B F16 2.79 pp512 2026.97 2030.72 1.00
M1 Pro llama 1B F16 2.79 tg128 59.93 60.95 1.02
M1 Pro llama 3B F16 6.72 pp512 720.95 721.32 1.00
M1 Pro llama 3B F16 6.72 tg128 25.02 24.94 1.00
M1 Pro llama 1B Q4_0 0.91 pp512 1805.67 1820.12 1.01
M1 Pro llama 1B Q4_0 0.91 tg128 129.72 134.97 1.04
M1 Pro llama 3B Q4_0 2.08 pp512 639.57 640.64 1.00
M1 Pro llama 3B Q4_0 2.08 tg128 65.16 66.31 1.02
M1 Pro llama 1B Q8_0 1.48 pp512 1836.40 1838.10 1.00
M1 Pro llama 1B Q8_0 1.48 tg128 98.69 101.36 1.03
M1 Pro llama 3B Q8_0 3.57 pp512 648.21 648.73 1.00
M1 Pro llama 3B Q8_0 3.57 tg128 43.36 44.12 1.02
CPU Model Model Size [GiB] Test t/s master t/s gg/perf-metal Speedup
M2 Ultra llama 1B F16 2.79 pp512 8553.00 8561.62 1.00
M2 Ultra llama 1B F16 2.79 tg128 152.10 158.84 1.04
M2 Ultra llama 3B F16 6.72 pp512 3183.39 3190.32 1.00
M2 Ultra llama 3B F16 6.72 tg128 70.81 72.14 1.02
M2 Ultra llama 1B Q4_0 0.91 pp512 7751.11 7742.41 1.00
M2 Ultra llama 1B Q4_0 0.91 tg128 255.28 269.99 1.06
M2 Ultra llama 3B Q4_0 2.08 pp512 2857.36 2857.06 1.00
M2 Ultra llama 3B Q4_0 2.08 tg128 148.58 153.84 1.04
M2 Ultra llama 1B Q8_0 1.48 pp512 7728.96 7738.59 1.00
M2 Ultra llama 1B Q8_0 1.48 tg128 218.00 230.20 1.06
M2 Ultra llama 3B Q8_0 3.57 pp512 2880.60 2881.06 1.00
M2 Ultra llama 3B Q8_0 3.57 tg128 113.95 117.39 1.03

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 30, 2024
@ggerganov ggerganov changed the title examples : add basic metal perf tool [no ci] metal : reduce command encoding overhead Sep 30, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to move this example to ggml instead, since it does nothing specific to llama.cpp. Also applies to the benchmark-matmult example, although that one could probably be removed entirely, since test-backend-ops can now measure mat mult FLOPs.

@ggerganov ggerganov marked this pull request as ready for review October 1, 2024 08:02
Comment on lines +3053 to +3055
// TODO: how to avoid this allocation? I tried initializing it in ggml_backend_metal_set_n_cb but it crashes.
ctx->encode_async = ^(size_t iter) {
const int cb_idx = iter;
Copy link
Member Author

@ggerganov ggerganov Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This callback should not be created each time here. Instead, it should be created once in ggml_backend_meta_set_n_cb(). But for some reason when I do it like this, we crash on the first compute. I'm missing some understanding of how Obj-C lifetime works - hopefully someone will figure this out in the future and fix it. For now, we keep creating the callback on each compute.

@ggerganov ggerganov merged commit cad341d into master Oct 1, 2024
54 checks passed
@ggerganov ggerganov deleted the gg/perf-metal branch October 1, 2024 13:00
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* metal : reduce command encoding overhead

ggml-ci

* metal : add comments
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* metal : reduce command encoding overhead

ggml-ci

* metal : add comments
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* metal : reduce command encoding overhead

ggml-ci

* metal : add comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metal : increase GPU duty-cycle during inference
2 participants