Benchmark HF optimum-executorch #11450

guangy10 · 2025-06-06T19:38:05Z

Benchmark LLMs from optimum-executorch. With all the work recently happening in optimum-executorch, we are able to boost the out-of-the-box performance. Putting these models on benchmark infra to gather perf numbers and understand the remaining perf gaps between the in-house generated model via export_llama.

We are able to do apple-to-apple comparison for CPU backend by introducing quant, custom SPDA, custom KV cache to native Hugging Face models in optimum-executorch: hf_xnnpack_custom_spda_kv_cache_8da4w represents the recipe used by optimum-et, et_xnnpack_custom_spda_kv_cache_8da4w is the counterpart for etLLM.

Here are the benchmark jobs in our infra:

Note there may be failures when running optimum-et models on-device due to lack of support HF tokenizers in llama runner. I will remove packing tokenizer.json from the .zip shortly so that the benchmark apps will take optimum-et LLMs as non-GenAI models.

pytorch-bot · 2025-06-06T19:38:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11450

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 34 New Failures, 13 Cancelled Jobs

As of commit 6c80e04 with merge base b632906 ():

NEW FAILURES - The following jobs have failed:

android-perf (private devices) / android / benchmark-on-device (google/gemma-3-1b-it, hf_xnnpack_custom_spda_kv_cache_8da4w, samsung_galaxy_... / mobile-job (android) (gh)
Final attempt failed. Child_process exited with error code 1
android-perf (private devices) / android / benchmark-on-device (HuggingFaceTB/SmolLM2-135M, hf_xnnpack_custom_spda_kv_cache_8da4w, samsung_g... / mobile-job (android) (gh)
Final attempt failed. Child_process exited with error code 1
android-perf (private devices) / android / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8, hf_xnnpack_custom_spda_... / mobile-job (android) (gh)
Final attempt failed. Child_process exited with error code 1
android-perf (private devices) / android / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8, llama3_spinquant, samsu... / mobile-job (android) (gh)
Final attempt failed. Child_process exited with error code 1
android-perf (private devices) / android / benchmark-on-device (meta-llama/Llama-3.2-1B, llama3_fb16, samsung_galaxy_s22_private, arn:aws:de... / mobile-job (android) (gh)
Final attempt failed. Child_process exited with error code 1
android-perf (private devices) / android / benchmark-on-device (mv3, qnn_q8, samsung_galaxy_s22_private, arn:aws:devicefarm:us-west-2:308535... / mobile-job (android) (gh)
Final attempt failed. Child_process exited with error code 1
android-perf (private devices) / android / benchmark-on-device (Qwen/Qwen3-0.6B, et_xnnpack_custom_spda_kv_cache_8da4w, samsung_galaxy_s22_p... / mobile-job (android) (gh)
Final attempt failed. Child_process exited with error code 1
android-perf (private devices) / android / benchmark-on-device (Qwen/Qwen3-0.6B, hf_xnnpack_custom_spda_kv_cache_8da4w, samsung_galaxy_s22_p... / mobile-job (android) (gh)
Final attempt failed. Child_process exited with error code 1
android-perf (private devices) / android / export-models (meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8, hf_xnnpack_custom_spda_kv_cache_8... / linux-job (gh)
RuntimeError: Command docker exec -t 0d5cdcdc93398dbcd83571d8c666a9f6902fa28d16fb0ec7c5249aba943ee6f2 /exec failed with exit code 1
android-perf (private devices) / android / export-models (meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8, hf_xnnpack_custom_spda_kv_cac... / linux-job (gh)
RuntimeError: Command docker exec -t 62e353d23cfbb936411be173ad3138be38680b33c11c9924ccd5cb5adc2faaf3 /exec failed with exit code 1
android-perf (private devices) / android / upload-benchmark-results (gh)
ValueError: Fail to generate record for GIT_JOB level failure for android / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8, hf_xnnpack_custom_spda_... / mobile-job (android): expect at least 3 items extrac from git_job_name android / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8, hf_xnnpack_custom_spda_... / mobile-job (android), but got ['meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8', ' hf_xnnpack_custom_spda_... / mobile-job (android)']. please check if pattern is in sync with executorch/.ci/scripts/gather_benchmark_configs.py
android-perf / benchmark-on-device (llama, xnnpack_q8, samsung_galaxy_s22, arn:aws:devicefarm:us-west-2:30853538... / mobile-job (android) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf (private devices) / apple / benchmark-on-device (google/gemma-3-1b-it, hf_xnnpack_custom_spda_kv_cache_8da4w, apple_iphone_15... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf (private devices) / apple / benchmark-on-device (HuggingFaceTB/SmolLM2-135M, hf_xnnpack_custom_spda_kv_cache_8da4w, apple_iph... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf (private devices) / apple / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8, hf_xnnpack_custom_spda_kv_c... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf (private devices) / apple / benchmark-on-device (meta-llama/Llama-3.2-1B, et_xnnpack_custom_spda_kv_cache_8da4w, apple_iphone... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf (private devices) / apple / benchmark-on-device (meta-llama/Llama-3.2-1B, hf_xnnpack_custom_spda_kv_cache_8da4w, apple_iphone... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf (private devices) / apple / benchmark-on-device (mv3, coreml_fp16, apple_iphone_15_private, arn:aws:devicefarm:us-west-2:3085... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf (private devices) / apple / benchmark-on-device (mv3, mps, apple_iphone_15_private, arn:aws:devicefarm:us-west-2:308535385114... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf (private devices) / apple / benchmark-on-device (Qwen/Qwen3-0.6B, et_xnnpack_custom_spda_kv_cache_8da4w, apple_iphone_15_priv... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf (private devices) / apple / benchmark-on-device (Qwen/Qwen3-0.6B, hf_xnnpack_custom_spda_kv_cache_8da4w, apple_iphone_15_priv... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf (private devices) / apple / export-models (meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8, hf_xnnpack_custom_spda_kv_cache_8... / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
apple-perf (private devices) / apple / export-models (meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8, hf_xnnpack_custom_spda_kv_cac... / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
apple-perf (private devices) / apple / upload-benchmark-results (gh)
ValueError: Fail to generate record for GIT_JOB level failure for apple / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8, hf_xnnpack_custom_spda_kv_c... / mobile-job (ios): expect at least 3 items extrac from git_job_name apple / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8, hf_xnnpack_custom_spda_kv_c... / mobile-job (ios), but got ['meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8', ' hf_xnnpack_custom_spda_kv_c... / mobile-job (ios)']. please check if pattern is in sync with executorch/.ci/scripts/gather_benchmark_configs.py
apple-perf / benchmark-on-device (llama, coreml_fp16, apple_iphone_15, arn:aws:devicefarm:us-west-2:3085353851... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
apple-perf / benchmark-on-device (llama, xnnpack_q8, apple_iphone_15, arn:aws:devicefarm:us-west-2:30853538511... / mobile-job (ios) (gh)
Final attempt failed. Child_process exited with error code 1
pull / unittest / linux / linux-job (gh)
.ci/scripts/tests/test_gather_benchmark_configs.py::TestGatehrBenchmarkConfigs::test_generate_compatible_configs_quantized_llama_model
pull / unittest-editable / linux / linux-job (gh)
.ci/scripts/tests/test_gather_benchmark_configs.py::TestGatehrBenchmarkConfigs::test_generate_compatible_configs_quantized_llama_model
trunk / test-huggingface-transformers (allenai/OLMo-1B-hf) / linux-job (gh)
RuntimeError: Command docker exec -t b4039f8cdffeeba0d9cd2ec56925841f9594cab139d7130c7b613f9d1b4b0e73 /exec failed with exit code 2
trunk / test-huggingface-transformers (google/gemma-3-1b-it) / linux-job (gh)
RuntimeError: Command docker exec -t 777cc21e767734bdad4dbf283655b2f65a3443359d6d0286a1e1efa86b29915b /exec failed with exit code 2
trunk / test-huggingface-transformers (HuggingFaceTB/SmolLM2-135M) / linux-job (gh)
RuntimeError: Command docker exec -t 68ab3b8be840d8b9ba51d1278a1c844b0f671d334f90a9c53fb04099b57336c4 /exec failed with exit code 2
trunk / test-huggingface-transformers (meta-llama/Llama-3.2-1B) / linux-job (gh)
RuntimeError: Command docker exec -t 8bcdd9713ca1bd72b2c3457631cb97a72eccbecce26ad166335968bcb761f07b /exec failed with exit code 2
trunk / test-huggingface-transformers (Qwen/Qwen3-0.6B) / linux-job (gh)
RuntimeError: Command docker exec -t fb6c771195101bfcb6f2aa595cdb3374f15220a251320e22683086106c44cf7d /exec failed with exit code 2
trunk / unittest-release / linux / linux-job (gh)
.ci/scripts/tests/test_gather_benchmark_configs.py::TestGatehrBenchmarkConfigs::test_generate_compatible_configs_quantized_llama_model

CANCELLED JOBS - The following jobs were cancelled. Please retry:

android-perf (private devices) / android / benchmark-on-device (allenai/OLMo-1B-hf, hf_xnnpack_custom_spda_kv_cache_8da4w, samsung_galaxy_s2... / mobile-job (android) (gh)
##[error]The operation was canceled.
android-perf (private devices) / android / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8, hf_xnnpack_custom_spda_kv_c... / mobile-job (android) (gh)
##[error]The operation was canceled.
android-perf (private devices) / android / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8, llama3_qlora, samsung_galax... / mobile-job (android) (gh)
##[error]The operation was canceled.
android-perf (private devices) / android / benchmark-on-device (meta-llama/Llama-3.2-1B, et_xnnpack_custom_spda_kv_cache_8da4w, samsung_gala... / mobile-job (android) (gh)
##[error]The operation was canceled.
android-perf (private devices) / android / benchmark-on-device (meta-llama/Llama-3.2-1B, hf_xnnpack_custom_spda_kv_cache_8da4w, samsung_gala... / mobile-job (android) (gh)
android-perf (private devices) / android / benchmark-on-device (mv3, xnnpack_q8, samsung_galaxy_s22_private, arn:aws:devicefarm:us-west-2:30... / mobile-job (android) (gh)
##[error]The operation was canceled.
apple-perf (private devices) / apple / benchmark-on-device (allenai/OLMo-1B-hf, hf_xnnpack_custom_spda_kv_cache_8da4w, apple_iphone_15_p... / mobile-job (ios) (gh)
apple-perf (private devices) / apple / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8, llama3_qlora, apple_iphone_... / mobile-job (ios) (gh)
##[error]The operation was canceled.
apple-perf (private devices) / apple / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8, hf_xnnpack_custom_spda_... / mobile-job (ios) (gh)
apple-perf (private devices) / apple / benchmark-on-device (meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8, llama3_spinquant, apple... / mobile-job (ios) (gh)
apple-perf (private devices) / apple / benchmark-on-device (meta-llama/Llama-3.2-1B, llama3_fb16, apple_iphone_15_private, arn:aws:devic... / mobile-job (ios) (gh)
##[error]The operation was canceled.
apple-perf (private devices) / apple / benchmark-on-device (mv3, xnnpack_q8, apple_iphone_15_private, arn:aws:devicefarm:us-west-2:30853... / mobile-job (ios) (gh)
apple-perf (private devices) / apple / export-models (mv3, coreml_fp16, apple_iphone_15_private, arn:aws:devicefarm:us-west-2:3085353851... / macos-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

guangy10 · 2025-06-07T01:54:39Z

@huydhn In the apple's workflow, though I have specified the python version to be "3.11", it still install the python "3.13". Then when trying to pip install executorch, it ended with no package found. That's because we only publish with python 3.10, 3.11, and 3.12. https://github.com/pytorch/executorch/actions/runs/15500604843/job/43647388676#step:9:13372

Okay, it turns out that I need to run install with ${CONDA_RUN}

guangy10 · 2025-06-07T03:59:01Z

Another issue. @huydhn Why isn't the artifact uploaded though it's zipped and moved to the dest dir successfully?

https://github.com/pytorch/executorch/actions/runs/15502766906/job/43653221066#step:20:1

kimishpatel · 2025-06-09T15:19:48Z

.ci/scripts/gather_benchmark_configs.py

+        "hf_xnnpack_custom_spda_kv_cache_8da4w",
+        "et_xnnpack_custom_spda_kv_cache_8da4w",


whats the difference between these two?

kimishpatel · 2025-06-09T15:25:30Z

.github/workflows/android-perf.yml

+                      -X \
+                      --xnnpack-extended-ops \
+                      -qmode 8da4w -G 32 -E 8,0 \
+                      --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \


are these for llama_3_2?

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 6, 2025

guangy10 had a problem deploying to upload-benchmark-results June 6, 2025 19:38 — with GitHub Actions Failure

guangy10 had a problem deploying to upload-benchmark-results June 6, 2025 19:40 — with GitHub Actions Failure

guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 19:43 — with GitHub Actions Inactive

guangy10 force-pushed the optimum_et_benchmark branch from e4718b0 to fff15c6 Compare June 6, 2025 19:46

guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 19:51 — with GitHub Actions Inactive

guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 20:37 — with GitHub Actions Inactive

guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 20:45 — with GitHub Actions Inactive

guangy10 force-pushed the optimum_et_benchmark branch from fff15c6 to 00149f2 Compare June 6, 2025 20:46

guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 20:51 — with GitHub Actions Inactive

guangy10 force-pushed the optimum_et_benchmark branch from 00149f2 to 112eb2b Compare June 6, 2025 21:14

guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 21:19 — with GitHub Actions Inactive

guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 21:38 — with GitHub Actions Inactive

guangy10 force-pushed the optimum_et_benchmark branch from 112eb2b to a38a694 Compare June 6, 2025 22:02

guangy10 requested review from huydhn, kimishpatel and kirklandsign June 6, 2025 22:48

guangy10 force-pushed the optimum_et_benchmark branch from a38a694 to a0f636f Compare June 6, 2025 22:50

guangy10 marked this pull request as ready for review June 6, 2025 22:50

guangy10 changed the title ~~Benchmark optimum-executorch~~ Benchmark HF optimum-executorch Jun 6, 2025

guangy10 added the release notes: none Do not include this in the release notes label Jun 6, 2025

guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 23:43 — with GitHub Actions Inactive

guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 23:50 — with GitHub Actions Inactive

guangy10 force-pushed the optimum_et_benchmark branch from a0f636f to 5d6dd04 Compare June 7, 2025 01:23

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 01:25 — with GitHub Actions Inactive

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 02:15 — with GitHub Actions Inactive

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 02:16 — with GitHub Actions Inactive

guangy10 force-pushed the optimum_et_benchmark branch from 5d6dd04 to 01ce07b Compare June 7, 2025 03:19

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 03:33 — with GitHub Actions Inactive

guangy10 force-pushed the optimum_et_benchmark branch 2 times, most recently from df785ca to 8aa9c02 Compare June 7, 2025 03:48

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 03:51 — with GitHub Actions Inactive

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 04:27 — with GitHub Actions Inactive

Benchmark optimum-executorch

b0d829a

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 04:35 — with GitHub Actions Inactive

guangy10 force-pushed the optimum_et_benchmark branch from 8aa9c02 to b0d829a Compare June 7, 2025 04:35

Fix android artifacts upload

6c80e04

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 05:40 — with GitHub Actions Inactive

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 05:55 — with GitHub Actions Inactive

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 06:00 — with GitHub Actions Inactive

guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 06:19 — with GitHub Actions Inactive

guangy10 had a problem deploying to upload-benchmark-results June 7, 2025 08:08 — with GitHub Actions Failure

guangy10 had a problem deploying to upload-benchmark-results June 7, 2025 09:30 — with GitHub Actions Failure

kimishpatel reviewed Jun 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark HF optimum-executorch #11450

Benchmark HF optimum-executorch #11450

guangy10 commented Jun 6, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 6, 2025 •

edited

Loading

Uh oh!

guangy10 commented Jun 7, 2025 •

edited

Loading

Uh oh!

guangy10 commented Jun 7, 2025 •

edited

Loading

Uh oh!

kimishpatel Jun 9, 2025

Uh oh!

kimishpatel Jun 9, 2025

Uh oh!

Uh oh!

		"hf_xnnpack_custom_spda_kv_cache_8da4w",
		"et_xnnpack_custom_spda_kv_cache_8da4w",

Benchmark HF optimum-executorch #11450

Are you sure you want to change the base?

Benchmark HF optimum-executorch #11450

Conversation

guangy10 commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11450

❌ 34 New Failures, 13 Cancelled Jobs

Uh oh!

guangy10 commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guangy10 commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kimishpatel Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

kimishpatel Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

guangy10 commented Jun 6, 2025 •

edited

Loading

pytorch-bot bot commented Jun 6, 2025 •

edited

Loading

guangy10 commented Jun 7, 2025 •

edited

Loading

guangy10 commented Jun 7, 2025 •

edited

Loading