Skip to content

[Bug] mlx_metal_kernel_opt: subprocess benchmarks ignore evolved kernels; correctness uses wrong dtype; docs/tests assume wrong head config #372

@lanmogu98

Description

@lanmogu98

Three independent issues make the example’s “best program” and benchmark results unreliable:

  • Bug 1 (benchmark validity): qwen3_benchmark_suite.py runs mlx_lm.generate via subprocess.run(...), so parent-process monkey-patching (e.g. mlx_lm.models.qwen3.Attention = CustomGQAAttention) does not propagate into the child process. “Custom” benchmarks effectively run the vanilla kernel.
  • Bug 2 (dtype coverage): evaluator.py correctness uses mx.random.normal(...) (default float32), but the default model is mlx-community/Qwen3-0.6B-bf16 (bfloat16). Metal kernels template/compile differently for float vs bfloat16, so kernels can pass correctness yet fail in real inference.
  • Bug 3 (architecture mismatch): docs/prompt/MockArgs assume 40:8 heads and hidden_size=5120, but mlx-community/Qwen3-0.6B-bf16 reports 16:8 (2:1) at runtime. This misguides evolution and tests different shapes than the benchmarked model.

Evidence / Repro

  1. --mode compare completes and reports "optimized" benchmark results:
python openevolve/examples/mlx_metal_kernel_opt/run_benchmarks.py \
  --mode compare \
  --model mlx-community/Qwen3-0.6B-bf16
  1. When the same best_program.py hook is actually applied in-process before loading the model, real inference fails to compile under bfloat16:
cd openevolve/examples/mlx_metal_kernel_opt
python - <<'PY'
import importlib.util
import mlx_lm
from mlx_lm.generate import generate

spec = importlib.util.spec_from_file_location("best_program", "best_program.py")
m = importlib.util.module_from_spec(spec)
spec.loader.exec_module(m)

apply_hook, remove_hook = m.create_metal_qwen3_optimization_hook()
orig = apply_hook()
try:
    model, tokenizer = mlx_lm.load("mlx-community/Qwen3-0.6B-bf16")
    generate(model, tokenizer, prompt="Hello", max_tokens=5)
finally:
    remove_hook(orig)
PY

Observed:

  • The compare benchmark does not crash and prints "optimized" results.
  • The in-process run prints the hook being applied, reports Architecture: 16:8 heads (2:1 ratio), then fails with Metal compilation errors under bfloat16 (e.g. no matching function for call to 'dot' with vec<bfloat,...>).
  • If the compare benchmarks were actually exercising the evolved kernel in the subprocess, they should fail similarly; the fact they do not strongly suggests the subprocess benchmarks are running the vanilla kernel.

Impact

  • Reported speedups can be noise (evolved kernel not executed in subprocess benchmarks).
  • “Best” programs may not run under the actual model dtype (bfloat16).
  • Evolution may optimize for / be guided by the wrong head ratio + hidden size.

Suggested fixes

  • Bug 1: avoid subprocess, or run subprocess via a wrapper entrypoint that applies the patch inside the child process (like mlx_lm_generate_with_hook.py).
  • Bug 2: run correctness in bfloat16 (or add a 1-token mlx_lm.generate smoke test under the target model/dtype as a correctness gate).
  • Bug 3: align docs/prompt/MockArgs with the chosen default model, or change the default model to one that actually matches the 40:8 assumptions; ensure evaluator + benchmarks use the same model config.

Environment

  • macOS Apple Silicon
  • model: mlx-community/Qwen3-0.6B-bf16

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions