-
Notifications
You must be signed in to change notification settings - Fork 791
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Three independent issues make the example’s “best program” and benchmark results unreliable:
- Bug 1 (benchmark validity):
qwen3_benchmark_suite.pyrunsmlx_lm.generateviasubprocess.run(...), so parent-process monkey-patching (e.g.mlx_lm.models.qwen3.Attention = CustomGQAAttention) does not propagate into the child process. “Custom” benchmarks effectively run the vanilla kernel. - Bug 2 (dtype coverage):
evaluator.pycorrectness usesmx.random.normal(...)(defaultfloat32), but the default model ismlx-community/Qwen3-0.6B-bf16(bfloat16). Metal kernels template/compile differently forfloatvsbfloat16, so kernels can pass correctness yet fail in real inference. - Bug 3 (architecture mismatch): docs/prompt/MockArgs assume 40:8 heads and
hidden_size=5120, butmlx-community/Qwen3-0.6B-bf16reports 16:8 (2:1) at runtime. This misguides evolution and tests different shapes than the benchmarked model.
Evidence / Repro
--mode comparecompletes and reports "optimized" benchmark results:
python openevolve/examples/mlx_metal_kernel_opt/run_benchmarks.py \
--mode compare \
--model mlx-community/Qwen3-0.6B-bf16- When the same
best_program.pyhook is actually applied in-process before loading the model, real inference fails to compile under bfloat16:
cd openevolve/examples/mlx_metal_kernel_opt
python - <<'PY'
import importlib.util
import mlx_lm
from mlx_lm.generate import generate
spec = importlib.util.spec_from_file_location("best_program", "best_program.py")
m = importlib.util.module_from_spec(spec)
spec.loader.exec_module(m)
apply_hook, remove_hook = m.create_metal_qwen3_optimization_hook()
orig = apply_hook()
try:
model, tokenizer = mlx_lm.load("mlx-community/Qwen3-0.6B-bf16")
generate(model, tokenizer, prompt="Hello", max_tokens=5)
finally:
remove_hook(orig)
PYObserved:
- The compare benchmark does not crash and prints "optimized" results.
- The in-process run prints the hook being applied, reports
Architecture: 16:8 heads (2:1 ratio), then fails with Metal compilation errors under bfloat16 (e.g.no matching function for call to 'dot'withvec<bfloat,...>). - If the compare benchmarks were actually exercising the evolved kernel in the subprocess, they should fail similarly; the fact they do not strongly suggests the subprocess benchmarks are running the vanilla kernel.
Impact
- Reported speedups can be noise (evolved kernel not executed in subprocess benchmarks).
- “Best” programs may not run under the actual model dtype (
bfloat16). - Evolution may optimize for / be guided by the wrong head ratio + hidden size.
Suggested fixes
- Bug 1: avoid subprocess, or run subprocess via a wrapper entrypoint that applies the patch inside the child process (like
mlx_lm_generate_with_hook.py). - Bug 2: run correctness in
bfloat16(or add a 1-tokenmlx_lm.generatesmoke test under the target model/dtype as a correctness gate). - Bug 3: align docs/prompt/MockArgs with the chosen default model, or change the default model to one that actually matches the 40:8 assumptions; ensure evaluator + benchmarks use the same model config.
Environment
- macOS Apple Silicon
- model:
mlx-community/Qwen3-0.6B-bf16
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working