[Bug] mlx_metal_kernel_opt: subprocess benchmarks ignore evolved kernels; correctness uses wrong dtype; docs/tests assume wrong head config

Three independent issues make the example’s “best program” and benchmark results unreliable:

- **Bug 1 (benchmark validity)**: `qwen3_benchmark_suite.py` runs `mlx_lm.generate` via `subprocess.run(...)`, so parent-process monkey-patching (e.g. `mlx_lm.models.qwen3.Attention = CustomGQAAttention`) does **not** propagate into the child process. “Custom” benchmarks effectively run the vanilla kernel.
- **Bug 2 (dtype coverage)**: `evaluator.py` correctness uses `mx.random.normal(...)` (default `float32`), but the default model is `mlx-community/Qwen3-0.6B-bf16` (`bfloat16`). Metal kernels template/compile differently for `float` vs `bfloat16`, so kernels can pass correctness yet fail in real inference.
- **Bug 3 (architecture mismatch)**: docs/prompt/MockArgs assume **40:8** heads and `hidden_size=5120`, but `mlx-community/Qwen3-0.6B-bf16` reports **16:8 (2:1)** at runtime. This misguides evolution and tests different shapes than the benchmarked model.

## Evidence / Repro

1) `--mode compare` completes and reports "optimized" benchmark results:

```bash
python openevolve/examples/mlx_metal_kernel_opt/run_benchmarks.py \
  --mode compare \
  --model mlx-community/Qwen3-0.6B-bf16
```

2) When the same `best_program.py` hook is actually applied in-process *before* loading the model, real inference fails to compile under bfloat16:

```bash
cd openevolve/examples/mlx_metal_kernel_opt
python - <<'PY'
import importlib.util
import mlx_lm
from mlx_lm.generate import generate

spec = importlib.util.spec_from_file_location("best_program", "best_program.py")
m = importlib.util.module_from_spec(spec)
spec.loader.exec_module(m)

apply_hook, remove_hook = m.create_metal_qwen3_optimization_hook()
orig = apply_hook()
try:
    model, tokenizer = mlx_lm.load("mlx-community/Qwen3-0.6B-bf16")
    generate(model, tokenizer, prompt="Hello", max_tokens=5)
finally:
    remove_hook(orig)
PY
```

Observed:

- The compare benchmark does not crash and prints "optimized" results.
- The in-process run prints the hook being applied, reports `Architecture: 16:8 heads (2:1 ratio)`, then fails with Metal compilation errors under bfloat16 (e.g. `no matching function for call to 'dot'` with `vec<bfloat,...>`).
- If the compare benchmarks were actually exercising the evolved kernel in the subprocess, they should fail similarly; the fact they do not strongly suggests the subprocess benchmarks are running the vanilla kernel.

## Impact

- Reported speedups can be noise (evolved kernel not executed in subprocess benchmarks).
- “Best” programs may not run under the actual model dtype (`bfloat16`).
- Evolution may optimize for / be guided by the wrong head ratio + hidden size.

## Suggested fixes

- **Bug 1**: avoid subprocess, or run subprocess via a wrapper entrypoint that applies the patch inside the child process (like `mlx_lm_generate_with_hook.py`).
- **Bug 2**: run correctness in `bfloat16` (or add a 1-token `mlx_lm.generate` smoke test under the target model/dtype as a correctness gate).
- **Bug 3**: align docs/prompt/MockArgs with the chosen default model, or change the default model to one that actually matches the 40:8 assumptions; ensure evaluator + benchmarks use the same model config.

## Environment

- macOS Apple Silicon
- model: `mlx-community/Qwen3-0.6B-bf16`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] mlx_metal_kernel_opt: subprocess benchmarks ignore evolved kernels; correctness uses wrong dtype; docs/tests assume wrong head config #372

Evidence / Repro

Impact

Suggested fixes

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] mlx_metal_kernel_opt: subprocess benchmarks ignore evolved kernels; correctness uses wrong dtype; docs/tests assume wrong head config #372

Description

Evidence / Repro

Impact

Suggested fixes

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions