Skip to content

Feature request: BitNet 1.58-bit ternary inference on ROCm (gfx1151) #2

@stampby

Description

@stampby

Request

Enable BitNet 1.58-bit (ternary weight) inference on ROCm builds. The architecture is listed as supported in the README but the pre-built gfx1151 binary fails to load BitNet models.

What We Tested

Binary: b1004-tech-preview (gfx1151)
Model: mlx-community/bitnet-b1.58-2B-4T
Result: warmup failed — model loads but inference crashes

Also tested mlx-community/Falcon-E-3B-Instruct-1.58bit — same result.

Why This Matters

BitNet 1.58-bit uses ternary weights {-1, 0, 1}. On unified memory hardware like Strix Halo (128GB), 1-bit models would:

  • Dramatically reduce memory bandwidth pressure — the main bottleneck on unified memory
  • Enable much larger effective model sizes — a 70B 1-bit model fits where a 70B FP16 cannot
  • Pair with MLX speed advantage — MLX is already 29-85% faster than vLLM on this hardware. Adding 1-bit would compound that.

Microsoft open-sourced BitNet. PrismML has Bonsai 1-bit MLX models (Apple Silicon). The architecture support is in the codebase — the ternary matmul kernels just need the ROCm/HIP path.

Available Models

mlx-community/bitnet-b1.58-2B-4T           — Microsoft official
mlx-community/Falcon-E-3B-Instruct-1.58bit — Falcon extreme quant
prism-ml/Bonsai-8B-mlx-1bit                — PrismML (Apple Silicon)
prism-ml/Bonsai-4B-mlx-1bit
prism-ml/Bonsai-1.7B-mlx-1bit

Test Environment

Hardware: AMD Strix Halo, 128GB unified, gfx1151
Binary:   b1004-tech-preview
OS:       CachyOS (Arch), kernel 7.0.0-1-mainline

Context

We have comprehensive MLX benchmarks on this hardware — 6 models passing at 21-151 tok/s (4-bit). Full results: https://github.com/stampby/bleeding-edge

Happy to test 1-bit builds when available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions