Skip to content

feat: GigaAM v3 CTC inference on Apple Silicon via MLX#61

Closed
misteral wants to merge 20 commits intosalute-developers:mainfrom
misteral:feat/mlx-apple-silicon
Closed

feat: GigaAM v3 CTC inference on Apple Silicon via MLX#61
misteral wants to merge 20 commits intosalute-developers:mainfrom
misteral:feat/mlx-apple-silicon

Conversation

@misteral
Copy link
Copy Markdown

Summary

Native MLX inference for GigaAM v3 CTC on Apple Silicon — 139x realtime on M4.

What's included

File Description
mlx_convert/gigaam_mlx.py Full MLX model: Conformer encoder (16 layers, 768d, RoPE), CTC head, mel spectrogram, streaming
mlx_convert/convert_gigaam_to_mlx.py PyTorch → MLX conversion (safetensors + config.json)
mlx_convert/gigaam-cli Single-file transcription CLI
mlx_convert/gigaam-stream Real-time streaming (live mic + file)
mlx_convert/gigaam-transcribe Shell wrapper
mlx_convert/README.md Documentation: Python API, CLI, benchmarks, mlx-audio integration

Architecture

Audio (16kHz) → Log-Mel (64 bins) → Conv1d Subsampling (4x)
  → 16× Conformer (RoPE MHSA + GLU Conv + SiLU FFN)
  → CTC Head → Greedy Decode

Key details:

  • RoPE applied before Q/K/V projections (non-standard, matching original)
  • Mel filterbank saved from PyTorch (exact match, no recomputation drift)
  • All Conv1d weights transposed: [out, in, K][out, K, in] for MLX

Performance (Apple M4)

Metric Value
Batch (11s audio) 81ms (139x realtime)
Streaming (1s step) 57ms/step
Model size (fp16) 421 MB

Python API

from gigaam_mlx import load_model, load_audio

model = load_model("./gigaam-v3-ctc-mlx")
text = model.transcribe(load_audio("audio.wav"))

# Streaming
for r in model.stream_generate(load_audio("audio.wav")):
    print(r.cumulative_text)

mlx-audio compatibility

StreamingResult follows the same contract as mlx-audio Parakeet/Whisper streaming. README includes integration guide for adding GigaAM as an mlx-audio STT model.

Testing

Tested on Apple M4 with various Russian speech samples. Output matches PyTorch reference (character-level exact match on short utterances, minor CTC boundary differences on longer audio).

georgygospodinov and others added 20 commits May 30, 2024 18:31
Fix pyannote model loading (conflicts with new torch) and workflow disk memory
…al_cache

No need to repass HF_TOKEN, explicit check for a local copy
Add native MLX (Apple Silicon) inference for GigaAM v3 CTC model:

- Full Conformer encoder (16 layers, 768d) with RoPE attention
- Conv1d subsampling, GLU convolution module, SiLU FFN
- CTC greedy decoding with proper blank/repeat collapsing
- Log-mel spectrogram computed in MLX (exact match to PyTorch)
- PyTorch → MLX weight conversion script (safetensors + config.json)
- Streaming transcription (growing buffer, live mic + file)
- CLI tools: gigaam-cli, gigaam-stream, gigaam-transcribe
- Python API: load_model, transcribe, stream_generate, stream_live
- mlx-audio ecosystem compatible (StreamingResult contract)

Performance on Apple M4:
- 139x realtime (11s audio in 81ms)
- 57ms/step streaming latency
- fp16 weights: 421 MB

New files in mlx_convert/:
  gigaam_mlx.py              — MLX model + inference + streaming
  convert_gigaam_to_mlx.py   — PyTorch → MLX conversion
  gigaam-cli                 — single-file transcription CLI
  gigaam-stream              — real-time streaming CLI
  gigaam-transcribe          — shell wrapper
  README.md                  — documentation, API, benchmarks
- Implement  (Embedding + LSTM) matching PyTorch layout
- Implement  (Linear layers + ReLU)
- Abstract  and  to automatically use RNNT or CTC based on config
- Update conversion script to map PyTorch's  separate gate weights/biases to MLX's grouped format
- Update GigaAMConfig to parse RNNT settings
- Update README with RNNT benchmarks (48x realtime on M4 vs 139x for CTC)
…rking

Add complete MLX conversion pipeline and inference tools for GigaAM v3 RNNT model targeting Apple Silicon:

- Add pre-converted GigaAM v3 RNNT MLX model artifacts (config, weights in safetensors format) with comprehensive documentation of architecture and performance metrics (48× realtime on M4)
- Add weight inspection utility to analyze PyTorch checkpoint structure and enable accurate parameter mapping during conversion
- Add comprehensive test suite covering MLX model inference, fp32 variant testing, and PyTorch baseline comparison
- Add comparative benchmarking tool (compare_all.py) for side-by-side evaluation of Whisper CPP, GigaAM PyTorch (CPU), and GigaAM MLX implementations
- Add RNNT architecture patch script to support joint network and LSTM decoder components
- Add dependency lock file (uv.lock) for reproducible environment management

This enables efficient speech recognition on Apple Silicon with ~9% lower WER compared to CTC variant through autoregressive joint language modeling.
@misteral
Copy link
Copy Markdown
Author

Superseded by #62 — clean rebased version with only MLX-related changes.

@misteral misteral closed this Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants