feat: add full MPS (Apple Silicon) support #288

gruckion · 2025-12-11T22:11:35Z

Summary

This PR enables true MPS GPU acceleration for macOS Apple Silicon users, providing a 44x speedup over CPU fallback.

Performance Benchmarks

Metric	MPS (Apple Silicon GPU)	CPU	Improvement
Generation Time	39.9s	1771.6s (~29.5 min)	44.4x faster
Realtime Factor	0.154x	0.017x	9x better
Tokens/sec	13.2	1.5	8.8x faster

Tested on Apple Silicon with float32 precision, use_torch_compile=False

Background

PR #167 added example/simple-mac.py as a Mac support solution, but it was actually a CPU fallback workaround that avoided MPS issues rather than fixing them. This PR addresses the root causes to enable true MPS GPU acceleration.

Changes

1. Guard CUDAGraph Call (Critical)

File: dia/model.py:701-702

# Before (crashes/warns on MPS)
torch.compiler.cudagraph_mark_step_begin()

# After
if self.device.type == "cuda":
    torch.compiler.cudagraph_mark_step_begin()

Why: cudagraph_mark_step_begin() is CUDA-specific and has no MPS equivalent. Called ~860 times per generation in the autoregressive loop.

2. Add Device Check to torch.compile (Critical)

File: dia/model.py:657-667

# Before (fails on MPS with "Device mps not supported")
if use_torch_compile and not hasattr(self, "_compiled"):
    self._prepare_generation = torch.compile(...)
    self._decoder_step = torch.compile(..., mode="max-autotune")
    self._compiled = True

# After
if use_torch_compile and not hasattr(self, "_compiled"):
    if self.device.type != "cuda":
        warnings.warn(
            f"torch.compile with max-autotune is only supported on CUDA devices. "
            f"Current device: {self.device.type}. Skipping compilation."
        )
    else:
        # ... compile as before

Why: mode="max-autotune" requires Triton which is CUDA/ROCm only. MPS torch.compile is still experimental (PyTorch #150121).

3. Fix CLI Device Auto-Detection

File: cli.py:12-18, 82-83

# Before (skips MPS, falls back to CPU)
default="cuda" if torch.cuda.is_available() else "cpu"

# After
def _get_default_device_str():
    if torch.cuda.is_available():
        return "cuda"
    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return "mps"
    return "cpu"

default=_get_default_device_str()

Why: Mac users running python cli.py "text" --output out.wav now automatically get MPS acceleration instead of slow CPU fallback.

4. Fix simple-mac.py dtype

File: example/simple-mac.py:4-6

# Before (potential precision issues)
model = Dia.from_pretrained("nari-labs/Dia-1.6B-0626", compute_dtype="float16")

# After
# Use float32 for better stability on MPS (Apple Silicon)
# float16 may have precision issues on MPS and provides minimal performance benefit
model = Dia.from_pretrained("nari-labs/Dia-1.6B-0626", compute_dtype="float32")

Why: Matches app.py recommendation. MPS float16 has documented precision bugs and provides minimal speedup on Apple Silicon (no dedicated Tensor Cores).

5. Guard Triton Config in Benchmark

File: example/benchmark.py:9-13

# Before (AttributeError on macOS)
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.triton.unique_kernel_names = True
torch._inductor.config.fx_graph_cache = True

# After
if sys.platform in ("linux", "win32"):
    torch._inductor.config.coordinate_descent_tuning = True
    torch._inductor.config.triton.unique_kernel_names = True
    torch._inductor.config.fx_graph_cache = True

Why: Triton is only available on Linux/Windows per pyproject.toml. These settings fail on macOS.

6. Add MPS Seed Management

Files: cli.py:30-32, app.py:66-68

# Added to set_seed()
if hasattr(torch, "mps") and torch.backends.mps.is_available():
    torch.mps.manual_seed(seed)

Why: Explicit MPS RNG seeding for reproducibility (though MPS reproducibility has known limitations).

7. Add ARM64 Linux Dockerfile

File: docker/Dockerfile.arm (new)

Adds support for ARM64 Linux servers (AWS Graviton, Ampere Altra, etc.) with CPU-only inference.

Note: Docker cannot access MPS on macOS (runs in Linux VM). macOS users must install natively for GPU acceleration.

Test Plan

python example/simple-mac.py runs without errors on MPS
python cli.py "[S1] Hello world." --output test.wav auto-detects MPS
Audio generation completes successfully on MPS
No CUDA-specific warnings/errors on MPS device
Benchmark confirms 44x speedup over CPU

References

PyTorch MPS Backend
torch.compile MPS Status (#150121)
Apple Metal PyTorch Guide
README TODO: "Docker support for ARM architecture and MacOS"

🤖 Generated with Claude Code

This commit enables true MPS GPU acceleration for macOS users: - Guard cudagraph_mark_step_begin() with CUDA device check (dia/model.py) - Skip torch.compile on non-CUDA devices with warning (dia/model.py) - Add MPS to CLI device auto-detection (cli.py) - Fix simple-mac.py to use float32 for MPS stability (example/simple-mac.py) - Guard Triton config with platform check (example/benchmark.py) - Add MPS seed management to set_seed() functions (cli.py, app.py) - Add ARM64 Linux Dockerfile for Graviton/ARM servers (docker/Dockerfile.arm) Tested on Apple Silicon with ~0.16x realtime performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

gruckion and others added 2 commits December 11, 2025 18:31

chore: research

f6b7b65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add full MPS (Apple Silicon) support #288

feat: add full MPS (Apple Silicon) support #288

Uh oh!

gruckion commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add full MPS (Apple Silicon) support #288

Are you sure you want to change the base?

feat: add full MPS (Apple Silicon) support #288

Uh oh!

Conversation

gruckion commented Dec 11, 2025

Summary

Performance Benchmarks

Background

Changes

1. Guard CUDAGraph Call (Critical)

2. Add Device Check to torch.compile (Critical)

3. Fix CLI Device Auto-Detection

4. Fix simple-mac.py dtype

5. Guard Triton Config in Benchmark

6. Add MPS Seed Management

7. Add ARM64 Linux Dockerfile

Test Plan

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant