A demonstration of integrating Mac-native MLX models with the Strands agent framework, enabling local LLM inference using Apple Silicon GPU acceleration.
This project implements a custom MacNativeModel adapter that allows the Strands agent framework to use locally-running MLX models. Instead of relying on cloud-based APIs, this enables private, offline AI inference directly on your Mac's GPU.
- Local GPU-accelerated inference using MLX
- Integration with Strands agent framework
- Support for quantized models (4-bit) for efficient memory usage
- Async/await support for non-blocking inference
- Compatible with Llama 3.2 and other MLX-compatible models
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.13+
- uv (Python package manager)
- Clone the repository:
git clone <repository-url>
cd strands- Install dependencies using uv:
uv syncRun the example:
uv run python main.pyThe script will:
- Load the Llama-3.2-3B-Instruct-4bit model onto your Mac's GPU
- Create a Strands agent with the local model
- Execute a test prompt
- Display the AI response
The MacNativeModel class implements the Strands Model interface with the following key methods:
__init__: Loads the MLX model and tokenizerstream: Handles inference with proper event streaming for Strands- Message formatting compatible with Strands' content block structure
- Async thread execution to prevent blocking
strands-agents: Core agent frameworkmlx-lm: Apple MLX for local model inferenceapple-foundation-models: Apple's ML foundation tools
Default model: mlx-community/Llama-3.2-3B-Instruct-4bit
To use a different model, modify the model_path in main.py:
model_path = "mlx-community/your-model-name"
native_model = MacNativeModel(model_id=model_path)