strands-agents · msenkfor · Mar 20, 2026
diff --git a/src/config/navigation.yml b/src/config/navigation.yml
@@ -213,6 +213,7 @@ sidebar:
           - docs/community/model-providers/clova-studio
           - docs/community/model-providers/fireworksai
           - docs/community/model-providers/nebius-token-factory
+          - docs/community/model-providers/neuron
           - docs/community/model-providers/nvidia-nim
           - docs/community/model-providers/sglang
           - docs/community/model-providers/vllm

diff --git a/src/content/docs/community/model-providers/neuron.mdx b/src/content/docs/community/model-providers/neuron.mdx
@@ -0,0 +1,153 @@
+---
+title: AWS Neuron
+community: true
+description: vLLM on AWS Neuron hardware (Trainium/Inferentia)
+integrationType: model-provider
+languages: Python
+project:
+  pypi: https://pypi.org/project/strands-neuron/
+  maintainer: msenkfor
+---
+
+:::note[Community Contribution]
+This is a community-maintained package that is not owned or supported by the Strands team. Validate and review
+the package before using it in your project.
+
+Have your own integration? [We'd love to add it here too!](https://github.com/strands-agents/docs/issues/new?assignees=&labels=enhancement&projects=&template=content_addition.yml&title=%5BContent+Addition%5D%3A+)
+:::
+
+:::note[Language Support]
+This provider is only supported in Python.
+:::
+
+[strands-neuron](https://pypi.org/project/strands-neuron/) is a vLLM on [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/vllm/index.html) model provider for Strands Agents SDK. It connects to vLLM servers running on [AWS AI Chips](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/trainium.html) (Trainium and Inferentia) via an OpenAI-compatible API, enabling high-performance LLM inference on AWS Neuron hardware.
+
+**Features:**
+
+- **OpenAI-Compatible API**: Works with any OpenAI-compatible vLLM server
+- **Full Streaming Support**: Async generators for real-time token streaming
+- **Tool/Function Calling**: Native support for function calling and tool use
+- **Structured Output**: Generate structured data via tool calls
+- **Neuron-Optimized**: Designed for AWS Neuron hardware acceleration
+- **Flexible Configuration**: Extensive configuration options for model behavior
+
+:::caution[Parallel Tool Calling Support]
+Tool calling support depends on the underlying model:
+
+- **Llama 3.1 models**: Only support single tool calls at once
+- **Llama 4 models**: Support parallel tool calls
+- **Other models with parallel support**: Granite 3.1, xLAM, Pythonic parser models
+
+If you encounter `"This model only supports single tool-calls at once!"`, this is a model limitation, not a configuration issue. Workarounds: use a model that supports parallel tool calls (Llama 4, Granite 3.1, xLAM), design agents to use one tool at a time, or use `structured_output()` which only requires a single tool call.
+:::
+
+## Installation
+
+Install strands-neuron along with the Strands Agents SDK:
+
+```bash
+pip install strands-neuron strands-agents
+```
+
+## Requirements
+
+- AWS EC2 instance with Neuron hardware (inf2, trn1, trn2, or trn3)
+- AWS Neuron Deep Learning AMI (DLAMI) for Ubuntu 22.04
+- Running vLLM Neuron server accessible via HTTP
+
+## Usage
+
+### Start the vLLM Neuron Server
+
+Set up and start your vLLM Neuron server on your AWS Neuron instance. The server should expose an OpenAI-compatible endpoint (default: `http://localhost:8080/v1`).
+
+For tool calling support, start vLLM with the appropriate flags:
+
+```bash
+vllm serve <MODEL_ID> \
+    --host 0.0.0.0 \
+    --port 8080 \
+    --enable-auto-tool-choice \
+    --tool-call-parser <PARSER>  # e.g., llama3_json, mistral, etc.
+```
+
+### Basic Agent
+
+```python
+from strands import Agent
+from strands_neuron import NeuronModel
+
+model = NeuronModel(
+    config={
+        "model_id": "mistralai/Mistral-7B-Instruct-v0.3",
+        "base_url": "http://localhost:8080/v1",
+        "api_key": "EMPTY",  # Not required for local servers
+    }
+)
+
+agent = Agent(
+    system_prompt="You are a helpful assistant.",
+    model=model,
+)
+
+response = agent("What is machine learning?")
+print(response)
+```
+
+## Configuration
+
+The `NeuronModel` accepts a `config` dictionary with the following parameters:
+
+| Parameter | Description | Example | Required |
+| --------- | ----------- | ------- | -------- |
+| `model_id` | Model identifier | `"mistralai/Mistral-7B-Instruct-v0.3"` | Yes |
+| `base_url` | Base URL for the OpenAI-compatible API | `"http://localhost:8080/v1"` | No (default: `"http://localhost:8080/v1"`) |
+| `api_key` | API key for authentication | `"EMPTY"` | No (default: `"EMPTY"`) |
+| `support_tool_choice_auto` | Set `True` if vLLM has `--enable-auto-tool-choice` and `--tool-call-parser` flags | `True` | No (default: `False`) |
+| `temperature` | Sampling temperature (0.0 to 2.0) | `0.7` | No |
+| `top_p` | Nucleus sampling parameter | `0.9` | No |
+| `max_completion_tokens` | Maximum tokens to generate | `1000` | No |
+| `stop` | Sequences that stop generation | `["\n\n"]` | No |
+| `frequency_penalty` | Penalize tokens based on frequency (-2.0 to 2.0) | `0.0` | No |
+| `presence_penalty` | Penalize tokens based on presence (-2.0 to 2.0) | `0.0` | No |
+| `additional_args` | Additional arguments passed to the API request | `{}` | No |
+
+### Example Configuration
+
+```python
+model = NeuronModel(
+    config={
+        "model_id": "mistralai/Mistral-7B-Instruct-v0.3",
+        "base_url": "http://localhost:8080/v1",
+        "api_key": "EMPTY",
+        "temperature": 0.7,
+        "top_p": 0.9,
+        "max_completion_tokens": 1000,
+        "support_tool_choice_auto": True,
+    }
+)
+```
+
+## Troubleshooting
+
+### Connection errors to vLLM server
+
+Ensure your vLLM Neuron server is running and accessible:
+
+```bash
+curl http://localhost:8080/health
+```
+
+### Model only supports single tool calls
+
+If you see `"This model only supports single tool-calls at once!"`, this is a model-level constraint. Switch to a model that supports parallel tool calls (Llama 4, Granite 3.1, xLAM), or use `structured_output()` for single-tool workflows.
+
+### Tool calling not working
+
+Ensure the vLLM server was started with `--enable-auto-tool-choice` and `--tool-call-parser` flags, and set `"support_tool_choice_auto": True` in the model config.
+
+## References
+
+- [strands-neuron on PyPI](https://pypi.org/project/strands-neuron/)
+- [AWS Neuron Documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/)
+- [NxD Inference vLLM Integration](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/vllm/index.html)