Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/config/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,7 @@ sidebar:
- docs/community/model-providers/clova-studio
- docs/community/model-providers/fireworksai
- docs/community/model-providers/nebius-token-factory
- docs/community/model-providers/neuron
- docs/community/model-providers/nvidia-nim
- docs/community/model-providers/sglang
- docs/community/model-providers/vllm
Expand Down
153 changes: 153 additions & 0 deletions src/content/docs/community/model-providers/neuron.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
title: AWS Neuron
community: true
description: vLLM on AWS Neuron hardware (Trainium/Inferentia)
integrationType: model-provider
languages: Python
project:
pypi: https://pypi.org/project/strands-neuron/
maintainer: msenkfor
---

:::note[Community Contribution]
This is a community-maintained package that is not owned or supported by the Strands team. Validate and review
the package before using it in your project.

Have your own integration? [We'd love to add it here too!](https://github.com/strands-agents/docs/issues/new?assignees=&labels=enhancement&projects=&template=content_addition.yml&title=%5BContent+Addition%5D%3A+)
:::

:::note[Language Support]
This provider is only supported in Python.
:::

[strands-neuron](https://pypi.org/project/strands-neuron/) is a vLLM on [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/vllm/index.html) model provider for Strands Agents SDK. It connects to vLLM servers running on [AWS AI Chips](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/trainium.html) (Trainium and Inferentia) via an OpenAI-compatible API, enabling high-performance LLM inference on AWS Neuron hardware.

**Features:**

- **OpenAI-Compatible API**: Works with any OpenAI-compatible vLLM server
- **Full Streaming Support**: Async generators for real-time token streaming
- **Tool/Function Calling**: Native support for function calling and tool use
- **Structured Output**: Generate structured data via tool calls
- **Neuron-Optimized**: Designed for AWS Neuron hardware acceleration
- **Flexible Configuration**: Extensive configuration options for model behavior

:::caution[Parallel Tool Calling Support]
Tool calling support depends on the underlying model:

- **Llama 3.1 models**: Only support single tool calls at once
- **Llama 4 models**: Support parallel tool calls
- **Other models with parallel support**: Granite 3.1, xLAM, Pythonic parser models

If you encounter `"This model only supports single tool-calls at once!"`, this is a model limitation, not a configuration issue. Workarounds: use a model that supports parallel tool calls (Llama 4, Granite 3.1, xLAM), design agents to use one tool at a time, or use `structured_output()` which only requires a single tool call.
:::

## Installation

Install strands-neuron along with the Strands Agents SDK:

```bash
pip install strands-neuron strands-agents
```

## Requirements

- AWS EC2 instance with Neuron hardware (inf2, trn1, trn2, or trn3)
- AWS Neuron Deep Learning AMI (DLAMI) for Ubuntu 22.04
- Running vLLM Neuron server accessible via HTTP

## Usage

### Start the vLLM Neuron Server

Set up and start your vLLM Neuron server on your AWS Neuron instance. The server should expose an OpenAI-compatible endpoint (default: `http://localhost:8080/v1`).

For tool calling support, start vLLM with the appropriate flags:

```bash
vllm serve <MODEL_ID> \
--host 0.0.0.0 \
--port 8080 \
--enable-auto-tool-choice \
--tool-call-parser <PARSER> # e.g., llama3_json, mistral, etc.
```

### Basic Agent

```python
from strands import Agent
from strands_neuron import NeuronModel

model = NeuronModel(
config={
"model_id": "mistralai/Mistral-7B-Instruct-v0.3",
"base_url": "http://localhost:8080/v1",
"api_key": "EMPTY", # Not required for local servers
}
)

agent = Agent(
system_prompt="You are a helpful assistant.",
model=model,
)

response = agent("What is machine learning?")
print(response)
```

## Configuration

The `NeuronModel` accepts a `config` dictionary with the following parameters:

| Parameter | Description | Example | Required |
| --------- | ----------- | ------- | -------- |
| `model_id` | Model identifier | `"mistralai/Mistral-7B-Instruct-v0.3"` | Yes |
| `base_url` | Base URL for the OpenAI-compatible API | `"http://localhost:8080/v1"` | No (default: `"http://localhost:8080/v1"`) |
| `api_key` | API key for authentication | `"EMPTY"` | No (default: `"EMPTY"`) |
| `support_tool_choice_auto` | Set `True` if vLLM has `--enable-auto-tool-choice` and `--tool-call-parser` flags | `True` | No (default: `False`) |
| `temperature` | Sampling temperature (0.0 to 2.0) | `0.7` | No |
| `top_p` | Nucleus sampling parameter | `0.9` | No |
| `max_completion_tokens` | Maximum tokens to generate | `1000` | No |
| `stop` | Sequences that stop generation | `["\n\n"]` | No |
| `frequency_penalty` | Penalize tokens based on frequency (-2.0 to 2.0) | `0.0` | No |
| `presence_penalty` | Penalize tokens based on presence (-2.0 to 2.0) | `0.0` | No |
| `additional_args` | Additional arguments passed to the API request | `{}` | No |

### Example Configuration

```python
model = NeuronModel(
config={
"model_id": "mistralai/Mistral-7B-Instruct-v0.3",
"base_url": "http://localhost:8080/v1",
"api_key": "EMPTY",
"temperature": 0.7,
"top_p": 0.9,
"max_completion_tokens": 1000,
"support_tool_choice_auto": True,
}
)
```

## Troubleshooting

### Connection errors to vLLM server

Ensure your vLLM Neuron server is running and accessible:

```bash
curl http://localhost:8080/health
```

### Model only supports single tool calls

If you see `"This model only supports single tool-calls at once!"`, this is a model-level constraint. Switch to a model that supports parallel tool calls (Llama 4, Granite 3.1, xLAM), or use `structured_output()` for single-tool workflows.

### Tool calling not working

Ensure the vLLM server was started with `--enable-auto-tool-choice` and `--tool-call-parser` flags, and set `"support_tool_choice_auto": True` in the model config.

## References

- [strands-neuron on PyPI](https://pypi.org/project/strands-neuron/)
- [AWS Neuron Documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/)
- [NxD Inference vLLM Integration](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/vllm/index.html)