vLLM on AWS Neuron infrastructure provider for AWS Strands Agents SDK.
This package provides a model provider implementation that connects to vLLM servers running on AWS AI Chips, enabling high-performance LLM inference with OpenAI-compatible APIs.
- 🚀 OpenAI-compatible API - Works with any OpenAI-compatible vLLM server
- 📡 Full streaming support - Async generators for real-time token streaming
- 🛠️ Tool/function calling - Native support for function calling and tool use
- 📊 Structured output - Generate structured data via tool calls
- ⚡ Neuron-optimized - Designed for AWS Neuron hardware acceleration
- 🔧 Flexible configuration - Extensive configuration options for model behavior
Tool calling support depends on the underlying model:
- Llama 3.1 models: Only support single tool calls at once (e.g.,
mistralai/Mistral-7B-Instruct-v0.3) - Llama 4 models: Support parallel tool calls
- Other models with parallel support: Granite 3.1, xLAM, Pythonic parser models
If you encounter "This model only supports single tool-calls at once!" errors, this is a model limitation, not a configuration issue. The vLLM server is correctly configured with --enable-auto-tool-choice and --tool-call-parser flags in the Dockerfile.
Workarounds:
- Use a model that supports parallel tool calls (e.g., Llama 4, Granite 3.1, xLAM)
- Design agents to only use one tool at a time
- Use
structured_output()which only requires a single tool call (works perfectly with Llama 3.1)
First, clone the repository and create a virtual environment:
git clone <repository-url>
cd strands-neuron
python3 -m venv .venv
source .venv/bin/activateInstall the Strands Agents SDK:
pip install strands-agents strands-agents-toolsThen install the package:
pip install strands-neuronFor development (includes testing and linting tools):
pip install -e ".[dev]"- AWS EC2 instance with Neuron hardware (e.g., inf2, trn1, trn2 or trn3)
- AWS Neuron Deep Learning AMI (DLAMI) for Ubuntu 22.04
See the infrastructure README for detailed setup instructions.
- Python 3.10 or higher
- Running vLLM Neuron server (see infrastructure setup)
First, set up and start your vLLM Neuron server following the infrastructure README.
The server should be accessible at http://localhost:8080/v1 (or your configured endpoint).
from strands import Agent
from strands_neuron import NeuronModel
# Initialize the model
model = NeuronModel(
config={
"model_id": "mistralai/Mistral-7B-Instruct-v0.3",
"base_url": "http://localhost:8080/v1",
"api_key": "EMPTY", # Not required for local servers
# "support_tool_choice_auto": True, # Uncomment if vLLM has --enable-auto-tool-choice flag
}
)
# Create an agent
agent = Agent(
system_prompt="You are a helpful assistant.",
model=model,
)
# Use the agent
response = agent("What is machine learning?")
print(response)import asyncio
from strands_neuron import NeuronModel
async def stream_example():
model = NeuronModel(
config={
"model_id": "mistralai/Mistral-7B-Instruct-v0.3",
"base_url": "http://localhost:8080/v1",
"api_key": "EMPTY",
}
)
messages = [{"role": "user", "content": [{"text": "Explain Python"}]}]
async for event in model.stream(messages, system_prompt="You are a coding assistant."):
if "contentBlockDelta" in event:
delta = event["contentBlockDelta"].get("delta", {})
if "text" in delta:
print(delta["text"], end="", flush=True)
asyncio.run(stream_example())The NeuronModel accepts a configuration dictionary with the following options:
model_id(str): The model identifier (e.g.,"mistralai/Mistral-7B-Instruct-v0.3")
base_url(str): Base URL for the OpenAI-compatible API (default:"http://localhost:8080/v1")api_key(str): API key for authentication (default:"EMPTY")
temperature(float): Sampling temperature (0.0 to 2.0)top_p(float): Nucleus sampling parametermax_completion_tokens(int): Maximum tokens to generatestop(str | List[str]): Sequences that stop generationstop_sequences(List[str]): Alternative tostopfor backwards compatibilityfrequency_penalty(float): Penalize tokens based on frequency (-2.0 to 2.0)presence_penalty(float): Penalize tokens based on presence (-2.0 to 2.0)n(int): Number of completions to generatelogprobs(bool): Return log probabilitiestop_logprobs(int): Number of top log probabilities to return
support_tool_choice_auto(bool): Set toTrueif your vLLM server has--enable-auto-tool-choiceand--tool-call-parserflags enabled (default:False)
additional_args(Dict[str, Any]): Additional arguments passed to the API request
model = NeuronModel(
config={
"model_id": "mistralai/Mistral-7B-Instruct-v0.3",
"base_url": "http://localhost:8080/v1",
"api_key": "EMPTY",
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 1000,
"stop_sequences": ["\n\n"],
"tensor_parallel_size": 2,
"enable_prefix_caching": True,
}
)This package includes several example implementations:
Demonstrates structured output extraction using Pydantic models:
python examples/person_example.pyDemonstrates using NeuronModel with tools to create a weather assistant:
python examples/weather_example.pyShows various streaming patterns:
python examples/stream_example.pyDemonstrates Model Context Protocol (MCP) integration:
cd examples/mcp
python mcp-server.py # In one terminal
python mcp-example.py # In another terminalSee the MCP example README for detailed instructions.
The main model class that implements the Strands Model interface.
stream(messages, tool_specs=None, system_prompt=None, **kwargs): Stream responses as async generatorstructured_output(output_model, prompt, system_prompt=None, **kwargs): Generate structured outputformat_request(messages, tool_specs=None, system_prompt=None, stream=True): Format request for APIupdate_config(**config): Update model configurationget_config(): Get current configuration
# Clone the repository
git clone <repository-url>
cd strands-neuron
# Install in development mode
pip install strands-agents strands-agents-tools pytest
pip install -e ".[dev]"# Run all tests
pytest
# Run unit tests only
pytest tests/unit
# Run integration tests only
pytest tests/integrationThis project uses:
- Ruff for linting
- Black for code formatting
- mypy for type checking
# Format code
black src tests
# Lint
ruff check src tests
# Type check
mypy srcFor information on setting up and deploying the vLLM Neuron server, see the infrastructure README.
Apache-2.0 License - see LICENSE file for details.
See CHANGELOG.md for a list of changes and version history.
