Skip to content

msenkfor/strands-neuron

Repository files navigation

strands-neuron

strands-neuron

PyPI version

vLLM on AWS Neuron infrastructure provider for AWS Strands Agents SDK.

This package provides a model provider implementation that connects to vLLM servers running on AWS AI Chips, enabling high-performance LLM inference with OpenAI-compatible APIs.

Features

  • 🚀 OpenAI-compatible API - Works with any OpenAI-compatible vLLM server
  • 📡 Full streaming support - Async generators for real-time token streaming
  • 🛠️ Tool/function calling - Native support for function calling and tool use
  • 📊 Structured output - Generate structured data via tool calls
  • Neuron-optimized - Designed for AWS Neuron hardware acceleration
  • 🔧 Flexible configuration - Extensive configuration options for model behavior

⚠️ Parallel Tool Calling Support

Tool calling support depends on the underlying model:

  • Llama 3.1 models: Only support single tool calls at once (e.g., mistralai/Mistral-7B-Instruct-v0.3)
  • Llama 4 models: Support parallel tool calls
  • Other models with parallel support: Granite 3.1, xLAM, Pythonic parser models

If you encounter "This model only supports single tool-calls at once!" errors, this is a model limitation, not a configuration issue. The vLLM server is correctly configured with --enable-auto-tool-choice and --tool-call-parser flags in the Dockerfile.

Workarounds:

  1. Use a model that supports parallel tool calls (e.g., Llama 4, Granite 3.1, xLAM)
  2. Design agents to only use one tool at a time
  3. Use structured_output() which only requires a single tool call (works perfectly with Llama 3.1)

Installation

First, clone the repository and create a virtual environment:

git clone <repository-url>
cd strands-neuron
python3 -m venv .venv
source .venv/bin/activate

Install the Strands Agents SDK:

pip install strands-agents strands-agents-tools

Then install the package:

pip install strands-neuron

For development (includes testing and linting tools):

pip install -e ".[dev]"

Prerequisites

Hardware Requirements

  • AWS EC2 instance with Neuron hardware (e.g., inf2, trn1, trn2 or trn3)
  • AWS Neuron Deep Learning AMI (DLAMI) for Ubuntu 22.04

See the infrastructure README for detailed setup instructions.

Software Requirements

Quick Start

1. Start the vLLM Neuron Server

First, set up and start your vLLM Neuron server following the infrastructure README.

The server should be accessible at http://localhost:8080/v1 (or your configured endpoint).

2. Use NeuronModel in Your Code

from strands import Agent
from strands_neuron import NeuronModel

# Initialize the model
model = NeuronModel(
    config={
        "model_id": "mistralai/Mistral-7B-Instruct-v0.3",
        "base_url": "http://localhost:8080/v1",
        "api_key": "EMPTY",  # Not required for local servers
        # "support_tool_choice_auto": True,  # Uncomment if vLLM has --enable-auto-tool-choice flag
    }
)

# Create an agent
agent = Agent(
    system_prompt="You are a helpful assistant.",
    model=model,
)

# Use the agent
response = agent("What is machine learning?")
print(response)

3. Streaming Example

import asyncio
from strands_neuron import NeuronModel

async def stream_example():
    model = NeuronModel(
        config={
            "model_id": "mistralai/Mistral-7B-Instruct-v0.3",
            "base_url": "http://localhost:8080/v1",
            "api_key": "EMPTY",
        }
    )
    
    messages = [{"role": "user", "content": [{"text": "Explain Python"}]}]
    
    async for event in model.stream(messages, system_prompt="You are a coding assistant."):
        if "contentBlockDelta" in event:
            delta = event["contentBlockDelta"].get("delta", {})
            if "text" in delta:
                print(delta["text"], end="", flush=True)

asyncio.run(stream_example())

Configuration

The NeuronModel accepts a configuration dictionary with the following options:

Required

  • model_id (str): The model identifier (e.g., "mistralai/Mistral-7B-Instruct-v0.3")

Optional

API Configuration

  • base_url (str): Base URL for the OpenAI-compatible API (default: "http://localhost:8080/v1")
  • api_key (str): API key for authentication (default: "EMPTY")

Generation Parameters

  • temperature (float): Sampling temperature (0.0 to 2.0)
  • top_p (float): Nucleus sampling parameter
  • max_completion_tokens (int): Maximum tokens to generate
  • stop (str | List[str]): Sequences that stop generation
  • stop_sequences (List[str]): Alternative to stop for backwards compatibility
  • frequency_penalty (float): Penalize tokens based on frequency (-2.0 to 2.0)
  • presence_penalty (float): Penalize tokens based on presence (-2.0 to 2.0)
  • n (int): Number of completions to generate
  • logprobs (bool): Return log probabilities
  • top_logprobs (int): Number of top log probabilities to return

vLLM Server Capabilities

  • support_tool_choice_auto (bool): Set to True if your vLLM server has --enable-auto-tool-choice and --tool-call-parser flags enabled (default: False)

Advanced Options

  • additional_args (Dict[str, Any]): Additional arguments passed to the API request

Example Configuration

model = NeuronModel(
    config={
        "model_id": "mistralai/Mistral-7B-Instruct-v0.3",
        "base_url": "http://localhost:8080/v1",
        "api_key": "EMPTY",
        "temperature": 0.7,
        "top_p": 0.9,
        "max_tokens": 1000,
        "stop_sequences": ["\n\n"],
        "tensor_parallel_size": 2,
        "enable_prefix_caching": True,
    }
)

Examples

This package includes several example implementations:

Person Info Example (Structured Output)

Demonstrates structured output extraction using Pydantic models:

python examples/person_example.py

Weather Agent Example

Demonstrates using NeuronModel with tools to create a weather assistant:

python examples/weather_example.py

Streaming Examples

Shows various streaming patterns:

python examples/stream_example.py

MCP Integration

Demonstrates Model Context Protocol (MCP) integration:

cd examples/mcp
python mcp-server.py  # In one terminal
python mcp-example.py  # In another terminal

See the MCP example README for detailed instructions.

API Reference

NeuronModel

The main model class that implements the Strands Model interface.

Methods

  • stream(messages, tool_specs=None, system_prompt=None, **kwargs): Stream responses as async generator
  • structured_output(output_model, prompt, system_prompt=None, **kwargs): Generate structured output
  • format_request(messages, tool_specs=None, system_prompt=None, stream=True): Format request for API
  • update_config(**config): Update model configuration
  • get_config(): Get current configuration

Development

Setup

# Clone the repository
git clone <repository-url>
cd strands-neuron

# Install in development mode
pip install strands-agents strands-agents-tools pytest
pip install -e ".[dev]"

Running Tests

# Run all tests
pytest

# Run unit tests only
pytest tests/unit

# Run integration tests only
pytest tests/integration

Code Quality

This project uses:

  • Ruff for linting
  • Black for code formatting
  • mypy for type checking
# Format code
black src tests

# Lint
ruff check src tests

# Type check
mypy src

Infrastructure

For information on setting up and deploying the vLLM Neuron server, see the infrastructure README.

License

Apache-2.0 License - see LICENSE file for details.

Changelog

See CHANGELOG.md for a list of changes and version history.

About

AWS Neuron infrastructure provider for Strands Agents SDK

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors