Using Chutes AI as your LLM provider for BaseAgent
Chutes AI provides access to advanced language models through a simple API. BaseAgent supports Chutes as a first-class provider, offering access to the Kimi K2.5-TEE model with its powerful thinking capabilities.
| Feature | Value |
|---|---|
| API Base URL | https://llm.chutes.ai/v1 |
| Default Model | moonshotai/Kimi-K2.5-TEE |
| Model Parameters | 1T total, 32B activated |
| Context Window | 256K tokens |
| Thinking Mode | Enabled by default |
- Visit chutes.ai
- Create an account or sign in
- Navigate to API settings
- Generate an API token
# Required: API token
export CHUTES_API_TOKEN="your-token-from-chutes.ai"
# Optional: Explicitly set provider and model
export LLM_PROVIDER="chutes"
export LLM_MODEL="moonshotai/Kimi-K2.5-TEE"python3 agent.py --instruction "Your task description"sequenceDiagram
participant Agent as BaseAgent
participant Client as LiteLLM Client
participant Chutes as Chutes API
Agent->>Client: Initialize with CHUTES_API_TOKEN
Client->>Client: Configure litellm
loop Each Request
Agent->>Client: chat(messages, tools)
Client->>Chutes: POST /v1/chat/completions
Note over Client,Chutes: Authorization: Bearer $CHUTES_API_TOKEN
Chutes-->>Client: Response with tokens
Client-->>Agent: LLMResponse
end
The moonshotai/Kimi-K2.5-TEE model offers:
- Total Parameters: 1 Trillion (1T)
- Activated Parameters: 32 Billion (32B)
- Architecture: Mixture of Experts (MoE)
- Context Length: 256,000 tokens
Kimi K2.5-TEE supports a "thinking mode" where the model shows its reasoning process:
sequenceDiagram
participant User
participant Model as Kimi K2.5-TEE
participant Response
User->>Model: Complex task instruction
rect rgb(230, 240, 255)
Note over Model: Thinking Mode Active
Model->>Model: Analyze problem
Model->>Model: Consider approaches
Model->>Model: Evaluate options
end
Model->>Response: <think>Reasoning process...</think>
Model->>Response: Final answer/action
| Mode | Temperature | Top-p | Description |
|---|---|---|---|
| Thinking | 1.0 | 0.95 | More exploratory reasoning |
| Instant | 0.6 | 0.95 | Faster, more deterministic |
# src/config/defaults.py
CONFIG = {
"model": os.environ.get("LLM_MODEL", "moonshotai/Kimi-K2.5-TEE"),
"provider": "chutes",
"temperature": 1.0, # For thinking mode
"max_tokens": 16384,
}| Variable | Required | Default | Description |
|---|---|---|---|
CHUTES_API_TOKEN |
Yes | - | API token from chutes.ai |
LLM_PROVIDER |
No | openrouter |
Set to chutes |
LLM_MODEL |
No | moonshotai/Kimi-K2.5-TEE |
Model identifier |
LLM_COST_LIMIT |
No | 10.0 |
Max cost in USD |
When thinking mode is enabled, responses include <think> tags:
<think>
The user wants to create a file with specific content.
I should:
1. Check if the file already exists
2. Create the file with the requested content
3. Verify the file was created correctly
</think>
I'll create the file for you now.BaseAgent can be configured to:
- Parse and strip the thinking tags (show only final answer)
- Keep the thinking content (useful for debugging)
- Log thinking to stderr while showing final answer
import re
def parse_thinking(response_text: str) -> tuple[str, str]:
"""Extract thinking and final response."""
think_pattern = r'<think>(.*?)</think>'
match = re.search(think_pattern, response_text, re.DOTALL)
if match:
thinking = match.group(1).strip()
final = re.sub(think_pattern, '', response_text, flags=re.DOTALL).strip()
return thinking, final
return "", response_textChutes API follows OpenAI-compatible format:
curl -X POST https://llm.chutes.ai/v1/chat/completions \
-H "Authorization: Bearer $CHUTES_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2.5-TEE",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 1024,
"temperature": 1.0,
"top_p": 0.95
}'If Chutes is unavailable, BaseAgent can fall back to OpenRouter:
flowchart TB
Start[API Request] --> Check{Chutes Available?}
Check -->|Yes| Chutes[Send to Chutes API]
Chutes --> Success{Success?}
Success -->|Yes| Done[Return Response]
Success -->|No| Retry{Retry Count < 3?}
Retry -->|Yes| Chutes
Retry -->|No| Fallback[Use OpenRouter]
Check -->|No| Fallback
Fallback --> Done
# Primary: Chutes
export CHUTES_API_TOKEN="..."
export LLM_PROVIDER="chutes"
# Fallback: OpenRouter
export OPENROUTER_API_KEY="..."# Switch to OpenRouter
export LLM_PROVIDER="openrouter"
export LLM_MODEL="openrouter/anthropic/claude-sonnet-4-20250514"
# Switch back to Chutes
export LLM_PROVIDER="chutes"
export LLM_MODEL="moonshotai/Kimi-K2.5-TEE"| Metric | Cost |
|---|---|
| Input tokens | Varies by model |
| Output tokens | Varies by model |
| Cached input | Reduced rate |
# Set cost limit
export LLM_COST_LIMIT="5.0" # Max $5.00 per sessionBaseAgent tracks costs and will abort if the limit is exceeded:
# In src/llm/client.py
if self._total_cost >= self.cost_limit:
raise CostLimitExceeded(
f"Cost limit exceeded: ${self._total_cost:.4f}",
used=self._total_cost,
limit=self.cost_limit,
)LLMError: authentication_error
Solution: Verify your token is correct and exported:
echo $CHUTES_API_TOKEN # Should show your token
export CHUTES_API_TOKEN="correct-token"LLMError: rate_limit
Solution: BaseAgent automatically retries with exponential backoff. You can also:
- Wait a few minutes before retrying
- Reduce request frequency
- Check your API plan limits
LLMError: Model 'xyz' not found
Solution: Use the correct model identifier:
export LLM_MODEL="moonshotai/Kimi-K2.5-TEE"LLMError: timeout
Solution: BaseAgent retries automatically. If persistent:
- Check your internet connection
- Verify Chutes API status
- Consider using OpenRouter as fallback
BaseAgent uses LiteLLM for provider abstraction:
# src/llm/client.py
import litellm
# For Chutes, configure base URL
litellm.api_base = "https://llm.chutes.ai/v1"
# Make request
response = litellm.completion(
model="moonshotai/Kimi-K2.5-TEE",
messages=messages,
api_key=os.environ.get("CHUTES_API_TOKEN"),
)- Enable thinking mode for complex reasoning tasks
- Use appropriate temperature (1.0 for exploration, 0.6 for precision)
- Leverage the 256K context for large codebases
- Monitor costs with
LLM_COST_LIMIT
- Set up fallback to OpenRouter
- Handle rate limits gracefully (automatic in BaseAgent)
- Log responses for debugging complex tasks
- Enable prompt caching (reduces costs by 90%)
- Use context management to avoid token waste
- Set reasonable cost limits for testing
- Configuration Reference - All settings explained
- Best Practices - Optimization tips
- Usage Guide - Command-line options