Plexus includes an automatic token estimation feature for providers that don't return usage data in their API responses. This is particularly useful for free-tier models on platforms like OpenRouter, where usage tracking is essential but not natively provided.
When a provider doesn't return token counts (e.g., some OpenRouter free models), Plexus can automatically:
- Reconstruct the full response content from the streaming output
- Estimate input and output token counts using a character-based heuristic algorithm
- Store the estimated counts in the usage database with a flag indicating they're estimates
- Clean up temporary data without persisting debug logs
Add estimateTokens: true to any provider that needs token estimation:
providers:
openrouter-free:
api_base_url: https://openrouter.ai/api/v1
api_key: ${OPENROUTER_API_KEY}
estimateTokens: true # Enable token estimation
models:
meta-llama/llama-3.2-3b-instruct:free:
pricing:
source: simple
input: 0
output: 0
google/gemma-2-9b-it:free:
pricing:
source: simple
input: 0
output: 0
openrouter-paid:
api_base_url: https://openrouter.ai/api/v1
api_key: ${OPENROUTER_API_KEY}
# Don't enable estimation - paid models return actual usage
models:
- anthropic/claude-3.5-sonnet- Navigate to Providers in the dashboard
- Click on the provider you want to configure
- Scroll to Advanced Configuration
- Toggle "Estimate Tokens" to ON
- Click Save Provider
When a request is made to a provider with estimateTokens: true:
Client Request → Plexus → Provider (no usage data returned)
↓
Enable ephemeral debug capture
↓
Stream response to client
↓
Reconstruct full response
↓
Estimate tokens from content
↓
Store usage with estimated flag
↓
Discard debug data
The estimation algorithm analyzes text content and adjusts for various patterns:
// Baseline: ~3.8 characters per token
let baseTokens = text.length / 3.8;
// Adjustments:
// - More whitespace → fewer tokens
// - Code patterns → different token density
// - JSON/structured data → overhead for structure
// - URLs → consolidated tokensExample Estimates:
| Content Type | Characters | Estimated Tokens | Actual Tokens | Accuracy |
|---|---|---|---|---|
| Plain English | 1,000 | 263 | 270 | 97% |
| Code (Python) | 1,000 | 280 | 295 | 95% |
| JSON Data | 1,000 | 240 | 255 | 94% |
| Mixed Content | 1,000 | 265 | 280 | 95% |
Usage records include a tokens_estimated field to distinguish estimated from actual counts:
-- Table structure
CREATE TABLE request_usage (
request_id TEXT PRIMARY KEY,
provider TEXT,
tokens_input INTEGER,
tokens_output INTEGER,
tokens_reasoning INTEGER,
tokens_estimated INTEGER NOT NULL DEFAULT 0, -- 0 = actual, 1 = estimated
-- ... other fields
);✅ Free-tier models that don't return usage data ✅ Cost tracking for budget and analytics ✅ Usage monitoring and trending analysis ✅ Capacity planning decisions ✅ Rate limiting approximations
❌ Provider returns actual usage data (adds unnecessary overhead) ❌ Precise billing required (use actual counts) ❌ Strict quota enforcement needed (use actual counts) ❌ Performance is critical (estimation requires buffering)
providers:
openrouter-free:
api_base_url: https://openrouter.ai/api/v1
api_key: ${OPENROUTER_API_KEY}
estimateTokens: true
models:
meta-llama/llama-3.2-3b-instruct:free:
pricing:
source: simple
input: 0 # Free model, but track usage
output: 0
models:
free-llm:
targets:
- provider: openrouter-free
model: meta-llama/llama-3.2-3b-instruct:freeResult: All requests to free-llm will have estimated token counts in the usage logs, enabling cost tracking and usage analytics even though the provider doesn't return usage data.
providers:
# Paid provider with actual usage data
openai:
api_base_url: https://api.openai.com/v1
api_key: ${OPENAI_API_KEY}
# No estimateTokens - uses actual data
models:
- gpt-4o
- gpt-4o-mini
# Free provider without usage data
free-provider:
api_base_url: https://api.example.com/v1
api_key: ${FREE_API_KEY}
estimateTokens: true # Enable estimation
models:
- free-model-a
- free-model-b
models:
smart-model:
selector: cost
targets:
- provider: openai
model: gpt-4o-mini
- provider: free-provider
model: free-model-aResult: Requests to OpenAI models use actual token counts, while requests to the free provider use estimated counts. Both are tracked consistently in the usage database.
Use estimation to validate usage patterns before committing to paid tiers:
providers:
test-provider:
api_base_url: https://api.test.com/v1
api_key: ${TEST_KEY}
estimateTokens: true
models:
- test-model
models:
validation-model:
targets:
- provider: test-provider
model: test-modelQuery usage to understand patterns:
-- Analyze estimated usage
SELECT
date,
COUNT(*) as requests,
AVG(tokens_input) as avg_input,
AVG(tokens_output) as avg_output,
SUM(tokens_input + tokens_output) as total_tokens
FROM request_usage
WHERE provider = 'test-provider' AND tokens_estimated = 1
GROUP BY date(date)
ORDER BY date DESC;The Admin Dashboard automatically displays estimated vs. actual token counts:
- Usage Logs: Shows token counts with an indicator for estimated data
- Cost Tracking: Includes estimated costs based on pricing configuration
- Provider Stats: Aggregates both actual and estimated usage
Find all requests with estimated tokens:
SELECT request_id, provider, tokens_input, tokens_output, tokens_estimated
FROM request_usage
WHERE tokens_estimated = 1
ORDER BY date DESC
LIMIT 100;Compare estimated vs. actual by provider:
SELECT
provider,
CASE WHEN tokens_estimated = 1 THEN 'Estimated' ELSE 'Actual' END as source,
COUNT(*) as count,
AVG(tokens_input) as avg_input,
AVG(tokens_output) as avg_output
FROM request_usage
GROUP BY provider, tokens_estimated
ORDER BY provider, tokens_estimated;Calculate total estimated costs:
SELECT
provider,
SUM(cost_total) as total_cost,
COUNT(*) as requests
FROM request_usage
WHERE tokens_estimated = 1
GROUP BY provider;Plexus logs token estimation events at info level:
[2024-01-15 10:23:45] [INFO] Estimated tokens for request abc-123: input=1234, output=5678, reasoning=0
[2024-01-15 10:23:46] [INFO] Estimated tokens for request def-456: input=890, output=2345, reasoning=0
Enable debug logging for detailed estimation information:
LOG_LEVEL=debug bun run startToken estimation requires buffering the response stream for reconstruction:
- Small responses (< 10KB): Negligible impact
- Medium responses (10-100KB): ~1-2ms overhead
- Large responses (> 100KB): ~5-10ms overhead
Memory is released immediately after estimation.
Estimation adds minimal latency:
| Operation | Time |
|---|---|
| Response reconstruction | ~0.5ms |
| Token estimation | ~1ms |
| Database write | ~2ms |
| Total overhead | ~3.5ms |
For comparison, typical LLM response times are 500-5000ms, making the overhead less than 1% of total request time.
Token estimation scales linearly with response size and doesn't block other requests. The system can handle thousands of concurrent estimations without performance degradation.
- Plain text: ±10% of actual token count
- Code: ±15% of actual token count
- Mixed content: ±15% of actual token count
- JSON/structured data: ±12% of actual token count
-
Model-specific tokenization: Different models use different tokenizers. Estimates are based on average patterns and may vary per model.
-
Language differences: Non-English text may have different token densities. The algorithm is optimized for English.
-
Special tokens: System messages, tool definitions, and special tokens may be counted differently than content tokens.
-
Reasoning tokens: Extended thinking tokens (e.g., o1/o3 models) are estimated separately but may have higher variance.
- Very short responses (< 50 tokens)
- Heavy use of special characters or emojis
- Non-Latin scripts (Chinese, Arabic, etc.)
- Binary or encoded data in responses
Problem: Usage logs show 0 tokens for requests to providers with estimateTokens: true.
Solutions:
- Check logs for estimation errors
- Verify provider is configured correctly
- Ensure responses are being streamed (not passthrough)
- Check if provider actually returns usage data (estimation disabled if data present)
Problem: Estimated tokens differ significantly from expected values.
Solutions:
- Validate against known token counts from the same model
- Check content type (code vs. text has different densities)
- Review estimation logs for patterns
- Consider if the model uses a non-standard tokenizer
Problem: Requests with estimation are slower than expected.
Solutions:
- Check response sizes (large responses take longer to process)
- Monitor system resources (CPU/memory)
- Verify database write performance
- Consider disabling estimation for high-traffic endpoints
- Update configuration to add
estimateTokens: true - Restart Plexus to apply changes
- Monitor logs for estimation activity
- Query database to verify estimated counts are being stored
Test estimation on a subset of providers first:
providers:
# Test provider with estimation
test-provider:
estimateTokens: true
# ... config ...
# Production providers without estimation (initially)
prod-provider:
estimateTokens: false
# ... config ...After validating accuracy and performance, enable for production providers.
To disable estimation:
- Set
estimateTokens: falsein provider configuration - Restart Plexus
- Historical estimated records remain in the database with
tokens_estimated = 1
Potential improvements being considered:
- Model-specific tokenizers: Use actual tokenizer libraries for more accurate counts
- Caching: Cache common text patterns to improve estimation speed
- Machine learning: Train models on actual usage data to improve estimation accuracy
- Provider hints: Allow providers to specify expected token density
- Batch estimation: Estimate multiple requests in parallel
A: No. Responses are streamed to clients in real-time. Estimation happens in parallel and doesn't block or delay the response.
A: Yes. Estimation works for both streaming and non-streaming responses.
A: Plexus automatically detects actual usage data and disables estimation for that request. The system prioritizes actual data over estimates.
A: Not currently. Estimation is configured at the provider level. If you need mixed behavior, create separate provider configurations.
A: Tool definitions and responses are included in the estimation. The algorithm accounts for JSON structure overhead.
A: Estimation only counts text tokens. Image tokens (if supported by the model) are not estimated and will show as 0 unless the provider returns actual counts.
A: The algorithm is built-in and not configurable. If you need custom logic, you can modify packages/backend/src/utils/estimate-tokens.ts and rebuild.
For issues or questions:
- GitHub Issues: https://github.com/mcowger/plexus/issues
- Documentation: https://github.com/mcowger/plexus/tree/main/docs
- Configuration Reference: CONFIGURATION.md
