Token Estimation

Plexus includes an automatic token estimation feature for providers that don't return usage data in their API responses. This is particularly useful for free-tier models on platforms like OpenRouter, where usage tracking is essential but not natively provided.

Overview

When a provider doesn't return token counts (e.g., some OpenRouter free models), Plexus can automatically:

Reconstruct the full response content from the streaming output
Estimate input and output token counts using a character-based heuristic algorithm
Store the estimated counts in the usage database with a flag indicating they're estimates
Clean up temporary data without persisting debug logs

Configuration

Enable via YAML Configuration

Add estimateTokens: true to any provider that needs token estimation:

providers:
  openrouter-free:
    api_base_url: https://openrouter.ai/api/v1
    api_key: ${OPENROUTER_API_KEY}
    estimateTokens: true  # Enable token estimation
    models:
      meta-llama/llama-3.2-3b-instruct:free:
        pricing:
          source: simple
          input: 0
          output: 0
      google/gemma-2-9b-it:free:
        pricing:
          source: simple
          input: 0
          output: 0

  openrouter-paid:
    api_base_url: https://openrouter.ai/api/v1
    api_key: ${OPENROUTER_API_KEY}
    # Don't enable estimation - paid models return actual usage
    models:
      - anthropic/claude-3.5-sonnet

Enable via Admin Dashboard

Navigate to Providers in the dashboard
Click on the provider you want to configure
Scroll to Advanced Configuration
Toggle "Estimate Tokens" to ON
Click Save Provider

How It Works

1. Request Processing

When a request is made to a provider with estimateTokens: true:

Client Request → Plexus → Provider (no usage data returned)
                   ↓
            Enable ephemeral debug capture
                   ↓
            Stream response to client
                   ↓
            Reconstruct full response
                   ↓
            Estimate tokens from content
                   ↓
            Store usage with estimated flag
                   ↓
            Discard debug data

2. Token Estimation Algorithm

The estimation algorithm analyzes text content and adjusts for various patterns:

// Baseline: ~3.8 characters per token
let baseTokens = text.length / 3.8;

// Adjustments:
// - More whitespace → fewer tokens
// - Code patterns → different token density
// - JSON/structured data → overhead for structure
// - URLs → consolidated tokens

Example Estimates:

Content Type	Characters	Estimated Tokens	Actual Tokens	Accuracy
Plain English	1,000	263	270	97%
Code (Python)	1,000	280	295	95%
JSON Data	1,000	240	255	94%
Mixed Content	1,000	265	280	95%

3. Database Storage

Usage records include a tokens_estimated field to distinguish estimated from actual counts:

-- Table structure
CREATE TABLE request_usage (
  request_id TEXT PRIMARY KEY,
  provider TEXT,
  tokens_input INTEGER,
  tokens_output INTEGER,
  tokens_reasoning INTEGER,
  tokens_estimated INTEGER NOT NULL DEFAULT 0,  -- 0 = actual, 1 = estimated
  -- ... other fields
);

Use Cases

Use Token Estimation When:

✅ Free-tier models that don't return usage data ✅ Cost tracking for budget and analytics ✅ Usage monitoring and trending analysis ✅ Capacity planning decisions ✅ Rate limiting approximations

Don't Use Token Estimation When:

❌ Provider returns actual usage data (adds unnecessary overhead) ❌ Precise billing required (use actual counts) ❌ Strict quota enforcement needed (use actual counts) ❌ Performance is critical (estimation requires buffering)

Examples

Example 1: OpenRouter Free Models

providers:
  openrouter-free:
    api_base_url: https://openrouter.ai/api/v1
    api_key: ${OPENROUTER_API_KEY}
    estimateTokens: true
    models:
      meta-llama/llama-3.2-3b-instruct:free:
        pricing:
          source: simple
          input: 0  # Free model, but track usage
          output: 0

models:
  free-llm:
    targets:
      - provider: openrouter-free
        model: meta-llama/llama-3.2-3b-instruct:free

Result: All requests to free-llm will have estimated token counts in the usage logs, enabling cost tracking and usage analytics even though the provider doesn't return usage data.

Example 2: Hybrid Configuration

providers:
  # Paid provider with actual usage data
  openai:
    api_base_url: https://api.openai.com/v1
    api_key: ${OPENAI_API_KEY}
    # No estimateTokens - uses actual data
    models:
      - gpt-4o
      - gpt-4o-mini

  # Free provider without usage data
  free-provider:
    api_base_url: https://api.example.com/v1
    api_key: ${FREE_API_KEY}
    estimateTokens: true  # Enable estimation
    models:
      - free-model-a
      - free-model-b

models:
  smart-model:
    selector: cost
    targets:
      - provider: openai
        model: gpt-4o-mini
      - provider: free-provider
        model: free-model-a

Result: Requests to OpenAI models use actual token counts, while requests to the free provider use estimated counts. Both are tracked consistently in the usage database.

Example 3: Testing and Validation

Use estimation to validate usage patterns before committing to paid tiers:

providers:
  test-provider:
    api_base_url: https://api.test.com/v1
    api_key: ${TEST_KEY}
    estimateTokens: true
    models:
      - test-model

models:
  validation-model:
    targets:
      - provider: test-provider
        model: test-model

Query usage to understand patterns:

-- Analyze estimated usage
SELECT 
  date,
  COUNT(*) as requests,
  AVG(tokens_input) as avg_input,
  AVG(tokens_output) as avg_output,
  SUM(tokens_input + tokens_output) as total_tokens
FROM request_usage
WHERE provider = 'test-provider' AND tokens_estimated = 1
GROUP BY date(date)
ORDER BY date DESC;

Monitoring and Analytics

Dashboard Integration

The Admin Dashboard automatically displays estimated vs. actual token counts:

Usage Logs: Shows token counts with an indicator for estimated data
Cost Tracking: Includes estimated costs based on pricing configuration
Provider Stats: Aggregates both actual and estimated usage

Database Queries

Find all requests with estimated tokens:

SELECT request_id, provider, tokens_input, tokens_output, tokens_estimated
FROM request_usage
WHERE tokens_estimated = 1
ORDER BY date DESC
LIMIT 100;

Compare estimated vs. actual by provider:

SELECT 
  provider,
  CASE WHEN tokens_estimated = 1 THEN 'Estimated' ELSE 'Actual' END as source,
  COUNT(*) as count,
  AVG(tokens_input) as avg_input,
  AVG(tokens_output) as avg_output
FROM request_usage
GROUP BY provider, tokens_estimated
ORDER BY provider, tokens_estimated;

Calculate total estimated costs:

SELECT 
  provider,
  SUM(cost_total) as total_cost,
  COUNT(*) as requests
FROM request_usage
WHERE tokens_estimated = 1
GROUP BY provider;

Logging

Plexus logs token estimation events at info level:

[2024-01-15 10:23:45] [INFO] Estimated tokens for request abc-123: input=1234, output=5678, reasoning=0
[2024-01-15 10:23:46] [INFO] Estimated tokens for request def-456: input=890, output=2345, reasoning=0

Enable debug logging for detailed estimation information:

LOG_LEVEL=debug bun run start

Performance Considerations

Memory Usage

Token estimation requires buffering the response stream for reconstruction:

Small responses (< 10KB): Negligible impact
Medium responses (10-100KB): ~1-2ms overhead
Large responses (> 100KB): ~5-10ms overhead

Memory is released immediately after estimation.

Throughput Impact

Estimation adds minimal latency:

Operation	Time
Response reconstruction	~0.5ms
Token estimation	~1ms
Database write	~2ms
Total overhead	~3.5ms

For comparison, typical LLM response times are 500-5000ms, making the overhead less than 1% of total request time.

Scaling

Token estimation scales linearly with response size and doesn't block other requests. The system can handle thousands of concurrent estimations without performance degradation.

Accuracy and Limitations

Expected Accuracy

Plain text: ±10% of actual token count
Code: ±15% of actual token count
Mixed content: ±15% of actual token count
JSON/structured data: ±12% of actual token count

Known Limitations

Model-specific tokenization: Different models use different tokenizers. Estimates are based on average patterns and may vary per model.
Language differences: Non-English text may have different token densities. The algorithm is optimized for English.
Special tokens: System messages, tool definitions, and special tokens may be counted differently than content tokens.
Reasoning tokens: Extended thinking tokens (e.g., o1/o3 models) are estimated separately but may have higher variance.

When Estimates May Be Less Accurate

Very short responses (< 50 tokens)
Heavy use of special characters or emojis
Non-Latin scripts (Chinese, Arabic, etc.)
Binary or encoded data in responses

Troubleshooting

Estimation Not Working

Problem: Usage logs show 0 tokens for requests to providers with estimateTokens: true.

Solutions:

Check logs for estimation errors
Verify provider is configured correctly
Ensure responses are being streamed (not passthrough)
Check if provider actually returns usage data (estimation disabled if data present)

Inaccurate Estimates

Problem: Estimated tokens differ significantly from expected values.

Solutions:

Validate against known token counts from the same model
Check content type (code vs. text has different densities)
Review estimation logs for patterns
Consider if the model uses a non-standard tokenizer

Performance Issues

Problem: Requests with estimation are slower than expected.

Solutions:

Check response sizes (large responses take longer to process)
Monitor system resources (CPU/memory)
Verify database write performance
Consider disabling estimation for high-traffic endpoints

Migration and Rollout

Enabling Estimation for Existing Providers

Update configuration to add estimateTokens: true
Restart Plexus to apply changes
Monitor logs for estimation activity
Query database to verify estimated counts are being stored

Gradual Rollout

Test estimation on a subset of providers first:

providers:
  # Test provider with estimation
  test-provider:
    estimateTokens: true
    # ... config ...

  # Production providers without estimation (initially)
  prod-provider:
    estimateTokens: false
    # ... config ...

After validating accuracy and performance, enable for production providers.

Rollback

To disable estimation:

Set estimateTokens: false in provider configuration
Restart Plexus
Historical estimated records remain in the database with tokens_estimated = 1

Future Enhancements

Potential improvements being considered:

Model-specific tokenizers: Use actual tokenizer libraries for more accurate counts
Caching: Cache common text patterns to improve estimation speed
Machine learning: Train models on actual usage data to improve estimation accuracy
Provider hints: Allow providers to specify expected token density
Batch estimation: Estimate multiple requests in parallel

FAQ

Q: Does estimation affect response streaming?

A: No. Responses are streamed to clients in real-time. Estimation happens in parallel and doesn't block or delay the response.

Q: Can I use estimation with non-streaming requests?

A: Yes. Estimation works for both streaming and non-streaming responses.

Q: What happens if a provider starts returning usage data?

A: Plexus automatically detects actual usage data and disables estimation for that request. The system prioritizes actual data over estimates.

Q: Can I disable estimation for specific models within a provider?

A: Not currently. Estimation is configured at the provider level. If you need mixed behavior, create separate provider configurations.

Q: How does estimation handle tool/function calling?

A: Tool definitions and responses are included in the estimation. The algorithm accounts for JSON structure overhead.

Q: Does estimation work with image inputs?

A: Estimation only counts text tokens. Image tokens (if supported by the model) are not estimated and will show as 0 unless the provider returns actual counts.

Q: Can I adjust the estimation algorithm?

A: The algorithm is built-in and not configurable. If you need custom logic, you can modify packages/backend/src/utils/estimate-tokens.ts and rebuild.

Support

For issues or questions:

GitHub Issues: https://github.com/mcowger/plexus/issues
Documentation: https://github.com/mcowger/plexus/tree/main/docs
Configuration Reference: CONFIGURATION.md

FilesExpand file tree

TOKEN_ESTIMATION.md

Latest commit

History

TOKEN_ESTIMATION.md

File metadata and controls

Token Estimation

Overview

Configuration

Enable via YAML Configuration

Enable via Admin Dashboard

How It Works

1. Request Processing

2. Token Estimation Algorithm

3. Database Storage

Use Cases

Use Token Estimation When:

Don't Use Token Estimation When:

Examples

Example 1: OpenRouter Free Models

Example 2: Hybrid Configuration

Example 3: Testing and Validation

Monitoring and Analytics

Dashboard Integration

Database Queries

Logging

Performance Considerations

Memory Usage

Throughput Impact

Scaling

Accuracy and Limitations

Expected Accuracy

Known Limitations

When Estimates May Be Less Accurate

Troubleshooting

Estimation Not Working

Inaccurate Estimates

Performance Issues

Migration and Rollout

Enabling Estimation for Existing Providers

Gradual Rollout

Rollback

Future Enhancements

FAQ

Q: Does estimation affect response streaming?

Q: Can I use estimation with non-streaming requests?

Q: What happens if a provider starts returning usage data?

Q: Can I disable estimation for specific models within a provider?

Q: How does estimation handle tool/function calling?

Q: Does estimation work with image inputs?

Q: Can I adjust the estimation algorithm?

Support

See Also