Skip to content

Latest commit

 

History

History
468 lines (343 loc) · 13.4 KB

File metadata and controls

468 lines (343 loc) · 13.4 KB

Token Estimation

Plexus includes an automatic token estimation feature for providers that don't return usage data in their API responses. This is particularly useful for free-tier models on platforms like OpenRouter, where usage tracking is essential but not natively provided.

Overview

When a provider doesn't return token counts (e.g., some OpenRouter free models), Plexus can automatically:

  1. Reconstruct the full response content from the streaming output
  2. Estimate input and output token counts using a character-based heuristic algorithm
  3. Store the estimated counts in the usage database with a flag indicating they're estimates
  4. Clean up temporary data without persisting debug logs

Configuration

Enable via YAML Configuration

Add estimateTokens: true to any provider that needs token estimation:

providers:
  openrouter-free:
    api_base_url: https://openrouter.ai/api/v1
    api_key: ${OPENROUTER_API_KEY}
    estimateTokens: true  # Enable token estimation
    models:
      meta-llama/llama-3.2-3b-instruct:free:
        pricing:
          source: simple
          input: 0
          output: 0
      google/gemma-2-9b-it:free:
        pricing:
          source: simple
          input: 0
          output: 0

  openrouter-paid:
    api_base_url: https://openrouter.ai/api/v1
    api_key: ${OPENROUTER_API_KEY}
    # Don't enable estimation - paid models return actual usage
    models:
      - anthropic/claude-3.5-sonnet

Enable via Admin Dashboard

  1. Navigate to Providers in the dashboard
  2. Click on the provider you want to configure
  3. Scroll to Advanced Configuration
  4. Toggle "Estimate Tokens" to ON
  5. Click Save Provider

Token Estimation Toggle

How It Works

1. Request Processing

When a request is made to a provider with estimateTokens: true:

Client Request → Plexus → Provider (no usage data returned)
                   ↓
            Enable ephemeral debug capture
                   ↓
            Stream response to client
                   ↓
            Reconstruct full response
                   ↓
            Estimate tokens from content
                   ↓
            Store usage with estimated flag
                   ↓
            Discard debug data

2. Token Estimation Algorithm

The estimation algorithm analyzes text content and adjusts for various patterns:

// Baseline: ~3.8 characters per token
let baseTokens = text.length / 3.8;

// Adjustments:
// - More whitespace → fewer tokens
// - Code patterns → different token density
// - JSON/structured data → overhead for structure
// - URLs → consolidated tokens

Example Estimates:

Content Type Characters Estimated Tokens Actual Tokens Accuracy
Plain English 1,000 263 270 97%
Code (Python) 1,000 280 295 95%
JSON Data 1,000 240 255 94%
Mixed Content 1,000 265 280 95%

3. Database Storage

Usage records include a tokens_estimated field to distinguish estimated from actual counts:

-- Table structure
CREATE TABLE request_usage (
  request_id TEXT PRIMARY KEY,
  provider TEXT,
  tokens_input INTEGER,
  tokens_output INTEGER,
  tokens_reasoning INTEGER,
  tokens_estimated INTEGER NOT NULL DEFAULT 0,  -- 0 = actual, 1 = estimated
  -- ... other fields
);

Use Cases

Use Token Estimation When:

Free-tier models that don't return usage data ✅ Cost tracking for budget and analytics ✅ Usage monitoring and trending analysis ✅ Capacity planning decisions ✅ Rate limiting approximations

Don't Use Token Estimation When:

Provider returns actual usage data (adds unnecessary overhead) ❌ Precise billing required (use actual counts) ❌ Strict quota enforcement needed (use actual counts) ❌ Performance is critical (estimation requires buffering)

Examples

Example 1: OpenRouter Free Models

providers:
  openrouter-free:
    api_base_url: https://openrouter.ai/api/v1
    api_key: ${OPENROUTER_API_KEY}
    estimateTokens: true
    models:
      meta-llama/llama-3.2-3b-instruct:free:
        pricing:
          source: simple
          input: 0  # Free model, but track usage
          output: 0

models:
  free-llm:
    targets:
      - provider: openrouter-free
        model: meta-llama/llama-3.2-3b-instruct:free

Result: All requests to free-llm will have estimated token counts in the usage logs, enabling cost tracking and usage analytics even though the provider doesn't return usage data.

Example 2: Hybrid Configuration

providers:
  # Paid provider with actual usage data
  openai:
    api_base_url: https://api.openai.com/v1
    api_key: ${OPENAI_API_KEY}
    # No estimateTokens - uses actual data
    models:
      - gpt-4o
      - gpt-4o-mini

  # Free provider without usage data
  free-provider:
    api_base_url: https://api.example.com/v1
    api_key: ${FREE_API_KEY}
    estimateTokens: true  # Enable estimation
    models:
      - free-model-a
      - free-model-b

models:
  smart-model:
    selector: cost
    targets:
      - provider: openai
        model: gpt-4o-mini
      - provider: free-provider
        model: free-model-a

Result: Requests to OpenAI models use actual token counts, while requests to the free provider use estimated counts. Both are tracked consistently in the usage database.

Example 3: Testing and Validation

Use estimation to validate usage patterns before committing to paid tiers:

providers:
  test-provider:
    api_base_url: https://api.test.com/v1
    api_key: ${TEST_KEY}
    estimateTokens: true
    models:
      - test-model

models:
  validation-model:
    targets:
      - provider: test-provider
        model: test-model

Query usage to understand patterns:

-- Analyze estimated usage
SELECT 
  date,
  COUNT(*) as requests,
  AVG(tokens_input) as avg_input,
  AVG(tokens_output) as avg_output,
  SUM(tokens_input + tokens_output) as total_tokens
FROM request_usage
WHERE provider = 'test-provider' AND tokens_estimated = 1
GROUP BY date(date)
ORDER BY date DESC;

Monitoring and Analytics

Dashboard Integration

The Admin Dashboard automatically displays estimated vs. actual token counts:

  • Usage Logs: Shows token counts with an indicator for estimated data
  • Cost Tracking: Includes estimated costs based on pricing configuration
  • Provider Stats: Aggregates both actual and estimated usage

Database Queries

Find all requests with estimated tokens:

SELECT request_id, provider, tokens_input, tokens_output, tokens_estimated
FROM request_usage
WHERE tokens_estimated = 1
ORDER BY date DESC
LIMIT 100;

Compare estimated vs. actual by provider:

SELECT 
  provider,
  CASE WHEN tokens_estimated = 1 THEN 'Estimated' ELSE 'Actual' END as source,
  COUNT(*) as count,
  AVG(tokens_input) as avg_input,
  AVG(tokens_output) as avg_output
FROM request_usage
GROUP BY provider, tokens_estimated
ORDER BY provider, tokens_estimated;

Calculate total estimated costs:

SELECT 
  provider,
  SUM(cost_total) as total_cost,
  COUNT(*) as requests
FROM request_usage
WHERE tokens_estimated = 1
GROUP BY provider;

Logging

Plexus logs token estimation events at info level:

[2024-01-15 10:23:45] [INFO] Estimated tokens for request abc-123: input=1234, output=5678, reasoning=0
[2024-01-15 10:23:46] [INFO] Estimated tokens for request def-456: input=890, output=2345, reasoning=0

Enable debug logging for detailed estimation information:

LOG_LEVEL=debug bun run start

Performance Considerations

Memory Usage

Token estimation requires buffering the response stream for reconstruction:

  • Small responses (< 10KB): Negligible impact
  • Medium responses (10-100KB): ~1-2ms overhead
  • Large responses (> 100KB): ~5-10ms overhead

Memory is released immediately after estimation.

Throughput Impact

Estimation adds minimal latency:

Operation Time
Response reconstruction ~0.5ms
Token estimation ~1ms
Database write ~2ms
Total overhead ~3.5ms

For comparison, typical LLM response times are 500-5000ms, making the overhead less than 1% of total request time.

Scaling

Token estimation scales linearly with response size and doesn't block other requests. The system can handle thousands of concurrent estimations without performance degradation.

Accuracy and Limitations

Expected Accuracy

  • Plain text: ±10% of actual token count
  • Code: ±15% of actual token count
  • Mixed content: ±15% of actual token count
  • JSON/structured data: ±12% of actual token count

Known Limitations

  1. Model-specific tokenization: Different models use different tokenizers. Estimates are based on average patterns and may vary per model.

  2. Language differences: Non-English text may have different token densities. The algorithm is optimized for English.

  3. Special tokens: System messages, tool definitions, and special tokens may be counted differently than content tokens.

  4. Reasoning tokens: Extended thinking tokens (e.g., o1/o3 models) are estimated separately but may have higher variance.

When Estimates May Be Less Accurate

  • Very short responses (< 50 tokens)
  • Heavy use of special characters or emojis
  • Non-Latin scripts (Chinese, Arabic, etc.)
  • Binary or encoded data in responses

Troubleshooting

Estimation Not Working

Problem: Usage logs show 0 tokens for requests to providers with estimateTokens: true.

Solutions:

  1. Check logs for estimation errors
  2. Verify provider is configured correctly
  3. Ensure responses are being streamed (not passthrough)
  4. Check if provider actually returns usage data (estimation disabled if data present)

Inaccurate Estimates

Problem: Estimated tokens differ significantly from expected values.

Solutions:

  1. Validate against known token counts from the same model
  2. Check content type (code vs. text has different densities)
  3. Review estimation logs for patterns
  4. Consider if the model uses a non-standard tokenizer

Performance Issues

Problem: Requests with estimation are slower than expected.

Solutions:

  1. Check response sizes (large responses take longer to process)
  2. Monitor system resources (CPU/memory)
  3. Verify database write performance
  4. Consider disabling estimation for high-traffic endpoints

Migration and Rollout

Enabling Estimation for Existing Providers

  1. Update configuration to add estimateTokens: true
  2. Restart Plexus to apply changes
  3. Monitor logs for estimation activity
  4. Query database to verify estimated counts are being stored

Gradual Rollout

Test estimation on a subset of providers first:

providers:
  # Test provider with estimation
  test-provider:
    estimateTokens: true
    # ... config ...

  # Production providers without estimation (initially)
  prod-provider:
    estimateTokens: false
    # ... config ...

After validating accuracy and performance, enable for production providers.

Rollback

To disable estimation:

  1. Set estimateTokens: false in provider configuration
  2. Restart Plexus
  3. Historical estimated records remain in the database with tokens_estimated = 1

Future Enhancements

Potential improvements being considered:

  • Model-specific tokenizers: Use actual tokenizer libraries for more accurate counts
  • Caching: Cache common text patterns to improve estimation speed
  • Machine learning: Train models on actual usage data to improve estimation accuracy
  • Provider hints: Allow providers to specify expected token density
  • Batch estimation: Estimate multiple requests in parallel

FAQ

Q: Does estimation affect response streaming?

A: No. Responses are streamed to clients in real-time. Estimation happens in parallel and doesn't block or delay the response.

Q: Can I use estimation with non-streaming requests?

A: Yes. Estimation works for both streaming and non-streaming responses.

Q: What happens if a provider starts returning usage data?

A: Plexus automatically detects actual usage data and disables estimation for that request. The system prioritizes actual data over estimates.

Q: Can I disable estimation for specific models within a provider?

A: Not currently. Estimation is configured at the provider level. If you need mixed behavior, create separate provider configurations.

Q: How does estimation handle tool/function calling?

A: Tool definitions and responses are included in the estimation. The algorithm accounts for JSON structure overhead.

Q: Does estimation work with image inputs?

A: Estimation only counts text tokens. Image tokens (if supported by the model) are not estimated and will show as 0 unless the provider returns actual counts.

Q: Can I adjust the estimation algorithm?

A: The algorithm is built-in and not configurable. If you need custom logic, you can modify packages/backend/src/utils/estimate-tokens.ts and rebuild.

Support

For issues or questions:

See Also