Complete collection of examples demonstrating cascadeflow from basics to production deployment.
# 1. Install cascadeflow
pip install cascadeflow[all]
# 2. Set your API key
export OPENAI_API_KEY="sk-..."
# 3. Run your first example
python examples/basic_usage.pyThat's it! You'll see cascading in action with cost savings.
| Example | What It Does | Complexity | Time | Best For |
|---|---|---|---|---|
| basic_usage.py | Learn cascading basics | ⭐ Easy | 5 min | First-time users |
| streaming_text.py | Real-time streaming | ⭐⭐ Medium | 10 min | Interactive apps |
| tool_execution.py | Function calling | ⭐⭐ Medium | 15 min | Agent builders |
| agentic_multi_agent.py | Tool loops + multi-agent | ⭐⭐⭐ Advanced | 20 min | Agentic apps |
| cost_tracking.py | Budget management | ⭐⭐ Medium | 15 min | Cost optimization |
| multi_provider.py | Mix AI providers | ⭐⭐ Medium | 10 min | Multi-cloud |
| reasoning_models.py | o1, o3, Claude 3.7, DeepSeek-R1 | ⭐⭐ Medium | 10 min | Complex reasoning |
| gateway_client_openai.py | Drop-in gateway (OpenAI SDK) | ⭐ Easy | 2 min | Existing OpenAI apps |
| gateway_client_anthropic.py | Drop-in gateway (Anthropic format) | ⭐ Easy | 2 min | Existing Anthropic apps |
| proxy_service_basic.py | Build your own proxy (router + handler) | ⭐⭐ Medium | 10 min | Custom gateways |
| fastapi_integration.py | REST API server | ⭐⭐⭐ Advanced | 20 min | Production APIs |
| production_patterns.py | Enterprise patterns | ⭐⭐⭐ Advanced | 30 min | Production deployment |
| edge_device.py | Edge AI (Jetson/Pi) | ⭐⭐⭐ Advanced | 20 min | Edge deployment |
💡 Tip: Start with basic_usage.py, then explore based on your use case!
I want to...
- Stream responses? →
streaming_text.py,streaming_tools.py - Use tools/functions? →
tool_execution.py,agentic_multi_agent.py,streaming_tools.py - Track costs? →
cost_tracking.py,user_budget_tracking.py,integrations/litellm_cost_tracking.py - Enforce budgets? →
enforcement/basic_enforcement.py,enforcement/stripe_integration.py - Use multiple providers? →
multi_provider.py,integrations/litellm_providers.py - Access DeepSeek/Gemini/Azure? →
integrations/litellm_providers.py - Deploy to production? →
production_patterns.py,fastapi_integration.py - Monitor in production? →
integrations/opentelemetry_grafana.py - Run locally/edge? →
edge_device.py,integrations/local_providers_setup.py,vllm_example.py,multi_instance_ollama.py,multi_instance_vllm.py - Integrate an existing OpenAI/Anthropic app quickly? →
gateway_client_openai.py,gateway_client_anthropic.py - Build a custom gateway/proxy? →
proxy_service_basic.py - Use reasoning models? →
reasoning_models.py - Manage user budgets? →
user_budget_tracking.py,profile_database_integration.py - Integrate with Stripe? →
enforcement/stripe_integration.py - Add safety guardrails? →
guardrails_usage.py - Customize routing? →
custom_cascade.py,multi_step_cascade.py - Validate quality? →
custom_validation.py,semantic_quality_domain_detection.py
- 🌟 Core Examples - Basic usage, streaming, tools
- 💰 Cost Management - Budgets and tracking
- 🏭 Production - Deployment patterns
- 🔌 Integrations - LiteLLM, Paygentic, OpenTelemetry, local providers
- 🛡️ Enforcement - Budget enforcement and Stripe
- ⚡ Advanced - Custom routing and validation
- 🌐 Edge - Edge device deployment and multi-instance configurations
Perfect for learning cascadeflow basics. Start with these!
File: basic_usage.py
Time: 5 minutes
What you'll learn:
- How cascading works (cheap model → expensive model)
- Automatic quality-based routing
- Cost tracking and savings
- When drafts are accepted vs rejected
Run it:
export OPENAI_API_KEY="sk-..."
python examples/basic_usage.pyExpected output:
Query 1/8: What color is the sky?
💚 Model: gpt-4o-mini only
💰 Cost: $0.000014
✅ Draft Accepted
Query 6/8: Explain quantum entanglement...
💚💛 Models: gpt-4o-mini + gpt-4o
💰 Cost: $0.005006
❌ Draft Rejected
💰 TOTAL SAVINGS: 45% reduction
File: streaming_text.py
Time: 10 minutes
What you'll learn:
- Real-time text streaming
- See cascade decisions in action
- Visual feedback with colors
- Performance metrics
Key concept: Watch the cascade happen in real-time!
File: tool_execution.py
Time: 15 minutes
What you'll learn:
- Function calling with tools
- Tool execution workflow
- Multi-turn conversations
- Error handling
Important: This shows actual tool EXECUTION, not just streaming tool calls.
File: multi_provider.py
Time: 10 minutes
What you'll learn:
- Mix models from different providers
- OpenAI + Anthropic + Groq
- Provider-specific optimizations
- Cross-provider cost comparison
Example setup:
agent = CascadeAgent(models=[
ModelConfig(name="llama-3.1-8b", provider="groq", cost=0.00005), # Fast & cheap
ModelConfig(name="gpt-4o", provider="openai", cost=0.00625), # Quality
ModelConfig(name="claude-3-5-sonnet", provider="anthropic", cost=0.003), # Reasoning
])File: reasoning_models.py
Time: 10 minutes
What you'll learn:
- Use o1, o3-mini, Claude 3.7, DeepSeek-R1
- Extended thinking mode
- Chain-of-thought reasoning
- Auto-detection of reasoning capabilities
Supported models:
- OpenAI: o1, o1-mini, o3-mini
- Anthropic: claude-3-7-sonnet
- Ollama: deepseek-r1 (free local)
- vLLM: deepseek-r1 (self-hosted)
File: cost_tracking.py
Time: 15 minutes
What you'll learn:
- Real-time cost monitoring
- Budget limits and warnings
- Per-model cost breakdown
- Cost optimization insights
Features:
- Budget alerts at 80% threshold
- Per-provider analysis
- Query-level cost tracking
- Savings calculation
Learn how to use tools and functions with cascadeflow.
File: tool_execution.py
Complete tool workflow with ToolExecutor - actual execution, not just detection.
File: streaming_tools.py
File: agentic_multi_agent.py
Shows a complete agentic workflow:
- Tool loop with automatic tool execution (
tool_executor,max_steps) - Multi-agent delegation as a tool (
delegate_to_researcher)
Watch tool calls form in real-time as JSON arrives.
Key difference:
tool_execution.py= Complete workflow (detection + execution)streaming_tools.py= Just streaming detection
Track costs, manage budgets, and optimize spending.
File: cost_tracking.py
Real-time cost monitoring with budget limits.
File: user_budget_tracking.py
Per-user budget enforcement and tracking.
File: user_profile_usage.py
User-specific routing and tier management.
File: profile_database_integration.py
Integrate user profiles with databases.
Use cases:
- SaaS applications with user tiers
- Multi-tenant systems
- Budget-aware routing
- Cost allocation by user
Deploy cascadeflow to production with enterprise patterns.
File: production_patterns.py
Time: 30 minutes
What you'll learn:
- Error handling & retries
- Rate limiting
- Circuit breakers
- Caching
- Monitoring & alerting
Enterprise features:
- Exponential backoff
- Request throttling
- Failure detection
- Response caching
- Health checks
File: fastapi_integration.py
Time: 20 minutes
REST API deployment with Server-Sent Events (SSE).
Endpoints:
POST /api/query- Non-streamingGET /api/query/stream- SSE streamingGET /health- Health checkGET /api/stats- Statistics
File: batch_processing.py
Process multiple queries efficiently.
File: rate_limiting_usage.py
Request throttling and queuing.
File: guardrails_usage.py
Safety and content filtering.
Access 10+ providers with accurate cost tracking, production monitoring, and local inference.
File: integrations/litellm_providers.py
Time: 15 minutes
What you'll learn:
- Access DeepSeek, Google Gemini, Azure OpenAI, and 7 more providers
- Calculate accurate costs for 100+ models
- Compare costs across providers
- Integrate with CascadeAgent
8 Complete Examples:
- List all supported providers
- Cost calculation comparison
- Model pricing details
- Cost comparison across use cases
- Provider information lookup
- Convenience functions
- API key status check
- Real-world CascadeAgent integration
Cost Savings:
- DeepSeek: 95% cheaper than GPT-4o for code ($0.00028 vs $0.0075)
- Gemini Flash: 98% cheaper for simple tasks ($0.000225 vs $0.0075)
- Annual impact: Save $21,000-$28,500 per year
Quick Example:
from cascadeflow.integrations.litellm import calculate_cost
cost = calculate_cost(
model="deepseek/deepseek-coder",
input_tokens=1000,
output_tokens=500
)
print(f"Cost: ${cost:.6f}") # $0.000280 vs $0.007500 for GPT-4oFile: integrations/litellm_cost_tracking.py
Cost tracking with LiteLLM integration and provider validation.
File: integrations/paygentic_usage.py
Time: 10 minutes
Opt-in usage metering with Paygentic for production billing workflows.
Python proxy flows support delivery_mode="background" by default, with optional sync or durable_outbox.
File: integrations/local_providers_setup.py
Time: 15 minutes
Complete guide for Ollama and vLLM setup (local, network, remote scenarios).
File: integrations/opentelemetry_grafana.py
Time: 20 minutes
Production observability with OpenTelemetry, Prometheus, and Grafana.
Features:
- Cost metrics export
- Token usage tracking
- Latency histograms
- User-level analytics
File: integrations/test_all_providers.py
Validate API keys and test all 10 providers.
Documentation: 📖 integrations/README.md
Implement budget enforcement and cost controls for production SaaS applications.
File: enforcement/basic_enforcement.py
Time: 10 minutes
What you'll learn:
- Configure budget limits per tier
- Use built-in enforcement callbacks
- Create custom callbacks
- Handle enforcement actions (ALLOW, WARN, BLOCK, DEGRADE)
Built-in Callbacks:
strict_budget_enforcement- Block at 100%, warn at 80%graceful_degradation- Degrade to cheaper models at 90%tier_based_enforcement- Different policies per tier
Quick Example:
from cascadeflow.telemetry import (
BudgetConfig,
CostTracker,
EnforcementCallbacks,
strict_budget_enforcement,
)
# Configure budgets
tracker = CostTracker(
user_budgets={
"free": BudgetConfig(daily=0.10),
"pro": BudgetConfig(daily=1.0),
}
)
# Set up enforcement
callbacks = EnforcementCallbacks()
callbacks.register(strict_budget_enforcement)
# Check before processing
action = callbacks.check(context)
if action == EnforcementAction.BLOCK:
return {"error": "Budget exceeded. Please upgrade."}File: enforcement/stripe_integration.py
Time: 15 minutes
Real-world template for integrating with Stripe subscriptions.
Features:
- Map Stripe tiers to budgets
- Subscription-based enforcement
- Upgrade flow handling
- Tier-specific policies
Documentation: 📖 enforcement/README.md
Custom routing, validation, and specialized deployments.
File: custom_cascade.py
Domain-specific routing, time-based routing, budget-aware cascades.
File: custom_validation.py
Build custom quality validators for specific domains.
File: multi_step_cascade.py
Complex multi-stage cascades.
File: semantic_quality_domain_detection.py
ML-based domain and quality detection.
File: cost_forecasting_anomaly_detection.py
Predict costs and detect unusual spending.
File: vllm_example.py
Self-hosted inference with vLLM.
Run cascadeflow on edge devices with local inference and multi-instance configurations.
File: edge_device.py
Time: 20 minutes
What you'll learn:
- Local inference with vLLM on Jetson/Raspberry Pi
- Automatic cascade to cloud for complex queries
- Zero-cost local processing
- Privacy-first architecture
Hardware:
- Nvidia Jetson (Thor, Orin, Xavier)
- Raspberry Pi 5
- 8GB+ RAM recommended
Use cases:
- Smart factories
- Healthcare devices (HIPAA)
- Retail kiosks
- Autonomous robots
- IoT gateways
Cost savings: 70% + privacy + lower latency
File: multi_instance_ollama.py
Time: 15 minutes
What you'll learn:
- Run draft and verifier models on separate Ollama instances
- Multi-GPU configuration with Docker Compose
- Health checks and instance validation
- GPU resource isolation for optimal performance
Use cases:
- Multi-GPU systems (draft on GPU 0, verifier on GPU 1)
- Distributed inference across network
- Load balancing between instances
- Better fault isolation
Setup: See Docker Compose guide
File: multi_instance_vllm.py
Time: 15 minutes
What you'll learn:
- Run draft and verifier models on separate vLLM instances
- High-performance inference with PagedAttention
- Kubernetes pod configuration
- Production-scale deployments
Use cases:
- GPU 0: Fast 7B model (200+ tokens/sec)
- GPU 1: Powerful 70B model (50+ tokens/sec)
- Kubernetes StatefulSets
- Load-balanced inference clusters
Performance: 10-24x faster than standard serving
- ✅ Run
basic_usage.py- Understand core concepts - ✅ Read the code comments - Learn patterns
- ✅ Try different queries - See routing decisions
Key concepts:
- Cascading = cheap model first, escalate if needed
- Draft accepted = money saved ✅
- Draft rejected = quality ensured ✅
- ✅ Run
streaming_text.py- See streaming - ✅ Run
tool_execution.py- Learn tool usage - ✅ Read Streaming Guide
Key concepts:
- Streaming requires 2+ models
- Event-based architecture
- Tool execution workflow
- ✅ Run
cost_tracking.py- Learn budget tracking - ✅ Run
user_budget_tracking.py- Per-user budgets - ✅ Read Cost Tracking Guide
Key concepts:
- Budget alerts at 80%
- Per-model breakdown
- Cost optimization
- ✅ Run
production_patterns.py- Enterprise patterns - ✅ Run
fastapi_integration.py- API deployment - ✅ Read Production Guide
Key concepts:
- Error handling
- Rate limiting
- Monitoring
- ✅ Run
custom_cascade.py- Custom routing - ✅ Run
custom_validation.py- Custom validators - ✅ Modify for your use case
# Install with all dependencies
pip install cascadeflow[all]
# Or install specific providers
pip install cascadeflow[openai]
pip install cascadeflow[anthropic]
pip install cascadeflow[groq]# OpenAI (most examples)
export OPENAI_API_KEY="sk-..."
# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."
# Groq (free, fast)
export GROQ_API_KEY="gsk_..."
# Together AI
export TOGETHER_API_KEY="..."
# Hugging Face
export HF_TOKEN="hf_..."# From repository root
python examples/basic_usage.py
python examples/streaming_text.py
python examples/cost_tracking.py
# With custom config
python examples/multi_provider.pyAPI key errors
# Check if set
echo $OPENAI_API_KEY
# Set it
export OPENAI_API_KEY="sk-..."
# Windows
set OPENAI_API_KEY=sk-...Import errors
# Install all dependencies
pip install cascadeflow[all]
# Or specific providers
pip install cascadeflow[openai]Examples run but show errors
# Check Python version (3.9+ required)
python --version
# Reinstall
pip install --upgrade cascadeflow[all]Streaming shows garbled output
Terminal may not support ANSI colors:
# Disable colors
TERM=dumb python examples/streaming_text.pyBegin with basic_usage.py before advanced examples.
All examples are heavily commented. Read through to understand patterns.
Streaming vs Execution:
streaming_tools.py= Watch tool calls formtool_execution.py= Actually execute tools- Why separate? Gives you control over validation
Cost Tracking:
- Extract from
result.metadata - Use safe extraction:
getattr()and.get() - Track with
CostTrackerfor budgets
Quality Validation:
- Draft accepted = cheap model only (saves money!)
- Draft rejected = both models called (ensures quality)
- Adjust thresholds based on use case
result = await agent.run(query)
# Safe extraction
total_cost = getattr(result, 'total_cost', 0)
model_used = getattr(result, 'model_used', 'unknown')
cascaded = result.metadata.get('cascaded', False)from cascadeflow.telemetry import CostTracker
tracker = CostTracker(budget_limit=1.0, warn_threshold=0.8)
# Run queries
result = await agent.run(query)
# Track costs
tracker.add_cost(
model=result.model_used,
provider="openai",
tokens=result.metadata.get('total_tokens', 0),
cost=result.total_cost
)
# View summary
tracker.print_summary()- Quick Start - 5-minute introduction
- Providers Guide - Configure AI providers
- Streaming Guide - Real-time responses
- Tools Guide - Function calling
- Cost Tracking - Budget management
- Production Guide - Enterprise deployment
- Performance Guide - Optimization
- Custom Cascade - Custom routing
- Custom Validation - Quality control
- Edge Devices - Jetson/Pi deployment
- Browser Cascading - Edge/browser deployment
- FastAPI Integration - REST API
- n8n Integration - No-code automation
Have a great use case? Contribute an example!
"""
Your Example - Brief Description
What it demonstrates:
- Feature 1
- Feature 2
Requirements:
- Dependency 1
Setup:
pip install cascadeflow[all]
export API_KEY="..."
python examples/your_example.py
Expected Results:
Description of output
"""
import asyncio
from cascadeflow import CascadeAgent, ModelConfig
async def main():
print("=" * 80)
print("YOUR EXAMPLE TITLE")
print("=" * 80)
# Your code here
print("\nKEY TAKEAWAYS:")
print("- Takeaway 1")
print("- Takeaway 2")
if __name__ == "__main__":
asyncio.run(main())See CONTRIBUTING.md for guidelines.
📖 Complete Guides 🌊 Streaming Guide 🛠️ Tools Guide 💰 Cost Tracking Guide 🏭 Production Guide
💬 GitHub Discussions - Ask questions 🐛 GitHub Issues - Report bugs 💡 Use "question" label for general questions
Core (6): Basic usage, streaming text, tool execution, multi-provider, reasoning models, cost tracking
Cost Management (4): Cost tracking, user budgets, user profiles, database integration
Production (5): Production patterns, FastAPI, batch processing, rate limiting, guardrails
Integrations (6): LiteLLM providers, cost tracking, Paygentic billing, local setup, OpenTelemetry, provider testing
Enforcement (2): Basic enforcement, Stripe integration
Advanced (6): Custom cascade, custom validation, multi-step, semantic detection, forecasting, vLLM
Edge (1): Edge device deployment
- ✅ 30 examples (~6,000+ lines of code)
- ✅ 3 specialized READMEs (integrations, enforcement, main)
- ✅ 10+ comprehensive guides (~10,000 lines of docs)
- ✅ ~16,000+ lines total of professional documentation
- ✅ 100% feature coverage
Essential Concepts:
- ✅ Draft accepted = money saved
- ✅ Draft rejected = quality ensured
- ✅ Streaming requires 2+ models
- ✅ Use universal tool format
- ✅ Extract costs from
result.metadata - ✅ Track budgets with
CostTracker
Production Ready:
- ✅ Error handling
- ✅ Rate limiting
- ✅ Monitoring
- ✅ Budget management
- ✅ API deployment
💰 Save 40-85% on AI costs with intelligent cascading! 🚀
View All Documentation • Python Examples • TypeScript Examples • GitHub Discussions