Skip to content

lemony-ai/cascadeflow

cascadeflow Logo

Smart AI model cascading for cost optimization

PyPI version npm version LangChain version Vercel AI version n8n version License: MIT PyPI Downloads npm Downloads Tests Python Docs TypeScript Docs X Follow GitHub Stars


Cost Savings Benchmarks: 69% (MT-Bench), 93% (GSM8K), 52% (MMLU), 80% (TruthfulQA) savings, retaining 96% GPT-5 quality.


Python Python β€’ TypeScript TypeScript β€’ LangChain LangChain β€’ n8n n8n β€’ Vercel AI Vercel AI β€’ OpenClaw OpenClaw β€’ πŸ“– Docs β€’ πŸ’‘ Examples


Stop Bleeding Money on AI Calls. Cut Costs 30-65% in 3 Lines of Code.

40-70% of text prompts and 20-60% of agent calls don't need expensive flagship models. You're overpaying every single day.

cascadeflow fixes this with intelligent model cascading, available in Python and TypeScript.

pip install cascadeflow
npm install @cascadeflow/core

Why cascadeflow?

cascadeflow is an intelligent AI model cascading library that dynamically selects the optimal model for each query or tool call through speculative execution. It's based on the research that 40-70% of queries don't require slow, expensive flagship models, and domain-specific smaller models often outperform large general-purpose models on specialized tasks. For the remaining queries that need advanced reasoning, cascadeflow automatically escalates to flagship models if needed.

Use Cases

Use cascadeflow for:

  • Cost Optimization. Reduce API costs by 40-85% through intelligent model cascading and speculative execution with automatic per-query cost tracking.
  • Cost Control and Transparency. Built-in telemetry for query, model, and provider-level cost tracking with configurable budget limits and programmable spending caps.
  • Low Latency & Speed Optimization. Sub-2ms framework overhead with fast provider routing (Groq sub-50ms). Cascade simple queries to fast models while reserving expensive models for complex reasoning, achieving 2-10x latency reduction overall. (use preset speed_optimized)
  • Multi-Provider Flexibility. Unified API across OpenAI, Anthropic, Groq, Ollama, vLLM, Together, and Hugging Face, plus 17+ providers via the Vercel AI SDK with automatic provider detection and zero vendor lock-in. Optional LiteLLM integration for 100+ additional providers, plus LangChain integration for LCEL chains and tools.
  • Edge & Local-Hosted AI Deployment. Use best of both worlds: handle most queries with local models (vLLM, Ollama), then automatically escalate complex queries to cloud providers only when needed.

ℹ️ Note: SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. Research paper


How cascadeflow Works

cascadeflow uses speculative execution with quality validation:

  1. Speculatively executes small, fast models first - optimistic execution ($0.15-0.30/1M tokens)
  2. Validates quality of responses using configurable thresholds (completeness, confidence, correctness)
  3. Dynamically escalates to larger models only when quality validation fails ($1.25-3.00/1M tokens)
  4. Learns patterns to optimize future cascading decisions and domain specific routing

Zero configuration. Works with YOUR existing models (>17 providers currently supported).

In practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation

Result: 40-85% cost reduction, 2-10x faster responses, zero quality loss.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      cascadeflow Stack                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Cascade Agent                                        β”‚  β”‚
β”‚  β”‚                                                       β”‚  β”‚
β”‚  β”‚  Orchestrates the entire cascade execution            β”‚  β”‚
β”‚  β”‚  β€’ Query routing & model selection                    β”‚  β”‚
β”‚  β”‚  β€’ Drafter -> Verifier coordination                   β”‚  β”‚
β”‚  β”‚  β€’ Cost tracking & telemetry                          β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                          ↓                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Domain Pipeline                                      β”‚  β”‚
β”‚  β”‚                                                       β”‚  β”‚
β”‚  β”‚  Automatic domain classification                      β”‚  β”‚
β”‚  β”‚  β€’ Rule-based detection (CODE, MATH, DATA, etc.)      β”‚  β”‚
β”‚  β”‚  β€’ Optional ML semantic classification                β”‚  β”‚
β”‚  β”‚  β€’ Domain-optimized pipelines & model selection       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                          ↓                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Quality Validation Engine                            β”‚  β”‚
β”‚  β”‚                                                       β”‚  β”‚
β”‚  β”‚  Multi-dimensional quality checks                     β”‚  β”‚
β”‚  β”‚  β€’ Length validation (too short/verbose)              β”‚  β”‚
β”‚  β”‚  β€’ Confidence scoring (logprobs analysis)             β”‚  β”‚
β”‚  β”‚  β€’ Format validation (JSON, structured output)        β”‚  β”‚
β”‚  β”‚  β€’ Semantic alignment (intent matching)               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                          ↓                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Cascading Engine (<2ms overhead)                     β”‚  β”‚
β”‚  β”‚                                                       β”‚  β”‚
β”‚  β”‚  Smart model escalation strategy                      β”‚  β”‚
β”‚  β”‚  β€’ Try cheap models first (speculative execution)     β”‚  β”‚
β”‚  β”‚  β€’ Validate quality instantly                         β”‚  β”‚
β”‚  β”‚  β€’ Escalate only when needed                          β”‚  β”‚
β”‚  β”‚  β€’ Automatic retry & fallback                         β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                          ↓                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Provider Abstraction Layer                           β”‚  β”‚
β”‚  β”‚                                                       β”‚  β”‚
β”‚  β”‚  Unified interface for >17 providers                   β”‚  β”‚
β”‚  β”‚  β€’ OpenAI β€’ Anthropic β€’ Groq β€’ Ollama                 β”‚  β”‚
β”‚  β”‚  β€’ Together β€’ vLLM β€’ HuggingFace β€’ LiteLLM            β”‚  β”‚
β”‚  β”‚  β€’ Vercel AI SDK (17+ additional providers)            β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

Drop-In Gateway (Existing Apps)

If you already have an app using the OpenAI or Anthropic APIs and want the fastest integration, run the gateway and point your existing client at it:

python -m cascadeflow.server --mode auto --port 8084

Docs: docs/guides/gateway.md

Python Python

pip install cascadeflow[all]
from cascadeflow import CascadeAgent, ModelConfig

# Define your cascade - try cheap model first, escalate if needed
agent = CascadeAgent(models=[
    ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),  # Draft model (~$0.375/1M tokens)
    ModelConfig(name="gpt-5", provider="openai", cost=0.00562),         # Verifier model (~$5.62/1M tokens)
])

# Run query - automatically routes to optimal model
result = await agent.run("What's the capital of France?")

print(f"Answer: {result.content}")
print(f"Model used: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")
πŸ’‘ Optional: Use ML-based Semantic Quality Validation

For advanced use cases, you can add ML-based semantic similarity checking to validate that responses align with queries.

Step 1: Install the optional ML package:

pip install cascadeflow[semantic]  # Adds semantic similarity via FastEmbed (~80MB model)

Step 2: Use semantic quality validation:

from cascadeflow.quality.semantic import SemanticQualityChecker

# Initialize semantic checker (downloads model on first use)
checker = SemanticQualityChecker(
    similarity_threshold=0.5,  # Minimum similarity score (0-1)
    toxicity_threshold=0.7     # Maximum toxicity score (0-1)
)

# Validate query-response alignment
query = "Explain Python decorators"
response = "Decorators are a way to modify functions using @syntax..."

result = checker.validate(query, response, check_toxicity=True)

print(f"Similarity: {result.similarity:.2%}")
print(f"Passed: {result.passed}")
print(f"Toxic: {result.is_toxic}")

What you get:

  • 🎯 Semantic similarity scoring (query ↔ response alignment)
  • πŸ›‘οΈ Optional toxicity detection
  • πŸ”„ Automatic model download and caching
  • πŸš€ Fast inference (~100ms per check)

Full example: See semantic_quality_domain_detection.py

⚠️ GPT-5 Note: GPT-5 streaming requires organization verification. Non-streaming works for all users. Verify here if needed (~15 min). Basic cascadeflow examples work without - GPT-5 is only called when needed (typically 20-30% of requests).

πŸ“– Learn more: Python Documentation | Quickstart Guide | Providers Guide

TypeScript TypeScript

npm install @cascadeflow/core
import { CascadeAgent, ModelConfig } from '@cascadeflow/core';

// Same API as Python!
const agent = new CascadeAgent({
  models: [
    { name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
  ],
});

const result = await agent.run('What is TypeScript?');
console.log(`Model: ${result.modelUsed}`);
console.log(`Cost: $${result.totalCost}`);
console.log(`Saved: ${result.savingsPercentage}%`);
πŸ’‘ Optional: ML-based Semantic Quality Validation

For advanced quality validation, enable ML-based semantic similarity checking to ensure responses align with queries.

Step 1: Install the optional ML packages:

npm install @cascadeflow/ml @xenova/transformers

Step 2: Enable semantic validation in your cascade:

import { CascadeAgent, SemanticQualityChecker } from '@cascadeflow/core';

const agent = new CascadeAgent({
  models: [
    { name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
  ],
  quality: {
    threshold: 0.40,                    // Traditional confidence threshold
    requireMinimumTokens: 5,            // Minimum response length
    useSemanticValidation: true,        // Enable ML validation
    semanticThreshold: 0.5,             // 50% minimum similarity
  },
});

// Responses now validated for semantic alignment
const result = await agent.run('Explain TypeScript generics');

Step 3: Or use semantic validation directly:

import { SemanticQualityChecker } from '@cascadeflow/core';

const checker = new SemanticQualityChecker();

if (await checker.isAvailable()) {
  const result = await checker.checkSimilarity(
    'What is TypeScript?',
    'TypeScript is a typed superset of JavaScript.'
  );

  console.log(`Similarity: ${(result.similarity * 100).toFixed(1)}%`);
  console.log(`Passed: ${result.passed}`);
}

What you get:

  • 🎯 Query-response semantic alignment detection
  • 🚫 Off-topic response filtering
  • πŸ“¦ BGE-small-en-v1.5 embeddings (~40MB, auto-downloads)
  • ⚑ Fast CPU inference (~50-100ms with caching)
  • πŸ”„ Request-scoped caching (50% latency reduction)
  • 🌐 Works in Node.js, Browser, and Edge Functions

Example: semantic-quality.ts

πŸ“– Learn more: TypeScript Documentation | Quickstart Guide | Node.js Examples | Browser/Edge Guide

πŸ”„ Migration Example

Migrate in 5min from direct Provider implementation to cost savings and full cost control and transparency.

Before (Standard Approach)

Cost: $0.000113, Latency: 850ms

# Using expensive model for everything
result = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's 2+2?"}]
)

After (With cascadeflow)

Cost: $0.000007, Latency: 234ms

agent = CascadeAgent(models=[
    ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),
    ModelConfig(name="gpt-4o", provider="openai", cost=0.00625),
])

result = await agent.run("What's 2+2?")

πŸ”₯ Saved: $0.000106 (94% reduction), 3.6x faster

πŸ“Š Learn more: Cost Tracking Guide | Production Best Practices | Performance Optimization


n8n n8n Integration

Use cascadeflow in n8n workflows for no-code AI automation with automatic cost optimization!

Installation

  1. Open n8n
  2. Go to Settings β†’ Community Nodes
  3. Search for: @cascadeflow/n8n-nodes-cascadeflow
  4. Click Install

Two Nodes

Node Type Use case
CascadeFlow (Model) Language Model sub-node Drop-in for any Chain/LLM node
CascadeFlow Agent Standalone agent (main in/out) Tool calling, memory, multi-step reasoning

Quick Start (Model):

  1. Add two AI Chat Model nodes (cheap drafter + powerful verifier)
  2. Add CascadeFlow (Model) and connect both models
  3. Connect to Basic LLM Chain or Chain node
  4. Check Logs tab on the Chain node to see cascade decisions

Quick Start (Agent):

  1. Add a Chat Trigger node
  2. Add CascadeFlow Agent and connect it to the trigger
  3. Connect Drafter, Verifier, optional Memory and Tools
  4. Check the Agent Output tab for cascade metadata and trace

Result: 40-85% cost savings in your n8n workflows!

Features:

  • Works with any AI Chat Model node (OpenAI, Anthropic, Ollama, Azure, etc.)
  • Mix providers (e.g., Ollama drafter + GPT-4o verifier)
  • Agent node: tool calling, memory, per-tool routing, tool call validation
  • 16-domain cascading for specialized model routing
  • Real-time flow visualization in Logs/Output tabs

πŸ”Œ Learn more: n8n Integration Guide | n8n Documentation


LangChain LangChain Integration

Use cascadeflow with LangChain for intelligent model cascading with full LCEL, streaming, and tools support!

Installation

TypeScript TypeScript

npm install @cascadeflow/langchain @langchain/core @langchain/openai

Python Python

pip install cascadeflow langchain-openai

Quick Start

TypeScript TypeScript - Drop-in replacement for any LangChain chat model
import { ChatOpenAI } from '@langchain/openai';
import { ChatAnthropic } from '@langchain/anthropic';
import { withCascade } from '@cascadeflow/langchain';

const cascade = withCascade({
  drafter: new ChatOpenAI({ model: 'gpt-4o-mini' }),      // $0.15/$0.60 per 1M tokens
  verifier: new ChatAnthropic({ model: 'claude-sonnet-4-5' }),  // $3/$15 per 1M tokens
  qualityThreshold: 0.8, // 80% queries use drafter
});

// Use like any LangChain chat model
const result = await cascade.invoke('Explain quantum computing');

// Optional: Enable LangSmith tracing (see https://smith.langchain.com)
// Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true

// Or with LCEL chains
const chain = prompt.pipe(cascade).pipe(new StringOutputParser());
Python Python - Drop-in replacement for any LangChain chat model
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from cascadeflow.integrations.langchain import CascadeFlow

cascade = CascadeFlow(
    drafter=ChatOpenAI(model="gpt-4o-mini"),      # $0.15/$0.60 per 1M tokens
    verifier=ChatAnthropic(model="claude-sonnet-4-5"),  # $3/$15 per 1M tokens
    quality_threshold=0.8,  # 80% queries use drafter
)

# Use like any LangChain chat model
result = await cascade.ainvoke("Explain quantum computing")

# Optional: Enable LangSmith tracing (see https://smith.langchain.com)
# Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true

# Or with LCEL chains
chain = prompt | cascade | StrOutputParser()
πŸ’‘ Optional: Cost Tracking with Callbacks (Python)

Track costs, tokens, and cascade decisions with LangChain-compatible callbacks:

from cascadeflow.integrations.langchain.langchain_callbacks import get_cascade_callback

# Track costs similar to get_openai_callback()
with get_cascade_callback() as cb:
    response = await cascade.ainvoke("What is Python?")

    print(f"Total cost: ${cb.total_cost:.6f}")
    print(f"Drafter cost: ${cb.drafter_cost:.6f}")
    print(f"Verifier cost: ${cb.verifier_cost:.6f}")
    print(f"Total tokens: {cb.total_tokens}")
    print(f"Successful requests: {cb.successful_requests}")

Features:

  • 🎯 Compatible with get_openai_callback() pattern
  • πŸ’° Separate drafter/verifier cost tracking
  • πŸ“Š Token usage (including streaming)
  • πŸ”„ Works with LangSmith tracing
  • ⚑ Near-zero overhead

Full example: See langchain_cost_tracking.py

πŸ’‘ Optional: Model Discovery & Analysis Helpers (TypeScript)

For discovering optimal cascade pairs from your existing LangChain models, use the built-in discovery helpers:

import {
  discoverCascadePairs,
  findBestCascadePair,
  analyzeModel,
  validateCascadePair
} from '@cascadeflow/langchain';

// Your existing LangChain models (configured with YOUR API keys)
const myModels = [
  new ChatOpenAI({ model: 'gpt-3.5-turbo' }),
  new ChatOpenAI({ model: 'gpt-4o-mini' }),
  new ChatOpenAI({ model: 'gpt-4o' }),
  new ChatAnthropic({ model: 'claude-3-haiku' }),
  // ... any LangChain chat models
];

// Quick: Find best cascade pair
const best = findBestCascadePair(myModels);
console.log(`Best pair: ${best.analysis.drafterModel} β†’ ${best.analysis.verifierModel}`);
console.log(`Estimated savings: ${best.estimatedSavings}%`);

// Use it immediately
const cascade = withCascade({
  drafter: best.drafter,
  verifier: best.verifier,
});

// Advanced: Discover all valid pairs
const pairs = discoverCascadePairs(myModels, {
  minSavings: 50,              // Only pairs with β‰₯50% savings
  requireSameProvider: false,  // Allow cross-provider cascades
});

// Validate specific pair
const validation = validateCascadePair(drafter, verifier);
console.log(`Valid: ${validation.valid}`);
console.log(`Warnings: ${validation.warnings}`);

What you get:

  • πŸ” Automatic discovery of optimal cascade pairs from YOUR models
  • πŸ’° Estimated cost savings calculations
  • ⚠️ Validation warnings for misconfigured pairs
  • πŸ“Š Model tier analysis (drafter vs verifier candidates)

Full example: See model-discovery.ts

Features:

  • βœ… Full LCEL support (pipes, sequences, batch)
  • βœ… Streaming with pre-routing
  • βœ… Tool calling and structured output
  • βœ… LangSmith cost tracking metadata
  • βœ… Cost tracking callbacks (Python)
  • βœ… Works with all LangChain features

🦜 Learn more: LangChain Integration Guide | TypeScript Package | Python Examples


Resources

Examples

Python Python Examples:

Basic Examples - Get started quickly
Example Description Link
Basic Usage Simple cascade setup with OpenAI models View
Preset Usage Use built-in presets for quick setup View
Multi-Provider Mix multiple AI providers in one cascade View
Reasoning Models Use reasoning models (o1/o3, Claude Sonnet 4, DeepSeek-R1) View
Tool Execution Function calling and tool usage View
Streaming Text Stream responses from cascade agents View
Cost Tracking Track and analyze costs across queries View
Agentic Multi-Agent Multi-turn tool loops & agent-as-a-tool delegation View
Advanced Examples - Production & customization
Example Description Link
Production Patterns Best practices for production deployments View
FastAPI Integration Integrate cascades with FastAPI View
Streaming Tools Stream tool calls and responses View
Batch Processing Process multiple queries efficiently View
Multi-Step Cascade Build complex multi-step cascades View
Edge Device Run cascades on edge devices with local models View
vLLM Example Use vLLM for local model deployment View
Multi-Instance Ollama Run draft/verifier on separate Ollama instances View
Multi-Instance vLLM Run draft/verifier on separate vLLM instances View
Custom Cascade Build custom cascade strategies View
Custom Validation Implement custom quality validators View
User Budget Tracking Per-user budget enforcement and tracking View
User Profile Usage User-specific routing and configurations View
Rate Limiting Implement rate limiting for cascades View
Guardrails Add safety and content guardrails View
Cost Forecasting Forecast costs and detect anomalies View
Semantic Quality Detection ML-based domain and quality detection View
Profile Database Integration Integrate user profiles with databases View
LangChain Basic Simple LangChain cascade setup View
LangChain Streaming Stream responses with LangChain View
LangChain Model Discovery Discover and analyze LangChain models View
LangChain LangSmith Cost tracking with LangSmith integration View
LangChain Cost Tracking Track costs with callback handlers View
LangChain LCEL Pipeline LCEL chains with cascade routing View
LangGraph Multi-Agent LangGraph multi-agent orchestration View

TypeScript TypeScript Examples:

Basic Examples - Get started quickly
Example Description Link
Basic Usage Simple cascade setup (Node.js) View
Tool Calling Function calling with tools (Node.js) View
Multi-Provider Mix providers in TypeScript (Node.js) View
Reasoning Models Use reasoning models (o1/o3, Claude Sonnet 4, DeepSeek-R1) View
Cost Tracking Track and analyze costs across queries View
Semantic Quality ML-based semantic validation with embeddings View
Streaming Stream responses in TypeScript View
Tool Execution Tool execution engine and result handling View
Streaming Tools Stream tool calls with event detection View
Agentic Multi-Agent Multi-turn tool loops & multi-agent orchestration View
Advanced Examples - Production, edge & LangChain
Example Description Link
Production Patterns Production best practices (Node.js) View
Multi-Instance Ollama Run draft/verifier on separate Ollama instances View
Multi-Instance vLLM Run draft/verifier on separate vLLM instances View
Browser/Edge Vercel Edge runtime example View
LangChain Basic Simple LangChain cascade setup View
LangChain Cross-Provider Haiku β†’ GPT-5 with PreRouter View
LangChain LangSmith Cost tracking with LangSmith View
LangChain Cost Tracking Compare cascadeflow vs LangSmith cost tracking View
LangGraph Multi-Agent LangGraph multi-agent orchestration View
LangChain Tool Risk Gating Tool routing based on risk and complexity View

πŸ“‚ View All Python Examples β†’ | View All TypeScript Examples β†’

Documentation

Getting Started - Core concepts and basics
Guide Description Link
Quickstart Get started with cascadeflow in 5 minutes Read
Providers Guide Configure and use different AI providers Read
Presets Guide Using and creating custom presets Read
Streaming Guide Stream responses from cascade agents Read
Tools Guide Function calling and tool usage Read
Cost Tracking Track and analyze API costs Read
Agentic Patterns (Python) Tool loops, multi-agent, agent-as-a-tool delegation Read
Agentic Patterns (TypeScript) Tool loops, multi-agent orchestration Read
Gateway Guide Drop-in OpenAI/Anthropic-compatible server Read
Advanced Topics - Production, customization & integrations
Guide Description Link
Production Guide Best practices for production deployments Read
Performance Guide Optimize cascade performance and latency Read
Custom Cascade Build custom cascade strategies Read
Custom Validation Implement custom quality validators Read
Edge Device Deploy cascades on edge devices Read
Browser Cascading Run cascades in the browser/edge Read
FastAPI Integration Integrate with FastAPI applications Read
LangChain Integration Use cascadeflow with LangChain Read
n8n Integration Use cascadeflow in n8n workflows Read
Local Providers Ollama & vLLM self-hosted deployment Read
OpenClaw Provider OpenClaw custom provider setup Read
Enterprise Guide Enterprise deployments and configuration Read
Quick Integration Integrate cascadeflow fast Read
User Budget Tracking Per-user cost limits and budgets Read
Proxy Routing Provider-aware proxy routing Read

πŸ“š View All Documentation β†’


Features

Feature Benefit
🎯 Speculative Cascading Tries cheap models first, escalates intelligently
πŸ’° 40-85% Cost Savings Research-backed, proven in production
⚑ 2-10x Faster Small models respond in <50ms vs 500-2000ms
⚑ Low Latency Sub-2ms framework overhead, negligible performance impact
πŸ”„ Mix Any Providers OpenAI, Anthropic, Groq, Ollama, vLLM, Together + LiteLLM (optional) + LangChain integration
πŸ‘€ User Profile System Per-user budgets, tier-aware routing, enforcement callbacks
βœ… Quality Validation Automatic checks + semantic similarity (optional ML, ~80MB, CPU)
🎨 Cascading Policies Domain-specific pipelines, multi-step validation strategies
🧠 Domain Understanding 15 domains auto-detected (code, medical, legal, finance, math, etc.), routes to specialists
πŸ€– Drafter/Validator Pattern 20-60% savings for agent/tool systems
πŸ”§ Tool Calling Support Universal format, works across all providers
πŸ“Š Cost Tracking Built-in analytics + OpenTelemetry export (vendor-neutral)
πŸš€ 3-Line Integration Zero architecture changes needed
πŸ” Agent Loops Multi-turn tool execution with automatic tool call, result, re-prompt cycles
πŸ“‹ Message & Tool Call Lists Full conversation history with tool_calls and tool_call_id preservation across turns
πŸͺ Hooks & Callbacks Telemetry callbacks, cost events, and streaming hooks for observability
🏭 Production Ready Streaming, batch processing, tool handling, reasoning model support, caching, error recovery, anomaly detection

License

MIT Β© see LICENSE file.

Free for commercial use. Attribution appreciated but not required.


Contributing

We ❀️ contributions!

πŸ“ Contributing Guide - Python & TypeScript development setup


Recently Shipped

  • βœ… Agent Loops & Multi-Agent - Multi-turn tool execution, agent-as-a-tool delegation, LangGraph orchestration
  • βœ… Tool Execution Engine - Automatic tool call routing, parallel execution, risk gating
  • βœ… Hooks & Callbacks - Telemetry callbacks, cost events, streaming hooks for observability
  • βœ… Vercel AI SDK Integration - 17+ additional providers with automatic provider detection
  • βœ… OpenClaw Provider - Custom provider for OpenClaw deployments
  • βœ… Gateway Server - Drop-in OpenAI/Anthropic-compatible proxy endpoint
  • βœ… User Tier Management - Cost controls and limits per user tier with advanced routing
  • βœ… Semantic Quality Validators - Lightweight local quality scoring via FastEmbed
  • βœ… Code Complexity Detection - Dynamic cascading based on task complexity analysis
  • βœ… Domain Aware Cascading - ML-based semantic domain detection with per-domain routing
  • βœ… Benchmark Reports - Automated benchmarking (MMLU, GSM8K, MT-Bench, HumanEval, TruthfulQA)

Support


Citation

If you use cascadeflow in your research or project, please cite:

@software{cascadeflow2025,
  author = {Lemony Inc., Sascha Buehrle and Contributors},
  title = {cascadeflow: Smart AI model cascading for cost optimization},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/lemony-ai/cascadeflow}
}

Ready to cut your AI costs by 40-85%?

pip install cascadeflow
npm install @cascadeflow/core

Read the Docs β€’ View Python Examples β€’ View TypeScript Examples β€’ Join Discussions


About

Built with ❀️ by Lemony Inc. and the cascadeflow Community

One cascade. Hundreds of specialists.

New York | Zurich

⭐ Star us on GitHub if cascadeflow helps you save money!