feat: merge agent-intelligence v2 integration train into main by saschabuehrle · Pull Request #164 · lemony-ai/cascadeflow

saschabuehrle · 2026-03-02T15:28:40Z

Summary

merge the full Agent Intelligence V2 integration train into one branch for mainline merge
include harness API + enforce actions + security/privacy telemetry + OpenAI auto-instrumentation
include OpenAI Agents, CrewAI, and LangChain harness integrations
include reproducibility benchmark pipeline and V2 docs/workboard updates

Validation

Python quality gates

python3 -m black --check cascadeflow tests examples
python3 -m ruff check cascadeflow tests examples
python3 -m mypy cascadeflow --ignore-missing-imports

JS/TS quality gates

pnpm --filter @cascadeflow/ml build
pnpm --filter @cascadeflow/core build
pnpm --filter @cascadeflow/core lint
pnpm --filter @cascadeflow/core typecheck
pnpm --filter @cascadeflow/core typecheck:examples
pnpm --filter @cascadeflow/n8n-nodes-cascadeflow lint
pnpm --filter @cascadeflow/n8n-nodes-cascadeflow test

Python test suites

python3 -m pytest tests/ -m "not integration and not requires_api and not requires_ollama and not requires_vllm" -q
- result: 1050 passed, 33 skipped, 59 deselected
CASCADEFLOW_RUN_ML_E2E=1 python3 -m pytest -m integration -q (with local .env keys)
- result: 29 passed, 30 skipped, 1083 deselected

E2E checks (live APIs)

python3 tests/e2e/alignment_parity.py
- result: mismatch_count = 0
python3 tests/e2e/basic_usage_parity.py
- result: pass, no blocking warnings/failures
python3 tests/e2e/cost_tracking_e2e.py
- result: completed for litellm and direct modes
python3 -m pytest tests/test_gateway_cli_e2e.py -q
- result: 2 passed

Notes

local untracked benchmark artifacts and package tarballs were intentionally left untouched
existing lint warnings in @cascadeflow/core remain warnings-only (no errors)

Replace the instrument.py scaffold with a full implementation that patches openai.resources.chat.completions.Completions.create (sync) and AsyncCompletions.create (async) for harness observe/enforce modes. Key capabilities: - Class-level patching of sync and async create methods - Streaming wrappers (_InstrumentedStream, _InstrumentedAsyncStream) that capture usage metrics after all chunks are consumed - Cost estimation from a built-in pricing table - Energy estimation using deterministic model coefficients - Tool call counting in both response and streaming chunks - Budget remaining tracking within scoped runs - Idempotent patching with clean unpatch/reset path Context tracking per call: - cost, step_count, latency_used_ms, energy_used, tool_calls - budget_remaining auto-updated when budget_max is set - model_used and decision trace via ctx.record() Added step_count, latency_used_ms, energy_used fields to HarnessRunContext in api.py. Hooked patch_openai into init() and unpatch_openai into reset(). 39 new tests covering: patch lifecycle, sync/async wrappers, sync/async stream wrappers, cost/energy estimation, nested run isolation, and edge cases (no usage, no choices, missing chunks). All 63 harness tests pass (39 instrument + 24 api).

…m usage injection - init(mode="off") now calls unpatch_openai() if previously patched - Trace records actual mode (observe/enforce) instead of always "observe" - Enforce mode raises BudgetExceededError pre-call when budget exhausted - Auto-inject stream_options.include_usage=True for streaming requests - Add pytest.importorskip("openai") for graceful skip when not installed - 10 new tests covering all four fixes (73 total pass)

Implements cascadeflow.integrations.crewai module that hooks into CrewAI's native llm_hooks system (v1.5+) to feed cost, latency, energy, and step metrics into harness run contexts. - before_llm_call: budget gate in enforce mode, latency tracking - after_llm_call: token estimation, cost/energy/step accounting - enable()/disable() lifecycle with fail_open and budget_gate config - 37 tests covering hooks, estimation, enable/disable, and edge cases - Fixed __init__.py import ordering (CREWAI_AVAILABLE before __all__)

- Add crewai extra to pyproject.toml (pip install cascadeflow[crewai]) - Handle dict messages in _extract_message_content (CrewAI passes {"role": "...", "content": "..."} not objects with .content attr) - Move budget gate check before start time recording so blocked calls don't leak entries in _call_start_times - Fix unused imports (field, TYPE_CHECKING, Callable) and import order - Fix docstring referencing nonexistent cost_model_override - Replace yield with return in test fixture (PT022) - Add 7 new tests: dict/object message extraction, blocked call leak

- Use time.monotonic() for duration_ms calculation instead of wall-clock delta (avoids NTP/suspend clock jumps) - Extract sanitize constants (_MAX_ACTION_LEN, _MAX_REASON_LEN, _MAX_MODEL_LEN) - Log warning when record() receives empty action (was silently defaulting) - Cache CallbackEvent import in _emit_harness_decision for hot-path perf - Add tests: no-callback-manager noop, empty-action warning, duration field

Add 5 new benchmark modules and 15 unit tests that enable third-party reproducibility and automated V2 readiness checks: - repro.py: environment fingerprint (git SHA, packages, platform) - baseline.py: save/load baselines, delta comparison, Go/No-Go gates - harness_overhead.py: decision-path p95 measurement (<5ms gate) - observe_validation.py: observe-mode zero-change proof (6 cases) - artifact.py: JSON artifact bundler + REPRODUCE.md generation Extends run_all.py with --baseline, --harness-mode, --with-repro flags.

Add CascadeFlowADKPlugin(BasePlugin) that intercepts all LLM calls across ADK Runner agents for budget enforcement, cost/latency/energy tracking, tool call counting, and trace recording. New files: - cascadeflow/harness/pricing.py — shared pricing table with Gemini models - cascadeflow/integrations/google_adk.py — plugin + enable/disable API - tests/test_google_adk_integration.py — 49 tests - docs/guides/google_adk_integration.md - examples/integrations/google_adk_harness.py Modified: - cascadeflow/integrations/__init__.py — register integration - pyproject.toml — add google-adk optional extra

- Remove harness `agent` from top-level cascadeflow namespace to avoid shadowing the cascadeflow.agent module (breaks dotted-path patches in test_agent.py and test_agent_p0_tool_loop.py) - Use id(callback_context) fallback in ADK plugin _callback_key() when invocation_id and agent_name are both empty, preventing state map collisions under concurrency - Add 4 tests for callback-key collision scenario - Update test_harness_api to import agent from cascadeflow.harness

1. HIGH: off mode now respected — before/after callbacks return early when ctx.mode == "off", preventing metric tracking in off mode 2. HIGH: versioned Gemini model IDs now resolve correctly — added _resolve_pricing_key() with suffix stripping (-preview-XX-XX, -YYYYMMDD, -latest, -exp-N) and longest-prefix fallback matching 3. MEDIUM: callback key collision fixed — switched from (invocation_id, agent_name) tuple to id(callback_context) int key, guaranteeing uniqueness even for concurrent calls with same IDs 4. MEDIUM: fail_open tests now patch the correct symbol (cascadeflow.integrations.google_adk.get_current_run instead of cascadeflow.harness.api.get_current_run) 5. MEDIUM: budget error response no longer leaks spend/limit numbers — user-facing message is generic, exact figures logged at warning level Added 13 new tests: off-mode behavior (2), versioned model pricing (7), callback key collision (4). Total: 62 ADK tests pass. Full suite: 1097 passed, 69 skipped, 0 failures.

Add 29 tests covering the Anthropic Python SDK monkey-patching that was introduced in v2.1. Tests cover usage extraction, tool call counting, sync/async wrapper behavior, budget enforcement in enforce mode, stream passthrough, cost/energy/latency tracking, and init/reset lifecycle.

Port the Python harness decision engine to TypeScript and wire it into the n8n Agent node. Tracks 5 dimensions (cost, latency, energy, tool calls, quality) across every LLM call. Observe mode is on by default; enforce mode stops the agent loop when limits are hit. - Add nodes/harness/ with pricing (18 models, fuzzy resolution), HarnessRunContext (7-step decision cascade, compliance allowlists, KPI-weighted scoring), and 43 tests - Replace hardcoded estimatesPerMillion in CascadeChatModel with shared harness/pricing.ts (broader model coverage + suffix stripping) - Add harness UI parameters to Agent node (mode, budget, tool cap, latency cap, energy cap, compliance, KPI weights) - Wire pre-call checks and tool-call counting into agent executor loop - Add harness summary to Agent output JSON

…ation

…tion CI

Fix CI Python Code Quality check — these files drifted from Black formatting after recent merges into the integration branch.

…nchmark

Close V2 Go/No-Go gaps: - Add harness section to langchain_integration.md documenting HarnessAwareCascadeFlowCallbackHandler and get_harness_callback - Create langchain_harness.py example (matches CrewAI/OpenAI Agents/ADK pattern) - Create llms.txt at repo root for LLM-readable project discovery - Update V2 workboard: all feature branches merged, Go/No-Go checklist updated

…model fix - Add _validate_harness_params() to init() and run() — rejects negative budget/tool_calls/latency/energy and invalid compliance strings - Add trace rotation (MAX_TRACE_ENTRIES=1000) in both Python and TypeScript to prevent unbounded memory growth in long-running agents - Add sanitizeNumericParam() in n8n harness.ts — coerces NaN/Infinity/negative config values to null - Remove phantom gpt-5-nano from llms.txt (not in any pricing table) - Document HarnessRunContext thread-safety limitation in docstring - Add 10 new tests covering validation, compliance, and trace rotation

…intlify docs site Phase 0 — GitHub refresh: - pyproject.toml: update description, keywords, classifier to Production/Stable - __init__.py: replace emoji docstring with harness API focus - llms.txt: expand from 88 to 214 lines (HarnessConfig, pricing, energy, integrations) - README.md: new H1, comparison table, Harness API section, 6 new feature rows - docs/README.md: Mintlify banner, add LangChain to integrations list Phase 1 — Mintlify docs site (docs-site/): - docs.json config (palm theme, 5 tabs, full navigation) - 36 MDX pages: Get Started (4), Harness (8), Integrations (7), API Reference (8), Examples (6), index + changelog + contributing - Logo assets copied from .github/assets/

Social-style shields.io badges intermittently render as "invalid" due to GitHub API rate limiting. Flat style is more reliable.

github-actions bot added documentation Improvements or additions to documentation dependencies examples lang: python tests core configuration size/xl lang: typescript integration: n8n ci/cd labels Mar 2, 2026

saschabuehrle added 19 commits March 5, 2026 17:22

Add core harness API scaffold with context-scoped runtime

eb3df89

Harden harness core scaffolding and complete API test coverage

8b0d2e0

Add OpenAI Agents SDK harness integration (opt-in)

1f0fad0

fix(openai-agents): align SDK interface and enforce-safe errors

7bc50de

docs(plan): claim v2 enforce-actions feature branch

1cf5590

feat(harness): enforce switch-model, deny-tool, and stop actions

cb69081

feat(harness): implement enforce actions for v2 harness

d032ba6

fix(harness): clarify observe traces and hard-stop semantics

bcee09c

perf(harness): optimize model utility hot paths

ee6e040

refactor(harness): unify pricing profiles across integrations

b54637b

docs(plan): claim langchain harness extension branch

6afcfa7

feat(harness): add privacy-safe decision telemetry and callback hooks

cc51cf7

fix(harness): avoid shadowing cascadeflow.agent module

49ee601

style: apply black formatting for harness integration files

c1236f1

saschabuehrle added 25 commits March 5, 2026 17:32

fix(harness): avoid shadowing cascadeflow.agent module

8f74dee

docs(plan): update workboard — bench-repro-pipeline PR #163 in review

97250f4

style(bench): apply linter formatting to repro pipeline files

805fef1

style(langchain): finalize harness callback typing and formatting

f05ca3d

feat(harness): add anthropic python auto-instrumentation for v2.1

af414d0

feat(core): deliver v2.1 ts harness parity and sdk auto-instrumentation

76f6c2e

feat(harness): instrument Anthropic streaming usage and tool calls

de4a638

fix(harness): finalize stream metrics on errors and harden env parsing

ac15742

docs: add harness quickstart and missing integration coverage

b894cd3

fix(google-adk): initialize plugin name and stabilize callback correl…

510bdd1

…ation

chore(dx): clarify integration prerequisites and add optional integra…

bace69d

…tion CI

style: apply Black formatting to 7 Python files

27b9402

Fix CI Python Code Quality check — these files drifted from Black formatting after recent merges into the integration branch.

chore(ci/docs): enforce integration matrix across python versions

37276b2

style: fix ruff I001 import sorting in google_adk_harness example

1b470d6

feat(benchmarks): add baseline and savings metrics to agentic tool be…

a986060

…nchmark

fix: switch GitHub Stars badge from social to flat style

adbf47e

Social-style shields.io badges intermittently render as "invalid" due to GitHub API rate limiting. Flat style is more reliable.

saschabuehrle force-pushed the feature/agent-intelligence-v2-integration branch from d62aa4d to adbf47e Compare March 5, 2026 16:32

saschabuehrle merged commit 5cd3f58 into main Mar 5, 2026

saschabuehrle deleted the feature/agent-intelligence-v2-integration branch March 5, 2026 16:33

This was referenced Mar 5, 2026

feat(integrations): add Google ADK harness plugin #165

Closed

feat(harness): instrument Anthropic streaming usage and tool calls #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: merge agent-intelligence v2 integration train into main#164

feat: merge agent-intelligence v2 integration train into main#164
saschabuehrle merged 49 commits intomainfrom
feature/agent-intelligence-v2-integration

saschabuehrle commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saschabuehrle commented Mar 2, 2026

Summary

Validation

Python quality gates

JS/TS quality gates

Python test suites

E2E checks (live APIs)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant