Skip to content

feat(integrations): add Google ADK harness plugin#165

Closed
saschabuehrle wants to merge 15 commits intomainfrom
feat/v2-google-adk-integration
Closed

feat(integrations): add Google ADK harness plugin#165
saschabuehrle wants to merge 15 commits intomainfrom
feat/v2-google-adk-integration

Conversation

@saschabuehrle
Copy link
Collaborator

@saschabuehrle saschabuehrle commented Mar 4, 2026

Summary

  • Add CascadeFlowADKPlugin(BasePlugin) integration for Google ADK (Agent Development Kit) — intercepts all LLM calls across all agents in a Runner for budget enforcement, cost/latency/energy tracking, tool call counting, and trace recording
  • Add shared cascadeflow/harness/pricing.py with Gemini model pricing (2.5-flash, 2.5-pro, 2.0-flash, 1.5-flash, 1.5-pro) alongside existing OpenAI/Anthropic models
  • Fix import regression: remove harness agent from top-level cascadeflow namespace to avoid shadowing the cascadeflow.agent module
  • Fix callback-key collision: use id(callback_context) fallback when invocation_id and agent_name are both empty

New files

File Description
cascadeflow/integrations/google_adk.py Plugin with before_model_callback (budget gate), after_model_callback (metrics), on_model_error_callback (trace), enable()/disable() API
cascadeflow/harness/pricing.py Shared pricing table, estimate_cost(), estimate_energy(), energy coefficients
tests/test_google_adk_integration.py 53 tests — fake ADK types, callback lifecycle, budget enforcement, collision safety
docs/guides/google_adk_integration.md Integration guide with quickstart
examples/integrations/google_adk_harness.py Runnable example

Modified files

File Change
cascadeflow/integrations/__init__.py Register Google ADK in capabilities + exports
cascadeflow/__init__.py Add harness exports, remove agent re-export to fix module shadowing
pyproject.toml Add google-adk optional extra (python_version >= '3.10')
tests/test_harness_api.py Update export test to import agent from submodule

Design decisions

  • Plugin-per-Runner — Unlike CrewAI (global hooks), ADK plugins are registered per-Runner. enable() returns the instance; user passes it to Runner(plugins=[plugin]).
  • No tool gating — ADK's tools_dict is part of agent definition, not per-call. Budget gate via before_model_callback is sufficient.
  • Stream passthrough — Consistent with current Anthropic integration approach.
  • Conditional importfind_spec("google.adk") guard; module works without google-adk installed.

Test plan

  • pytest tests/test_google_adk_integration.py -v — 53 passed
  • pytest tests/test_agent.py tests/test_agent_p0_tool_loop.py -v — 23 passed, 6 skipped (regression fixed)
  • pytest tests/test_harness_api.py -v — all passed
  • Full suite: pytest tests/ --ignore=tests/_archive_development_tests -x — 1088 passed, 69 skipped, 0 failures
  • ruff check — clean

Replace the instrument.py scaffold with a full implementation that patches
openai.resources.chat.completions.Completions.create (sync) and
AsyncCompletions.create (async) for harness observe/enforce modes.

Key capabilities:
- Class-level patching of sync and async create methods
- Streaming wrappers (_InstrumentedStream, _InstrumentedAsyncStream)
  that capture usage metrics after all chunks are consumed
- Cost estimation from a built-in pricing table
- Energy estimation using deterministic model coefficients
- Tool call counting in both response and streaming chunks
- Budget remaining tracking within scoped runs
- Idempotent patching with clean unpatch/reset path

Context tracking per call:
- cost, step_count, latency_used_ms, energy_used, tool_calls
- budget_remaining auto-updated when budget_max is set
- model_used and decision trace via ctx.record()

Added step_count, latency_used_ms, energy_used fields to
HarnessRunContext in api.py. Hooked patch_openai into init()
and unpatch_openai into reset().

39 new tests covering: patch lifecycle, sync/async wrappers,
sync/async stream wrappers, cost/energy estimation, nested run
isolation, and edge cases (no usage, no choices, missing chunks).

All 63 harness tests pass (39 instrument + 24 api).
…m usage injection

- init(mode="off") now calls unpatch_openai() if previously patched
- Trace records actual mode (observe/enforce) instead of always "observe"
- Enforce mode raises BudgetExceededError pre-call when budget exhausted
- Auto-inject stream_options.include_usage=True for streaming requests
- Add pytest.importorskip("openai") for graceful skip when not installed
- 10 new tests covering all four fixes (73 total pass)
Replaces instrument.py stub with full OpenAI auto-instrumentation:
- Sync/async monkey-patching of Completions.create
- Stream wrappers for usage capture after consumption
- Cost/energy estimation with pricing tables
- Pre-call budget gate in enforce mode (BudgetExceededError)
- Auto-inject stream_options.include_usage for streaming
- init(mode="off") properly unpatches previously patched client
- 49 instrument tests + 24 API tests = 73 total passing
…n' into feature/agent-intelligence-v2-integration
Implements cascadeflow.integrations.crewai module that hooks into
CrewAI's native llm_hooks system (v1.5+) to feed cost, latency,
energy, and step metrics into harness run contexts.

- before_llm_call: budget gate in enforce mode, latency tracking
- after_llm_call: token estimation, cost/energy/step accounting
- enable()/disable() lifecycle with fail_open and budget_gate config
- 37 tests covering hooks, estimation, enable/disable, and edge cases
- Fixed __init__.py import ordering (CREWAI_AVAILABLE before __all__)
- Add crewai extra to pyproject.toml (pip install cascadeflow[crewai])
- Handle dict messages in _extract_message_content (CrewAI passes
  {"role": "...", "content": "..."} not objects with .content attr)
- Move budget gate check before start time recording so blocked calls
  don't leak entries in _call_start_times
- Fix unused imports (field, TYPE_CHECKING, Callable) and import order
- Fix docstring referencing nonexistent cost_model_override
- Replace yield with return in test fixture (PT022)
- Add 7 new tests: dict/object message extraction, blocked call leak
Add CascadeFlowADKPlugin(BasePlugin) that intercepts all LLM calls
across ADK Runner agents for budget enforcement, cost/latency/energy
tracking, tool call counting, and trace recording.

New files:
- cascadeflow/harness/pricing.py — shared pricing table with Gemini models
- cascadeflow/integrations/google_adk.py — plugin + enable/disable API
- tests/test_google_adk_integration.py — 49 tests
- docs/guides/google_adk_integration.md
- examples/integrations/google_adk_harness.py

Modified:
- cascadeflow/integrations/__init__.py — register integration
- pyproject.toml — add google-adk optional extra
- Remove harness `agent` from top-level cascadeflow namespace to avoid
  shadowing the cascadeflow.agent module (breaks dotted-path patches in
  test_agent.py and test_agent_p0_tool_loop.py)
- Use id(callback_context) fallback in ADK plugin _callback_key() when
  invocation_id and agent_name are both empty, preventing state map
  collisions under concurrency
- Add 4 tests for callback-key collision scenario
- Update test_harness_api to import agent from cascadeflow.harness
1. HIGH: off mode now respected — before/after callbacks return early
   when ctx.mode == "off", preventing metric tracking in off mode

2. HIGH: versioned Gemini model IDs now resolve correctly — added
   _resolve_pricing_key() with suffix stripping (-preview-XX-XX,
   -YYYYMMDD, -latest, -exp-N) and longest-prefix fallback matching

3. MEDIUM: callback key collision fixed — switched from
   (invocation_id, agent_name) tuple to id(callback_context) int key,
   guaranteeing uniqueness even for concurrent calls with same IDs

4. MEDIUM: fail_open tests now patch the correct symbol
   (cascadeflow.integrations.google_adk.get_current_run instead of
   cascadeflow.harness.api.get_current_run)

5. MEDIUM: budget error response no longer leaks spend/limit numbers —
   user-facing message is generic, exact figures logged at warning level

Added 13 new tests: off-mode behavior (2), versioned model pricing (7),
callback key collision (4). Total: 62 ADK tests pass.
Full suite: 1097 passed, 69 skipped, 0 failures.
saschabuehrle added a commit that referenced this pull request Mar 5, 2026
1. HIGH: off mode now respected — before/after callbacks return early
   when ctx.mode == "off", preventing metric tracking in off mode

2. HIGH: versioned Gemini model IDs now resolve correctly — added
   _resolve_pricing_key() with suffix stripping (-preview-XX-XX,
   -YYYYMMDD, -latest, -exp-N) and longest-prefix fallback matching

3. MEDIUM: callback key collision fixed — switched from
   (invocation_id, agent_name) tuple to id(callback_context) int key,
   guaranteeing uniqueness even for concurrent calls with same IDs

4. MEDIUM: fail_open tests now patch the correct symbol
   (cascadeflow.integrations.google_adk.get_current_run instead of
   cascadeflow.harness.api.get_current_run)

5. MEDIUM: budget error response no longer leaks spend/limit numbers —
   user-facing message is generic, exact figures logged at warning level

Added 13 new tests: off-mode behavior (2), versioned model pricing (7),
callback key collision (4). Total: 62 ADK tests pass.
Full suite: 1097 passed, 69 skipped, 0 failures.
saschabuehrle added a commit that referenced this pull request Mar 5, 2026
* Add core harness API scaffold with context-scoped runtime

* Harden harness core scaffolding and complete API test coverage

* feat(harness): implement OpenAI Python client auto-instrumentation

Replace the instrument.py scaffold with a full implementation that patches
openai.resources.chat.completions.Completions.create (sync) and
AsyncCompletions.create (async) for harness observe/enforce modes.

Key capabilities:
- Class-level patching of sync and async create methods
- Streaming wrappers (_InstrumentedStream, _InstrumentedAsyncStream)
  that capture usage metrics after all chunks are consumed
- Cost estimation from a built-in pricing table
- Energy estimation using deterministic model coefficients
- Tool call counting in both response and streaming chunks
- Budget remaining tracking within scoped runs
- Idempotent patching with clean unpatch/reset path

Context tracking per call:
- cost, step_count, latency_used_ms, energy_used, tool_calls
- budget_remaining auto-updated when budget_max is set
- model_used and decision trace via ctx.record()

Added step_count, latency_used_ms, energy_used fields to
HarnessRunContext in api.py. Hooked patch_openai into init()
and unpatch_openai into reset().

39 new tests covering: patch lifecycle, sync/async wrappers,
sync/async stream wrappers, cost/energy estimation, nested run
isolation, and edge cases (no usage, no choices, missing chunks).

All 63 harness tests pass (39 instrument + 24 api).

* fix: address PR review — off-mode unpatch, enforce budget gate, stream usage injection

- init(mode="off") now calls unpatch_openai() if previously patched
- Trace records actual mode (observe/enforce) instead of always "observe"
- Enforce mode raises BudgetExceededError pre-call when budget exhausted
- Auto-inject stream_options.include_usage=True for streaming requests
- Add pytest.importorskip("openai") for graceful skip when not installed
- 10 new tests covering all four fixes (73 total pass)

* Add OpenAI Agents SDK harness integration (opt-in)

* fix(openai-agents): align SDK interface and enforce-safe errors

* Add CrewAI harness integration with before/after LLM-call hooks

Implements cascadeflow.integrations.crewai module that hooks into
CrewAI's native llm_hooks system (v1.5+) to feed cost, latency,
energy, and step metrics into harness run contexts.

- before_llm_call: budget gate in enforce mode, latency tracking
- after_llm_call: token estimation, cost/energy/step accounting
- enable()/disable() lifecycle with fail_open and budget_gate config
- 37 tests covering hooks, estimation, enable/disable, and edge cases
- Fixed __init__.py import ordering (CREWAI_AVAILABLE before __all__)

* fix: address PR review — dict messages, start time leak, lint, extras

- Add crewai extra to pyproject.toml (pip install cascadeflow[crewai])
- Handle dict messages in _extract_message_content (CrewAI passes
  {"role": "...", "content": "..."} not objects with .content attr)
- Move budget gate check before start time recording so blocked calls
  don't leak entries in _call_start_times
- Fix unused imports (field, TYPE_CHECKING, Callable) and import order
- Fix docstring referencing nonexistent cost_model_override
- Replace yield with return in test fixture (PT022)
- Add 7 new tests: dict/object message extraction, blocked call leak

* docs(plan): claim v2 enforce-actions feature branch

* feat(harness): enforce switch-model, deny-tool, and stop actions

* feat(harness): implement enforce actions for v2 harness

* fix(harness): clarify observe traces and hard-stop semantics

* perf(harness): optimize model utility hot paths

* refactor(harness): unify pricing profiles across integrations

* docs(plan): claim langchain harness extension branch

* feat(harness): add privacy-safe decision telemetry and callback hooks

* fix(harness): address telemetry review findings

- Use time.monotonic() for duration_ms calculation instead of wall-clock
  delta (avoids NTP/suspend clock jumps)
- Extract sanitize constants (_MAX_ACTION_LEN, _MAX_REASON_LEN, _MAX_MODEL_LEN)
- Log warning when record() receives empty action (was silently defaulting)
- Cache CallbackEvent import in _emit_harness_decision for hot-path perf
- Add tests: no-callback-manager noop, empty-action warning, duration field

* fix(harness): avoid shadowing cascadeflow.agent module

* style: apply black formatting for harness integration files

* feat(langchain): add harness-aware callback and state extractor

* feat(langchain): auto-attach harness callback in active run scopes

* docs(plan): mark langchain harness extension branch completed

* fix(langchain): address PR #161 review findings

- Document enforce-mode limitations for switch_model and deny_tool
- Replace per-handler _executed_tool_calls with run_ctx.tool_calls
- Fix _extract_candidate_state fallback leaking arbitrary kwargs
- Remove return-in-finally (B012) and fix import ordering
- Separate langgraph from langchain optional extra in pyproject.toml
- Add 4 edge-case tests: no-run-context safety, state extraction
  guard, and run_ctx tool_calls gating

* fix(langchain): enforce tool caps on executed calls and harden tool extraction

* fix(harness): avoid shadowing cascadeflow.agent module

* feat(bench): add reproducibility pipeline for V2 Go/No-Go validation

Add 5 new benchmark modules and 15 unit tests that enable third-party
reproducibility and automated V2 readiness checks:

- repro.py: environment fingerprint (git SHA, packages, platform)
- baseline.py: save/load baselines, delta comparison, Go/No-Go gates
- harness_overhead.py: decision-path p95 measurement (<5ms gate)
- observe_validation.py: observe-mode zero-change proof (6 cases)
- artifact.py: JSON artifact bundler + REPRODUCE.md generation

Extends run_all.py with --baseline, --harness-mode, --with-repro flags.

* docs(plan): update workboard — bench-repro-pipeline PR #163 in review

* style(bench): apply linter formatting to repro pipeline files

* style(langchain): finalize harness callback typing and formatting

* feat(integrations): add Google ADK harness plugin

Add CascadeFlowADKPlugin(BasePlugin) that intercepts all LLM calls
across ADK Runner agents for budget enforcement, cost/latency/energy
tracking, tool call counting, and trace recording.

New files:
- cascadeflow/harness/pricing.py — shared pricing table with Gemini models
- cascadeflow/integrations/google_adk.py — plugin + enable/disable API
- tests/test_google_adk_integration.py — 49 tests
- docs/guides/google_adk_integration.md
- examples/integrations/google_adk_harness.py

Modified:
- cascadeflow/integrations/__init__.py — register integration
- pyproject.toml — add google-adk optional extra

* fix: resolve import regression and callback-key collision

- Remove harness `agent` from top-level cascadeflow namespace to avoid
  shadowing the cascadeflow.agent module (breaks dotted-path patches in
  test_agent.py and test_agent_p0_tool_loop.py)
- Use id(callback_context) fallback in ADK plugin _callback_key() when
  invocation_id and agent_name are both empty, preventing state map
  collisions under concurrency
- Add 4 tests for callback-key collision scenario
- Update test_harness_api to import agent from cascadeflow.harness

* fix: address PR #165 review — 5 findings resolved

1. HIGH: off mode now respected — before/after callbacks return early
   when ctx.mode == "off", preventing metric tracking in off mode

2. HIGH: versioned Gemini model IDs now resolve correctly — added
   _resolve_pricing_key() with suffix stripping (-preview-XX-XX,
   -YYYYMMDD, -latest, -exp-N) and longest-prefix fallback matching

3. MEDIUM: callback key collision fixed — switched from
   (invocation_id, agent_name) tuple to id(callback_context) int key,
   guaranteeing uniqueness even for concurrent calls with same IDs

4. MEDIUM: fail_open tests now patch the correct symbol
   (cascadeflow.integrations.google_adk.get_current_run instead of
   cascadeflow.harness.api.get_current_run)

5. MEDIUM: budget error response no longer leaks spend/limit numbers —
   user-facing message is generic, exact figures logged at warning level

Added 13 new tests: off-mode behavior (2), versioned model pricing (7),
callback key collision (4). Total: 62 ADK tests pass.
Full suite: 1097 passed, 69 skipped, 0 failures.

* feat(harness): add anthropic python auto-instrumentation for v2.1

* feat(core): deliver v2.1 ts harness parity and sdk auto-instrumentation

* test(harness): add comprehensive Anthropic auto-instrumentation tests

Add 29 tests covering the Anthropic Python SDK monkey-patching that was
introduced in v2.1. Tests cover usage extraction, tool call counting,
sync/async wrapper behavior, budget enforcement in enforce mode, stream
passthrough, cost/energy/latency tracking, and init/reset lifecycle.

* feat(harness): instrument Anthropic streaming usage and tool calls

* fix(harness): finalize stream metrics on errors and harden env parsing

* docs: add harness quickstart and missing integration coverage

* feat(n8n): add multi-dimensional harness integration to Agent node

Port the Python harness decision engine to TypeScript and wire it into
the n8n Agent node. Tracks 5 dimensions (cost, latency, energy, tool
calls, quality) across every LLM call. Observe mode is on by default;
enforce mode stops the agent loop when limits are hit.

- Add nodes/harness/ with pricing (18 models, fuzzy resolution),
  HarnessRunContext (7-step decision cascade, compliance allowlists,
  KPI-weighted scoring), and 43 tests
- Replace hardcoded estimatesPerMillion in CascadeChatModel with shared
  harness/pricing.ts (broader model coverage + suffix stripping)
- Add harness UI parameters to Agent node (mode, budget, tool cap,
  latency cap, energy cap, compliance, KPI weights)
- Wire pre-call checks and tool-call counting into agent executor loop
- Add harness summary to Agent output JSON

* fix(google-adk): initialize plugin name and stabilize callback correlation

* chore(dx): clarify integration prerequisites and add optional integration CI

* style: apply Black formatting to 7 Python files

Fix CI Python Code Quality check — these files drifted from Black
formatting after recent merges into the integration branch.

* chore(ci/docs): enforce integration matrix across python versions

* style: fix ruff I001 import sorting in google_adk_harness example

* feat(benchmarks): add baseline and savings metrics to agentic tool benchmark

* feat(dx): add LangChain harness docs, harness example, and llms.txt

Close V2 Go/No-Go gaps:
- Add harness section to langchain_integration.md documenting
  HarnessAwareCascadeFlowCallbackHandler and get_harness_callback
- Create langchain_harness.py example (matches CrewAI/OpenAI Agents/ADK pattern)
- Create llms.txt at repo root for LLM-readable project discovery
- Update V2 workboard: all feature branches merged, Go/No-Go checklist updated

* harden harness: input validation, trace rotation, NaN guard, phantom model fix

- Add _validate_harness_params() to init() and run() — rejects negative
  budget/tool_calls/latency/energy and invalid compliance strings
- Add trace rotation (MAX_TRACE_ENTRIES=1000) in both Python and TypeScript
  to prevent unbounded memory growth in long-running agents
- Add sanitizeNumericParam() in n8n harness.ts — coerces NaN/Infinity/negative
  config values to null
- Remove phantom gpt-5-nano from llms.txt (not in any pricing table)
- Document HarnessRunContext thread-safety limitation in docstring
- Add 10 new tests covering validation, compliance, and trace rotation

* docs: reframe positioning as agent runtime intelligence layer + add Mintlify docs site

Phase 0 — GitHub refresh:
- pyproject.toml: update description, keywords, classifier to Production/Stable
- __init__.py: replace emoji docstring with harness API focus
- llms.txt: expand from 88 to 214 lines (HarnessConfig, pricing, energy, integrations)
- README.md: new H1, comparison table, Harness API section, 6 new feature rows
- docs/README.md: Mintlify banner, add LangChain to integrations list

Phase 1 — Mintlify docs site (docs-site/):
- docs.json config (palm theme, 5 tabs, full navigation)
- 36 MDX pages: Get Started (4), Harness (8), Integrations (7),
  API Reference (8), Examples (6), index + changelog + contributing
- Logo assets copied from .github/assets/

* fix: switch GitHub Stars badge from social to flat style

Social-style shields.io badges intermittently render as "invalid"
due to GitHub API rate limiting. Flat style is more reliable.
@saschabuehrle
Copy link
Collaborator Author

Superseded by #164 — all commits included in the integration train.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant