feat: merge agent-intelligence v2 integration train into main#164
Merged
saschabuehrle merged 49 commits intomainfrom Mar 5, 2026
Merged
feat: merge agent-intelligence v2 integration train into main#164saschabuehrle merged 49 commits intomainfrom
saschabuehrle merged 49 commits intomainfrom
Conversation
Replace the instrument.py scaffold with a full implementation that patches openai.resources.chat.completions.Completions.create (sync) and AsyncCompletions.create (async) for harness observe/enforce modes. Key capabilities: - Class-level patching of sync and async create methods - Streaming wrappers (_InstrumentedStream, _InstrumentedAsyncStream) that capture usage metrics after all chunks are consumed - Cost estimation from a built-in pricing table - Energy estimation using deterministic model coefficients - Tool call counting in both response and streaming chunks - Budget remaining tracking within scoped runs - Idempotent patching with clean unpatch/reset path Context tracking per call: - cost, step_count, latency_used_ms, energy_used, tool_calls - budget_remaining auto-updated when budget_max is set - model_used and decision trace via ctx.record() Added step_count, latency_used_ms, energy_used fields to HarnessRunContext in api.py. Hooked patch_openai into init() and unpatch_openai into reset(). 39 new tests covering: patch lifecycle, sync/async wrappers, sync/async stream wrappers, cost/energy estimation, nested run isolation, and edge cases (no usage, no choices, missing chunks). All 63 harness tests pass (39 instrument + 24 api).
…m usage injection
- init(mode="off") now calls unpatch_openai() if previously patched
- Trace records actual mode (observe/enforce) instead of always "observe"
- Enforce mode raises BudgetExceededError pre-call when budget exhausted
- Auto-inject stream_options.include_usage=True for streaming requests
- Add pytest.importorskip("openai") for graceful skip when not installed
- 10 new tests covering all four fixes (73 total pass)
Implements cascadeflow.integrations.crewai module that hooks into CrewAI's native llm_hooks system (v1.5+) to feed cost, latency, energy, and step metrics into harness run contexts. - before_llm_call: budget gate in enforce mode, latency tracking - after_llm_call: token estimation, cost/energy/step accounting - enable()/disable() lifecycle with fail_open and budget_gate config - 37 tests covering hooks, estimation, enable/disable, and edge cases - Fixed __init__.py import ordering (CREWAI_AVAILABLE before __all__)
- Add crewai extra to pyproject.toml (pip install cascadeflow[crewai])
- Handle dict messages in _extract_message_content (CrewAI passes
{"role": "...", "content": "..."} not objects with .content attr)
- Move budget gate check before start time recording so blocked calls
don't leak entries in _call_start_times
- Fix unused imports (field, TYPE_CHECKING, Callable) and import order
- Fix docstring referencing nonexistent cost_model_override
- Replace yield with return in test fixture (PT022)
- Add 7 new tests: dict/object message extraction, blocked call leak
- Use time.monotonic() for duration_ms calculation instead of wall-clock delta (avoids NTP/suspend clock jumps) - Extract sanitize constants (_MAX_ACTION_LEN, _MAX_REASON_LEN, _MAX_MODEL_LEN) - Log warning when record() receives empty action (was silently defaulting) - Cache CallbackEvent import in _emit_harness_decision for hot-path perf - Add tests: no-callback-manager noop, empty-action warning, duration field
Add 5 new benchmark modules and 15 unit tests that enable third-party reproducibility and automated V2 readiness checks: - repro.py: environment fingerprint (git SHA, packages, platform) - baseline.py: save/load baselines, delta comparison, Go/No-Go gates - harness_overhead.py: decision-path p95 measurement (<5ms gate) - observe_validation.py: observe-mode zero-change proof (6 cases) - artifact.py: JSON artifact bundler + REPRODUCE.md generation Extends run_all.py with --baseline, --harness-mode, --with-repro flags.
Add CascadeFlowADKPlugin(BasePlugin) that intercepts all LLM calls across ADK Runner agents for budget enforcement, cost/latency/energy tracking, tool call counting, and trace recording. New files: - cascadeflow/harness/pricing.py — shared pricing table with Gemini models - cascadeflow/integrations/google_adk.py — plugin + enable/disable API - tests/test_google_adk_integration.py — 49 tests - docs/guides/google_adk_integration.md - examples/integrations/google_adk_harness.py Modified: - cascadeflow/integrations/__init__.py — register integration - pyproject.toml — add google-adk optional extra
- Remove harness `agent` from top-level cascadeflow namespace to avoid shadowing the cascadeflow.agent module (breaks dotted-path patches in test_agent.py and test_agent_p0_tool_loop.py) - Use id(callback_context) fallback in ADK plugin _callback_key() when invocation_id and agent_name are both empty, preventing state map collisions under concurrency - Add 4 tests for callback-key collision scenario - Update test_harness_api to import agent from cascadeflow.harness
1. HIGH: off mode now respected — before/after callbacks return early when ctx.mode == "off", preventing metric tracking in off mode 2. HIGH: versioned Gemini model IDs now resolve correctly — added _resolve_pricing_key() with suffix stripping (-preview-XX-XX, -YYYYMMDD, -latest, -exp-N) and longest-prefix fallback matching 3. MEDIUM: callback key collision fixed — switched from (invocation_id, agent_name) tuple to id(callback_context) int key, guaranteeing uniqueness even for concurrent calls with same IDs 4. MEDIUM: fail_open tests now patch the correct symbol (cascadeflow.integrations.google_adk.get_current_run instead of cascadeflow.harness.api.get_current_run) 5. MEDIUM: budget error response no longer leaks spend/limit numbers — user-facing message is generic, exact figures logged at warning level Added 13 new tests: off-mode behavior (2), versioned model pricing (7), callback key collision (4). Total: 62 ADK tests pass. Full suite: 1097 passed, 69 skipped, 0 failures.
Add 29 tests covering the Anthropic Python SDK monkey-patching that was introduced in v2.1. Tests cover usage extraction, tool call counting, sync/async wrapper behavior, budget enforcement in enforce mode, stream passthrough, cost/energy/latency tracking, and init/reset lifecycle.
Port the Python harness decision engine to TypeScript and wire it into the n8n Agent node. Tracks 5 dimensions (cost, latency, energy, tool calls, quality) across every LLM call. Observe mode is on by default; enforce mode stops the agent loop when limits are hit. - Add nodes/harness/ with pricing (18 models, fuzzy resolution), HarnessRunContext (7-step decision cascade, compliance allowlists, KPI-weighted scoring), and 43 tests - Replace hardcoded estimatesPerMillion in CascadeChatModel with shared harness/pricing.ts (broader model coverage + suffix stripping) - Add harness UI parameters to Agent node (mode, budget, tool cap, latency cap, energy cap, compliance, KPI weights) - Wire pre-call checks and tool-call counting into agent executor loop - Add harness summary to Agent output JSON
Fix CI Python Code Quality check — these files drifted from Black formatting after recent merges into the integration branch.
Close V2 Go/No-Go gaps: - Add harness section to langchain_integration.md documenting HarnessAwareCascadeFlowCallbackHandler and get_harness_callback - Create langchain_harness.py example (matches CrewAI/OpenAI Agents/ADK pattern) - Create llms.txt at repo root for LLM-readable project discovery - Update V2 workboard: all feature branches merged, Go/No-Go checklist updated
…model fix - Add _validate_harness_params() to init() and run() — rejects negative budget/tool_calls/latency/energy and invalid compliance strings - Add trace rotation (MAX_TRACE_ENTRIES=1000) in both Python and TypeScript to prevent unbounded memory growth in long-running agents - Add sanitizeNumericParam() in n8n harness.ts — coerces NaN/Infinity/negative config values to null - Remove phantom gpt-5-nano from llms.txt (not in any pricing table) - Document HarnessRunContext thread-safety limitation in docstring - Add 10 new tests covering validation, compliance, and trace rotation
…intlify docs site Phase 0 — GitHub refresh: - pyproject.toml: update description, keywords, classifier to Production/Stable - __init__.py: replace emoji docstring with harness API focus - llms.txt: expand from 88 to 214 lines (HarnessConfig, pricing, energy, integrations) - README.md: new H1, comparison table, Harness API section, 6 new feature rows - docs/README.md: Mintlify banner, add LangChain to integrations list Phase 1 — Mintlify docs site (docs-site/): - docs.json config (palm theme, 5 tabs, full navigation) - 36 MDX pages: Get Started (4), Harness (8), Integrations (7), API Reference (8), Examples (6), index + changelog + contributing - Logo assets copied from .github/assets/
Social-style shields.io badges intermittently render as "invalid" due to GitHub API rate limiting. Flat style is more reliable.
d62aa4d to
adbf47e
Compare
This was referenced Mar 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation
Python quality gates
JS/TS quality gates
Python test suites
E2E checks (live APIs)
Notes