feat(instrumentation-microsoft-agent-framework): add plugin per execute.md#229
feat(instrumentation-microsoft-agent-framework): add plugin per execute.md#229rangemer333-cell wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new GenAI instrumentation plugin, opentelemetry-instrumentation-microsoft-agent-framework, to automatically enrich Microsoft Agent Framework (MAF) spans into ARMS GenAI semantic conventions via a custom SpanProcessor, plus an optional ReAct-step monkey patch.
Changes:
- Added
MAFSemanticProcessorto classify/enrich MAF spans (kind/op/framework/rename/provider normalization/TTFT) and aggregate ARMS gauges. - Added optional
react_step_patchto emit STEP spans around ReAct loop LLM calls viaExtendedTelemetryHandler.react_step(). - Added unit tests for config parsing, processor enrichment/metrics behavior, and patch idempotency/coroutine behavior without requiring MAF installed.
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/src/opentelemetry/instrumentation/microsoft_agent_framework/init.py | Instrumentor: enables MAF native telemetry, registers span processor, optionally applies ReAct patch |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/src/opentelemetry/instrumentation/microsoft_agent_framework/span_processor.py | Core enrichment + metrics aggregation logic |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/src/opentelemetry/instrumentation/microsoft_agent_framework/react_step_patch.py | Optional ReAct loop patch emitting STEP spans |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/src/opentelemetry/instrumentation/microsoft_agent_framework/semantic_conventions.py | Semconv constants + MAF→gen_ai rename map + provider normalization map |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/src/opentelemetry/instrumentation/microsoft_agent_framework/config.py | Env var parsing for enabling instrumentation/metrics/react-step/sensitive data/slow threshold |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/src/opentelemetry/instrumentation/microsoft_agent_framework/package.py | Declares instrumented dependency + metrics support |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/src/opentelemetry/instrumentation/microsoft_agent_framework/version.py | Package version |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/pyproject.toml | Packaging metadata and entry-point registration |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/README.rst | Usage + configuration docs |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/tests/test_processor.py | Processor enrichment/metrics tests (incl. MCP + regressions) |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/tests/test_react_step.py | ReAct patch tests and direct handler STEP span shape test |
| instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/tests/test_config.py | Config env var parsing tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self._live_spans: Dict[str, OtelSpan] = {} | ||
| self._span_parents: Dict[str, Optional[str]] = {} | ||
| self._slow_threshold_ns = int(slow_threshold_ms) * 1_000_000 | ||
| self._capture_sensitive = capture_sensitive_data | ||
| self._counters = _Counters() | ||
| self._meter = None | ||
| self._gauges: list[ObservableGauge] = [] | ||
| self._metrics_enabled = metrics_enabled | ||
| if metrics_enabled: |
| c = self._counters | ||
|
|
||
| def _calls_cb(options): | ||
| for (model, kind), count in c.calls_count.items(): | ||
| yield _obs( | ||
| count, | ||
| {"modelName": model or "unknown", "spanKind": kind}, | ||
| ) | ||
|
|
||
| def _duration_cb(options): | ||
| for (model, kind), total in c.calls_duration_ns_sum.items(): | ||
| count = max(c.calls_count.get((model, kind), 0), 1) | ||
| yield _obs( | ||
| total / count / 1e9, | ||
| {"modelName": model or "unknown", "spanKind": kind}, | ||
| ) | ||
|
|
| def _error_cb(options): | ||
| for (model, kind), count in c.calls_error_count.items(): | ||
| yield _obs( | ||
| count, | ||
| {"modelName": model or "unknown", "spanKind": kind}, | ||
| ) | ||
|
|
||
| def _slow_cb(options): | ||
| for (model, kind), count in c.calls_slow_count.items(): | ||
| yield _obs( | ||
| count, | ||
| {"modelName": model or "unknown", "spanKind": kind}, | ||
| ) | ||
|
|
| def _uninstrument(self, **kwargs: Any) -> None: | ||
| if self._react_applied: | ||
| revert_react_step_patch() | ||
| self._react_applied = False | ||
| if self._processor is not None: | ||
| try: | ||
| self._processor.shutdown() | ||
| except Exception as exc: # pragma: no cover - defensive | ||
| logger.debug("processor shutdown error: %s", exc) | ||
| self._processor = None |
| assert asyncio.iscoroutine(coro), ( | ||
| "FIL wrapper must return a coroutine so MAF can `await` it" | ||
| ) | ||
| result = asyncio.get_event_loop().run_until_complete(coro) |
ralf0131
left a comment
There was a problem hiding this comment.
Summary
New opentelemetry-instrumentation-microsoft-agent-framework plugin (+2233 lines, 14 files, 36 tests). Well-structured implementation following the established openai-agents-v2 pattern: MAFSemanticProcessor enriches native MAF OTel spans with ARMS GenAI semantic conventions (span kind, operation name, attribute renaming, provider normalization, TTFT backfill, status, 6 metrics gauges). ReAct step patch is opt-in and cleanly revertible. Test coverage is thorough — classification, MCP detection, provider normalization, TTFT, metrics, and patch apply/revert are all covered.
Overall verdict: solid work for a first-time contribution. No blocking issues found. A few suggestions below for production hardening.
Findings
- [Warning] span_processor.py:684 — Thread-safety of metrics counters (
+=on defaultdict) - [Info] span_processor.py:553 — Potential
_live_spansmemory growth on un-ended spans - [Info] semantic_conventions.py:106 — Provider normalization conflates framework with provider
- [Info] init.py:119 — Private SDK internals for processor ordering
Suggestions
- For the metrics counters, consider a
threading.Lockaround_aggregate_metrics— the overhead is negligible sinceon_endis not a hot path per-span, and it eliminates a class of subtle race conditions in multi-threaded apps. - For
_live_spans, a simple sweep inforce_flushthat removes entries whose span start time is older than e.g. 60s would prevent unbounded growth from orphaned spans. - Consider adding an integration test that verifies the processor works end-to-end with a real (or mock) MAF tracer provider, not just unit tests with mock spans.
First-Time Contributor Note
Welcome and thank you for this high-quality contribution! The code is clean, well-documented, and follows the repository's established patterns. The E2E validation (9/9 examples) is impressive. Please don't let the suggestions above discourage you — they are minor hardening ideas, not blockers.
Automated review by github-manager-bot
sipercai
left a comment
There was a problem hiding this comment.
I checked the existing review threads and avoided duplicating the already-open findings for the non-MAF span guard, _uninstrument() cleanup, metrics locking, and provider normalization. Adding a few additional findings from local pytest/behavior repro, Weaver live-check, and current CI. This is a comment review only, not a request-changes review.
Separate CI note: changelog, generate, precommit, and typecheck are currently failing. generate wants updates to generated bootstrap/docs files, and changelog needs either an entry or the skip label.
| GEN_AI_USAGE_OUTPUT_TOKENS = "gen_ai.usage.output_tokens" | ||
| GEN_AI_REACT_ROUND = "gen_ai.react.round" | ||
| GEN_AI_REACT_FINISH_REASON = "gen_ai.react.finish_reason" | ||
| GEN_AI_FRAMEWORK = "gen_ai.framework" |
There was a problem hiding this comment.
gen_ai.framework is not defined in the current LoongSuite semantic-conventions registry; live-check reports it as a missing attribute and a collision with the existing gen_ai namespace. Either add this extension to the registry/docs first, or use an already registered/non-gen_ai.* attribute key so emitted spans pass the LoongSuite profile.
|
跟进修复已推到 PR head: 本轮主要修复:
本地验证结果:
ARMS 云端读回也尝试过多条路径(ARMS collector、SLS OTLP、现有 service/new service),但这次没有稳定拿到 backend readback,主要卡在 ingestion endpoint/auth/instance 映射上;因此这轮结论以真实 MAF runtime + 本地 OTLP + Weaver live-check 为准。GitHub Actions 已批准,目前仍在 |
ralf0131
left a comment
There was a problem hiding this comment.
Summary
Re-review after 3 new commits (review feedback addressed + gen-ai convention alignment). The previous [Warning] on thread-safety is now resolved — threading.Lock guards all metrics counter operations. The framework/provider conflation is also resolved — gen_ai.framework is set separately from gen_ai.provider.name. Code quality remains high.
CI Status (blocking — must fix before merge)
- changelog ❌ — No CHANGELOG entry. Add one or add the
Skip Changeloglabel. - precommit ❌ —
ruff formatreformats 1 file;rstcheckflags README.rst:45 (title underline too short). See inline comment. - typecheck ❌ — Pyright errors in
vertexai/patch.py(unnecessary# type: ignore). These are pre-existing — not introduced by this PR. Safe to ignore.
Remaining Findings (non-blocking)
- [Info] span_processor.py:826 —
force_flushstill doesn't sweep stale_live_spans. Carried over from previous review; minor hardening suggestion.
First-Time Contributor Note
Great follow-up work addressing the review feedback! The thread-safety fix and framework/provider separation are exactly right. Once the two CI formatting issues (changelog + rstcheck) are resolved, this PR is in good shape. The typecheck failure is in a different package and not your responsibility.
Automated review by github-manager-bot
ralf0131
left a comment
There was a problem hiding this comment.
Summary
Re-review after commit 8c174f7 ("fix(maf): address review comments and ci"). All previously identified findings are resolved — thread-safety, provider normalization, and CI are all addressed. CI is fully green, CLA signed. Approving.
Findings (all resolved ✅)
- [Resolved] Thread-safety —
threading.Lock()now guards all_live_spans/_span_parentsaccess inon_start,on_end,shutdown, and_sweep_stale_live_spans. - [Resolved] Provider normalization —
microsoft.agent_frameworkno longer collapsed toopenai; unknown providers are lower-cased and kept distinct (correct, since MAF routes to multiple underlying providers). - [Resolved] CI failures —
changelog(CHANGELOG.md added),precommit(ruff format + rstcheck fixed),typecheck(pre-existing Pyright errors, not introduced by this PR).
New Functionality
_sweep_stale_live_spans()— good addition; prevents memory leaks from spans started but never ended, bounded by 60s max age. Test coverage intest_force_flush_sweeps_stale_live_spans.
Tests
All tests updated to reflect new provider normalization behavior. New test for stale span sweep. Coverage looks solid.
Automated review by github-manager-bot
…te.md Implements the hybrid "SpanProcessor + optional ReAct step patch" plan documented in llm-dev/microsoft-agent-framework/investigate/execute.md. - MicrosoftAgentFrameworkInstrumentor: enables MAF native OTel layers (force=True) and registers MAFSemanticProcessor. - MAFSemanticProcessor (span_processor.py): injects gen_ai.span.kind, gen_ai.operation.name, renames MAF private-prefix attributes to gen_ai.*, normalizes provider.name (azure_openai -> openai), backfills TTFT from streaming events, sets StatusCode.OK on success, aggregates the 6 ARMS gauges. Reuses opentelemetry.util.genai.utils.gen_ai_json_dumps (aligned with openai-agents-v2/span_processor.py:27) to coerce dict/list attribute values into JSON strings. - react_step_patch.py (opt-in via ARMS_MAF_REACT_STEP_ENABLED): emits one react step span per LLM round-trip inside FunctionInvocationLayer via ExtendedTelemetryHandler.react_step() from opentelemetry-util-genai. - config.py: env switches (master, sensitive data, react step, slow threshold). - tests: 23 passing unit tests covering span classification, metric aggregation, provider normalization, TTFT backfill, dict coercion, and react_step handler behavior.
[M1] MCP span classification: detect mcp.method.name attribute (or SpanKind.CLIENT + mcp.* fallback) in _classify_span and return (CLIENT, MCP). MAF_SPAN_NAME_PREFIXES documents that MCP is detected via attribute rather than prefix (method names are unbounded). [M2] revert_react_step_patch: capture originals BEFORE wrapping (via __wrapped__ unwrap chain), and switch from broken decorator form to wrap_function_wrapper(class, name, wrapper) + @wrapt.decorator. revert now restores the original; apply->revert->apply does not stack wrappers. [L1] _safe_dumps: cap output at 4096 chars (execute.md single-field cap) since gen_ai_json_dumps only serializes. Docstring + module docstring updated to match actual behavior. Tests: +6 (test_mcp_span_classified_as_client, mcp client-kind fallback, non-mcp client negative, _safe_dumps truncation, apply->revert->apply round-trip, _unwrap_to_function). 29 passed.
- P0: react_step_patch wrappers now return coroutines from sync wrappers so MAF's await layer.get_response no longer raises TypeError. ContextVar tokens are set inside the coroutine body so set/reset share the same asyncio task context. - P1: extend op_name override condition to include CLIENT span kind so MCP tools/call inner spans get gen_ai.operation.name=mcp even when MAF pre-wrote execute_tool. - P3: provider.name normalization now handles sequence values (MAF emits list-wrapped values on AGENT spans) and falls back to case-insensitive matching so microsoft.agent_framework -> openai on AGENT spans. - P5: instrument() prepends MAFSemanticProcessor to the SDK processor tuple so on_end enrichments run before exporter processors registered earlier in bootstrap. Adds 7 unit tests; pytest tests/ -> 36 passed.
8c174f7 to
f3fe856
Compare
ralf0131
left a comment
There was a problem hiding this comment.
Summary
Re-review after force-push (commits 383158a → f3fe856). The 3 new commits by @rangemer333-cell move the plugin into the loongsuite matrix and align telemetry with gen-ai semantic conventions. All changes are improvements with no issues found.
Key changes reviewed (commits 4–6)
- Thread-safety (span_processor.py):
_counter_lockand_live_span_lock(threading.Lock) now guard all access to_live_spans,_span_parents, and_counters. Observable gauge callbacks collect into local lists under the lock then yield. Lock ordering is consistent (live_span_lock → counter_lock), no deadlock risk. - Proper uninstrument (init.py): New
_remove_span_processor()removes the semantic processor from the tracer provider on teardown, preventing stale processor leaks. Uses the same defensivegetattrpattern as the prepend path. - Non-MAF span filtering (span_processor.py):
_is_maf_span()guard inon_endskips non-MAF spans — important correctness fix since the processor is prepended to a shared tracer provider. - Stale span sweep (span_processor.py):
force_flush()now cleans orphaned entries from_live_spans/_span_parents, preventing memory leaks from unclosed spans. - Semantic convention alignment: removed
gen_ai.framework/gen_ai.user.time_to_first_tokenattributes; provider normalization preservesmicrosoft.agent_framework(lowercased) instead of collapsing toopenai; successful spans leftUNSET(matching OTel SDK default + Weaver validation); addedgen_ai.response.finish_reasonsJSON-string normalization. - Tests: comprehensive new coverage for thread-safety, force_flush sweep, uninstrument processor removal, non-MAF filtering, finish-reasons normalization, and provider normalization changes.
CI & CLA
- CLA: signed (license/cla ✓)
- CI: green (only Auto-Assign-Owners was pending, now passed)
LGTM. Re-approving after the force-push invalidated the previous approval. Ready to merge.
Automated review by github-manager-bot
Closes #52
Summary
新增
opentelemetry-instrumentation-microsoft-agent-framework插件,为 Microsoft Agent Framework 框架提供自动插桩能力,按execute.md实现方案落地,遵循/home/admin/semantic-conventions/arms_docs/trace/gen-ai.md语义规范。rangemer333-cell:feat/microsoft-agent-framework(HEAD9caf6641)alibaba:main${WORKSPACE_ROOT}/llm-dev/microsoft-agent-framework/investigate/execute.md(WORKSPACE_ROOT=/apsara/loongsuite-plugin-microsoft-agent-framework.nPqhpw)${WORKSPACE_ROOT}/validation/microsoft_agent_framework_verification_report.mdvalidation/spans_v5.json(49 spans)validation/logs_v5/exNN.log改动范围
新增插件目录
instrumentation-genai/opentelemetry-instrumentation-microsoft-agent-framework/,共 14 文件 / +2233 行:__init__.py— instrument 入口、processor 前置注册span_processor.py—MAFSemanticProcessor,复用opentelemetry.util.genai.utils的截断/PII/gen_ai_json_dumps,对齐openai-agents-v2/span_processor.pyreact_step_patch.py— ReAct step span 走util/opentelemetry-util-genai的ExtendedTelemetryHandler.react_step()semantic_conventions.py/config.py/package.py/version.py/README.rst/pyproject.tomltests/— 36 单测单测
pytest tests/→ 36 passed(原 29 + 本轮新增 7 覆盖 P0/P1/P3 修复路径)。E2E 观测结论(9/9 example 符合)
namespace
loongsuite-maf-e2e,9 个 Deploymentmaf-example-01..09,镜像acos-demo-registry.cn-hangzhou.cr.aliyuncs.com/private-mesh/maf-e2e:plugin-9caf6641。tools/callinner CLIENT spanop=mcp+mcp.method.name=tools/call+gen_ai.tool.name;AGENTprovider=openai归一Review 修复摘要(P0/P1/P3/P5)
react_step_patch.py:91-167:_fil_wrapper/_chat_wrapper改为同步函数返回_scoped()/_step_scoped()coroutine,修复 MAF_agents.py:964await layer.get_response(...)TypeError;ContextVar token 的 set/reset 移入 coroutine body,修复跨 taskValueError。span_processor.py:589-599:gen_ai.operation.name覆盖条件由{TASK, AGENT}扩为{TASK, AGENT, CLIENT},覆盖 MCPtools/callinner span。span_processor.py:218-241:_normalize_provider三步兜底(list/tuple 取首元素 → 精确匹配PROVIDER_NAME_NORMALIZE→.lower()),MAF 写入的microsoft.agent_framework归一为openai。__init__.py:_instrument:add_span_processor之后将MAFSemanticProcessor前置到_active_span_processor._span_processors首位,异常静默降级。Known Limitations
gen_ai.react.round恒为1(ex07 ReAct step span)gen_ai.react.round均为1,未随轮次递增(预期 1/2/3)。execute.md走ExtendedTelemetryHandler.react_step()正确产出 STEP span,仅 round 字段无法填充真实值。validation/spans_v5.jsonex07 三个 STEP span;console 日志validation/logs_v5/ex07.log。semantic-conventions规范 owner 给出 fallback 方案后再补齐。Test Plan
pytest tests/→ 36 passed🤖 Generated with Claude Code