feat(observability): config-gated Prometheus + OTLP exporters (MCP-32)#746
Conversation
Wire the existing observability scaffolding into the running daemon and make
it operator-configurable. Both exporters are OFF by default.
- config: add observability.metrics / observability.tracing blocks
(enabled, protocol http|grpc, endpoint, sample_rate) with validation/repair.
- server: build the observability config from file config, pass the manager to
the HTTP API (so /metrics is actually served) and to the MCP proxy; close the
tracer provider on shutdown. Previously the manager was created but the API
received nil, so /metrics and tool-call metrics were dead code.
- tracing: support both OTLP/HTTP and OTLP/gRPC transports; attach optional
resource attributes.
- tool calls: record latency/outcome metrics and open an OTLP span wrapping the
upstream hop, inline at the call site (no-op when disabled).
- metrics: add mcpproxy_quarantine_events_total{scope,action}.
- bridge: project runtime events (server stats, quarantine changes) onto
Prometheus gauges/counters, decoupled from business logic.
- server edition: annotate tool-call spans with user_id + profile via a
build-tagged seam (span attributes, not metric labels, to bound cardinality).
- docs: "Observability for mcpproxy" page (scrape config, metric reference,
OTLP setup) + reference Grafana dashboard in contrib/grafana/.
- regenerate OpenAPI spec for the new config fields.
Related MCP-32
Deploying mcpproxy-docs with
|
| Latest commit: |
78c8d4f
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://91d5a66b.mcpproxy-docs.pages.dev |
| Branch Preview URL: | https://feat-mcp-32-observability-ex.mcpproxy-docs.pages.dev |
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
📦 Build ArtifactsWorkflow Run: View Run Available Artifacts
How to DownloadOption 1: GitHub Web UI (easiest)
Option 2: GitHub CLI gh run download 27955262300 --repo smart-mcp-proxy/mcpproxy-go
|
|
CodexReviewer: changes requested. Substantive: observability health manager is built for every server and replaces the readiness handler even when observability is disabled (server.go:138 + httpapi/server.go:563) → |
…ed (MCP-32) CodexReviewer finding: passing a non-nil observability manager made the HTTP API skip its controller-backed health block entirely, so /readyz (and the /livez,/health aliases) either 404'd or fell back to the observability health manager's vacuous always-ready handler — a config-gated feature must be a no-op for readiness when disabled. - httpapi: register /metrics independently of the health endpoints; only let the observability health manager own /healthz+/readyz when it is actually enabled, otherwise keep the controller-backed handlers authoritative. - server: disable the observability health manager in buildObservabilityConfig (out of scope for MCP-32; its readiness is vacuous) so controller readiness stays authoritative even when metrics are enabled. - tests: regression test that with metrics ON / health OFF, /readyz reflects controller readiness (503 when not ready) and the liveness aliases + /metrics are all served; assert buildObservabilityConfig keeps health off. Related MCP-32 Co-Authored-By: Paperclip <noreply@paperclip.ing>
QATester BLOCK — PR #746 (MCP-32 Observability)SHA: Finding:
|
|
CEO/Claude fallback code review (MCP-3137 — GeminiCritic adapter failed, CEO acting as model-diverse fallback per MCP-3066) VERDICT: REQUEST_CHANGES Summary: Config-gating logic is correct (exporters off by default), nil-guarding is solid throughout. One blocking resource leak found. Bug (blocking)
Concerns (non-blocking)
|
…P-3135)
The /metrics handler is registered on the httpapi chi router but the outer
http.ServeMux only forwarded /api/, /events, and the health endpoints, so
GET /metrics returned 404 even with observability.metrics.enabled=true.
Extract the route forwarding into Server.registerHTTPHandlers and add a
gated mux.Handle("/metrics", httpAPIServer) so the endpoint is reachable
only when the Prometheus exporter is enabled. Covered by a unit test on the
real outer mux and a full-binary e2e that scrapes /metrics for 200 +
mcpproxy_uptime_seconds.
Related #746
…idge runMetricsBridge calls SubscribeEvents() but never UnsubscribeEvents(), leaking a channel in the runtime's eventSubs map on every call. This blocks PR #746 from merging (the leak causes a permanent subscriber that silently drops events after the bridge goroutine exits). Add defer s.runtime.UnsubscribeEvents(events) right after subscribing, matching the existing pattern in listenForRoutingModeRefresh. Related #746
There was a problem hiding this comment.
✅ Gatekeeper approval — MCP-32 observability exporters. Per the approved gate (Codex-first + verification): CodexReviewer's first-review caught a /readyz regression (config-gated feature changing readiness when OFF); BackendEngineer fixed it (51d7264 — keep controller-backed /readyz when health disabled) + a /metrics routing fix, with a regression test. CI fully green (33 checks: unit/E2E/integration/server-edition/CodeQL). Operator-verified the fix in code. Kimi re-review verdict was a CI-timing artifact on the now-green head, not substantive. Author (BackendEngineer) ≠ approver.
Summary
Ships the observability exporters for the daemon (roadmap #8 / MCP-32). The
internal/observabilityscaffolding already existed onmainbut was deadcode: the manager was created with a hardcoded default and the HTTP API was
constructed with
nilobservability, so/metricswas never served, tool-callmetrics were never recorded, and the OTLP tracing helpers were never called.
This PR makes it real and operator-configurable, OFF by default.
What's included (exit criteria)
/metricsPrometheus endpoint — served on the existing HTTP listener,gated by
observability.metrics.enabled(defaultfalse).their upstream hop, gated by
observability.tracing.enabled(defaultfalse).contrib/grafana/mcpproxy-dashboard.json.docs/features/observability.md(scrape config, metricreference, OTLP collector setup, dashboard import).
user_idand
profilevia a//go:build serverseam (span attributes, not Prometheuslabels, to keep metric cardinality bounded).
Metrics now wired
mcpproxy_tool_calls_total/_duration_seconds(inline at the call site),mcpproxy_quarantine_events_total{scope,action}(new), upstream-health gauges(
servers_total/connected/quarantined,tools_total,docker_containers_active)via an event-bus bridge, plus the pre-existing HTTP and OAuth-refresh metrics
that are now actually scrapeable.
Config
{ "observability": { "metrics": { "enabled": true }, "tracing": { "enabled": true, "protocol": "http", "endpoint": "localhost:4318", "sample_rate": 0.1 } } }Testing
transport selection, the quarantine counter, the metrics event bridge, the
config-gating mapping, and the inline tool-call recorder.
go test ./internal/config/ ./internal/observability/ ./internal/server/green(incl.
-race); E2E binary-startup tests (TestBinaryStartupAndShutdown,TestBinaryAPIEndpoints,TestBinaryHealthAndRecovery) pass with the newwiring.
golangci-lintv2 clean on both editions; both editions build.Notes / follow-ups
traceparentpropagation to upstream servers over the wire wouldrequire instrumenting the upstream transports; this PR creates the local
spans (tool call + upstream hop). Can be a follow-up.
TODO(MCP-32)for a rendered dashboard screenshot, whichneeds a live Grafana instance.
Related MCP-32