Skip to content

feat: Add distributed tracing (#696)#761

Closed
anchapin wants to merge 7 commits intomainfrom
feat/distributed-tracing-696
Closed

feat: Add distributed tracing (#696)#761
anchapin wants to merge 7 commits intomainfrom
feat/distributed-tracing-696

Conversation

@anchapin
Copy link
Owner

@anchapin anchapin commented Mar 8, 2026

Summary\nImplements distributed tracing for ModPorter-AI microservices using OpenTelemetry.\n\n## Changes\n- Add OpenTelemetry tracing for backend and ai-engine services\n- Implement W3C Trace Context propagation between services\n- Add automatic FastAPI, HTTPX, and Redis instrumentation\n- Configure Jaeger and OTLP exporters (configurable via env vars)\n- Add trace context injection/extraction utilities for service calls\n- Update docker-compose.yml with tracing environment variables\n\n## Related Issue\n#696

- Added fetch-depth: 0 to checkout step for full git history
- Added base: main to paths-filter action for local act testing

The dorny/paths-filter@v3 action requires either:
1. The base input to be configured, or
2. repository.default_branch to be set in the event payload

When running locally with 'act', the GitHub event payload doesn't have
the default_branch set, causing the action to fail with:
'This action requires base input to be configured'
…678)

- Add W rule (pycodestyle warnings) to ai-engine/pyproject.toml
- Add W rule (pycodestyle warnings) to backend/pyproject.toml
- Add W291, W292, W293 to ignore list for legacy code

Co-authored-by: openhands <openhands@all-hands.dev>
- Add OpenTelemetry tracing for backend and ai-engine services
- Implement W3C Trace Context propagation between services
- Add automatic FastAPI, HTTPX, and Redis instrumentation
- Configure Jaeger and OTLP exporters (configurable via env vars)
- Add trace context injection/extraction utilities for service calls
- Update docker-compose.yml with tracing environment variables

Co-authored-by: openhands <openhands@all-hands.dev>
Copilot AI review requested due to automatic review settings March 8, 2026 21:48
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @anchapin, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements distributed tracing across ModPorter AI microservices using OpenTelemetry, enabling trace propagation between the backend and AI Engine and adding a Jaeger collector/UI for local visualization.

Changes:

  • Add OpenTelemetry tracing bootstrap modules for backend and ai-engine (FastAPI/httpx/Redis instrumentation + exporters).
  • Propagate trace context from backend → AI Engine requests via injected headers.
  • Extend local Docker Compose with a Jaeger service and add tracing-related environment variables.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
docker-compose.yml Adds tracing env vars to services and introduces a Jaeger service + volume.
backend/src/services/tracing.py New backend tracing setup + propagation helpers.
backend/src/services/structured_logging.py Adjusts structlog processor ordering for exception rendering.
backend/src/services/ai_engine_client.py Injects trace context headers into outbound AI Engine calls.
backend/src/main.py Initializes and shuts down tracing in app lifespan.
backend/requirements.txt Adds OpenTelemetry dependencies.
backend/requirements-dev.txt Adds dev dependency tooling (pip-audit, pipdeptree).
backend/pyproject.toml Updates Ruff lint selection/ignores (adds W + ignores some W29x).
ai-engine/utils/logging_config.py Adjusts structlog processor ordering for exception rendering.
ai-engine/tracing.py New AI Engine tracing setup + propagation helpers.
ai-engine/requirements.txt Adds OpenTelemetry dependencies.
ai-engine/requirements-dev.txt Adds dev dependency tooling (pip-audit, pipdeptree).
ai-engine/pyproject.toml Updates Ruff lint selection/ignores (adds W + ignores some W29x).
ai-engine/main.py Initializes and shuts down tracing in FastAPI lifecycle.
.github/workflows/ci.yml Adjusts checkout depth and paths-filter base configuration.

Comment on lines +217 to +223

if context:
with tracer.start_as_current_span(name, context=context, kind=kind) as span:
return span
else:
with tracer.start_as_current_span(name, kind=kind) as span:
return span
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_span() uses start_as_current_span() as a context manager and then returns the span from inside the with block. The span will be ended immediately when the function returns, so callers would receive an already-finished span. Consider changing this helper to return a context manager (so callers do with create_span(...):), or create a span via tracer.start_span() and manage activation separately.

Suggested change
if context:
with tracer.start_as_current_span(name, context=context, kind=kind) as span:
return span
else:
with tracer.start_as_current_span(name, kind=kind) as span:
return span
if context:
return tracer.start_span(name, context=context, kind=kind)
else:
return tracer.start_span(name, kind=kind)

Copilot uses AI. Check for mistakes.
Comment on lines +299 to +324
def get_trace_id() -> Optional[str]:
"""
Get the current trace ID as a hex string.

Returns:
Trace ID or None
"""
span = get_current_span()
if span:
trace_id = span.get_span_context().trace_id
return format(trace_id, '032x')
return None


def get_span_id() -> Optional[str]:
"""
Get the current span ID as a hex string.

Returns:
Span ID or None
"""
span = get_current_span()
if span:
span_id = span.get_span_context().span_id
return format(span_id, '016x')
return None
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_trace_id() / get_span_id() check if span: but trace.get_current_span() typically returns a non-None default span even when there is no active span. This will cause IDs like all-zero values to be returned and propagated/logged. Consider checking span.get_span_context().is_valid (and/or non-zero ids) before formatting and returning the IDs.

Copilot uses AI. Check for mistakes.
Comment on lines +214 to +225
New span
"""
tracer = get_tracer()

if context:
with tracer.start_as_current_span(name, context=context, kind=kind) as span:
return span
else:
with tracer.start_as_current_span(name, kind=kind) as span:
return span


Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_span() returns a span from inside a with tracer.start_as_current_span(...) block, which ends the span as the function returns. This helper will therefore hand back an ended span. Consider returning the context manager itself (to be used with with) or using tracer.start_span() and managing activation explicitly.

Suggested change
New span
"""
tracer = get_tracer()
if context:
with tracer.start_as_current_span(name, context=context, kind=kind) as span:
return span
else:
with tracer.start_as_current_span(name, kind=kind) as span:
return span
New span (not yet ended). The caller is responsible for ending the span.
"""
tracer = get_tracer()
if context:
span = tracer.start_span(name, context=context, kind=kind)
else:
span = tracer.start_span(name, kind=kind)
return span

Copilot uses AI. Check for mistakes.
Comment on lines +299 to +324
def get_trace_id() -> Optional[str]:
"""
Get the current trace ID as a hex string.

Returns:
Trace ID or None
"""
span = get_current_span()
if span:
trace_id = span.get_span_context().trace_id
return format(trace_id, '032x')
return None


def get_span_id() -> Optional[str]:
"""
Get the current span ID as a hex string.

Returns:
Span ID or None
"""
span = get_current_span()
if span:
span_id = span.get_span_context().span_id
return format(span_id, '016x')
return None
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_trace_id() / get_span_id() will return formatted IDs even when there is no active span because trace.get_current_span() commonly returns a default non-valid span. This can lead to all-zero IDs being logged/propagated. Consider checking span.get_span_context().is_valid (and/or non-zero IDs) before returning.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +17
from opentelemetry.sdk.extension.aws.resource.ec2 import AwsEc2ResourceDetector
from opentelemetry.sdk.extension.aws.resource.ecs import AwsEcsResourceDetector
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AwsEc2ResourceDetector/AwsEcsResourceDetector are imported unconditionally, but the required OpenTelemetry AWS resource extension package isn’t listed in backend/requirements.txt. As written, importing this module will raise ModuleNotFoundError and prevent the backend from starting. Consider either adding the missing dependency (e.g., the OpenTelemetry AWS resource extension) or moving these imports behind a try/except (and only enabling the detectors when the package is installed).

Suggested change
from opentelemetry.sdk.extension.aws.resource.ec2 import AwsEc2ResourceDetector
from opentelemetry.sdk.extension.aws.resource.ecs import AwsEcsResourceDetector
try:
from opentelemetry.sdk.extension.aws.resource.ec2 import AwsEc2ResourceDetector
from opentelemetry.sdk.extension.aws.resource.ecs import AwsEcsResourceDetector
except ImportError:
class AwsEc2ResourceDetector: # type: ignore[override]
"""
Fallback no-op AWS EC2 resource detector used when the
OpenTelemetry AWS resource extension package is not installed.
"""
def detect(self) -> Resource:
# Return an empty resource so tracing still works without AWS metadata.
return Resource.create({})
class AwsEcsResourceDetector: # type: ignore[override]
"""
Fallback no-op AWS ECS resource detector used when the
OpenTelemetry AWS resource extension package is not installed.
"""
def detect(self) -> Resource:
# Return an empty resource so tracing still works without AWS metadata.
return Resource.create({})

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +80

Returns:
Dictionary of trace headers to include in requests
"""
headers = {}
inject_trace_context(headers)
# Also add trace_id for logging purposes
trace_id = get_trace_id()
if trace_id:
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because get_trace_id() can return an all-zero trace id when there is no active span, this client may end up sending X-Trace-ID: 000... on requests. Consider only adding X-Trace-ID when the current span context is valid (or after get_trace_id() is updated to return None when invalid).

Suggested change
Returns:
Dictionary of trace headers to include in requests
"""
headers = {}
inject_trace_context(headers)
# Also add trace_id for logging purposes
trace_id = get_trace_id()
if trace_id:
Returns:
Dictionary of trace headers to include in requests
"""
headers = {}
inject_trace_context(headers)
# Also add trace_id for logging purposes, but only if it is non-zero
trace_id = get_trace_id()
if trace_id and any(ch != "0" for ch in trace_id):

Copilot uses AI. Check for mistakes.
- uses: dorny/paths-filter@v3
id: changes
with:
base: main
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base: main is hardcoded for dorny/paths-filter, but this workflow runs on PRs and pushes to both main and develop. Using main as the base can produce incorrect change detection (e.g., PRs targeting develop or pushes to develop), potentially skipping required jobs or running too much. Consider using the PR base ref (e.g., ${{ github.base_ref }}) and falling back appropriately for push events, or omit base to use the action’s defaults.

Suggested change
base: main

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +17
from opentelemetry.sdk.extension.aws.resource.ec2 import AwsEc2ResourceDetector
from opentelemetry.sdk.extension.aws.resource.ecs import AwsEcsResourceDetector
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AwsEc2ResourceDetector/AwsEcsResourceDetector are imported unconditionally, but ai-engine/requirements.txt doesn’t include the OpenTelemetry AWS resource extension package required to provide opentelemetry.sdk.extension.aws.*. This will raise ModuleNotFoundError at startup. Consider adding the missing dependency or making these imports optional (try/except) so tracing can work without AWS-specific packages.

Suggested change
from opentelemetry.sdk.extension.aws.resource.ec2 import AwsEc2ResourceDetector
from opentelemetry.sdk.extension.aws.resource.ecs import AwsEcsResourceDetector
try:
from opentelemetry.sdk.extension.aws.resource.ec2 import AwsEc2ResourceDetector
from opentelemetry.sdk.extension.aws.resource.ecs import AwsEcsResourceDetector
except ModuleNotFoundError:
# Fallback detectors when the OpenTelemetry AWS extension package is not installed.
# These maintain API compatibility but do not add any AWS-specific resource attributes.
class AwsEc2ResourceDetector: # type: ignore[override]
def detect(self) -> Resource:
return Resource.create({})
class AwsEcsResourceDetector: # type: ignore[override]
def detect(self) -> Resource:
return Resource.create({})
logging.getLogger(__name__).warning(
"OpenTelemetry AWS resource detectors are not available. "
"Install 'opentelemetry-sdk-extension-aws' to enable AWS resource detection."
)

Copilot uses AI. Check for mistakes.
Comment on lines 33 to +39
depends_on:
redis:
condition: service_healthy
postgres:
condition: service_healthy
jaeger:
condition: service_started
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jaeger has a healthcheck defined, but the backend is only waiting for service_started. This can allow the backend to start before Jaeger is actually ready, dropping early spans. Consider switching this to condition: service_healthy to align with the other dependencies and ensure Jaeger is ready before startup.

Copilot uses AI. Check for mistakes.
redis:
condition: service_healthy
jaeger:
condition: service_started
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jaeger has a healthcheck defined, but ai-engine depends on it with condition: service_started. Consider using condition: service_healthy so the engine waits until Jaeger is ready before emitting spans, reducing lost traces at startup.

Suggested change
condition: service_started
condition: service_healthy

Copilot uses AI. Check for mistakes.
- Format backend files: tracing.py, ai_engine_client.py, structured_logging.py, main.py
- Format ai-engine files: tracing.py, main.py, logging_config.py

Co-authored-by: openhands <openhands@all-hands.dev>
- backend/src/main.py
- backend/src/services/ai_engine_client.py
- backend/src/services/structured_logging.py
- ai-engine/main.py
- ai-engine/utils/logging_config.py

Co-authored-by: openhands <openhands@all-hands.dev>
…atibility

Latest version 1.24.0+ requires Python <3.11. This fixes the Prepare Base Images CI failure.
anchapin added a commit that referenced this pull request Mar 10, 2026
* feat: Add Ruff linter configuration for all Python directories

- Add root-level pyproject.toml with comprehensive Ruff config
- Configure Ruff to check all Python directories (backend, ai-engine, modporter, tests)
- Add appropriate ignores for legacy code patterns (unused imports,
  module-level imports not at top, bare except, etc.)
- Update CI workflow to use root config with 'ruff check .'
- Exclude UTF-16 encoded temp_init.py file from linting

Co-authored-by: openhands <openhands@all-hands.dev>

* fix(CI): Resolve integration tests failing on main branch

- Fix prepare-base-images job to always run but conditionally skip build
  - This ensures outputs are always available for dependent jobs
  - Fixes 'Unable to find image' error when dependencies haven't changed

- Fix integration-tests container image to use coalesce() for fallback
  - When prepare-base-images is skipped, use python:3.11-slim as fallback
  - Fixes empty container image reference error

- Fix performance-monitoring job needs clause
  - Corrected 'prepare-base-images' reference (was missing underscore)

- Fix frontend-tests pnpm setup order
  - Install pnpm before setup-node to avoid 'unable to cache dependencies'
  - Simplified caching to use built-in pnpm cache in setup-node

Co-authored-by: openhands <openhands@all-hands.dev>

* fix(frontend): Resolve Issue #776 - Frontend Test Failures

Fixed multiple failing test files:
- RecipeBuilder.test.tsx: Added async/await patterns, fixed userEvent setup
- ConversionUpload.test.tsx: Fixed URL validation test (validation happens on button click)
- EnhancedConversionReport.test.tsx: Simplified complex DOM interaction tests
- ConversionProgress.test.tsx, useUndoRedo.test.ts, api.test.ts: Additional fixes

All 189 frontend tests now passing.

Co-authored-by: openhands <openhands@all-hands.dev>

* fix: Fix pre-existing frontend test failures (#776)

Co-authored-by: openhands <openhands@all-hands.dev>

* fix(CI): Set VITE_API_BASE_URL for frontend tests

This ensures the WebSocket URL is properly constructed in CI environment
instead of using 'ws://undefined/ws/...' when VITE_API_BASE_URL is not set.

Co-authored-by: openhands <openhands@all-hands.dev>

---------

Co-authored-by: openhands <openhands@all-hands.dev>
@anchapin anchapin closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants