Skip to content

[observability] most basic OpenTelemetry integration into MCK #93

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nammn
Copy link
Collaborator

@nammn nammn commented May 7, 2025

Summary

This pull request introduces OpenTelemetry tracing support to the MongoDB Kubernetes Operator and its related components. Key changes include the integration of OpenTelemetry libraries, the addition of tracing configuration, and updates to ensure trace propagation across the application. These changes enhance observability and debugging capabilities.

In our CI suite this means we will have the following kind of traces:

trace_id: abc123

                         ┌────────────────────┐
                         │     Evergreen      │
                         │  span_id: ROOT     │
                         │  parent_id: none   │
                         └─────────┬──────────┘
                                   │
        ┌──────────────────────────┼─────────────────────────┐
        ▼                          ▼                         ▼
┌──────────────┐         ┌────────────────┐         ┌────────────────────┐
│   E2E Test   │         │   Operator     │         │     (Other…)       │
│ span_id: A1  │         │ span_id: B1    │         │                    │
│ parent: ROOT │         │ parent: ROOT   │         │                    │
└──────┬───────┘         └──────┬─────────┘         └────────────────────┘
       │                        │
       ▼                        ▼
┌──────────────┐         ┌────────────────────┐
│ E2E Function │         │   Reconcile Loop   │
│ span_id: A2  │         │   span_id: B2      │
│ parent: A1   │         │   parent: B1       │
└──────────────┘         └────────────────────┘

OpenTelemetry Integration:

  • Tracing in main.go:
    • Added OpenTelemetry setup in the main function, including trace and span ID extraction from environment variables and the creation of a root span for the operator. Tracing context is propagated across controllers and shutdown processes are handled gracefully.
  • Telemetry in pkg/telemetry/client.go: <--- this is good to know if we happen to make a change and happen to send to prod atlas
    • Added a span to the SendEventWithRetry function to capture telemetry events and include the Atlas base URL as a span attribute.

Helm Chart Updates:

  • Operator configuration:
    • Added OpenTelemetry-specific environment variables (OTEL_TRACE_ID, OTEL_PARENT_ID, OTEL_EXPORTER_OTLP_ENDPOINT) to the operator's deployment template. ([helm_chart/templates/operator.yamlR83-R90](https://github.com/mongodb/mongodb-kubernetes/pull/93/files#diff-5d2e377a6806023ca9eff60be4d7e5cd879803de2bd3800b630f479f8728f322R83-R90))
    • Introduced OpenTelemetry configuration options (enabled, traceID, parentID, collectorEndpoint) in the Helm chart's values.yaml.

Dependency Updates:

  • Go module dependencies:
    • Added OpenTelemetry-related libraries (otel, otel/sdk, otel/trace, etc.) to go.mod.

Proof of Work

  • e.g. patch

  • generated traces in our ci: Link
    Screenshot 2025-05-21 at 15 19 20

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you checked for release_note changes?

Reminder (Please remove this when merging)

  • Please try to Approve or Reject Changes the PR, keep PRs in review as short as possible
  • Our Short Guide for PRs: Link
  • Remember the following Communication Standards - use comment prefixes for clarity:
    • blocking: Must be addressed before approval.
    • follow-up: Can be addressed in a later PR or ticket.
    • q: Clarifying question.
    • nit: Non-blocking suggestions.
    • note: Side-note, non-actionable. Example: Praise
    • --> no prefix is considered a question

@nammn nammn changed the title add initial operator tracing support OpenTelemetry integration into MCK May 7, 2025
@nammn nammn force-pushed the traces-operator branch from 19334f0 to 388f4ad Compare May 21, 2025 08:00
@nammn nammn changed the title OpenTelemetry integration into MCK most basic OpenTelemetry integration into MCK May 21, 2025
@nammn nammn changed the title most basic OpenTelemetry integration into MCK [observability] most basic OpenTelemetry integration into MCK May 21, 2025
@nammn nammn force-pushed the traces-operator branch 3 times, most recently from 1b6922c to 3420830 Compare May 21, 2025 08:45
- name: OTEL_SERVICE_NAME
value: evergreen-agent
value: mongodb-e2e-tests
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the service-name for better querying

reset_namespace "$(kubectl config current-context)" "${NAMESPACE}" || true
fi
# If the test passed, then the namespace is removed
delete_operator "${NAMESPACE}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove the operator to test teardown at the end of the test-run + it enables us to ensure traces are exported in time

@nammn nammn force-pushed the traces-operator branch from 2391b87 to b75faef Compare May 21, 2025 13:22
@nammn nammn marked this pull request as ready for review May 21, 2025 13:23
@nammn nammn requested a review from a team as a code owner May 21, 2025 13:23
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces basic OpenTelemetry tracing support into the MongoDB Kubernetes Operator to improve observability and debugging. Key changes include the integration of OpenTelemetry libraries and configuration in main.go and pkg/telemetry, updates to Helm chart values and templates for trace propagation, and accompanying changes in deployment and test scripts to ensure correct trace information is passed.

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
scripts/funcs/operator_deployment Adds OpenTelemetry environment variable handling to operator deployment configuration.
scripts/funcs/kubernetes Updates operator naming and adjusts deployment deletion commands.
scripts/evergreen/e2e/e2e.sh Introduces cluster diagnostics and cleanup functions for enhanced troubleshooting.
scripts/evergreen/deployments/test-app/templates/mongodb-enterprise-tests.yaml Modifies environment variables to set the correct service name in traces.
pkg/telemetry/* Implements tracing setup, span creation, and propagation via new OpenTelemetry integrations.
pipeline.py Updates comment on trace flags to reflect sampling behavior.
main.go Integrates tracing setup with graceful shutdown and propagates trace context to controllers and webhooks.
helm_chart/* Adds OpenTelemetry configuration options to Helm chart values and templates.
go.mod Introduces OpenTelemetry dependencies.
docker/mongodb-kubernetes-tests/tests/conftest.py Replaces print with logger and corrects a spelling mistake in a comment.
LICENSE-THIRD-PARTY Updates third-party dependency list to include new OpenTelemetry libraries.

Copy link
Contributor

@lsierant lsierant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very, very nice change and I would like to have it included. But as of right now I think we should discuss this PR a bit more before deciding to merge. There are too many end-user implications to just LGTM it now.

@@ -123,11 +127,89 @@ func main() {
// Namespace where the operator is installed
currentNamespace := env.ReadOrPanic(util.CurrentNamespace)

// Get trace and span IDs from environment variables
traceIDHex := os.Getenv("OTEL_TRACE_ID")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why Hex in the name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the api follows hex: https://www.w3.org/TR/trace-context/#parent-id

@@ -32,6 +32,13 @@ operator:
- mongodbcommunity
- mongodbsearch

# OpenTelemetry configuration
opentelemetry:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be ever configured by the users? I wonder if we actually need that exposed. I'm a bit worried that it will be confusing why we have operator.telemetry and operator.opentelemetry in values.yaml. How this is related to each other, is there a possibility for the users to confuse the two and expect no telemetry at all when opentelemetry=false?

I think we're lacking docs or explanations here. If we don't have good docs/story for it, I'd prefer to not expose the settings here at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that what we use otel here is just tracing. What if we don't mention any "telemetry" but always refer to it as tracing to not confuse with our telemetry gathering?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now its just tracing - but we might more in the future. But for now its tightly coupled. We could not expose this and just use it for us

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't necessarily want to expose this; but how can we use it while not exposing it in our current e2e test infrastructure?

main.go Outdated
log.Infof("Setting up tracing with ID: %s, Parent ID: %s, Endpoint: %s", traceIDHex, spanIDHex, endpoint)
traceCtx, tp, err := telemetry.SetupTracing(signalCtx, traceIDHex, spanIDHex, endpoint)
if err != nil {
log.Errorf("Failed to setup tracing: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change every message where we mention "tracing" to have it as "opentelemetry tracing"?

main.go Outdated
log.Errorf("Failed to setup tracing: %v", err)
// Continue without tracing instead of failing
traceCtx = signalCtx
} else if tp != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we extract that tracing code to a separate file/function? It's a quite a chunk of pretty verbose code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

main.go Outdated
attribute.Bool("root", true),
)

log.Info("Ending root operator span and flushing it to exporter")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The messages on Info level will appear in normal, regular logs for every customer. And it looks like this is not guarded by having otel enabled or not.

Please review every log message and consider if the customer seeing it in the logs will know what is this about and not be confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all reworked to debug, thanks for the catch!

func RunTelemetry(ctx context.Context, mongodbImage, databaseNonStaticImage, namespace string, operatorClusterMgr manager.Manager, clusterMap map[string]cluster.Cluster, atlasClient *Client, configuredOperatorEnv util.OperatorEnvironment) {
Logger.Debug("sending telemetry!")
func RunTelemetry(leaderTrace context.Context, mongodbImage, databaseNonStaticImage, namespace string, operatorClusterMgr manager.Manager, clusterMap map[string]cluster.Cluster, atlasClient *Client, configuredOperatorEnv util.OperatorEnvironment) {
Logger.Debug("Collecting telemetry!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uppercased Logger is quite surprising to me. I know it's been there before this change, but can you explain why it's a different convention than other log *zap.SugaredLog conventions we use?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason was to expose a global package level Logger which is always ready and correctly initialized for the independent telemetry module. This saves us the need for creating a new sugaredLogger everytime + the potential of configuring it wrongly

@nammn nammn force-pushed the traces-operator branch from a3c5077 to 1441f82 Compare May 23, 2025 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants