-
Notifications
You must be signed in to change notification settings - Fork 8
[observability] most basic OpenTelemetry integration into MCK #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1b6922c
to
3420830
Compare
- name: OTEL_SERVICE_NAME | ||
value: evergreen-agent | ||
value: mongodb-e2e-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the service-name for better querying
scripts/evergreen/e2e/e2e.sh
Outdated
reset_namespace "$(kubectl config current-context)" "${NAMESPACE}" || true | ||
fi | ||
# If the test passed, then the namespace is removed | ||
delete_operator "${NAMESPACE}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should remove the operator to test teardown at the end of the test-run + it enables us to ensure traces are exported in time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces basic OpenTelemetry tracing support into the MongoDB Kubernetes Operator to improve observability and debugging. Key changes include the integration of OpenTelemetry libraries and configuration in main.go and pkg/telemetry, updates to Helm chart values and templates for trace propagation, and accompanying changes in deployment and test scripts to ensure correct trace information is passed.
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
scripts/funcs/operator_deployment | Adds OpenTelemetry environment variable handling to operator deployment configuration. |
scripts/funcs/kubernetes | Updates operator naming and adjusts deployment deletion commands. |
scripts/evergreen/e2e/e2e.sh | Introduces cluster diagnostics and cleanup functions for enhanced troubleshooting. |
scripts/evergreen/deployments/test-app/templates/mongodb-enterprise-tests.yaml | Modifies environment variables to set the correct service name in traces. |
pkg/telemetry/* | Implements tracing setup, span creation, and propagation via new OpenTelemetry integrations. |
pipeline.py | Updates comment on trace flags to reflect sampling behavior. |
main.go | Integrates tracing setup with graceful shutdown and propagates trace context to controllers and webhooks. |
helm_chart/* | Adds OpenTelemetry configuration options to Helm chart values and templates. |
go.mod | Introduces OpenTelemetry dependencies. |
docker/mongodb-kubernetes-tests/tests/conftest.py | Replaces print with logger and corrects a spelling mistake in a comment. |
LICENSE-THIRD-PARTY | Updates third-party dependency list to include new OpenTelemetry libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very, very nice change and I would like to have it included. But as of right now I think we should discuss this PR a bit more before deciding to merge. There are too many end-user implications to just LGTM it now.
@@ -123,11 +127,89 @@ func main() { | |||
// Namespace where the operator is installed | |||
currentNamespace := env.ReadOrPanic(util.CurrentNamespace) | |||
|
|||
// Get trace and span IDs from environment variables | |||
traceIDHex := os.Getenv("OTEL_TRACE_ID") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why Hex in the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the api follows hex: https://www.w3.org/TR/trace-context/#parent-id
@@ -32,6 +32,13 @@ operator: | |||
- mongodbcommunity | |||
- mongodbsearch | |||
|
|||
# OpenTelemetry configuration | |||
opentelemetry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be ever configured by the users? I wonder if we actually need that exposed. I'm a bit worried that it will be confusing why we have operator.telemetry
and operator.opentelemetry
in values.yaml. How this is related to each other, is there a possibility for the users to confuse the two and expect no telemetry at all when opentelemetry=false?
I think we're lacking docs or explanations here. If we don't have good docs/story for it, I'd prefer to not expose the settings here at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that what we use otel here is just tracing. What if we don't mention any "telemetry" but always refer to it as tracing to not confuse with our telemetry gathering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now its just tracing - but we might more in the future. But for now its tightly coupled. We could not expose this and just use it for us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't necessarily want to expose this; but how can we use it while not exposing it in our current e2e test infrastructure?
main.go
Outdated
log.Infof("Setting up tracing with ID: %s, Parent ID: %s, Endpoint: %s", traceIDHex, spanIDHex, endpoint) | ||
traceCtx, tp, err := telemetry.SetupTracing(signalCtx, traceIDHex, spanIDHex, endpoint) | ||
if err != nil { | ||
log.Errorf("Failed to setup tracing: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change every message where we mention "tracing" to have it as "opentelemetry tracing"?
main.go
Outdated
log.Errorf("Failed to setup tracing: %v", err) | ||
// Continue without tracing instead of failing | ||
traceCtx = signalCtx | ||
} else if tp != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we extract that tracing code to a separate file/function? It's a quite a chunk of pretty verbose code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
main.go
Outdated
attribute.Bool("root", true), | ||
) | ||
|
||
log.Info("Ending root operator span and flushing it to exporter") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The messages on Info level will appear in normal, regular logs for every customer. And it looks like this is not guarded by having otel enabled or not.
Please review every log message and consider if the customer seeing it in the logs will know what is this about and not be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all reworked to debug, thanks for the catch!
func RunTelemetry(ctx context.Context, mongodbImage, databaseNonStaticImage, namespace string, operatorClusterMgr manager.Manager, clusterMap map[string]cluster.Cluster, atlasClient *Client, configuredOperatorEnv util.OperatorEnvironment) { | ||
Logger.Debug("sending telemetry!") | ||
func RunTelemetry(leaderTrace context.Context, mongodbImage, databaseNonStaticImage, namespace string, operatorClusterMgr manager.Manager, clusterMap map[string]cluster.Cluster, atlasClient *Client, configuredOperatorEnv util.OperatorEnvironment) { | ||
Logger.Debug("Collecting telemetry!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uppercased Logger is quite surprising to me. I know it's been there before this change, but can you explain why it's a different convention than other log *zap.SugaredLog conventions we use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason was to expose a global package level Logger which is always ready and correctly initialized for the independent telemetry module. This saves us the need for creating a new sugaredLogger everytime + the potential of configuring it wrongly
Summary
This pull request introduces OpenTelemetry tracing support to the MongoDB Kubernetes Operator and its related components. Key changes include the integration of OpenTelemetry libraries, the addition of tracing configuration, and updates to ensure trace propagation across the application. These changes enhance observability and debugging capabilities.
In our CI suite this means we will have the following kind of traces:
OpenTelemetry Integration:
main.go
:main
function, including trace and span ID extraction from environment variables and the creation of a root span for the operator. Tracing context is propagated across controllers and shutdown processes are handled gracefully.pkg/telemetry/client.go
: <--- this is good to know if we happen to make a change and happen to send to prod atlasSendEventWithRetry
function to capture telemetry events and include the Atlas base URL as a span attribute.Helm Chart Updates:
OTEL_TRACE_ID
,OTEL_PARENT_ID
,OTEL_EXPORTER_OTLP_ENDPOINT
) to the operator's deployment template. ([helm_chart/templates/operator.yamlR83-R90](https://github.com/mongodb/mongodb-kubernetes/pull/93/files#diff-5d2e377a6806023ca9eff60be4d7e5cd879803de2bd3800b630f479f8728f322R83-R90)
)enabled
,traceID
,parentID
,collectorEndpoint
) in the Helm chart'svalues.yaml
.Dependency Updates:
otel
,otel/sdk
,otel/trace
, etc.) togo.mod
.Proof of Work
e.g. patch
generated traces in our ci: Link

Checklist
Reminder (Please remove this when merging)