-
Notifications
You must be signed in to change notification settings - Fork 92
This issue has been moved to a discussionGo to the discussion
Closed
Labels
feature-requestNew feature or requestNew feature or request
Description
Pre-RFC: Datadog Tracer Provider for Powertools-Java
Hi team! 👋 Before drafting a full RFC, I wanted to gauge interest in adding a Datadog tracer implementation to Powertools for AWS Lambda (Java).
Why?
This would provide Powertools+Datadog customers the ability to go beyond datadog's serverless layer, instrumenting downstream calls and introducing custom spans whilst relying on the ease of use of Powertools for Java.
High-level idea
- Extract today's X-Ray code behind a minimal interface (
PowertoolsTracer
). - Publish separate artifacts:
powertools-tracing-core
powertools-tracing-xray
(current impl, default)powertools-tracing-datadog
(new impl)
Next steps
- Confirm project interest / any concerns with the approach.
- If positive, I'll submit a full RFC with API sketch, migration plan, and PoC link.
- Iterate on feedback and open implementation PRs behind a feature flag.
Would this addition be of interest to the maintainers? Happy to flesh out details if the direction sounds reasonable!
Scott
phipag
Metadata
Metadata
Assignees
Labels
feature-requestNew feature or requestNew feature or request
Type
Projects
Status
Closed
Milestone
Relationships
Development
Select code repository
Activity
phipag commentedon Jun 17, 2025
Hey @scottgerring, thanks for opening this issue. This is a really cool idea.
In Python, we have something similar for the Metrics utility already (https://docs.powertools.aws.dev/lambda/python/latest/core/metrics/datadog/). With the refactoring of the metrics module in
v2
we also have the ability now to support such providers in Metrics in Java. I believe that this provider pattern (with sub-modules) that you are suggesting is really good in general and might be a valuable addition to Tracing.Before we move on to an RFC stage, let me triage this with the team and get back to you with more details (by next week).
scottgerring commentedon Jun 18, 2025
Hey @phipag thanks for the quick turnaround!
Yes - do let me know - FWIW i'm out until the 3rd of July so on my side there's no great rush. And we could certainly do this for metrics, too :D
scottgerring commentedon Jul 2, 2025
Hey @phipag , did you folks have a chat about this? 👼
leandrodamascena commentedon Jul 2, 2025
Hi @scottgerring! Long time no see my friend. I hope life and family are going well. By the way, thank you so much for all the hard work you put into Java v2 branch before you left AWS. It paved the way for @phipag to get there.
Let me give you some additional details about the decisions we’ve made in the past that will drive other decisions in the future.
1/ Powertools add support for third-party observability providers
Two years ago, we decided to start supporting third-party oy11 providers, primarily because we knew customers were using providers other than CloudWatch for a variety of reasons: because it’s part of their foundation, because they want to have their data in one place, and for a thousand other reasons customers have. And our first bet was to add support for sending metrics directly to DataDog in Powertools Python. This decision was made mainly because I had some previous experience with DataDog Extension/Forward, which helped me understand edge cases and the best ways to implement them. To be honest, I think this was a great implementation and we see some customers using it and being happy with it.
2/ Datadog, NewRelic, HoneyComb, AppDynamics, and dozen of other providers
I remember a discussion we had back then: OK, now that we’ve added support for DataDog, and customers are going to ask for NewRelic, HoneyComb, AppDynamics, and dozen of other observability tools, how do we support all of these integrations? And who is responsible for adding and maintaining this code? Sure, it’s Powertools code, but we recognize that we may not have enough knowledge about each provider.
3/ Why not a standard
At that time, OpenTelemetry was growing and becoming a standard for some types of workloads, but we must recognize that this new standard was still in the adoption phase, especially since Tracer was the first stable/GA API in OTEL, but at that time Metrics had recently went to GA and Logs/Events was in RC phase. Additionally, customers were still figuring out how to do things the right way with OTEL and Lambda. Additionally, oy11 providers were still making OTLP endpoints stable and the recommended way to go.
And because of all this uncertainty around OTEL ecosystem (note that I'm not questioning OTEL/OTLP itself here), we decided to make it provider-specific.
4/ Looking to the Future
Given this scenario and changes we had in the last 2 years, we are evaluating new opportunities for the future. While we will maintain our standard user experience using CloudWatch + XRAY, why not adopt OTEL in Powertools with the same experience we have today and make it a standard that will implement this protocol and let customers decide where to send this data? If I am not mistaken, this is the recommended way for Datadog to send traces and may be the recommended way for other providers as well. I am not saying that this is the final decision and we may not be open to implementing a direct integration with DataDog, but we are considering what is best for the future for our customers.
I would like to take this opportunity to talk to you about some of the challenges with OTEL and Lambda and how you see this experience from the developer side when sending data to DataDog using OTEL + Lambda. Pls let me know if you're open to schedule a meeting and discuss more about this.
Leandro
scottgerring commentedon Jul 3, 2025
Hey @leandrodamascena lovely to hear from you again! It's been a while :)
I think this your take is fundamentally clear-eyed; if we can do it with OTel and support everything, that's great. The issue with OTel with Java in the past has been that the cold-start impact is significant. I'm not sure if this is still the case or not - but certainly worth validating! Here we'd also want to make sure that it's not introducing CRaC issues as I believe @phipag is working on making pt-java play well here.
Folks can use OTel/OTLP or they can use the datadog instrumentation
Chucked a meeting in your calendars - let's chat next week !
phipag commentedon Jul 5, 2025
Absolutely, we already have a PR opened by a contributor #1861.
Is there anything specific you have in mind that would introduce CRaC issues?
scottgerring commentedon Jul 7, 2025
Nope! just flagging that we should keep an eye on it :D
scottgerring commentedon Jul 7, 2025
I had a play with this this morning. I think we should:
io.opentelemetry.api.trace.Span
within the aspect for powertools@Tracing
annotation to push our extra bits out -ColdStart
andServiceName
annotations, plus the request, response, and error data according to the env vars@Tracing
, in favour of OTel's@WithSpan
It seems that x-ray is migrating to using OTel as its API as well, which makes thing more interesting! It looks like you can do auto-instrumentation with Lambda too, which is the same sort of mechanism Datadog can use to make things easy.
The destination-specific configuration becomes glue, and working out how to handle this is going to be important:
localhost:4317
as provided by the agent layerSo, something like this:
powertools-tracing
- the core tracing module. This depends only onio.opentelemetry.opentelemetry-api
- and this is all we need for Datadog and X-Ray if we encourage auto-instrumentationpowertools-tracing-otlp
- depends onpowertools-tracing
andopentelemetry-sdk
andopentelemetry-exporter-otlp
. Provides some way of letting the user specifying their OTLP endpoints (env var?)I have test code for the API-first way with Datadog and associated CDK setup I can share after we chat about what we want to do!
leandrodamascena commentedon Jul 7, 2025
Hey @scottgerring thanks for sharing some thoughts, that's a thing that I'm working in this exactly moment and I really appreciate that.
We basically have two distinct worlds in OTEL and OTEL+Lambda in general: auto-instrumentation and manual instrumentation. I'm trying to find the right balance between the two worlds and provide a good experience by taking all the burden off the OTEL API - which is super complex - but also allowing the customer to do whatever customizations they want. Finding that middle ground is hard LOL.
Our main idea is not to remove support for XRay or discontinue it, but to create a new provider that will have the same experience but will send data to OTEL. We cannot remove support for XRay because even though XRay supports data ingestion through the OTLP endpoint, the dependencies are different, the infrastructure changes (needs collector and layer) and this could break existing customers.
I have this code working in Python with auto-instrumentation using this: https://aws-otel.github.io/docs/getting-started/lambda/lambda-python. I use existing span segment, that is the
Lambda Handler
and the I just set the new attributes such asColdStart
andService
. But this works perfectly using this extension because it also creates the root span which is the Lambda Handler and then we can start adding more span/attributes/context or whatever you want to it, but in case of customer who don't use this extension, we need to create the root span on their behalf, and I still don't know exactly where I need to create this to have a meaningful view of all the spans.This is hard to say. Because the destination matters if customers are not using collectors inside Lambda. I agree that this should be the default experience: Lambda code -> Send to collector endpoint -> Destination. But this is sometimes not what customers are doing, as customers are exporting the JSON created by the OTEL SDK and then aggregating it using Kinesis, for example. Or customers are using the export directly in their code with a sync http call + batch.
I come back here to my initial point of the discussion: auto-instrumentation or manual instrumentation. If we decide to support manual instrumentation - which for me make 100% sense - we need to allow the customer to configure the tracer instance with the export they want, something like this:
I'm not sure. I was reading the Datadog documentation and it seems like we need to instrument the code to send the tracers. But ofc I might be missing something and not understanding this. I need to better understand how we can support most providers with the same API/experience.
Both the Datadog layer and the ADOT layer work the same way: they auto-instrument third-party libraries. The ADOT layer also give us with the exceptions, stack traces, and attributes for this for free, and I imagine Datadog does too.
Our idea is not to change the current tracing module, but to create a new one called powertools-tracing-otlp with specific dependencies or something like that. This is another point we need to think about. We don't necessarily need to bring these dependencies if customers are using the ADOT layer, because it bring them in their layer and we can help the customer to reduce the size of the lambda package. But yes, customers must have those dependencies.
Nice, this is amazing! We can talk more about this in our meeting.
As always, thank you so much for sharing your knowledge and helping us get there. I'm super excited about this integration and we definitely intend to support Datadog from day one.