Skip to content
This issue has been moved to a discussionGo to the discussion

Tracing module - implement other providers #1894

@scottgerring

Description

@scottgerring
Contributor

Pre-RFC: Datadog Tracer Provider for Powertools-Java

Hi team! 👋 Before drafting a full RFC, I wanted to gauge interest in adding a Datadog tracer implementation to Powertools for AWS Lambda (Java).

Why?

This would provide Powertools+Datadog customers the ability to go beyond datadog's serverless layer, instrumenting downstream calls and introducing custom spans whilst relying on the ease of use of Powertools for Java.

High-level idea

  • Extract today's X-Ray code behind a minimal interface (PowertoolsTracer).
  • Publish separate artifacts:
    • powertools-tracing-core
    • powertools-tracing-xray (current impl, default)
    • powertools-tracing-datadog (new impl)

Next steps

  1. Confirm project interest / any concerns with the approach.
  2. If positive, I'll submit a full RFC with API sketch, migration plan, and PoC link.
  3. Iterate on feedback and open implementation PRs behind a feature flag.

Would this addition be of interest to the maintainers? Happy to flesh out details if the direction sounds reasonable!

Scott

Activity

phipag

phipag commented on Jun 17, 2025

@phipag
Contributor

Hey @scottgerring, thanks for opening this issue. This is a really cool idea.

In Python, we have something similar for the Metrics utility already (https://docs.powertools.aws.dev/lambda/python/latest/core/metrics/datadog/). With the refactoring of the metrics module in v2 we also have the ability now to support such providers in Metrics in Java. I believe that this provider pattern (with sub-modules) that you are suggesting is really good in general and might be a valuable addition to Tracing.

Before we move on to an RFC stage, let me triage this with the team and get back to you with more details (by next week).

scottgerring

scottgerring commented on Jun 18, 2025

@scottgerring
ContributorAuthor

Hey @phipag thanks for the quick turnaround!
Yes - do let me know - FWIW i'm out until the 3rd of July so on my side there's no great rush. And we could certainly do this for metrics, too :D

scottgerring

scottgerring commented on Jul 2, 2025

@scottgerring
ContributorAuthor

Hey @phipag thanks for the quick turnaround! Yes - do let me know - FWIW i'm out until the 3rd of July so on my side there's no great rush. And we could certainly do this for metrics, too :D

Hey @phipag , did you folks have a chat about this? 👼

leandrodamascena

leandrodamascena commented on Jul 2, 2025

@leandrodamascena
Contributor

Hi @scottgerring! Long time no see my friend. I hope life and family are going well. By the way, thank you so much for all the hard work you put into Java v2 branch before you left AWS. It paved the way for @phipag to get there.

Let me give you some additional details about the decisions we’ve made in the past that will drive other decisions in the future.

1/ Powertools add support for third-party observability providers

Two years ago, we decided to start supporting third-party oy11 providers, primarily because we knew customers were using providers other than CloudWatch for a variety of reasons: because it’s part of their foundation, because they want to have their data in one place, and for a thousand other reasons customers have. And our first bet was to add support for sending metrics directly to DataDog in Powertools Python. This decision was made mainly because I had some previous experience with DataDog Extension/Forward, which helped me understand edge cases and the best ways to implement them. To be honest, I think this was a great implementation and we see some customers using it and being happy with it.

2/ Datadog, NewRelic, HoneyComb, AppDynamics, and dozen of other providers

I remember a discussion we had back then: OK, now that we’ve added support for DataDog, and customers are going to ask for NewRelic, HoneyComb, AppDynamics, and dozen of other observability tools, how do we support all of these integrations? And who is responsible for adding and maintaining this code? Sure, it’s Powertools code, but we recognize that we may not have enough knowledge about each provider.

3/ Why not a standard

At that time, OpenTelemetry was growing and becoming a standard for some types of workloads, but we must recognize that this new standard was still in the adoption phase, especially since Tracer was the first stable/GA API in OTEL, but at that time Metrics had recently went to GA and Logs/Events was in RC phase. Additionally, customers were still figuring out how to do things the right way with OTEL and Lambda. Additionally, oy11 providers were still making OTLP endpoints stable and the recommended way to go.
And because of all this uncertainty around OTEL ecosystem (note that I'm not questioning OTEL/OTLP itself here), we decided to make it provider-specific.

4/ Looking to the Future

Given this scenario and changes we had in the last 2 years, we are evaluating new opportunities for the future. While we will maintain our standard user experience using CloudWatch + XRAY, why not adopt OTEL in Powertools with the same experience we have today and make it a standard that will implement this protocol and let customers decide where to send this data? If I am not mistaken, this is the recommended way for Datadog to send traces and may be the recommended way for other providers as well. I am not saying that this is the final decision and we may not be open to implementing a direct integration with DataDog, but we are considering what is best for the future for our customers.

I would like to take this opportunity to talk to you about some of the challenges with OTEL and Lambda and how you see this experience from the developer side when sending data to DataDog using OTEL + Lambda. Pls let me know if you're open to schedule a meeting and discuss more about this.

Leandro

scottgerring

scottgerring commented on Jul 3, 2025

@scottgerring
ContributorAuthor

Hey @leandrodamascena lovely to hear from you again! It's been a while :)

I think this your take is fundamentally clear-eyed; if we can do it with OTel and support everything, that's great. The issue with OTel with Java in the past has been that the cold-start impact is significant. I'm not sure if this is still the case or not - but certainly worth validating! Here we'd also want to make sure that it's not introducing CRaC issues as I believe @phipag is working on making pt-java play well here.

this is the recommended way for Datadog to send traces

Folks can use OTel/OTLP or they can use the datadog instrumentation

Chucked a meeting in your calendars - let's chat next week !

phipag

phipag commented on Jul 5, 2025

@phipag
Contributor

Here we'd also want to make sure that it's not introducing CRaC issues as I believe @phipag is working on making pt-java play well here.

Absolutely, we already have a PR opened by a contributor #1861.

Is there anything specific you have in mind that would introduce CRaC issues?

scottgerring

scottgerring commented on Jul 7, 2025

@scottgerring
ContributorAuthor

Here we'd also want to make sure that it's not introducing CRaC issues as I believe @phipag is working on making pt-java play well here.

Absolutely, we already have a PR opened by a contributor #1861.

Is there anything specific you have in mind that would introduce CRaC issues?

Nope! just flagging that we should keep an eye on it :D

scottgerring

scottgerring commented on Jul 7, 2025

@scottgerring
ContributorAuthor

I had a play with this this morning. I think we should:

  • use opentelemetry API as the API in powertools for tracing
    • we can then use io.opentelemetry.api.trace.Span within the aspect for powertools @Tracing annotation to push our extra bits out - ColdStart and ServiceName annotations, plus the request, response, and error data according to the env vars
    • we could also choose to deprecate @Tracing, in favour of OTel's @WithSpan

It seems that x-ray is migrating to using OTel as its API as well, which makes thing more interesting! It looks like you can do auto-instrumentation with Lambda too, which is the same sort of mechanism Datadog can use to make things easy.

The destination-specific configuration becomes glue, and working out how to handle this is going to be important:

  • In both x-ray and Datadog cases, we instrument using the OpenTelemetry API, then we can either use auto-instrumentation via layers, or explicitly add a bunch of provider specific deps and explicitly sink via OTLP to e.g. localhost:4317 as provided by the agent layer
    • In the Datadog case, auto-instrumentation is desirable, because it is much simpler on the app side and you get a pile of libraries automatically instrumented for you (HTTP clients, SQL clients, frameworks, etc.)
    • In the X-Ray case I would expect this to hold too, but someone from AWS can have opinions here ;)
  • In the generic otel case, we need to additionally depend on the OTel SDK and take user configuration for how (gRPC / HTTP) and where (localhost:4317) to sink the data

So, something like this:

  • powertools-tracing - the core tracing module. This depends only on io.opentelemetry.opentelemetry-api - and this is all we need for Datadog and X-Ray if we encourage auto-instrumentation
  • powertools-tracing-otlp - depends on powertools-tracing and opentelemetry-sdk and opentelemetry-exporter-otlp . Provides some way of letting the user specifying their OTLP endpoints (env var?)

I have test code for the API-first way with Datadog and associated CDK setup I can share after we chat about what we want to do!

leandrodamascena

leandrodamascena commented on Jul 7, 2025

@leandrodamascena
Contributor

Hey @scottgerring thanks for sharing some thoughts, that's a thing that I'm working in this exactly moment and I really appreciate that.

We basically have two distinct worlds in OTEL and OTEL+Lambda in general: auto-instrumentation and manual instrumentation. I'm trying to find the right balance between the two worlds and provide a good experience by taking all the burden off the OTEL API - which is super complex - but also allowing the customer to do whatever customizations they want. Finding that middle ground is hard LOL.

Our main idea is not to remove support for XRay or discontinue it, but to create a new provider that will have the same experience but will send data to OTEL. We cannot remove support for XRay because even though XRay supports data ingestion through the OTLP endpoint, the dependencies are different, the infrastructure changes (needs collector and layer) and this could break existing customers.

I had a play with this this morning. I think we should:

  • use opentelemetry API as the API in powertools for tracing

    • we can then use io.opentelemetry.api.trace.Span within the aspect for powertools @Tracing annotation to push our extra bits out - ColdStart and ServiceName annotations, plus the request, response, and error data according to the env vars
    • we could also choose to deprecate @Tracing, in favour of OTel's @WithSpan

I have this code working in Python with auto-instrumentation using this: https://aws-otel.github.io/docs/getting-started/lambda/lambda-python. I use existing span segment, that is the Lambda Handler and the I just set the new attributes such as ColdStart and Service. But this works perfectly using this extension because it also creates the root span which is the Lambda Handler and then we can start adding more span/attributes/context or whatever you want to it, but in case of customer who don't use this extension, we need to create the root span on their behalf, and I still don't know exactly where I need to create this to have a meaningful view of all the spans.

    def capture_lambda_handler(
        self,
        lambda_handler: Callable | None = None,
    ) -> Callable[..., Any]:

        @functools.wraps(lambda_handler)
        def decorate(event, context, **kwargs):
            span = trace.get_current_span()
            lambda_handler_name = lambda_handler.__name__
            try:
                logger.debug("Calling lambda handler")
                response = lambda_handler(event, context, **kwargs)
                logger.debug("Received lambda handler response successfully")
            except Exception as err:
                logger.exception(f"Exception received from {lambda_handler_name}")

                raise
            finally:
                cold_start = _is_cold_start()
                logger.debug("Annotating cold start")
                span.set_attribute(key="ColdStart", value=cold_start)

            return response

        return decorate
Image

It seems that x-ray is migrating to using OTel as its API as well, which makes thing more interesting! It looks like you can do auto-instrumentation with Lambda too, which is the same sort of mechanism Datadog can use to make things easy.

The destination-specific configuration becomes glue, and working out how to handle this is going to be important:

This is hard to say. Because the destination matters if customers are not using collectors inside Lambda. I agree that this should be the default experience: Lambda code -> Send to collector endpoint -> Destination. But this is sometimes not what customers are doing, as customers are exporting the JSON created by the OTEL SDK and then aggregating it using Kinesis, for example. Or customers are using the export directly in their code with a sync http call + batch.

I come back here to my initial point of the discussion: auto-instrumentation or manual instrumentation. If we decide to support manual instrumentation - which for me make 100% sense - we need to allow the customer to configure the tracer instance with the export they want, something like this:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter

# Set up the tracer with Console exporter
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(tracer_provider)

# Get a tracer for your module
tracer = trace.get_tracer(__name__)
  • In both x-ray and Datadog cases, we instrument using the OpenTelemetry API, then we can either use auto-instrumentation via layers, or explicitly add a bunch of provider specific deps and explicitly sink via OTLP to e.g. localhost:4317 as provided by the agent layer

I'm not sure. I was reading the Datadog documentation and it seems like we need to instrument the code to send the tracers. But ofc I might be missing something and not understanding this. I need to better understand how we can support most providers with the same API/experience.

  • In the Datadog case, auto-instrumentation is desirable, because it is much simpler on the app side and you get a pile of libraries automatically instrumented for you (HTTP clients, SQL clients, frameworks, etc.)

Both the Datadog layer and the ADOT layer work the same way: they auto-instrument third-party libraries. The ADOT layer also give us with the exceptions, stack traces, and attributes for this for free, and I imagine Datadog does too.

  • In the X-Ray case I would expect this to hold too, but someone from AWS can have opinions here ;)
  • In the generic otel case, we need to additionally depend on the OTel SDK and take user configuration for how (gRPC / HTTP) and where (localhost:4317) to sink the data

So, something like this:

  • powertools-tracing - the core tracing module. This depends only on io.opentelemetry.opentelemetry-api - and this is all we need for Datadog and X-Ray if we encourage auto-instrumentation
  • powertools-tracing-otlp - depends on powertools-tracing and opentelemetry-sdk and opentelemetry-exporter-otlp . Provides some way of letting the user specifying their OTLP endpoints (env var?)

Our idea is not to change the current tracing module, but to create a new one called powertools-tracing-otlp with specific dependencies or something like that. This is another point we need to think about. We don't necessarily need to bring these dependencies if customers are using the ADOT layer, because it bring them in their layer and we can help the customer to reduce the size of the lambda package. But yes, customers must have those dependencies.

I have test code for the API-first way with Datadog and associated CDK setup I can share after we chat about what we want to do!

Nice, this is amazing! We can talk more about this in our meeting.

As always, thank you so much for sharing your knowledge and helping us get there. I'm super excited about this integration and we definitely intend to support Datadog from day one.

locked and limited conversation to collaborators on Jul 7, 2025
converted this issue into a discussion #1919 on Jul 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Closed

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @scottgerring@leandrodamascena@phipag@dreamorosi

        Issue actions

          Tracing module - implement other providers · Issue #1894 · aws-powertools/powertools-lambda-java