Skip to content

Commit 2ba11bc

Browse files
trentmfelixbarny
andauthored
Spec adding span links for instrumentation of messaging systems (#616)
- Specify that span links should be used on "messaging" spans and on "messaging" transactions that cover a batch of messages, when messages include trace context. - Drop special-casing of single-message SQS/SNS Lambda triggers. - Specify a guard/limit of 1000 messages to check for trace-context. - Starts documenting message metadata mechanisms to use for propagating trace-context, for those systems that provide a facility. Closes: #606 Co-authored-by: Felix Barnsteiner <[email protected]>
1 parent ad59bb5 commit 2ba11bc

4 files changed

+73
-38
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.DS_Store

specs/agents/tracing-instrumentation-aws-lambda.md

+31-26
Original file line numberDiff line numberDiff line change
@@ -133,31 +133,31 @@ In version 2.0, the `${event.requestContext.routeKey}` can have the format `GET
133133
If `use_path_as_transaction_name` is applicable and set to `true`, use `${event.requestContext.http.method} ${event.requestContext.http.path}` as the transaction name.
134134

135135
### SQS / SNS
136-
Lambda functions that are triggered by SQS (or SNS) accept an `event` input that may contain one or more SQS / SNS messages in the `event.records` array. All message-related context information (including the `traceparent`) is encoded in the individual message attributes (if at all). We cannot (automatically) wrap the processing of the individual messages that are sent as a batch of messages with a single `event`.
137136

138-
Thus, in case that an SQS / SNS `event` contains **exactly one** SQS / SNS message, the agents must apply the following, messaging-specific retrieval of information. Otherwise, the agents should apply the [Generic Lambda Instrumentation](generic-lambda-instrumentation) as described above.
137+
Lambda functions that are triggered by SQS (or SNS) accept an `event` input that may contain one or more SQS / SNS messages in the `event.records` array. All message-related context information (including the `traceparent`) is encoded in the individual message attributes (if at all).
139138

140-
With only one message in `event.records`, the agents can use the single SQS / SNS `record` to retrieve the `traceparent` and `tracestate` from message attributes and use it for starting the lambda transaction.
139+
#### SQS
141140

142-
In addition the following fields should be set for Lambda functions triggered by SQS or SNS:
141+
Agents SHOULD check each record, [up to a maximum of 1000](tracing-instrumentation-messaging.md#receiving-trace-context),
142+
for a `traceparent` message attribute, and create a [span link](span-links.md)
143+
on the transaction for each message with trace-context.
144+
145+
In addition to [the generic Lambda transaction fields](#generic-lambda-instrumentation)
146+
the following fields SHOULD be set. The use of `records[0]` below depends on the
147+
understanding from [AWS Lambda SQS docs](https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html)
148+
that a trigger invocation can only include messages from *one* queue.
143149

144-
#### SQS
145150
Field | Value | Description | Source
146151
--- | --- | --- | ---
147152
`type` | `messaging`| Transaction type: constant value for SQS. | -
148-
`name` | e.g. `RECEIVE SomeQueue` | Transaction name: Follow the [messaging spec](./tracing-instrumentation-messaging.md) for transaction naming. | Simple queue name can be derived from the 6th segment of `record.eventSourceArn`.
153+
`name` | e.g. `RECEIVE SomeQueue` | Transaction name: Follow the [messaging spec](./tracing-instrumentation-messaging.md) for transaction naming. | Simple queue name can be derived from the 6th segment of `records[0].eventSourceArn`.
149154
`faas.trigger.type` | `pubsub` | Constant value for message based triggers | -
150-
`faas.trigger.request_id` | e.g. `someMessageId` | SQS message ID. | `record.messageId`
151-
`context.service.origin.name` | e.g. `my-queue` | SQS queue name | Simple queue name can be derived from the 6th segment of `record.eventSourceArn`.
152-
`context.service.origin.id` | e.g. `arn:aws:sqs:us-east-2:123456789012:my-queue` | SQS queue ARN. | `record.eventSourceArn`
155+
`context.service.origin.name` | e.g. `my-queue` | SQS queue name | Simple queue name can be derived from the 6th segment of `records[0].eventSourceArn`.
156+
`context.service.origin.id` | e.g. `arn:aws:sqs:us-east-2:123456789012:my-queue` | SQS queue ARN. | `records[0].eventSourceArn`
153157
`context.cloud.origin.service.name` | `sqs` | Fix value for SQS. | -
154-
`context.cloud.origin.region` | e.g. `us-east-1` | SQS queue region. | `record.awsRegion`
155-
`context.cloud.origin.account.id` | e.g. `12345678912` | Account ID of the SQS queue. | Parse account segment (5th) from `record.eventSourceArn`.
158+
`context.cloud.origin.region` | e.g. `us-east-1` | SQS queue region. | `records[0].awsRegion`
159+
`context.cloud.origin.account.id` | e.g. `12345678912` | Account ID of the SQS queue. | Parse account segment (5th) from `records[0].eventSourceArn`.
156160
`context.cloud.origin.provider` | `aws` | Use `aws` as fix value. | -
157-
`context.message.queue.name` | e.g. `my-queue` | The SQS queue name. | The 6th segment of `record.eventSourceArn`
158-
`context.message.age.ms` | e.g. `3298` | Age of the message in milliseconds. `current_time` - `SentTimestamp`, if SentTimestamp is available. | Message attribute with key `SentTimestamp`.
159-
`context.message.body` | - | The message body. Should only be captured if body capturing is enabled in the configuration. | `record.body`
160-
`context.message.headers` | - | The message attributes. Should only be captured, if capturing headers is enabled in the configuration. | Use the `stringValue` of entries in `record.messageAttributes` with `dataType == "String" || dataType == "Number"`. [Other attribute types](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-message-metadata.html#message-attribute-data-types) are ignored.
161161

162162
An example SQS event:
163163

@@ -206,23 +206,28 @@ An example SQS event:
206206

207207
#### SNS
208208

209+
Agents SHOULD check each record, [up to a maximum of 1000](tracing-instrumentation-messaging.md#receiving-trace-context),
210+
for a `traceparent` message attribute (`Records.*.Sns.MessageAttributes`), and
211+
create a [span link](span-links.md) on the transaction for each message with
212+
trace-context.
213+
214+
In addition to [the generic Lambda transaction fields](#generic-lambda-instrumentation)
215+
the following fields should be set. The use of `records[0]` is based on the
216+
understanding, from ["all notification messages will contain a single published
217+
message"](https://aws.amazon.com/sns/faqs/#Reliability), that an SNS trigger
218+
will only ever have a single record.
219+
209220
Field | Value | Description | Source
210221
--- | --- | --- | ---
211222
`type` | `messaging`| Transaction type: constant value for SNS. | -
212-
`name` | e.g. `RECEIVE SomeTopic` | Transaction name: Follow the [messaging spec](./tracing-instrumentation-messaging.md) for transaction naming. | Simple topic name can be derived from the 6th segment of `record.sns.topicArn`.
223+
`name` | e.g. `RECEIVE SomeTopic` | Transaction name: Follow the [messaging spec](./tracing-instrumentation-messaging.md) for transaction naming. | Simple topic name can be derived from the 6th segment of `records[0].sns.topicArn`.
213224
`faas.trigger.type` | `pubsub` | Constant value for message based triggers | -
214-
`faas.trigger.reuqest_id` | e.g. `someMessageId` | SNS message ID. | `record.sns.messageId`
215-
`context.service.origin.name` | e.g. `my-topic` | SNS topic name | Simple topic name can be derived from the 6th segment of `record.sns.topicArn`.
216-
`context.service.origin.id` | e.g. `arn:aws:sns:us-east-2:123456789012:my-topic` | SNS topic ARN. | `record.sns.topicArn`
217-
`context.service.origin.version` | e.g. `2.1` | SNS event version | `record.eventVersion`
225+
`context.service.origin.name` | e.g. `my-topic` | SNS topic name | Simple topic name can be derived from the 6th segment of `records[0].sns.topicArn`.
226+
`context.service.origin.id` | e.g. `arn:aws:sns:us-east-2:123456789012:my-topic` | SNS topic ARN. | `records[0].sns.topicArn`
218227
`context.cloud.origin.service.name` | `sns` | Fix value for SNS. | -
219-
`context.cloud.origin.region` | e.g. `us-east-1` | SNS topic region. | Parse region segment (4th) from `record.sns.topicArn`.
220-
`context.cloud.origin.account.id` | e.g. `12345678912` | Account ID of the SNS topic. | Parse account segment (5th) from `record.sns.topicArn`.
228+
`context.cloud.origin.region` | e.g. `us-east-1` | SNS topic region. | Parse region segment (4th) from `records[0].sns.topicArn`.
229+
`context.cloud.origin.account.id` | e.g. `12345678912` | Account ID of the SNS topic. | Parse account segment (5th) from `records[0].sns.topicArn`.
221230
`context.cloud.origin.provider` | `aws` | Use `aws` as fix value. | -
222-
`context.message.queue.name` | e.g. `my-topic` | The SNS topic name. | The 6th segment of `record.sns.topicArn`
223-
`context.message.age.ms` | e.g. `3298` | Age of the message in milliseconds. `current_time` - `snsTimestamp`. | `record.sns.timestamp`
224-
`context.message.body` | - | The message body. Should only be captured if body capturing is enabled in the configuration. | `record.sns.message`
225-
`context.message.headers` | - | The message attributes. Should only be captured, if capturing headers is enabled in the configuration. | Use the `Value` of entries in `record.Sns.MessageAttributes` with `Type == "String"`. [Other attribute types](https://docs.aws.amazon.com/sns/latest/dg/sns-message-attributes.html#SNSMessageAttributes.DataTypes) are ignored. |
226231

227232
An example SNS event:
228233

specs/agents/tracing-instrumentation-aws.md

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
We describe how to instrument some of AWS' services in this document.
44
Some of the services can use existing specs. When there are differences or additions, they have been noted below.
5+
The spec for [instrumenting AWS Lambda](tracing-instrumentation-aws-lambda.md) is in a separate document.
56

67
### S3 (Simple Storage Service)
78

specs/agents/tracing-instrumentation-messaging.md

+40-12
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@ occurring within a traced transaction.
1515

1616
![publish](uml/publish.svg)
1717

18-
### Trace Context
18+
### Sending Trace Context
1919

20-
if the messaging system exposes an API for sending additional message properties/metadata, it
20+
If the messaging system exposes a mechanism for sending additional [message metadata](#message-metadata), it
2121
SHOULD be used to propagate the [Trace Context](https://www.w3.org/TR/trace-context/) of the
2222
`messaging` span, to continue the [distributed trace](tracing-distributed-tracing.md).
2323

@@ -68,19 +68,28 @@ If creating a transaction for the processing of each message in a batch is not p
6868
the agent SHOULD create a single `messaging` transaction for the processing of the batch
6969
of messages.
7070

71-
### Trace Context
71+
### Receiving Trace Context
7272

73-
When message reception is captured as a `messaging` transaction, if the messaging
74-
system exposes an API for sending additional message properties/metadata, it
75-
SHOULD be checked for the presence of [Trace Context](https://www.w3.org/TR/trace-context/).
76-
If Trace Context is present, it SHOULD be propagated to the `messaging` transaction
73+
This section applies to messaging systems that support [message metadata](#message-metadata).
74+
The instrumentation of message reception SHOULD check message metadata for the
75+
presence of [Trace Context](https://www.w3.org/TR/trace-context/).
76+
77+
When single message reception is captured as a `messaging` transaction,
78+
and a Trace Context is present, it SHOULD be used as the parent of the `messaging` transaction
7779
to continue the [distributed trace](tracing-distributed-tracing.md).
7880

79-
If a batch of messages is processed in a single `messaging` transaction, it may be
80-
possible that each message in the batch has its own Trace Context. In this
81-
scenario, it is not currently possible to propagate a Trace Context to the `messaging`
82-
transaction, since there a multiple contexts present. It may be possible to capture
83-
these in future through [span links](https://github.com/elastic/apm/issues/122).
81+
Otherwise (a single message being captured as a `messaging` span, or a batch
82+
of messages is processed in a single `messaging` transaction or span), a
83+
[span link](span-links.md) SHOULD be added for each message with Trace Context.
84+
This includes the case where the size of the batch of received messages is one.
85+
86+
The number of events processed for trace context SHOULD be limited to a maximum
87+
of 1000, as a guard on agent overhead for extremely large batches of events.
88+
(For example, SQS's maximum batch size is 10000 messages. The maximum number of
89+
span links that could be sent for a single transaction/span to APM server with
90+
the default configuration is approximately 4000: 307200 bytes
91+
[APM server `max_event_size` default](https://www.elastic.co/guide/en/apm/server/current/configuration-process.html#max_event_size)
92+
/ 77 bytes per serialized span link.)
8493

8594
### Examples
8695

@@ -188,6 +197,25 @@ queues/topics/exchanges will be ignored.
188197
| Central config | `true` |
189198

190199

200+
### Message metadata
201+
202+
To support distributed tracing with automatic instrumentation, the messaging
203+
system must provide a mechanism to add metadata/properties/attributes to
204+
individual messages, akin to HTTP headers. If an APM agent supports
205+
trace-context for a given messaging system, it MUST use the following mechanisms
206+
so that cross-language tracing works:
207+
208+
| Messaging system | Mechanism |
209+
| ---------------------- | --------- |
210+
| Azure Queue | No mechanism |
211+
| Azure Service Bus | Possibly `Diagnostic-Id` [application property](https://docs.microsoft.com/en-us/dotnet/api/azure.messaging.servicebus.servicebusmessage.applicationproperties). See [this doc](https://docs.microsoft.com/en-us/azure/service-bus-messaging/service-bus-end-to-end-tracing). |
212+
| Java Messaging Service | [Message Properties](https://docs.oracle.com/javaee/7/api/javax/jms/Message.html) |
213+
| Apache Kafka | [Kafka Record headers](https://cwiki.apache.org/confluence/display/KAFKA/KIP-82%2B-%2BAdd%2BRecord%2BHeaders) using [binary trace context fields](tracing-distributed-tracing.md#binary-fields) |
214+
| RabbitMQ | [Message Attributes](https://www.rabbitmq.com/tutorials/amqp-concepts.html#messages) (a.k.a. `AMQP.BasicProperties` in [Java API](https://www.rabbitmq.com/api-guide.html)) |
215+
| AWS SQS | [SQS message attributes](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-message-metadata.html), if within message attribute limits. See [AWS instrumentation](tracing-instrumentation-aws.md). |
216+
| AWS SNS | [SNS message attributes](https://docs.aws.amazon.com/sns/latest/dg/sns-message-attributes.html), if within message attribute limits. See [AWS instrumentation](tracing-instrumentation-aws.md). |
217+
218+
191219
### AWS messaging systems
192220

193221
The instrumentation of [SQS](tracing-instrumentation-aws.md#sqs-simple-queue-service) and [SNS](tracing-instrumentation-aws.md#sns-aws-simple-notification-service) services generally follow this spec, with some nuances specified in the linked specs.

0 commit comments

Comments
 (0)