You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running our workloads on a Fargate cluster, the OS/architecture is Linux/X86_64, the platform version is 1.4.0, the network mode is awsvpc.
We are collecting ECS container insights, and we are persisting them to S3 bucket for long term storage and analytics.
We are also using Datadog for near-realtime monitoring and alerting. Datadog agent is running as a sidecar in all tasks and collects the container runtime data from the ECS task metadata endpoint.
With scaling out, our bill for the ECS insights has increased dramatically, and we started replacing them with an OTEL collector, which also runs as a sidecar and its pipeline uses awsecscontainermetricsreceiver to scrape the ECS task metadata endpoint and then s3exporter to dump the data into S3 bucket in the same format as ECS container insights.
The problem
Comparing the data, we realised that although CPU and memory statistics show no significant difference between OTEL and ECS insights, network TX/RX rates are about 3x higher as reported by OTEL as compared to ECS insights. We also used Datadog as the third reference source, and Datadog is more in agreement with OTEL than with ECS insights. Datadog data is more granular, but the TX/RX rate values are of about the same magnitude as OTEL values, and are consistently higher than ECS insights.
Other observations
I looked through the code of cloudwatch agent, ecs agent, and OTEL awsecscontainermetricsreceiver, in a hope that I'd uncover something suspicious.
We noticed long ago that the network tx/rx rates are the same for different containers within the task. I believe I saw a proof of that in the ecs agent code where it uses task level metrics to populate the container level metrics when the networking mode is awsvpc.
I also noticed the code that splits task level network stats equally between the containers. Maybe that explains the 3x difference somewhere (we have 3 containers per task).
I also noticed that the cloudwatch agent uses OTEL receiver's code behind the scenes, so in theory it should be sending the same data to CloudWatch, if that's how it works on Fargate. In reality, the data in CW logs and metrics for ECS insights is still different.
There seems to be a problem with data, whether it's a race condition when multiple components are polling the same metadata endpoint, or different approach to interpreting the data. Since it is the ECS agent that publishes the data on the task metadata endpoint, I see it as the potential culprit. Please investigate, because it adversely affects our analytics and makes us doubt the ECS insights data.
Sample data
I extracted one hour of insights data for one specific container for comparison. I used queries/selectors to drill down on the specific task family, task id, and container.
Network TX rate
ECS container insights:
OTEL insights:
Datadog insights:
Network RX rate
ECS container insights:
OTEL insights:
Datadog insights:
Raw data
Extracted from CloudWatch log group for ECS insights:
What we have
We are running our workloads on a Fargate cluster, the OS/architecture is Linux/X86_64, the platform version is 1.4.0, the network mode is
awsvpc
.We are collecting ECS container insights, and we are persisting them to S3 bucket for long term storage and analytics.
We are also using Datadog for near-realtime monitoring and alerting. Datadog agent is running as a sidecar in all tasks and collects the container runtime data from the ECS task metadata endpoint.
With scaling out, our bill for the ECS insights has increased dramatically, and we started replacing them with an OTEL collector, which also runs as a sidecar and its pipeline uses
awsecscontainermetricsreceiver
to scrape the ECS task metadata endpoint and thens3exporter
to dump the data into S3 bucket in the same format as ECS container insights.The problem
Comparing the data, we realised that although CPU and memory statistics show no significant difference between OTEL and ECS insights, network TX/RX rates are about 3x higher as reported by OTEL as compared to ECS insights. We also used Datadog as the third reference source, and Datadog is more in agreement with OTEL than with ECS insights. Datadog data is more granular, but the TX/RX rate values are of about the same magnitude as OTEL values, and are consistently higher than ECS insights.
Other observations
I looked through the code of cloudwatch agent, ecs agent, and OTEL awsecscontainermetricsreceiver, in a hope that I'd uncover something suspicious.
We noticed long ago that the network tx/rx rates are the same for different containers within the task. I believe I saw a proof of that in the ecs agent code where it uses task level metrics to populate the container level metrics when the networking mode is
awsvpc
.I also noticed the code that splits task level network stats equally between the containers. Maybe that explains the 3x difference somewhere (we have 3 containers per task).
I also noticed that the cloudwatch agent uses OTEL receiver's code behind the scenes, so in theory it should be sending the same data to CloudWatch, if that's how it works on Fargate. In reality, the data in CW logs and metrics for ECS insights is still different.
There seems to be a problem with data, whether it's a race condition when multiple components are polling the same metadata endpoint, or different approach to interpreting the data. Since it is the ECS agent that publishes the data on the task metadata endpoint, I see it as the potential culprit. Please investigate, because it adversely affects our analytics and makes us doubt the ECS insights data.
Sample data
I extracted one hour of insights data for one specific container for comparison. I used queries/selectors to drill down on the specific task family, task id, and container.
Network TX rate
ECS container insights:
OTEL insights:
Datadog insights:
Network RX rate
ECS container insights:
OTEL insights:
Datadog insights:
Raw data
Extracted from CloudWatch log group for ECS insights:
network - ecs insights.csv
Extracted from data persisted in S3 for OTEL insights:
network - otel collector.csv
The text was updated successfully, but these errors were encountered: