Discrepancy in network TX/RX rate as reported by AWS ECS insights and task metadata endpoint #4505

sergei-ivanov · 2025-02-15T00:42:55Z

What we have

We are running our workloads on a Fargate cluster, the OS/architecture is Linux/X86_64, the platform version is 1.4.0, the network mode is awsvpc.
We are collecting ECS container insights, and we are persisting them to S3 bucket for long term storage and analytics.
We are also using Datadog for near-realtime monitoring and alerting. Datadog agent is running as a sidecar in all tasks and collects the container runtime data from the ECS task metadata endpoint.
With scaling out, our bill for the ECS insights has increased dramatically, and we started replacing them with an OTEL collector, which also runs as a sidecar and its pipeline uses awsecscontainermetricsreceiver to scrape the ECS task metadata endpoint and then s3exporter to dump the data into S3 bucket in the same format as ECS container insights.

The problem

Comparing the data, we realised that although CPU and memory statistics show no significant difference between OTEL and ECS insights, network TX/RX rates are about 3x higher as reported by OTEL as compared to ECS insights. We also used Datadog as the third reference source, and Datadog is more in agreement with OTEL than with ECS insights. Datadog data is more granular, but the TX/RX rate values are of about the same magnitude as OTEL values, and are consistently higher than ECS insights.

Other observations

I looked through the code of cloudwatch agent, ecs agent, and OTEL awsecscontainermetricsreceiver, in a hope that I'd uncover something suspicious.

We noticed long ago that the network tx/rx rates are the same for different containers within the task. I believe I saw a proof of that in the ecs agent code where it uses task level metrics to populate the container level metrics when the networking mode is awsvpc.

I also noticed the code that splits task level network stats equally between the containers. Maybe that explains the 3x difference somewhere (we have 3 containers per task).

I also noticed that the cloudwatch agent uses OTEL receiver's code behind the scenes, so in theory it should be sending the same data to CloudWatch, if that's how it works on Fargate. In reality, the data in CW logs and metrics for ECS insights is still different.

There seems to be a problem with data, whether it's a race condition when multiple components are polling the same metadata endpoint, or different approach to interpreting the data. Since it is the ECS agent that publishes the data on the task metadata endpoint, I see it as the potential culprit. Please investigate, because it adversely affects our analytics and makes us doubt the ECS insights data.

Sample data

I extracted one hour of insights data for one specific container for comparison. I used queries/selectors to drill down on the specific task family, task id, and container.

Network TX rate

ECS container insights:

OTEL insights:

Datadog insights:

Network RX rate

ECS container insights:

OTEL insights:

Datadog insights:

Raw data

Extracted from CloudWatch log group for ECS insights:

network - ecs insights.csv

Extracted from data persisted in S3 for OTEL insights:

network - otel collector.csv

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in network TX/RX rate as reported by AWS ECS insights and task metadata endpoint #4505

Discrepancy in network TX/RX rate as reported by AWS ECS insights and task metadata endpoint #4505

sergei-ivanov commented Feb 15, 2025

Discrepancy in network TX/RX rate as reported by AWS ECS insights and task metadata endpoint #4505

Discrepancy in network TX/RX rate as reported by AWS ECS insights and task metadata endpoint #4505

Comments

sergei-ivanov commented Feb 15, 2025

What we have

The problem

Other observations

Sample data

Network TX rate

Network RX rate

Raw data