Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in network TX/RX rate as reported by AWS ECS insights and task metadata endpoint #4505

Open
sergei-ivanov opened this issue Feb 15, 2025 · 0 comments

Comments

@sergei-ivanov
Copy link

What we have

We are running our workloads on a Fargate cluster, the OS/architecture is Linux/X86_64, the platform version is 1.4.0, the network mode is awsvpc.
We are collecting ECS container insights, and we are persisting them to S3 bucket for long term storage and analytics.
We are also using Datadog for near-realtime monitoring and alerting. Datadog agent is running as a sidecar in all tasks and collects the container runtime data from the ECS task metadata endpoint.
With scaling out, our bill for the ECS insights has increased dramatically, and we started replacing them with an OTEL collector, which also runs as a sidecar and its pipeline uses awsecscontainermetricsreceiver to scrape the ECS task metadata endpoint and then s3exporter to dump the data into S3 bucket in the same format as ECS container insights.

The problem

Comparing the data, we realised that although CPU and memory statistics show no significant difference between OTEL and ECS insights, network TX/RX rates are about 3x higher as reported by OTEL as compared to ECS insights. We also used Datadog as the third reference source, and Datadog is more in agreement with OTEL than with ECS insights. Datadog data is more granular, but the TX/RX rate values are of about the same magnitude as OTEL values, and are consistently higher than ECS insights.

Other observations

I looked through the code of cloudwatch agent, ecs agent, and OTEL awsecscontainermetricsreceiver, in a hope that I'd uncover something suspicious.

We noticed long ago that the network tx/rx rates are the same for different containers within the task. I believe I saw a proof of that in the ecs agent code where it uses task level metrics to populate the container level metrics when the networking mode is awsvpc.

I also noticed the code that splits task level network stats equally between the containers. Maybe that explains the 3x difference somewhere (we have 3 containers per task).

I also noticed that the cloudwatch agent uses OTEL receiver's code behind the scenes, so in theory it should be sending the same data to CloudWatch, if that's how it works on Fargate. In reality, the data in CW logs and metrics for ECS insights is still different.

There seems to be a problem with data, whether it's a race condition when multiple components are polling the same metadata endpoint, or different approach to interpreting the data. Since it is the ECS agent that publishes the data on the task metadata endpoint, I see it as the potential culprit. Please investigate, because it adversely affects our analytics and makes us doubt the ECS insights data.

Sample data

I extracted one hour of insights data for one specific container for comparison. I used queries/selectors to drill down on the specific task family, task id, and container.

Network TX rate

ECS container insights:

Image

OTEL insights:

Image

Datadog insights:

Image

Network RX rate

ECS container insights:

Image

OTEL insights:

Image

Datadog insights:

Image

Raw data

Extracted from CloudWatch log group for ECS insights:

network - ecs insights.csv

Extracted from data persisted in S3 for OTEL insights:

network - otel collector.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant