[FA] - Fix apm_inject test verifySharedLib fail#50414
Conversation
|
🎯 Code Coverage (details) 🔗 Commit SHA: 9f1e9e7 | Docs | Datadog PR Page | Give us feedback! |
Files inventory check summaryFile checks results against ancestor b182c4f5: Results for datadog-agent_7.80.0~devel.git.507.9f1e9e7.pipeline.111712319-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates 32 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: b182c4f Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -0.05 | [-2.95, +2.84] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | otlp_ingest_logs | memory utilization | +0.90 | [+0.80, +1.00] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.63 | [+0.39, +0.87] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | +0.45 | [+0.40, +0.49] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.44 | [+0.24, +0.63] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.29 | [+0.08, +0.49] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.27 | [+0.15, +0.38] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.04 | [-0.37, +0.44] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.03 | [-0.19, +0.25] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.02 | [-0.43, +0.47] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.19, +0.20] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.00 | [-0.07, +0.07] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.00 | [-0.56, +0.55] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.00 | [-0.15, +0.14] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.02 | [-0.15, +0.12] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.02 | [-0.26, +0.22] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -0.05 | [-2.95, +2.84] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.09 | [-0.14, -0.05] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | -0.13 | [-0.29, +0.03] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.15 | [-0.20, -0.10] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.18 | [-0.39, +0.03] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.18 | [-0.34, -0.02] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.20 | [-0.24, -0.15] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -0.97 | [-1.93, -0.01] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 659 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 248.32MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 698 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 141.90MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 479.44MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 177.70MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 358.82 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 382.64MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
3289c70 to
8d6a208
Compare
…b, remove flake marks
8d6a208 to
9f1e9e7
Compare
| // written to /etc/ld.so.preload. | ||
| // A plain non-zero exit means the library ran, hit an error (e.g. an | ||
| // AppArmor-blocked syscall), and exited gracefully — it will not crash | ||
| // application processes at runtime. |
There was a problem hiding this comment.
I'm not sure about this.
It seems our injector code is actually being blocked by AppArmor (because of telemetry?) so it will make AppArmor block all apps on the system (better than crashing but still unusable).
There was a problem hiding this comment.
I'm not sure too, that's why I'm keeping it as a draft to take the time to find a proper fix.
What does this PR do?
Fixes APM inject tests flaking on Debian 12 and Ubuntu 24.04 by making
verifySharedLibonly block installation on fatal crash signals, and removes the now-unnecessary flake marks.Motivation
The new
datadog-apm-injectrelease (538720e) introduced telemetry code (9c23a97c) that, when the C library is loaded during the pre-write sanity check (verifySharedLib), attempts network/procfs operations that AppArmor blocks on Ubuntu 24.04 and Debian 12. This causedecho 1to exit with a non-zero code, which the sanity check incorrectly treated the same as a fatal crash — aborting the install entirely.The original check blocked the install on any non-zero exit from
echo 1. But the purpose ofverifySharedLibis narrower: prevent a library that sends fatal crash signals (SIGSEGV, SIGABRT) from being written to/etc/ld.so.preload, where it would kill every process on the system. A library that exits non-zero due to a blocked AppArmor syscall does not crash host processes at runtime — it fails gracefully and injection simply doesn't happen.The
TestAppArmorassertion was already fixed separately in #50400. The flake marks added in #50387 can now be removed.Describe how you validated your changes
Additional Notes
The root cause of the non-zero exit belongs in the auto_inject repo (a crashtracker fix
ebbfe1b4exists there but is not yet in a pinned release). This PR makes the agent-side sanity check correctly tolerant of graceful failures while preserving protection against libraries that actually crash host processes.