[Backport 7.79.x] Ignore enriched SBOMs if image is missing from workloadmeta#50399
Conversation
…### What does this PR do? When the SBOM remote collector receives a runtime SBOM from system-probe, it now silently drops the event if the corresponding image is not yet present in workloadmeta or if its Trivy scan has not completed (Status is Pending or empty). Previously, the collector would fall back to storing the system-probe BOM directly in that case. In the success path (Trivy scan complete), the collector merges runtime properties from the system-probe BOM into the existing Trivy BOM and preserves the scan metadata (Status, GenerationTime, GenerationDuration, GenerationMethod, Error) from the existing image SBOM when compressing the merged result. A nil-entity guard is also added in handleEvents to safely skip empty events produced by the early-return path. ### Motivation Permanent Trivy scan skip. When system-probe sent its SBOM before Trivy finished scanning, the collector would store the system-probe BOM (reduced package set) as the remote_sbom_collector entity with no Status. The ContainerImageMetadata.Merge logic prefers remote_sbom_collector (alphabetically first), so the merged image entity ended up with Status="". The image_sbom_trivy.go subscription loop skips images whose SBOM Status != Pending, so the Trivy scan was permanently skipped for that image. ### Describe how you validated your changes ### Additional Notes Co-authored-by: sylvain.baubeau <sylvain.baubeau@datadoghq.com> (cherry picked from commit 8fb5bb5) ___ Co-authored-by: Sylvain Baubeau <sylvain.baubeau@datadoghq.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: acd1b1f2b6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| seclog.Debugf("Failed to get image SBOM for container '%s': %v", sbom.ContainerID, err) | ||
| } | ||
|
|
||
| sbom.status = imageSBOM.Status |
There was a problem hiding this comment.
Guard image SBOM lookup before dereferencing
When forwarding, getContainerSBOM can legitimately return nil (for example while workloadmeta is still catching up or the image SBOM is not available yet), but the callback still unconditionally reads imageSBOM.Status. In that case this path panics with a nil-pointer dereference, which can terminate SBOM forwarding for that container during normal startup races.
Useful? React with 👍 / 👎.
|
|
||
| if existingImage.SBOM.Status == workloadmeta.Pending || existingImage.SBOM.Status == "" { | ||
| log.Debugf("Image %s SBOM is still in state '%s', skipping merge for now", imageID, existingImage.SBOM.Status) | ||
| return workloadmeta.Event{}, fmt.Errorf("image %s SBOM is still pending", imageID) |
There was a problem hiding this comment.
Skip pending-image SBOM merges without returning errors
This branch labels the condition as a normal "skipping merge for now" case, but it returns an error; handleEvents logs conversion errors at warning level, so expected pending-state races now produce repeated warnings. That turns a transient/expected state into noisy error logging and undermines the intended "ignore until ready" behavior.
Useful? React with 👍 / 👎.
Files inventory check summaryFile checks results against ancestor 9f24ee81: Results for datadog-agent_7.79.0~rc.5.git.1.acd1b1f.pipeline.111596966-1_amd64.deb:No change detected |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 9f24ee8 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -0.17 | [-3.10, +2.76] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +2.51 | [+2.33, +2.70] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.66 | [+0.60, +0.72] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.24 | [+0.09, +0.40] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.22 | [+0.05, +0.40] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.18 | [+0.12, +0.24] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.14 | [-0.05, +0.33] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.14 | [+0.05, +0.23] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.10 | [-0.30, +0.49] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.06 | [-0.08, +0.21] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.02 | [-0.13, +0.17] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.02 | [-0.18, +0.21] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.00 | [-0.20, +0.21] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.01 | [-0.09, +0.08] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.03 | [-0.56, +0.50] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.05 | [-0.49, +0.39] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.06 | [-0.09, -0.02] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | -0.13 | [-0.19, -0.08] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.14 | [-0.36, +0.09] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.15 | [-0.25, -0.05] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -0.17 | [-3.10, +2.76] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -0.39 | [-0.63, -0.16] | 1 | Logs bounds checks dashboard |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.41 | [-0.47, -0.35] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -0.64 | [-2.22, +0.94] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 683 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 278.33MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 712 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.24GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.22GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 4 = 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 176.11MiB ≤ 181MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 4 = 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 497.86MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 207.73MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 350.67 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 413.57MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
Backport 8fb5bb5 from #50226.
What does this PR do?
When the SBOM remote collector receives a runtime SBOM from system-probe, it now silently drops the event if the corresponding image is not yet present in
workloadmeta or if its Trivy scan has not completed (Status is Pending or empty). Previously, the collector would fall back to storing the system-probe
BOM directly in that case.
In the success path (Trivy scan complete), the collector merges runtime properties from the system-probe BOM into the existing Trivy BOM and preserves the
scan metadata (Status, GenerationTime, GenerationDuration, GenerationMethod, Error) from the existing image SBOM when compressing the merged result.
A nil-entity guard is also added in handleEvents to safely skip empty events produced by the early-return path.
Motivation
Permanent Trivy scan skip. When system-probe sent its SBOM before Trivy finished scanning, the collector would store the system-probe BOM (reduced package
set) as the remote_sbom_collector entity with no Status. The ContainerImageMetadata.Merge logic prefers remote_sbom_collector (alphabetically first), so
the merged image entity ended up with Status="". The image_sbom_trivy.go subscription loop skips images whose SBOM Status != Pending, so the Trivy scan
was permanently skipped for that image.
Describe how you validated your changes
Additional Notes