dyninst/symdb: stream JSON into gzip, chunk by compressed size#50396
dyninst/symdb: stream JSON into gzip, chunk by compressed size#50396gh-worker-dd-mergequeue-cf854d[bot] merged 2 commits intomainfrom
Conversation
The SymDB upload pipeline previously held each batch in memory three times
over: as a []Scope slice, as a marshalled JSON []byte, and as a gzipped
[]byte. Batches were flushed by buffered function count (default 10000),
which let the in-memory []Scope grow large before any compression.
Replace UploadBatch([]Scope) with a streaming BatchEncoder that owns the
gzip writer and a json.Encoder wrapping it. Scopes are encoded straight
into the gzip stream as they arrive, the caller no longer accumulates a
slice, and flushes are triggered when the compressed buffer reaches a
threshold (DefaultFlushThresholdBytes = 2 MiB). The envelope is written
inside the gzip stream as {service,version,language,upload_id,batch_num,
scopes:[...],final}, with final written at flush time.
Threshold is soft: gzip's internal window means the flushed payload may
overshoot by up to ~32 KiB. A threshold <= 0 forces per-scope flushing,
preserving the cancel-between-flushes test behaviour previously achieved
with maxBufferFuncs=1.
ErrUpload is exposed as a sentinel so callers can distinguish HTTP-side
failures (retryable) from local encoder errors via errors.Is.
https://datadoghq.atlassian.net/browse/DEBUG-5553
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 73310b4273
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // BatchEncoder streams Scope objects into a gzip-compressed SymDB JSON | ||
| // envelope and uploads chunks to the SymDB intake whenever the compressed | ||
| // payload reaches the configured threshold. A single BatchEncoder | ||
| // corresponds to one logical upload (one UploadID) and may emit multiple | ||
| // batches. | ||
| type BatchEncoder struct { |
There was a problem hiding this comment.
Restore the exported batch upload API or update CLI
This replaces the exported UploadBatch/UploadInfo API, but pkg/dyninst/symdb/cli/main.go still imports this package and calls up.UploadBatch(..., uploader.UploadInfo{...}) under the same linux_bpf build tag. I checked the repo with rg "UploadBatch|UploadInfo"; that CLI is the remaining caller, so building the SymDB CLI package now fails until it is migrated to NewBatchEncoder or a compatibility wrapper is kept.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the latest push: the CLI is migrated to NewBatchEncoder in the same commit, so the linux_bpf build is no longer broken.
| // the gzip writer's internal window may not yet be reflected in the buffer | ||
| // length, so the actual flushed payload may overshoot by up to the gzip | ||
| // window size (~32 KiB). | ||
| func WithFlushThreshold(bytes int) BatchEncoderOption { |
There was a problem hiding this comment.
it looks like this option is not really optional; each caller needs to provide it, otherwise they get a weird behavior for 0. Let's get rid of the options and pass it explicitly to the ctor.
There was a problem hiding this comment.
Done. Dropped WithFlushThreshold and the flushThreshold field; the encoder no longer manages a threshold. Replaced ShouldFlush with Size() and let the caller decide when to flush.
| } | ||
| b.gz = gzip.NewWriter(&b.buf) | ||
| b.enc = json.NewEncoder(b.gz) | ||
| b.enc.SetEscapeHTML(false) |
There was a problem hiding this comment.
what's up with this SetEscapeHTML? We didn't have it before?
There was a problem hiding this comment.
It isn't needed, and it is weird to have added it, but I think it is technically what we want.
String values encode as JSON strings coerced to valid UTF-8, replacing invalid bytes with the Unicode replacement rune. So that the JSON will be safe to embed inside HTML <script> tags, the string is encoded using HTMLEscape, which replaces "<", ">", "&", U+2028, and U+2029 are escaped to "\u003c","\u003e", "\u0026", "\u2028", and "\u2029". This replacement can be disabled when using an Encoder, by calling Encoder.SetEscapeHTML(false).
There was a problem hiding this comment.
Kept it but added a comment. The previous json.Marshal path also escaped (it's the default for both Marshal and Encoder), so disabling it here is a small intentional wire-format change rather than preservation — without it, a Go type name like chan<- would be emitted as chan<- in the SymDB payload, which nobody wants.
| // this upload. | ||
| Final bool | ||
| // BatchEncoder streams Scope objects into a gzip-compressed SymDB JSON | ||
| // envelope and uploads chunks to the SymDB intake whenever the compressed |
There was a problem hiding this comment.
this comment is not really accurate, is it? The Encoder doesn't upload anything by itself; you need to call Flush(). If it doesn't do the flushing on its own, consider letting the caller deal with the threshold too (so only provide a Size() function)
There was a problem hiding this comment.
Did both. Fixed the doc comment (the encoder no longer claims to flush on its own; the caller drives Flush). Dropped the threshold from the encoder in favour of Size() — the caller now owns the threshold check.
Files inventory check summaryFile checks results against ancestor 84ef6841: Results for datadog-agent_7.80.0~devel.git.482.efbc8af.pipeline.111607510-1_amd64.deb:No change detected |
| } | ||
|
|
||
| // NewBatchEncoder creates a BatchEncoder for a single logical upload. | ||
| func (s *SymDBUploader) NewBatchEncoder(uploadID uuid.UUID, opts ...BatchEncoderOption) *BatchEncoder { |
There was a problem hiding this comment.
actually, massage the flow some more -- as it stands, the caller creates a SymDBUploader only to call NewBatchEncoder() on it, and uses the Encoder from then on. Either make the caller only aware of the uploader, or only aware of the encoder.
There was a problem hiding this comment.
Went with encoder-only. SymDBUploader is now an unexported helper inside BatchEncoder; callers construct the encoder directly via NewBatchEncoder(url, service, version, runtimeID, uploadID, headers...).
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
23 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: e8100bb Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +0.04 | [-2.89, +2.97] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_metrics_logs | memory utilization | +1.20 | [+0.95, +1.45] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.79 | [+0.55, +1.03] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.56 | [+0.36, +0.76] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.52 | [+0.45, +0.59] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.51 | [+0.33, +0.69] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.49 | [+0.34, +0.64] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.29 | [+0.24, +0.34] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_security_idle | memory utilization | +0.27 | [+0.19, +0.35] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | +0.22 | [+0.13, +0.31] | 1 | Logs |
| ➖ | quality_gate_security_mean_fs_load | memory utilization | +0.17 | [+0.13, +0.21] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.16 | [-0.00, +0.31] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.05 | [-0.35, +0.45] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | +0.04 | [-2.89, +2.97] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.02 | [-0.42, +0.46] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.09, +0.10] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.19, +0.20] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.02 | [-0.17, +0.12] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.03 | [-0.23, +0.18] | 1 | Logs |
| ➖ | quality_gate_security_no_fs_load | memory utilization | -0.03 | [-0.14, +0.07] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_logs | % cpu utilization | -0.04 | [-1.01, +0.92] | 1 | Logs bounds checks dashboard |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.04 | [-0.10, +0.01] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.05 | [-0.61, +0.51] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.11 | [-0.21, -0.00] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.51 | [-0.55, -0.47] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -1.86 | [-2.07, -1.64] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 713 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 244.86MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 685 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 142.36MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 471.54MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 177.55MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 353.16 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 376.60MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_security_idle | cpu_usage | 10/10 | 25.23 ≤ 40 | bounds checks dashboard |
| ✅ | quality_gate_security_idle | memory_usage | 10/10 | 290.62MiB ≤ 330MiB | bounds checks dashboard |
| ✅ | quality_gate_security_mean_fs_load | cpu_usage | 10/10 | 52.56 ≤ 70 | bounds checks dashboard |
| ✅ | quality_gate_security_mean_fs_load | memory_usage | 10/10 | 267.21MiB ≤ 320MiB | bounds checks dashboard |
| ✅ | quality_gate_security_no_fs_load | cpu_usage | 10/10 | 19.68 ≤ 40 | bounds checks dashboard |
| ✅ | quality_gate_security_no_fs_load | memory_usage | 10/10 | 279.78MiB ≤ 320MiB | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_security_idle, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_mean_fs_load, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_mean_fs_load, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_no_fs_load, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_no_fs_load, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- Drop the WithFlushThreshold functional option. The threshold isn't truly optional (every caller needs to set it, and the 0 sentinel for flush-after-every-scope was a wart), and the encoder doesn't auto-flush anyway. Replace ShouldFlush with a Size() accessor and let the caller decide when to flush. - Hide SymDBUploader inside the BatchEncoder: callers now construct an encoder directly with NewBatchEncoder(url, service, version, runtimeID, uploadID, headers...) rather than going through SymDBUploader as an intermediary they then never use again. - Comment why we SetEscapeHTML(false): without it, '<', '>' and '&' inside string values get emitted as their six-character \u00XX escape, which would corrupt Go type names like 'chan<-' in the SymDB payload. - Fix the BatchEncoder doc comment: the encoder doesn't ship batches on its own; the caller must call Flush. - Migrate the symdb CLI to the new API so the package keeps building. As a side effect, this fixes a pre-existing bug where the CLI always sent BatchNum: 0 (the local batchNum was declared but never incremented); BatchEncoder increments it correctly per batch.
b4a69e2 to
efbc8af
Compare
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
What does this PR do?
The SymDB upload pipeline previously held each batch in memory three times over: as a []Scope slice, as a marshalled JSON []byte, and as a gzipped []byte. Batches were flushed by buffered function count (default 10000), which let the in-memory []Scope grow large before any compression.
Replace UploadBatch([]Scope) with a streaming BatchEncoder that owns the gzip writer and a json.Encoder wrapping it. Scopes are encoded straight into the gzip stream as they arrive, the caller no longer accumulates a slice, and flushes are triggered when the compressed buffer reaches a threshold (DefaultFlushThresholdBytes = 2 MiB). The envelope is written inside the gzip stream as {service,version,language,upload_id,batch_num, scopes:[...],final}, with final written at flush time.
Threshold is soft: gzip's internal window means the flushed payload may overshoot by up to ~32 KiB. A threshold <= 0 forces per-scope flushing, preserving the cancel-between-flushes test behaviour previously achieved with maxBufferFuncs=1.
ErrUpload is exposed as a sentinel so callers can distinguish HTTP-side failures (retryable) from local encoder errors via errors.Is.
Motivation
We've seen some OOMs uploading symdb data.
Describe how you validated your changes
There's some testing but could be more I suppose.
Additional Notes
https://datadoghq.atlassian.net/browse/DEBUG-5553