eBPF-based format profiling for Vortex #8337

m7kss1 · 2026-06-10T14:31:57Z

m7kss1
Jun 10, 2026

the idea is to expose a small number of semantic instrumentation points in vortex: scan boundaries, segment requests, coalesced reads, pruning/filter results, decode spans, and canonicalization fallbacks. eBPF programs can attach to these points with uprobes, aggregate counters in kernel maps, and combine them with kernel signals such as read syscalls, read-size histograms, and optional PMU conunters

note: this is not meant to replace perf/flamegraphs etc. those tools are good at showing where cpu time is spent. they do not directly explain format-level behavior: whether a layout caused more physical reads, whether pruning became less effective, which encoding dominated decode cost etc

the kind of statistics this could collect:

- logical vs physical bytes read, read amplification, coalescing factor
- read syscall count, read sizes, tiny reads
- footer / metadata read cost
- pruning and filter selectivity
- split count, split duration, concurrency, skipped splits
- decode calls and decode time per encoding
- rows/s and MB/s per encoding
- canonicalization fallback counts by operation and encoding
- optional cache misses, instructions, cycle
- cold/warm cache indicators from procfs
- any off-cpu statistics (e.g
- scheduler switches)
- network
- direct gpu loads validation (to see if fallback disk/network -> CPU RAM -> H2D in the future still happens, now it's not implemented afail)

the main advantage over manual instrumentation is that most aggregation stays outside the hot path. vortex only exposes stable semantic events; eBPF collects and aggregates them without threading metric state through every reader, executor, and engine integration. but the most important is that this approach makes the same counters usable across different workloads/query engines that vortex integrated with

my view is that this would complement existing benchmarks and profilers: benchmarks tell us that something changed, flamegraphs tell us where cpu went, but format profiling could explain what changed in the format behaviour

AdamGS · 2026-06-10T14:46:11Z

AdamGS
Jun 10, 2026
Maintainer

That sounds like a great idea. The observability story in Vortex is currently somewhat neglected, but this might be a good feature to motivate us to make it better.

0 replies

m7kss1 · 2026-06-10T14:46:12Z

m7kss1
Jun 10, 2026
Author

actually, i've already vibecoded some PoC implementation for datafusion engine using aya

note: the numbers below are illustrative and was collected on debug build:

$ sudo -E vx profile query lineitem_7.vortex --sql "select sum(l_extendedprice) from data where l_shipdate < date '1995-01-01'"
...
vx profile: attached 12 markers to pid 3060932

{
  "engine": "datafusion",
  "metrics": {
    "cold.major_faults": 0,
    "cold.page_cache_hit_ratio": 1.0,
    "cold.storage_read_bytes": 0,
    "decode.total_calls": 91,
    "decode.total_ms": 1088.0898,
    "decode.vortex.binary.bytes": 0,
    "decode.vortex.binary.calls": 2,
    "decode.vortex.binary.mb_per_sec": 0.0,
    "decode.vortex.binary.ms": 0.0966,
    "decode.vortex.binary.rows": 2,
    "decode.vortex.binary.rows_per_sec": 20705.8629,
    "decode.vortex.constant.bytes": 74,
    "decode.vortex.constant.calls": 19,
    "decode.vortex.constant.mb_per_sec": 0.2018,
    "decode.vortex.constant.ms": 0.3667,
    "decode.vortex.constant.rows": 19,
    "decode.vortex.constant.rows_per_sec": 51818.5587,
    "decode.vortex.filter.bytes": 0,
    "decode.vortex.filter.calls": 34,
    "decode.vortex.filter.mb_per_sec": 0.0,
    "decode.vortex.filter.ms": 93.7518,
    "decode.vortex.filter.rows": 1354792,
    "decode.vortex.filter.rows_per_sec": 14450839.5227,
    "decode.vortex.pco.bytes": 11003328,
    "decode.vortex.pco.calls": 35,
    "decode.vortex.pco.mb_per_sec": 11.0714,
    "decode.vortex.pco.ms": 993.8516,
    "decode.vortex.pco.rows": 3152593,
    "decode.vortex.pco.rows_per_sec": 3172096.4255,
    "filter.conjunct.0.ms": 1057.7027,
    "filter.conjunct.0.rows_in": 1576200,
    "filter.conjunct.0.rows_kept": 677396,
    "filter.conjunct.0.selectivity": 0.4298,
    "filter.rows_in": 1576200,
    "filter.rows_kept": 677396,
    "filter.selectivity": 0.4298,
    "io.coalescing_factor_avg": 4.2,
    "io.logical_segment_bytes": 6412096,
    "io.physical_read_bytes": 6722036,
    "io.physical_reads": 5,
    "io.read_amplification": 1.0483,
    "io.read_size_p50_bytes": 64.0,
    "io.read_size_p99_bytes": 2097152.0,
    "io.read_syscall_bytes": 6796154,
    "io.read_syscalls": 59,
    "io.segment_requests": 21,
    "io.segment_size_p50_bytes": 262144.0,
    "io.segment_size_p99_bytes": 262144.0,
    "io.tiny_reads": 54,
    "memory.rss_peak_bytes": 1428934656,
    "memory.rss_peak_delta_bytes": 1363107840,
    "metadata.footer_bytes": 65535,
    "metadata.footer_reads": 1,
    "pruning.conjunct.0.ms": 36.4686,
    "pruning.conjunct.0.rows_in": 1510664,
    "pruning.conjunct.0.rows_kept": 1510664,
    "pruning.pruned_ratio": 0.0,
    "pruning.rows_in": 1576200,
    "pruning.rows_kept": 1576200,
    "pushdown_fallback.total": 17,
    "pushdown_fallback.vortex.filter=>vortex.pco.count": 17,
    "scan.count": 1,
    "scan.rows_out": 677396,
    "scan.split_duration_p50_ms": 67.1089,
    "scan.split_duration_p99_ms": 67.1089,
    "scan.split_peak_concurrent": 17,
    "scan.splits": 3,
    "scan.splits_pruned": 0
  },
  "query": "select sum(l_extendedprice) from data where l_shipdate < date '1995-01-01'",
  "target": "lineitem_7.vortex",
  "wall_ms": 249.2569
}

0 replies

AdamGS · 2026-06-10T15:10:11Z

AdamGS
Jun 10, 2026
Maintainer

That looks really promising, do you have a branch you can share?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eBPF-based format profiling for Vortex #8337

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

eBPF-based format profiling for Vortex #8337

Uh oh!

Uh oh!

m7kss1 Jun 10, 2026

Replies: 3 comments

Uh oh!

AdamGS Jun 10, 2026 Maintainer

Uh oh!

Uh oh!

m7kss1 Jun 10, 2026 Author

Uh oh!

AdamGS Jun 10, 2026 Maintainer

m7kss1
Jun 10, 2026

AdamGS
Jun 10, 2026
Maintainer

m7kss1
Jun 10, 2026
Author

AdamGS
Jun 10, 2026
Maintainer