Netstacklat: Add filtering and grouping #129

simosund · 2025-06-05T17:58:23Z

Hi @tohojo and @netoptimizer. This is only draft PR for now to facilitate some discussion. For a real PR I imagine we break this down into some smaller chunks (this is 4 different branches stacked on top of each other). Each of those essentially deals with:

Change the socket enqueue hook because the one I used had some issues
Multiplex all histograms into a single hashmap (needed for the grouping support, and is generally more convenient)
Additional filtering options (ifindex, network namespace and cgroup. TODO: add filter for non-empty socket rx queues)
Groupby support (ifindex and cgroups)

So I do not expect you to review this in detail as of now (although if you have nothing better to do feel free to do so, do not have any major cleanup planned). I would however like some feedback on some design decisions and the limitations they bring. The design decisions/limitations that I would like you to consider are:

For multiplexing histograms:
- In userspace I maintain a sorted array of the histogram keys to allow relatively fast lookup via bsearch. Insertion is not very efficient (O(n)), but should be manageable as long as we do not have more than few thousand histograms (will do some more performance tests).
- If the BPF hashmap runs out of space (which may happen with the groupby options) it will simply not create any new entries. This may result in it keeping partial histograms, where some buckets are miss-reported as being empty. I do not attempt to evict any existing entry as I do not really have any good grounds to do that on. LRU could be an option, but Jesper mentioned they do not work too well with many CPUs.
- If userspace runs out of space for histogram keys (which may happen with groupby), it will warn that it's missing some histograms collected by the eBPF programs, but keep running.
For ifindex filtering:
- I do not support ifindex > 16384 (can be bumped up a bit, but not practical to support >> 1000000 (see commit message for b05cd38 - "netstacklat: Add filtering for network interfaces")
- I only check the ifindex, I do not jointly consider ifindex - namespace combinations (only really intend you to track one namespace, see next point)
For network namespace filtering:
- By default I set netstacklat to filter for the same network namespace as you run the tool in (which will probably be the root network namespace)
- I only support filtering for a single network namespace (see message for eb291bd - "netstacklat: Add filtering for network namespace"). However, you can disable the filtering to get data for all network namespaces.
For groupby options:
- You supply each groupby option as it's own command-line argument (i.e. --groupby-interface and --groupby-cgroup). It would perhaps be nicer to have --groupby interface,cgroup, but I find that adds needless complexity as long as we only have 2 options. For now it's unlikely I'll add many more groupby options, as the amount histograms you get for all combinations quickly explode.
- For --groupby-interface the user space agent prints out the ifindex, not the interface name (see message for d42cdc2 - "netstacklat: Add option to groupby interface"). The provided ebpf-exporter config attempts to print out the interface names (but I assume that will not work as intended if the ifindex is from a different namespace).
- For --groupby-cgroup the user space agent prints out the cgroup ID (inode), not the path (see message for cf7db6f - "netstacklat: Add option to groupby cgroup"). The provided ebpf-exporter config prints out the paths.

Further details can be found in each commit message.

@netoptimizer There's currently (AFAIK) no convenient way to provide the various values to filter for with ebpf-exporter. If PR 531 for ebpf-exporter is merged it seems like you would be able to provide a cgroup path regex in the YAML config, but that still leaves filtering on PIDs, interfaces and network namespace. The network namespace can be hardcoded in the config in the eBPF source code, but the rest need to be set in BPF maps (after their corresponding filter_xxx member in the config in the eBPF source has been set to true). That can be done externally with bpftool. Can hack up some shell script that does that if you want.

Previously netstacklat used fexit:tcp_data_queue as the socket enqueue point for TCP and fexit:udp[v6]_queue_rcv_one_skb for UDP. For both of these functions, the skb may actually have been freed by the time they return, leading us to read invalid data. Furthermore, not all calls to these functions necessarily end up enqueuing the skb to the socket, as they may be dropped for various reasons. For TCP, there are also some fast paths that may enqueue data to the socket without going through tcp_data_queue. Therefore, update these probes to hook more suitable functions. For TCP, use the tcp_queue_rcv function (which is called by tcp_data_queue when the data is actually queued to the socket). This function is much closer to the actual socket enqueue point, will never free the skb itself (although its return value may indicate to the calling function that it should be freed as it's be coalesced into tail skb in the receive queue), is only called when the skb is actually queued to the socket, and is also called in a fast path of tcp_rcv_established that bypasses tcp_data_queue. For UDP, use __udp_enqueue_schedule_skb, which udp[v6]_queue_rcv_one_skb functions call when they actually attempt to enqueue the skb to the socket. This function may still fail to enqueue to skb to the socket (if e.g. the socket buffer is full), so check the return value so that we only report the instances where the skb is successfully enqueued. This function is called by both the IPv4 and IPv6 UDP paths, so similar to the TCP case we only need to hook a single function now. Signed-off-by: Simon Sundberg <[email protected]>

Change the way that the latency histograms are stored in BPF maps. Instead of keeping a separate array map for each histogram, store all histograms in a single hash map, encoding the hook as part of the key. This results in higher overhead (as hash lookups are slower than array lookups), but is much more flexible. This makes it easier to add additional hook points as no new maps (and related code for mapping hooks to maps) need to be added. Furthermore, in the future it allows to easily group the results on various aspects by adding additional members to the key. On the userspace side, maintain a sorted array of the encountered histogram keys and a mapping to the corresponding histogram buckets. Instead of keeping a separate key for each histogram bucket (as the BPF maps do to be compatible with ebpf-exporter), restructure the data so only a single key is used per histogram. Essentially remove the bucket member from the key, keeping a full histogram (where any missing buckets are zeroed) for the remaining unique members in the histogram key (so far just the hook identifier). Keeping the array of histogram keys sorted allows for relatively quick lookups using binary search. When a new histogram key is encountered it will incur significant overhead the first time as it needs to be inserted into the right place in the array, but lookups ought to be much more common than inserting new keys. While this data structure will not scale well to a very large amount of unique keys (insertion time is O(n), lookup O(log n)), it avoids implementing or adding dependencies to more complicated data structures like trees or hash maps. As long as we do not need to keep track of many thousands of histograms, this solution should be good enough. Signed-off-by: Simon Sundberg <[email protected]>

Update the ebpf-exporter config to match the change to how the histograms are stored in the previous commit. As all histograms are stored in a single map, adding additional hooks in the future will only require adding a single line to the hook static_map. Signed-off-by: Simon Sundberg <[email protected]>

Refactor the parsing of arguments that accept lists of values (eg. --pids 1,2,3). Introduce a generic function for parsing delimited string lists and reuse that function to avoid repeating similar logic. This will simplify adding additional arguments that accept lists of values in the future. Signed-off-by: Simon Sundberg <[email protected]>

The full array of pid-values that could be parsed from the user was kept inside the config struct, which is allocated on stack (in the main function). Change the config struct to only keep a pointer to this array, and allocate it on the heap instead to avoid keeping this relatively large data structure on the stack. While not necessarily a large problem yet, establishing this pattern reduces the risk of running out of stack as new fields to filter for are added down the line or the maximum number of values to parse is increased. Also rename MAX_FIILTER_PIDS to MAX_PARSED_PIDS to better reflect what it actually is, and update the comment in parse_arguments() to reflect that the option is called pids and not filter-pids. Signed-off-by: Simon Sundberg <[email protected]>

The filtering for specific pids (--pids/-p) makes use of a BPF array map where each entry to be included is set to 1 (the rest remain 0 as all entries by default are zeroed in array maps). Generalize the user space logic that initializes the entries in this filter map so that it can be reused by other similar features in the future. Signed-off-by: Simon Sundberg <[email protected]>

Add the option -i/--interfaces option to filter for specific network interfaces. The interfaces can either be provided as interface names or indices, although if the interfaces are in another namespace than the netstacklat userspace agent is running in they SHOULD be provided as indices. The names are resolved in the current namespace, and may therefore fail or yield incorrect indices if the interface is in another namespace. Unlike the previous -p/--pids option, this option applies to all existing probe points in netstacklat. On the eBPF side, use skb->skb_iif if the skb is available as context, otherwise use the sk->sk_rx_dst_ifindex from the socket. Use a similar approach as the previous PID filter, where an array map is used to hold ifindices that should be filtered for, allowing a quick lookup. While the ifindex (unlike the PID) does not seem to have a clear upper limit, limit it to 16384 (IFINDEX_MAX) to keep the filter map reasonably small while still supporting the vast majority of scenarios. Note that internally filtering is applied on the interface index (ifindex), regardless if the option provided the index or the name for the interface. If the same ifindex is repeated in multiple network namespaces, it will include traffic for all of them. A future commit will add an option to also filter for a specific network namespaces. Signed-off-by: Simon Sundberg <[email protected]>

Add the -n/--network-namespace option, which let's the user specify which network namespace ID (inode number) should be monitored. Apply the filtering to all current netstacklat probe points. Use the value 0 (default) to filter for the network namespace that the netstacklat application itself is running in. Use value -1 to disable the filtering, including data from all network namespaces (equivalent with the behavior before this commit). Only support filtering for a single network namespace (or all namespaces if the filtering is disabled). This minimizes runtime overhead by allowing the ID to filter for to be kept as a constant in the eBPF program. Supporting multiple values would require an additional map lookup, and due to the wide range of IDs it would have to be a hashmap lookup, which would add considerable overhead for a rather niche use case (monitoring multiple network namespaces). Note that this option will interact with the -i/--interfaces option, as the ifindex that the --interface option filters for are relative to the network namespace set by this option. Signed-off-by: Simon Sundberg <[email protected]>

Add the -c/--cgroups option that lets the user specify one or more cgroups (v2) to filter for. The cgroups can either be provided through their absolute path (including mount path) or as the cgroup ID (inode). This filter only applies to the probe points running in process context, just like the PIDs filter, which is currently only the socket dequeue probes (tcp-read and udp-read). To keep it simple (and avoid high run-time overhead), only do an exact match on the cgroup ID. Do not consider the hierarchical relationship between cgroups. I.e. if the output is filtered for a parent cgroup, the children of that cgroup will NOT be included unless the children cgroups have also been explicitly specified. To support the wide range of possible cgroup IDs, keep the cgroups to filter for in a sparse hasmap (where only values to include have entries) instead of the dense array maps (where all possible values have keys but those to include have non-zero values) like previous multi-valued filters. This unfortuantely adds considerable overhead for doing an additional hash map lookup, but keeping a dense map for all possible IDs is not feasible. Signed-off-by: Simon Sundberg <[email protected]>

Add the -q/--nonempty-queue option, which when enabled only includes latency values when the socket receive queue is non-empty. Only apply this to the socket-read hooks (tcp-socket-read, udp-socket-read), where the probes are triggered AFTER the skbs have been read from the socket queue, and a non-empty queue therefore signifies that additional data remains after the read. The idea behind this hook is to offer a way to reduce overhead (by early aborting for all instances where the socket receive queue is empty) while still capturing latency for applications that can be assumed is under some load (enough load that more data queues up than the application will immediately read). Signed-off-by: Simon Sundberg <[email protected]>

Make the userspace agent set the BPF map sizes based on configured options. This allows more suitable map sizes to be used during run time than the static limits set in the BPF programs. This avoids wasting memory by using unnecessary large maps, and might slightly improve hashmap lookup performance by sizing them based on the expected number of entries (small enough that many entries may fit in cache, large enough to avoid excessive hash collisions). Scale the histogram map based on the expected number of histograms, the PID and ifindex filter maps to fit the largest key they need to include and the cgroup filter map to fit all tracked cgroups. This also fixes a bug where the maximum allowed PID (PID_MAX_LIMIT) and ifindex (IFINDEX_MAX) did not fit in their corresponding filter maps (off-by-one error). Signed-off-by: Simon Sundberg <[email protected]>

Add the -I/--groupby-interface option to collect and report the data on a per-interface (or rather ifindex) basis. Note that the network interfaces are tracked based on their ifindex, so if network namespace filtering has been disabled and there exists interfaces in different namespaces with a common ifindex, their data will be merged into the same histogram. Always write the interface index rather than the interface name in the output. While the interface name for the same network namespace as the user space agent runs in can easily be retrieved with e.g. if_indextoname(), that will only be valid if the user has configured netstacklat to only monitor its own network namespace. If a different network namespace is monitored, or filtration for network namespaces is disabled, translating to the interface names in the current namespace might produce misleading results. An alternative could be to print out the interface names in case the current network namespace is the one monitored (the default), or the index if there's a risk that the data might be from a different namespace. However, in addition to that added complexity, that will produce somewhat inconsistent output (i.e. you might get interface names or interface indices depending on how you configure netstacklat). Signed-off-by: Simon Sundberg <[email protected]>

Add the -C/--groupby-cgroup option to collect and report data on a per-cgroup basis. Just like the -c/--cgroups option, this will only apply to probes in the process context, which is currently only tcp-socket-read and udp-socket-read. When reporting the data, print out the cgroup ID (inode number) directly instead of the cgroup path. As far as I can tell, the only way to resolve the ID into a path is the walk the entire cgroup mount (e.g. /sys/fs/cgroup) and stat each path to find the matching inode. Doing this every time the cgroup needs to be printed seems highly inefficient, and to create an efficient cache the most suitable data structure seems like a hashmap, which C lacks. Adding support for printing out the cgroup path would thus be a significant implementation effort for something we in the end primarily will rely on ebpf-exporter for anyways. Signed-off-by: Simon Sundberg <[email protected]>

simosund · 2025-06-09T17:50:05Z

I've now added the filter for non-empty socket rx-queue that Jesper requested as well. See commit 7dc64ad - netstacklat: Add filtering for non-empty rxqueue.

Many of the options to filter for a subset of values (--pids, --interfaces, and --cgroups) rely on filling BPF maps with the values to include. When running together with the provided netstacklat user space process, the user space process handles filling these maps with the values passed on the command line. However, when using the netstacklat eBPF programs with some external loader, like the ebpf-exporter, these maps have to be filled by the user in some other manner. To make it easier to use netstacklat with external eBPF loaders, provide the fill_filter_maps.sh script. Rely on bpftool to fill the BPF maps (based on the map names as defined in the netstacklat.bpf.c file). in the script. Make the script support all the current filters that make use of maps, i.e. PIDs (pid), network interfaces (iface) and cgroups (cgroup). The pid option only supports integers. The iface option support either interface names or their ifindex. The cgroup option accepts either the cgroup ID (their inode number) or the full path to the cgroup. Examples: $ ./fill_filter_maps.sh pid 1234 98765 $ ./fill_filter_maps.sh iface veth0 lo 123 $ ./fill_filter_maps.sh cgroup /sys/fs/cgroup/system.slice/prometheus.service/ 12345 Note that for the values in the filter map to actually be used by the netstacklat eBPF programs, the corresponding filter_{pid,ifindex,cgroup} value must be true (by default they're all false). The netstacklat user space process normally takes care of enabling these as needed, but if used with an external loader the easiest way to enable these is probably to just change them in user_config at the start of netstacklat.bpf.c (and recompile). Also note that this script can also be used together with the netstacklat user space loader to add additional values to filter for after the starting the program. However, the netstacklat user space loader automatically minimizes the size of the filter maps since commit "netstacklat: Dynamically configure map sizes". So unless the initial filter values provided as netstacklat CLI arguments resulted in sufficiently large filter maps, the fill_filter_maps.sh script may not be able to successfully add the desired values. Finally, note that as bpftool expects you to feed it the individual bytes of the keys, the order of the bytes will be dependant on the endianess of the machine. Currently only support little-endian machines, big-endian support can be added later if needed. Signed-off-by: Simon Sundberg <[email protected]>

simosund added 13 commits May 28, 2025 22:22

simosund force-pushed the netstacklat-groupby branch from cf7db6f to 3026b28 Compare June 9, 2025 17:39

simosund force-pushed the netstacklat-groupby branch from 1ed19c3 to 87ee0b4 Compare June 17, 2025 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Netstacklat: Add filtering and grouping #129

Netstacklat: Add filtering and grouping #129

Uh oh!

simosund commented Jun 5, 2025

Uh oh!

simosund commented Jun 9, 2025

Uh oh!

Uh oh!

Netstacklat: Add filtering and grouping #129

Are you sure you want to change the base?

Netstacklat: Add filtering and grouping #129

Uh oh!

Conversation

simosund commented Jun 5, 2025

Uh oh!

simosund commented Jun 9, 2025

Uh oh!

Uh oh!