Skip to content

Netstacklat: Add filtering and grouping #129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

simosund
Copy link
Contributor

@simosund simosund commented Jun 5, 2025

Hi @tohojo and @netoptimizer. This is only draft PR for now to facilitate some discussion. For a real PR I imagine we break this down into some smaller chunks (this is 4 different branches stacked on top of each other). Each of those essentially deals with:

  • Change the socket enqueue hook because the one I used had some issues
  • Multiplex all histograms into a single hashmap (needed for the grouping support, and is generally more convenient)
  • Additional filtering options (ifindex, network namespace and cgroup. TODO: add filter for non-empty socket rx queues)
  • Groupby support (ifindex and cgroups)

So I do not expect you to review this in detail as of now (although if you have nothing better to do feel free to do so, do not have any major cleanup planned). I would however like some feedback on some design decisions and the limitations they bring. The design decisions/limitations that I would like you to consider are:

  • For multiplexing histograms:

    • In userspace I maintain a sorted array of the histogram keys to allow relatively fast lookup via bsearch. Insertion is not very efficient (O(n)), but should be manageable as long as we do not have more than few thousand histograms (will do some more performance tests).
    • If the BPF hashmap runs out of space (which may happen with the groupby options) it will simply not create any new entries. This may result in it keeping partial histograms, where some buckets are miss-reported as being empty. I do not attempt to evict any existing entry as I do not really have any good grounds to do that on. LRU could be an option, but Jesper mentioned they do not work too well with many CPUs.
    • If userspace runs out of space for histogram keys (which may happen with groupby), it will warn that it's missing some histograms collected by the eBPF programs, but keep running.
  • For ifindex filtering:

    • I do not support ifindex > 16384 (can be bumped up a bit, but not practical to support >> 1000000 (see commit message for b05cd38 - "netstacklat: Add filtering for network interfaces")
    • I only check the ifindex, I do not jointly consider ifindex - namespace combinations (only really intend you to track one namespace, see next point)
  • For network namespace filtering:

    • By default I set netstacklat to filter for the same network namespace as you run the tool in (which will probably be the root network namespace)
    • I only support filtering for a single network namespace (see message for eb291bd - "netstacklat: Add filtering for network namespace"). However, you can disable the filtering to get data for all network namespaces.
  • For groupby options:

    • You supply each groupby option as it's own command-line argument (i.e. --groupby-interface and --groupby-cgroup). It would perhaps be nicer to have --groupby interface,cgroup, but I find that adds needless complexity as long as we only have 2 options. For now it's unlikely I'll add many more groupby options, as the amount histograms you get for all combinations quickly explode.
    • For --groupby-interface the user space agent prints out the ifindex, not the interface name (see message for d42cdc2 - "netstacklat: Add option to groupby interface"). The provided ebpf-exporter config attempts to print out the interface names (but I assume that will not work as intended if the ifindex is from a different namespace).
    • For --groupby-cgroup the user space agent prints out the cgroup ID (inode), not the path (see message for cf7db6f - "netstacklat: Add option to groupby cgroup"). The provided ebpf-exporter config prints out the paths.

Further details can be found in each commit message.

@netoptimizer There's currently (AFAIK) no convenient way to provide the various values to filter for with ebpf-exporter. If PR 531 for ebpf-exporter is merged it seems like you would be able to provide a cgroup path regex in the YAML config, but that still leaves filtering on PIDs, interfaces and network namespace. The network namespace can be hardcoded in the config in the eBPF source code, but the rest need to be set in BPF maps (after their corresponding filter_xxx member in the config in the eBPF source has been set to true). That can be done externally with bpftool. Can hack up some shell script that does that if you want.

simosund added 13 commits May 28, 2025 22:22
Previously netstacklat used fexit:tcp_data_queue as the socket enqueue
point for TCP and fexit:udp[v6]_queue_rcv_one_skb for UDP. For both of
these functions, the skb may actually have been freed by the time they
return, leading us to read invalid data. Furthermore, not all calls to
these functions necessarily end up enqueuing the skb to the socket,
as they may be dropped for various reasons. For TCP, there are also
some fast paths that may enqueue data to the socket without going
through tcp_data_queue. Therefore, update these probes to hook more
suitable functions.

For TCP, use the tcp_queue_rcv function (which is called by
tcp_data_queue when the data is actually queued to the socket). This
function is much closer to the actual socket enqueue point, will never
free the skb itself (although its return value may indicate to the
calling function that it should be freed as it's be coalesced into
tail skb in the receive queue), is only called when the skb is
actually queued to the socket, and is also called in a fast path of
tcp_rcv_established that bypasses tcp_data_queue.

For UDP, use __udp_enqueue_schedule_skb, which
udp[v6]_queue_rcv_one_skb functions call when they actually attempt to
enqueue the skb to the socket. This function may still fail to enqueue
to skb to the socket (if e.g. the socket buffer is full), so check the
return value so that we only report the instances where the skb is
successfully enqueued. This function is called by both the IPv4 and
IPv6 UDP paths, so similar to the TCP case we only need to hook a
single function now.

Signed-off-by: Simon Sundberg <[email protected]>
Change the way that the latency histograms are stored in BPF
maps. Instead of keeping a separate array map for each histogram,
store all histograms in a single hash map, encoding the hook as part
of the key.

This results in higher overhead (as hash lookups are slower than array
lookups), but is much more flexible. This makes it easier to add
additional hook points as no new maps (and related code for mapping
hooks to maps) need to be added. Furthermore, in the future it allows
to easily group the results on various aspects by adding additional
members to the key.

On the userspace side, maintain a sorted array of the encountered
histogram keys and a mapping to the corresponding histogram
buckets. Instead of keeping a separate key for each histogram
bucket (as the BPF maps do to be compatible with ebpf-exporter),
restructure the data so only a single key is used per
histogram. Essentially remove the bucket member from the key, keeping
a full histogram (where any missing buckets are zeroed) for the
remaining unique members in the histogram key (so far just the hook
identifier).

Keeping the array of histogram keys sorted allows for relatively quick
lookups using binary search. When a new histogram key is encountered
it will incur significant overhead the first time as it needs to be
inserted into the right place in the array, but lookups ought to be
much more common than inserting new keys. While this data structure
will not scale well to a very large amount of unique keys (insertion
time is O(n), lookup O(log n)), it avoids implementing or adding
dependencies to more complicated data structures like trees or
hash maps. As long as we do not need to keep track of many thousands
of histograms, this solution should be good enough.

Signed-off-by: Simon Sundberg <[email protected]>
Update the ebpf-exporter config to match the change to how the
histograms are stored in the previous commit.

As all histograms are stored in a single map, adding additional hooks
in the future will only require adding a single line to the hook
static_map.

Signed-off-by: Simon Sundberg <[email protected]>
Refactor the parsing of arguments that accept lists of
values (eg. --pids 1,2,3). Introduce a generic function for
parsing delimited string lists and reuse that function to avoid
repeating similar logic. This will simplify adding additional
arguments that accept lists of values in the future.

Signed-off-by: Simon Sundberg <[email protected]>
The full array of pid-values that could be parsed from the user was
kept inside the config struct, which is allocated on stack (in the
main function). Change the config struct to only keep a pointer to
this array, and allocate it on the heap instead to avoid keeping this
relatively large data structure on the stack.

While not necessarily a large problem yet, establishing this pattern
reduces the risk of running out of stack as new fields to filter for
are added down the line or the maximum number of values to parse is
increased.

Also rename MAX_FIILTER_PIDS to MAX_PARSED_PIDS to better reflect what
it actually is, and update the comment in parse_arguments() to reflect
that the option is called pids and not filter-pids.

Signed-off-by: Simon Sundberg <[email protected]>
The filtering for specific pids (--pids/-p) makes use of a BPF array
map where each entry to be included is set to 1 (the rest remain 0 as
all entries by default are zeroed in array maps). Generalize the user
space logic that initializes the entries in this filter map so that it
can be reused by other similar features in the future.

Signed-off-by: Simon Sundberg <[email protected]>
Add the option -i/--interfaces option to filter for specific network
interfaces. The interfaces can either be provided as interface names
or indices, although if the interfaces are in another namespace than
the netstacklat userspace agent is running in they SHOULD be provided
as indices. The names are resolved in the current namespace, and may
therefore fail or yield incorrect indices if the interface is in
another namespace.

Unlike the previous -p/--pids option, this option applies to all
existing probe points in netstacklat. On the eBPF side, use
skb->skb_iif if the skb is available as context, otherwise use the
sk->sk_rx_dst_ifindex from the socket. Use a similar approach as the
previous PID filter, where an array map is used to hold ifindices that
should be filtered for, allowing a quick lookup. While the
ifindex (unlike the PID) does not seem to have a clear upper limit,
limit it to 16384 (IFINDEX_MAX) to keep the filter map reasonably
small while still supporting the vast majority of scenarios.

Note that internally filtering is applied on the interface
index (ifindex), regardless if the option provided the index or the
name for the interface. If the same ifindex is repeated in multiple
network namespaces, it will include traffic for all of them. A future
commit will add an option to also filter for a specific network
namespaces.

Signed-off-by: Simon Sundberg <[email protected]>
Add the -n/--network-namespace option, which let's the user specify
which network namespace ID (inode number) should be monitored. Apply
the filtering to all current netstacklat probe points.

Use the value 0 (default) to filter for the network namespace that the
netstacklat application itself is running in. Use value -1 to disable
the filtering, including data from all network namespaces (equivalent
with the behavior before this commit).

Only support filtering for a single network namespace (or all
namespaces if the filtering is disabled). This minimizes runtime
overhead by allowing the ID to filter for to be kept as a constant in
the eBPF program. Supporting multiple values would require an
additional map lookup, and due to the wide range of IDs it would have
to be a hashmap lookup, which would add considerable overhead for a
rather niche use case (monitoring multiple network namespaces).

Note that this option will interact with the -i/--interfaces option,
as the ifindex that the --interface option filters for are relative to
the network namespace set by this option.

Signed-off-by: Simon Sundberg <[email protected]>
Add the -c/--cgroups option that lets the user specify one or more
cgroups (v2) to filter for. The cgroups can either be provided through
their absolute path (including mount path) or as the cgroup
ID (inode). This filter only applies to the probe points running in
process context, just like the PIDs filter, which is currently only
the socket dequeue probes (tcp-read and udp-read).

To keep it simple (and avoid high run-time overhead), only do an exact
match on the cgroup ID. Do not consider the hierarchical relationship
between cgroups. I.e. if the output is filtered for a parent cgroup,
the children of that cgroup will NOT be included unless the children
cgroups have also been explicitly specified.

To support the wide range of possible cgroup IDs, keep the cgroups to
filter for in a sparse hasmap (where only values to include have
entries) instead of the dense array maps (where all possible values
have keys but those to include have non-zero values) like previous
multi-valued filters. This unfortuantely adds considerable overhead
for doing an additional hash map lookup, but keeping a dense map for
all possible IDs is not feasible.

Signed-off-by: Simon Sundberg <[email protected]>
Add the -q/--nonempty-queue option, which when enabled only includes
latency values when the socket receive queue is non-empty.

Only apply this to the socket-read hooks (tcp-socket-read,
udp-socket-read), where the probes are triggered AFTER the skbs have
been read from the socket queue, and a non-empty queue therefore
signifies that additional data remains after the read.

The idea behind this hook is to offer a way to reduce overhead (by
early aborting for all instances where the socket receive queue is
empty) while still capturing latency for applications that can be
assumed is under some load (enough load that more data queues up than
the application will immediately read).

Signed-off-by: Simon Sundberg <[email protected]>
Make the userspace agent set the BPF map sizes based on configured
options. This allows more suitable map sizes to be used during run
time than the static limits set in the BPF programs. This avoids
wasting memory by using unnecessary large maps, and might slightly
improve hashmap lookup performance by sizing them based on the
expected number of entries (small enough that many entries may fit in
cache, large enough to avoid excessive hash collisions).

Scale the histogram map based on the expected number of histograms,
the PID and ifindex filter maps to fit the largest key they need to
include and the cgroup filter map to fit all tracked cgroups.

This also fixes a bug where the maximum allowed PID (PID_MAX_LIMIT)
and ifindex (IFINDEX_MAX) did not fit in their corresponding filter
maps (off-by-one error).

Signed-off-by: Simon Sundberg <[email protected]>
Add the -I/--groupby-interface option to collect and report the data
on a per-interface (or rather ifindex) basis. Note that the network
interfaces are tracked based on their ifindex, so if network namespace
filtering has been disabled and there exists interfaces in different
namespaces with a common ifindex, their data will be merged into the
same histogram.

Always write the interface index rather than the interface name in the
output. While the interface name for the same network namespace as the
user space agent runs in can easily be retrieved with
e.g. if_indextoname(), that will only be valid if the user has
configured netstacklat to only monitor its own network namespace. If a
different network namespace is monitored, or filtration for network
namespaces is disabled, translating to the interface names in the
current namespace might produce misleading results.

An alternative could be to print out the interface names in case the
current network namespace is the one monitored (the default), or the
index if there's a risk that the data might be from a different
namespace. However, in addition to that added complexity, that will
produce somewhat inconsistent output (i.e. you might get interface
names or interface indices depending on how you configure
netstacklat).

Signed-off-by: Simon Sundberg <[email protected]>
Add the -C/--groupby-cgroup option to collect and report data on a
per-cgroup basis. Just like the -c/--cgroups option, this will only
apply to probes in the process context, which is currently only
tcp-socket-read and udp-socket-read.

When reporting the data, print out the cgroup ID (inode number)
directly instead of the cgroup path. As far as I can tell, the only
way to resolve the ID into a path is the walk the entire cgroup
mount (e.g. /sys/fs/cgroup) and stat each path to find the matching
inode. Doing this every time the cgroup needs to be printed seems
highly inefficient, and to create an efficient cache the most suitable
data structure seems like a hashmap, which C lacks. Adding support for
printing out the cgroup path would thus be a significant
implementation effort for something we in the end primarily will rely
on ebpf-exporter for anyways.

Signed-off-by: Simon Sundberg <[email protected]>
@simosund simosund force-pushed the netstacklat-groupby branch from cf7db6f to 3026b28 Compare June 9, 2025 17:39
@simosund
Copy link
Contributor Author

simosund commented Jun 9, 2025

I've now added the filter for non-empty socket rx-queue that Jesper requested as well. See commit 7dc64ad - netstacklat: Add filtering for non-empty rxqueue.

Many of the options to filter for a subset of values (--pids,
--interfaces, and --cgroups) rely on filling BPF maps with the values
to include. When running together with the provided netstacklat user
space process, the user space process handles filling these maps with
the values passed on the command line. However, when using the
netstacklat eBPF programs with some external loader, like the
ebpf-exporter, these maps have to be filled by the user in some other
manner.

To make it easier to use netstacklat with external eBPF loaders,
provide the fill_filter_maps.sh script. Rely on bpftool to fill the
BPF maps (based on the map names as defined in the netstacklat.bpf.c
file). in the script. Make the script support all the current filters
that make use of maps, i.e. PIDs (pid), network interfaces (iface) and
cgroups (cgroup).

The pid option only supports integers. The iface option support either
interface names or their ifindex. The cgroup option accepts either the
cgroup ID (their inode number) or the full path to the cgroup.

Examples:
$ ./fill_filter_maps.sh pid 1234 98765
$ ./fill_filter_maps.sh iface veth0 lo 123
$ ./fill_filter_maps.sh cgroup /sys/fs/cgroup/system.slice/prometheus.service/ 12345

Note that for the values in the filter map to actually be used by the
netstacklat eBPF programs, the corresponding
filter_{pid,ifindex,cgroup} value must be true (by default they're all
false). The netstacklat user space process normally takes care of
enabling these as needed, but if used with an external loader the
easiest way to enable these is probably to just change them in
user_config at the start of netstacklat.bpf.c (and recompile).

Also note that this script can also be used together with the
netstacklat user space loader to add additional values to filter for
after the starting the program. However, the netstacklat user space
loader automatically minimizes the size of the filter maps since
commit "netstacklat: Dynamically configure map sizes". So unless the
initial filter values provided as netstacklat CLI arguments resulted
in sufficiently large filter maps, the fill_filter_maps.sh script may
not be able to successfully add the desired values.

Finally, note that as bpftool expects you to feed it the individual
bytes of the keys, the order of the bytes will be dependant on the
endianess of the machine. Currently only support little-endian
machines, big-endian support can be added later if needed.

Signed-off-by: Simon Sundberg <[email protected]>
@simosund simosund force-pushed the netstacklat-groupby branch from 1ed19c3 to 87ee0b4 Compare June 17, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant