This framework currently cleans up stale socket (stuck in CLOSE-WAIT state with a timeout setup by the user (defaults to 60s).
There is a portion of the BPF side that already removes candidate sockets for deletion if those exit the CLOSE-WAIT condition timely, shown in the code below:
if(oldstate == TCP_CLOSE_WAIT && newstate != TCP_CLOSE_WAIT) {
bpf_map_delete_elem(&close_wait_tracker, &key);
}
At the same time, the user-space application also cleans up the sockets if those times out as specified in the code below:
if age > timeoutNs {
log.Printf("Stale CLOSE_WAIT: %s:%d -> %s:%d (age=%v, netns=%s)",
socket.FormatIP(key.SrcIp), socket.Ntohs(key.SrcPort),
socket.FormatIP(key.DstIp), socket.Ntohs(key.DstPort),
time.Duration(age), netns.GetNameByIno(info.NetnsIno))
err := socket.DestroySocketNetnsIno(
info.NetnsIno,
key.Proto,
key.SrcIp, key.SrcPort,
key.DstIp, key.DstPort,
)
...
}
This multiple deletion approach could cause race conditions, potentially hard to handle as there are no synchronization primitives between kernel and user space.
For this reason, a different approach would be worth it to be explored.
The idea is to move the timeout handling also on the BPF side, notifying the user-space possibly in an efficient way (e.g. using perf events) to let it performing the actual cleanup.
This framework currently cleans up stale socket (stuck in CLOSE-WAIT state with a timeout setup by the user (defaults to 60s).
There is a portion of the BPF side that already removes candidate sockets for deletion if those exit the CLOSE-WAIT condition timely, shown in the code below:
At the same time, the user-space application also cleans up the sockets if those times out as specified in the code below:
This multiple deletion approach could cause race conditions, potentially hard to handle as there are no synchronization primitives between kernel and user space.
For this reason, a different approach would be worth it to be explored.
The idea is to move the timeout handling also on the BPF side, notifying the user-space possibly in an efficient way (e.g. using perf events) to let it performing the actual cleanup.