Skip to content

Conversation

kernel-patches-daemon-bpf-rc[bot]
Copy link

Pull request for series with
subject: bpf: Allow opt-out from sk->sk_prot->memory_allocated.
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: de73422
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 0db4941
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 56b4d16
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 56b4d16
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 56b4d16
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 56b4d16
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 56b4d16
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 56b4d16
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 56b4d16
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 2e36338
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 7dc484f
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

@kernel-patches-review-bot
Copy link

commit ebe299a7398900b592bd3a27c68eefaf505b3be3
Author: Kuniyuki Iwashima <[email protected]>

selftest: bpf: Add test for sk->sk_bypass_prot_mem.

This commit adds a BPF selftest that validates the sk_bypass_prot_mem
functionality for TCP/UDP sockets over IPv4/IPv6, testing memory
accounting behavior with and without the bypass enabled.

> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_bypass_prot_mem.c b/tools/testing/selftests/bpf/prog_tests/sk_bypass_prot_mem.c
> new file mode 100644
> index 000000000..ce4fc47b6
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sk_bypass_prot_mem.c
> @@ -0,0 +1,282 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2025 Google LLC */
> +
> +#include <test_progs.h>
> +#include "sk_bypass_prot_mem.skel.h"
> +#include "network_helpers.h"
> +
> +#define NR_PAGES	32
> +#define NR_SOCKETS	2
> +#define BUF_TOTAL	(NR_PAGES * 4096 / NR_SOCKETS)
> +#define BUF_SINGLE	1024
> +#define NR_SEND		(BUF_TOTAL / BUF_SINGLE)
> +
> +struct test_case {
> +	char name[8];
> +	int family;
> +	int type;
> +	int (*create_sockets)(struct test_case *test_case, int sk[], int len);
> +	long (*get_memory_allocated)(struct test_case *test_case, struct sk_bypass_prot_mem *skel);
> +};
> +
> +static int tcp_create_sockets(struct test_case *test_case, int sk[], int len)
> +{
> +	int server, i;
> +
> +	server = start_server(test_case->family, test_case->type, NULL, 0, 0);
> +	ASSERT_GE(server, 0, "start_server_str");
> +
> +	/* Keep for-loop so we can change NR_SOCKETS easily. */
> +	for (i = 0; i < len; i += 2) {
> +		sk[i] = connect_to_fd(server, 0);
> +		if (sk[i] < 0) {
> +			ASSERT_GE(sk[i], 0, "connect_to_fd");
> +			return sk[i];

Can tcp_create_sockets() leak the server fd here?  If connect_to_fd()
fails, the function returns without closing server, and the caller
check_bypass()->close label only closes sk[] elements, not server.

> +		}
> +
> +		sk[i + 1] = accept(server, NULL, NULL);
> +		if (sk[i + 1] < 0) {
> +			ASSERT_GE(sk[i + 1], 0, "accept");
> +			return sk[i + 1];

Same question for accept() failure - does this leak the server fd?

> +		}
> +	}
> +
> +	close(server);
> +
> +	return 0;
> +}
> +
> +static int udp_create_sockets(struct test_case *test_case, int sk[], int len)
> +{
> +	int i, j, err, rcvbuf = BUF_TOTAL;
> +
> +	/* Keep for-loop so we can change NR_SOCKETS easily. */
> +	for (i = 0; i < len; i += 2) {
> +		sk[i] = start_server(test_case->family, test_case->type, NULL, 0, 0);
> +		if (sk[i] < 0) {
> +			ASSERT_GE(sk[i], 0, "start_server");
> +			return sk[i];
> +		}
> +
> +		sk[i + 1] = connect_to_fd(sk[i], 0);
> +		if (sk[i + 1] < 0) {
> +			ASSERT_GE(sk[i + 1], 0, "connect_to_fd");
> +			return sk[i + 1];

Can udp_create_sockets() leak sk[i] here?  If connect_to_fd() fails
after start_server() succeeded, sk[i] contains a valid server fd but the
function returns without closing it.  The caller will attempt close() on
all sk[] elements, but since i is incremented by 2 in the loop, if we
fail on the first iteration with i=0, sk[0] will be closed by the caller,
so this might be OK.  But if we fail on a later iteration, previously
created sk[i] from earlier loop iterations would remain open and leak.

> +		}
> +
> +		err = connect_fd_to_fd(sk[i], sk[i + 1], 0);
> +		if (err) {
> +			ASSERT_EQ(err, 0, "connect_fd_to_fd");
> +			return err;

Can udp_create_sockets() leak sk[i] and sk[i+1] here?  If
connect_fd_to_fd() fails, both sockets were created but the function
returns without closing them.

> +		}
> +
> +		for (j = 0; j < 2; j++) {
> +			err = setsockopt(sk[i + j], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(int));
> +			if (err) {
> +				ASSERT_EQ(err, 0, "setsockopt(SO_RCVBUF)");
> +				return err;

Same question - can this leak sk[i] and sk[i+1]?

> +			}
> +		}
> +	}
> +
> +	return 0;
> +}

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: selftest: bpf: Add test for sk->sk_bypass_prot_mem.
CI run summary: https://github.com/kernel-patches/bpf-rc/actions/runs/18413992460

@kernel-patches-daemon-bpf-rc
Copy link
Author

Forwarding comment 3391564277 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-review-bot
Copy link

commit b9280697a877872a83bc3e90fbafbe8023bce9af
Author: Kuniyuki Iwashima <[email protected]>

tcp: Save lock_sock() for memcg in inet_csk_accept().

This patch moves memcg operations from inet_csk_accept() to
__inet_accept() to save one lock_sock() call per accepted socket.
It adds a protocol check to handle SCTP's use of sk_alloc() instead
of sk_clone_lock().

> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 3109c5ec3..e8771faa5 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -755,6 +755,28 @@ EXPORT_SYMBOL(inet_stream_connect);
>
>  void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *newsk)
>  {
> +	/* TODO: use sk_clone_lock() in SCTP and remove protocol checks */
> +	if (mem_cgroup_sockets_enabled &&
> +	    (!IS_ENABLED(CONFIG_IP_SCTP) || sk_is_tcp(newsk))) {

Does this condition handle MPTCP sockets correctly when CONFIG_IP_SCTP
is enabled?

Looking at the protocol check:
- sk_is_tcp() returns true only when sk->sk_protocol == IPPROTO_TCP
- MPTCP sockets have sk->sk_protocol == IPPROTO_MPTCP (262)
- When CONFIG_IP_SCTP=y: (!false || false) = false

So for MPTCP sockets, the condition evaluates to false and the memcg
initialization is skipped.

However, MPTCP uses sk_clone_lock() which sets newsk->sk_memcg = NULL
(net/core/sock.c sk_clone_lock), expecting __inet_accept() to initialize
it. Call chain: mptcp_stream_accept() -> inet_csk_accept() ->
__inet_accept() with newsk having IPPROTO_MPTCP.

Can MPTCP sockets end up without memcg initialization when SCTP is
configured in the kernel?

> +		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
> +
> +		mem_cgroup_sk_alloc(newsk);
> +
> +		if (mem_cgroup_from_sk(newsk)) {
> +			int amt;
> +
> +			/* The socket has not been accepted yet, no need
> +			 * to look at newsk->sk_wmem_queued.
> +			 */
> +			amt = sk_mem_pages(newsk->sk_forward_alloc +
> +					   atomic_read(&newsk->sk_rmem_alloc));
> +			if (amt)
> +				mem_cgroup_sk_charge(newsk, amt, gfp);
> +		}
> +
> +		kmem_cache_charge(newsk, gfp);
> +	}
> +
>  	sock_rps_record_flow(newsk);
>  	WARN_ON(!((1 << newsk->sk_state) &
>  		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |

[ ... ]

> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index cdd1e12aa..3b83b66b2 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -712,31 +712,6 @@ struct sock *inet_csk_accept(struct sock *sk, struct proto_accept_arg *arg)
>
>  	release_sock(sk);
>
> -	if (mem_cgroup_sockets_enabled) {
> -		gfp_t gfp = GFP_KERNEL | __GFP_NOFAIL;
> -		int amt = 0;
> -
> -		/* atomically get the memory usage, set and charge the
> -		 * newsk->sk_memcg.
> -		 */
> -		lock_sock(newsk);
> -
> -		mem_cgroup_sk_alloc(newsk);
> -		if (mem_cgroup_from_sk(newsk)) {
> -			/* The socket has not been accepted yet, no need
> -			 * to look at newsk->sk_wmem_queued.
> -			 */
> -			amt = sk_mem_pages(newsk->sk_forward_alloc +
> -					   atomic_read(&newsk->sk_rmem_alloc));
> -		}
> -
> -		if (amt)
> -			mem_cgroup_sk_charge(newsk, amt, gfp);
> -		kmem_cache_charge(newsk, gfp);
> -
> -		release_sock(newsk);
> -	}
> -
>  	if (req)
>  		reqsk_put(req);
>



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: tcp: Save lock_sock() for memcg in inet_csk_accept().
CI run summary: https://github.com/kernel-patches/bpf-rc/actions/runs/18413992460

@kernel-patches-daemon-bpf-rc
Copy link
Author

Forwarding comment 3391579537 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 4c97c4b
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

If memcg is enabled, accept() acquires lock_sock() twice for each new
TCP/MPTCP socket in inet_csk_accept() and __inet_accept().

Let's move memcg operations from inet_csk_accept() to __inet_accept().

Note that SCTP somehow allocates a new socket by sk_alloc() in
sk->sk_prot->accept() and clones fields manually, instead of using
sk_clone_lock().

mem_cgroup_sk_alloc() is called for SCTP before __inet_accept(),
so I added the protocol check in __inet_accept(), but this can be
removed once SCTP uses sk_clone_lock().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 4c97c4b
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1008847
version: 1

q2ven added 5 commits October 10, 2025 16:43
Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.

Sometimes, system processes do not want that limitation.  For a similar
purpose, there is SO_RESERVE_MEM for sockets under memcg.

Also, by opting out of the per-protocol accounting, sockets under memcg
can avoid paying costs for two orthogonal memory accounting mechanisms.
A microbenchmark result is in the subsequent bpf patch.

Let's allow opt-out from the per-protocol memory accounting if
sk->sk_bypass_prot_mem is true.

sk->sk_bypass_prot_mem and sk->sk_prot are placed in the same cache
line, and sk_has_account() always fetches sk->sk_prot before accessing
sk->sk_bypass_prot_mem, so there is no extra cache miss for this patch.

The following patches will set sk->sk_bypass_prot_mem to true, and
then, the per-protocol memory accounting will be skipped.

Note that this does NOT disable memcg, but rather the per-protocol one.

Another option not to use the hole in struct sock_common is create
sk_prot variants like tcp_prot_bypass, but this would complicate
SOCKMAP logic, tcp_bpf_prots etc.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
If a socket has sk->sk_bypass_prot_mem flagged, the socket opts out
of the global protocol memory accounting.

Let's control the flag by a new sysctl knob.

The flag is written once during socket(2) and is inherited to child
sockets.

Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.

Setup:

  # mkdir /sys/fs/cgroup/test
  # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
  # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
  # ulimit -n 524288

Without net.core.bypass_prot_mem, charged to tcp_mem & memcg

  # python3 pressure.py &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 22642688 <-------------------------------------- charged to memcg
  # cat /proc/net/sockstat| grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
  # ss -tn | head -n 5
  State Recv-Q Send-Q Local Address:Port  Peer Address:Port
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53188
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:49972
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53868
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53554
  # nstat | grep Pressure || echo no pressure
  TcpExtTCPMemoryPressures        1                  0.0

With net.core.bypass_prot_mem=1, charged to memcg only:

  # sysctl -q net.core.bypass_prot_mem=1
  # python3 pressure.py &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 2757468160 <------------------------------------ charged to memcg
  # cat /proc/net/sockstat | grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
  # ss -tn | head -n 5
  State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:49026
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:45630
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:44870
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:45274
  # nstat | grep Pressure || echo no pressure
  no pressure

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
We will support flagging sk->sk_bypass_prot_mem via bpf_setsockopt()
at the BPF_CGROUP_INET_SOCK_CREATE hook.

BPF_CGROUP_INET_SOCK_CREATE is invoked by __cgroup_bpf_run_filter_sk()
that passes a pointer to struct sock to the bpf prog as void *ctx.

But there are no bpf_func_proto for bpf_setsockopt() that receives
the ctx as a pointer to struct sock.

Also, bpf_getsockopt() will be necessary for a cgroup with multiple
bpf progs running.

Let's add new bpf_setsockopt() and bpf_getsockopt() variants for
BPF_CGROUP_INET_SOCK_CREATE.

Note that inet_create() is not under lock_sock() and has the same
semantics with bpf_lsm_unlocked_sockopt_hooks.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
If a socket has sk->sk_bypass_prot_mem flagged, the socket opts out
of the global protocol memory accounting.

This is easily controlled by net.core.bypass_prot_mem sysctl, but it
lacks flexibility.

Let's support flagging (and clearing) sk->sk_bypass_prot_mem via
bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook.

  int val = 1;

  bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_BYPASS_PROT_MEM,
                 &val, sizeof(val));

As with net.core.bypass_prot_mem, this is inherited to child sockets,
and BPF always takes precedence over sysctl at socket(2) and accept(2).

SK_BPF_BYPASS_PROT_MEM is only supported at BPF_CGROUP_INET_SOCK_CREATE
and not supported on other hooks for some reasons:

  1. UDP charges memory under sk->sk_receive_queue.lock instead
     of lock_sock()

  2. Modifying the flag after skb is charged to sk requires such
     adjustment during bpf_setsockopt() and complicates the logic
     unnecessarily

We can support other hooks later if a real use case justifies that.

Most changes are inline and hard to trace, but a microbenchmark on
__sk_mem_raise_allocated() during neper/tcp_stream showed that more
samples completed faster with sk->sk_bypass_prot_mem == 1.  This will
be more visible under tcp_mem pressure (but it's not a fair comparison).

  # bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; }
    kretprobe:__sk_mem_raise_allocated /@start[tid]/
    { @EnD[tid] = nsecs - @start[tid]; @times = hist(@EnD[tid]); delete(@start[tid]); }'
  # tcp_stream -6 -F 1000 -N -T 256

Without bpf prog:

  [128, 256)          3846 |                                                    |
  [256, 512)       1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)        1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
  [1K, 2K)          198207 |@@@@@@                                              |
  [2K, 4K)           31199 |@                                                   |

With bpf prog in the next patch:
  (must be attached before tcp_stream)
  # bpftool prog load sk_bypass_prot_mem.bpf.o /sys/fs/bpf/test type cgroup/sock_create
  # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/test

  [128, 256)          6413 |                                                    |
  [256, 512)       1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)        1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
  [1K, 2K)          117031 |@@@@                                                |
  [2K, 4K)           11773 |                                                    |

Signed-off-by: Kuniyuki Iwashima <[email protected]>
The test does the following for IPv4/IPv6 x TCP/UDP sockets
with/without sk->sk_bypass_prot_mem, which can be turned on by
net.core.bypass_prot_mem or bpf_setsockopt(SK_BPF_BYPASS_PROT_MEM).

  1. Create socket pairs
  2. Send NR_PAGES (32) of data (TCP consumes around 35 pages,
     and UDP consuems 66 pages due to skb overhead)
  3. Read memory_allocated from sk->sk_prot->memory_allocated and
     sk->sk_prot->memory_per_cpu_fw_alloc
  4. Check if unread data is charged to memory_allocated

If sk->sk_bypass_prot_mem is set, memory_allocated should not be
changed, but we allow a small error (up to 10 pages) in case
other processes on the host use some amounts of TCP/UDP memory.

The amount of allocated pages are buffered to per-cpu variable
{tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv
before reported to {tcp,udp}_memory_allocated.

At 3., memory_allocated is calculated from the 2 variables at
fentry of socket create function.

We drain the receive queue only for UDP before close() because UDP
recv queue is destroyed after RCU grace period.  When I printed
memory_allocated, UDP bypass cases sometimes saw the no-bypass
case's leftover, but it's still in the small error range (<10 pages).

  bpf_trace_printk: memory_allocated: 0   <-- TCP no-bypass
  bpf_trace_printk: memory_allocated: 35
  bpf_trace_printk: memory_allocated: 0   <-- TCP w/ sysctl
  bpf_trace_printk: memory_allocated: 0
  bpf_trace_printk: memory_allocated: 0   <-- TCP w/ bpf
  bpf_trace_printk: memory_allocated: 0
  bpf_trace_printk: memory_allocated: 0   <-- UDP no-bypass
  bpf_trace_printk: memory_allocated: 66
  bpf_trace_printk: memory_allocated: 2   <-- UDP w/ sysctl (2 pages leftover)
  bpf_trace_printk: memory_allocated: 2
  bpf_trace_printk: memory_allocated: 2   <-- UDP w/ bpf (2 pages leftover)
  bpf_trace_printk: memory_allocated: 2

We prefer finishing tests faster than oversleeping for call_rcu()
 + sk_destruct().

The test completes within 2s on QEMU (64 CPUs) w/ KVM.

  # time ./test_progs -t sk_bypass
  #371/1   sk_bypass_prot_mem/TCP  :OK
  #371/2   sk_bypass_prot_mem/UDP  :OK
  #371/3   sk_bypass_prot_mem/TCPv6:OK
  #371/4   sk_bypass_prot_mem/UDPv6:OK
  #371     sk_bypass_prot_mem:OK
  Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

  real	0m1.481s
  user	0m0.181s
  sys	0m0.441s

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant