Do not allow hw filter stealing in no-5tuple support scenarios by mina · Pull Request #279 · Xilinx-CNS/onload

mina · 2025-05-27T22:23:48Z

Currently we're running into the following behavior in AF_XDP based deployment (which is 3-tuple only support):

Onload stack A starts and listens on port X.
Onload stack A exits; its cleanup of hw filters is delayed.
Onload stack B starts and listens on the same port X.
Onload stack B sees the 3-tuple hw filter inserted by stack A, which has not yet been cleaned.
In the current code, stack B will reuse the hw filter of stack A, but that 3-tuple hw filter points to the wrong stack, and the wrong RX queue.
This causes all the sockets created by stack B to become unaccelerated, because they land on the rxq that belonged to stack A, not the rxq that belongs to stack B.

The root cause of this issue is that oof_socket_add_wild expects oof_local_port_addr_fixup_wild to fix hw filters that point to different stacks, but that will not actually happen if
(!oof_socket_can_share_hw_filter && fm->fm_hwports_no5tuple). This creates the undesirable behavior that all sockets in this new stack being unaccelerated, even when the user passed EF_NO_FAIL=0 (!)

Fix this by returning an error in this edge case. This means that a new onload stack will not reuse a hw filter that doesn't belong to it. The user can restart the workload when the old onload stack has finished cleanup of hw filters.

Note that this patch will change the behavior for all fm->fm_hwports_no5tuple deployments. This seems beneficial as all those deployments should see the same bug, but other options are possible:

We could add a module parameter that configures this behavior.
We may be able to only disable this behavior for AF_XDP deployments.

tested by running these commands with onload in a tight loop:

onload tcp_rr -F 1000
onload udp_rr -F 1000

tcp_rr and udp_rr reuse the same ports by default. In all cases, onload will start successfully if there is no existing hw filter left over from the previous onload invocation, and will fail with -EEXIST if there is a left over filter from the previous stack.

Change-Id: I6e196ac55cc81efbc537206cdc6e20b2068a7e3d

Currently we're running into the following behavior in AF_XDP based deployment (which is 3-tuple only support): - Onload stack A starts and listens on port X. - Onload stack A exits; its cleanup of hw filters is delayed. - Onload stack B starts and listens on the same port X. - Onload stack B sees the 3-tuple hw filter inserted by stack A, which has not yet been cleaned. - In the current code, stack B will reuse the hw filter of stack A, but that 3-tuple hw filter points to the wrong stack, and the wrong RX queue. - This causes all the sockets created by stack B to become unaccelerated, because they land on the rxq that belonged to stack A, not the rxq that belongs to stack B. The root cause of this issue is that oof_socket_add_wild expects oof_local_port_addr_fixup_wild to fix hw filters that point to different stacks, but that will not actually happen if (!oof_socket_can_share_hw_filter && fm->fm_hwports_no5tuple). This creates the undesirable behavior that all sockets in this new stack being unaccelerated, even when the user passed EF_NO_FAIL=0 (!) Fix this by returning an error in this edge case. This means that a new onload stack will not reuse a hw filter that doesn't belong to it. The user can restart the workload when the old onload stack has finished cleanup of hw filters. Note that this patch will change the behavior for all fm->fm_hwports_no5tuple deployments. This seems beneficial as all those deployments should see the same bug, but other options are possible: - We could add a module parameter that configures this behavior. - We may be able to only disable this behavior for AF_XDP deployments. tested by running these commands with onload in a tight loop: onload tcp_rr -F 1000 onload udp_rr -F 1000 tcp_rr and udp_rr reuse the same ports by default. In all cases, onload will start successfully if there is no existing hw filter left over from the previous onload invocation, and will fail with -EEXIST if there is a left over filter from the previous stack. Change-Id: I6e196ac55cc81efbc537206cdc6e20b2068a7e3d

sianj-xilinx · 2025-05-28T08:11:08Z

The way that sort of scenario is expected to be handled is through use of the existing onload module parameters oof_shared_keep_thresh and oof_shared_steal_thresh. In this case I would expect that by setting oof_shared_steal_thresh=0 you should allow the second stack to steal the existing wild filter. Have you tried this?

mina · 2025-06-02T22:57:54Z

Thanks for the suggestion, I've tried oof_shared_steal_thresh=0, and we've root caused a related bug in oof_local_port_addr_fixup_wild that was making the stealing not work on our platform, but long story short, oof_shared_steal_thresh=0 and our related fix doesn't seem to fully resolve the issue.

The behavior with oof_shared_steal_thresh=0 and our fix to oof_local_port_addr_fixup_wild, what I see is:

One onload stack starts, exits, and leaves behind a hw filter.
The next onload stack starts, and oof_socket_add_wild sees that there is an existing hw filter.
Asynchronously oof_local_port_addr_fixup_wild will run (I guess it runs when the previous onload stack starts the delayed cleanup). oof_local_port_addr_fixup_wild will delete the hw filter pointing to the old stack, and insert the hw filter to the new stack.

This all works as expected, except in the case where the new onload stack receives data between steps 2 and 3. In this case, I still see some unaccelerated sockets:

Every 1.0s: onload_stackdump stats | grep -i accept                                                                                                                                                  almasrymina-3: Mon Jun  2 22:56:49 2025

tcp_accept_os: 573
ul_accepts: 428

What happened in this run is that 573 sockets were opened before oof_local_port_addr_fixup_wild ran, and those were accelerated, and then 428 sockets were open after oof_local_port_addr_fixup_wild and fixed up the hw filters, so those became accelerated.

The solution in this patch still seems valid to me, let me know if you disagree. Basically we make it so that oof_socket_add_wild sees an existing filter pointing to another onload stack, then it will fail creating the new stack, rather than relying to a future call to oof_local_port_addr_fixup_wild to fix the hw filters (and there is a time window where the hw filter is not yet stolen, which means unaccelerated sockets).

We will share the fix we found for the stealing in oof_local_port_addr_fixup_wild to work properly separately.

zhuyifei1999 · 2025-06-02T23:50:59Z

We will share the fix we found for the stealing in oof_local_port_addr_fixup_wild to work properly separately.

Is #282

mina requested a review from a team as a code owner May 27, 2025 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not allow hw filter stealing in no-5tuple support scenarios#279

Do not allow hw filter stealing in no-5tuple support scenarios#279
mina wants to merge 1 commit intoXilinx-CNS:masterfrom
mina:af-xdp-filter-steal-fix

mina commented May 27, 2025

Uh oh!

sianj-xilinx commented May 28, 2025

Uh oh!

mina commented Jun 2, 2025

Uh oh!

zhuyifei1999 commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mina commented May 27, 2025

Uh oh!

sianj-xilinx commented May 28, 2025

Uh oh!

mina commented Jun 2, 2025

Uh oh!

zhuyifei1999 commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants