Skip to content

Conversation

alwa-nordic
Copy link
Contributor

@alwa-nordic alwa-nordic commented Oct 9, 2025

This PR moves tx_processor off the system workqueue to a dedicated workqueue to prevent deadlocks. See commits for details.

Core changes

  • New bt_taskq workqueue - "for quick non-blocking Bluetooth tasks"
  • Move tx_processor to bt_taskq - Now it's safer to block on system work queue (which the Host does unfortunately)

Fallout fixes

  • Defer ATT user cb - User callbacks like write_cmd_cb stay in system work queue
  • Grab some RAM - BT_MAX_CONN reduced from 62 to 61 in peripheral_identity sample to fit bt_taskq

Cleanups

  • Fewer workarounds - bt_cmd_send_sync workaround disabled when tx_processor uses dedicated thread

@alwa-nordic alwa-nordic force-pushed the bt-taskq branch 7 times, most recently from a9fc732 to 03c365b Compare October 10, 2025 16:56
@nashif

This comment was marked as outdated.

@alwa-nordic alwa-nordic changed the title Bt taskq Bluetooth: Host: Move tx_processor to bt_taskq Oct 10, 2025
@alwa-nordic alwa-nordic force-pushed the bt-taskq branch 6 times, most recently from a16c556 to 38bd9d0 Compare October 14, 2025 07:59
ATT is invoking user callbacks in its net_buf destroy function. It is
common practice that these callbacks can block on bt_hci_cmd_alloc().
This is a deadlock when the net_buf_unref() happens inside the HCI
driver, invoked from tx_processor.

Blocking callbacks like this appear in our own samples. See further down
about how this problem was detected.

tx_processor not protect against blocking callbacks so it is de-facto
forbidden. The Host should not equip net_bufs with dangerous destroy
callbacks.

This commit makes ATT defer its net_buf destruction and user callback
invocation to the system workqueue, so that net_buf_unref is safe to
call from non-blocking threads. In the case of the deadlock, the
net_buf_unref() was below the tx_processor in the call stack, which (at
the time of this commit) is on the system work queue, so defering it to
the system work queue is preserving the existing behavior.

Future improvement may be to allow the user to provide their own
workqueue for ATT callbacks.

This deadlock was detected because the following test was failing while
moving tx_processor to the bt_taskq:

    tests/bsim/bluetooth/ll/throughput/tests_scripts/gatt_write.sh

The above test has an ATT callback `write_cmd_cb` invokes
`bt_conn_le_param_update` can block waiting for `tx_processor`.

The reason it was not failing while tx_processor was on the system work
queue is that the GATT API has a special non-blocking behavior when
called from the system work queue.

Signed-off-by: Aleksander Wasaznik <[email protected]>
Reduce BT_MAX_CONN from 62 to 61 to make it build on integration
platform qemu_cortex_m3/ti_lm3s6965 when we add bt_taskq in subsequent
commit.

Signed-off-by: Aleksander Wasaznik <[email protected]>
Add a new workqueue bt_taskq specifically designed for quick
non-blocking work items in the Bluetooth subsystem.

Signed-off-by: Aleksander Wasaznik <[email protected]>
It's not safe for the tx_processor to share the system workqueue with
work items that block the thread until tx_processor runs. This is a
deadlock.

The Bluetooth Host itself performs these operations, usually involving
bt_hci_cmd_alloc(), on the system workqueue.

This change effectively gives tx_processor its own thread, like the BT
TX thread that used to exist. But, this time the thread is intended to
be shared with any other non-blocking Bluetooth Host tasks.

The bt_taskq rules tx_processor is supposed to be non-blocking and only
have code under our control on the thread stack. Unfortunately, this is
not entirely true currently. But we consider it close enough for now and
will ensure it starts adhering to the rules in the future. Examples of
problems:

 - The tx_processor invokes bt_hci_send(), driver code which has no
   rules limiting what it can do on our thread.
 - The tx_processor invokes net_buf_unref() on stack-external net_buf
   which executes user code on our thread.

Signed-off-by: Aleksander Wasaznik <[email protected]>
The workaround in bt_cmd_send_sync is no longer needed when tx_processor
runs on a dedicated bt_taskq and not on system workqueue.

But for defensive programming, we keep the workaround in place and log a
warning if it's triggered. If CONFIG_TEST is enabled, we panic instead.

Signed-off-by: Aleksander Wasaznik <[email protected]>
Copy link

@alwa-nordic alwa-nordic marked this pull request as ready for review October 14, 2025 13:33
@zephyrbot zephyrbot added area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Samples Samples area: Bluetooth labels Oct 14, 2025
@alwa-nordic
Copy link
Contributor Author

Future work: This should make CONFIG_BT_RECV_WORKQ_SYS=y less problematic, since tx_processor is no longer blocked by blocking work on the system work queue. Should we update our opinion about its safety? Is it safer to enable BT_RECV_WORKQ_SYS or BT_TASKQ_SYSTEM_WORKQUEUE if you have to choose? The stack size for dedicated bt_workq is smaller.

Copy link
Contributor

@PavelVPV PavelVPV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is correct to run TX processor on a generic workqueue (even if it is Bluetooth Host specific). As soon as the generic workq is used for command sending, a deadlock will occur.

Also, currently there is no use-case for bt_taskq. I don't think introducing it now is right.

@alwa-nordic
Copy link
Contributor Author

As soon as the generic workq is used for command sending

That's not allowed.

@alwa-nordic
Copy link
Contributor Author

Also, currently there is no use-case for bt_taskq. I don't think introducing it now is right.

Running tx_processor?

@PavelVPV
Copy link
Contributor

I mean, the it is defined is how it can be used. It is quite hard in the code to track which API may eventually allocate or send command to Controller (which I guess is the main work for Host).

TX processor needs its own thread for now. This solves the exact problem. I don't see a problem that BT task solves currently.

@alwa-nordic
Copy link
Contributor Author

alwa-nordic commented Oct 14, 2025

It's not a generic work queue. The reason I define a bt_taskq is to establish that blocking on this thread is an error. That includes any blocking in tx_processor. Maybe we will find errors in tx_processor, but then we know we have to fix them.

This is in contrast to simply a thread dedicated for tx_processor. Then it's not an error to block in tx_processor.

We are concerned about RAM usage. We simply can't afford many threads. I want us to have a ready-to-use place for non-blocking tasks. It makes adding more tasks later easy and rewards this. Finding RAM for taskq was hard enough. Adding more threads later will be even harder. I really don't want tx_processor to need its own thread.

It is quite hard in the code to track which API may eventually allocate or send command to Controller (which I guess is the main work for Host).

Yeah. Writing good code is hard. We will have to maintain discipline with contracts on the taskq.

TX processor needs its own thread for now.

Does it? Why? Let's fix that!

This solves the exact problem. I don't see a problem that BT task solves currently.

What is 'this'?

@alwa-nordic alwa-nordic added the Bluetooth Review Discussion in the Bluetooth WG meeting required label Oct 14, 2025
@jhedberg
Copy link
Member

I need to put some time aside to do a proper review, however initial question is that is this complementary to #93033 or an alternate approach for the same issue?

@PavelVPV
Copy link
Contributor

It's not a generic work queue. The reason I define a bt_taskq is to establish that blocking on this thread is an error. That includes any blocking in tx_processor. Maybe we will find errors in tx_processor, but then we know we have to fix them.

This will end up in a situation where blocking call allocating a command buffer 99% of time doesn't block the thread, and in 1% blocks, thus changing application behavior.

This is in contrast to simply a thread dedicated for tx_processor. Then it's not an error to block in tx_processor.

This is fine, still, the thread should exclusively be used for tx processor for now. Later this can be changed, but now there's nothing that requires this.

We are concerned about RAM usage. We simply can't afford many threads. I want us to have a ready-to-use place for non-blocking tasks. It makes adding more tasks later easy and rewards this. Finding RAM for taskq was hard enough. Adding more threads later will be even harder. I really don't want tx_processor to need its own thread.

It doesn't change anything. This is now a new thread used by tx processor. Nothing else is using it. It is obvious that it will require memory. But now thread analyzer needs to be run to check how much is freed on sysworkq.

It is quite hard in the code to track which API may eventually allocate or send command to Controller (which I guess is the main work for Host).

Yeah. Writing good code is hard. We will have to maintain discipline with contracts on the taskq.

Sure, but how are you going to ensure this if even the bug that was triggered this change was hiding since removing of tx processor thread?

TX processor needs its own thread for now.

Does it? Why? Let's fix that!

I mean, this entire task is driven by the deadlock As we discuss in the team not a long time ago. I will remind you ticket in PM.

This solves the exact problem. I don't see a problem that BT task solves currently.

What is 'this'?

This -> a dedicated thread for tx processor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth area: Samples Samples Bluetooth Review Discussion in the Bluetooth WG meeting required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants