-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Bluetooth: Host: Move tx_processor to bt_taskq #97288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a9fc732
to
03c365b
Compare
This comment was marked as outdated.
This comment was marked as outdated.
a16c556
to
38bd9d0
Compare
ATT is invoking user callbacks in its net_buf destroy function. It is common practice that these callbacks can block on bt_hci_cmd_alloc(). This is a deadlock when the net_buf_unref() happens inside the HCI driver, invoked from tx_processor. Blocking callbacks like this appear in our own samples. See further down about how this problem was detected. tx_processor not protect against blocking callbacks so it is de-facto forbidden. The Host should not equip net_bufs with dangerous destroy callbacks. This commit makes ATT defer its net_buf destruction and user callback invocation to the system workqueue, so that net_buf_unref is safe to call from non-blocking threads. In the case of the deadlock, the net_buf_unref() was below the tx_processor in the call stack, which (at the time of this commit) is on the system work queue, so defering it to the system work queue is preserving the existing behavior. Future improvement may be to allow the user to provide their own workqueue for ATT callbacks. This deadlock was detected because the following test was failing while moving tx_processor to the bt_taskq: tests/bsim/bluetooth/ll/throughput/tests_scripts/gatt_write.sh The above test has an ATT callback `write_cmd_cb` invokes `bt_conn_le_param_update` can block waiting for `tx_processor`. The reason it was not failing while tx_processor was on the system work queue is that the GATT API has a special non-blocking behavior when called from the system work queue. Signed-off-by: Aleksander Wasaznik <[email protected]>
Reduce BT_MAX_CONN from 62 to 61 to make it build on integration platform qemu_cortex_m3/ti_lm3s6965 when we add bt_taskq in subsequent commit. Signed-off-by: Aleksander Wasaznik <[email protected]>
Add a new workqueue bt_taskq specifically designed for quick non-blocking work items in the Bluetooth subsystem. Signed-off-by: Aleksander Wasaznik <[email protected]>
It's not safe for the tx_processor to share the system workqueue with work items that block the thread until tx_processor runs. This is a deadlock. The Bluetooth Host itself performs these operations, usually involving bt_hci_cmd_alloc(), on the system workqueue. This change effectively gives tx_processor its own thread, like the BT TX thread that used to exist. But, this time the thread is intended to be shared with any other non-blocking Bluetooth Host tasks. The bt_taskq rules tx_processor is supposed to be non-blocking and only have code under our control on the thread stack. Unfortunately, this is not entirely true currently. But we consider it close enough for now and will ensure it starts adhering to the rules in the future. Examples of problems: - The tx_processor invokes bt_hci_send(), driver code which has no rules limiting what it can do on our thread. - The tx_processor invokes net_buf_unref() on stack-external net_buf which executes user code on our thread. Signed-off-by: Aleksander Wasaznik <[email protected]>
The workaround in bt_cmd_send_sync is no longer needed when tx_processor runs on a dedicated bt_taskq and not on system workqueue. But for defensive programming, we keep the workaround in place and log a warning if it's triggered. If CONFIG_TEST is enabled, we panic instead. Signed-off-by: Aleksander Wasaznik <[email protected]>
|
Future work: This should make CONFIG_BT_RECV_WORKQ_SYS=y less problematic, since tx_processor is no longer blocked by blocking work on the system work queue. Should we update our opinion about its safety? Is it safer to enable BT_RECV_WORKQ_SYS or BT_TASKQ_SYSTEM_WORKQUEUE if you have to choose? The stack size for dedicated bt_workq is smaller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is correct to run TX processor on a generic workqueue (even if it is Bluetooth Host specific). As soon as the generic workq is used for command sending, a deadlock will occur.
Also, currently there is no use-case for bt_taskq
. I don't think introducing it now is right.
That's not allowed. |
Running |
I mean, the it is defined is how it can be used. It is quite hard in the code to track which API may eventually allocate or send command to Controller (which I guess is the main work for Host). TX processor needs its own thread for now. This solves the exact problem. I don't see a problem that BT task solves currently. |
It's not a generic work queue. The reason I define a bt_taskq is to establish that blocking on this thread is an error. That includes any blocking in tx_processor. Maybe we will find errors in tx_processor, but then we know we have to fix them. This is in contrast to simply a thread dedicated for tx_processor. Then it's not an error to block in tx_processor. We are concerned about RAM usage. We simply can't afford many threads. I want us to have a ready-to-use place for non-blocking tasks. It makes adding more tasks later easy and rewards this. Finding RAM for taskq was hard enough. Adding more threads later will be even harder. I really don't want tx_processor to need its own thread.
Yeah. Writing good code is hard. We will have to maintain discipline with contracts on the taskq.
Does it? Why? Let's fix that!
What is 'this'? |
I need to put some time aside to do a proper review, however initial question is that is this complementary to #93033 or an alternate approach for the same issue? |
This will end up in a situation where blocking call allocating a command buffer 99% of time doesn't block the thread, and in 1% blocks, thus changing application behavior.
This is fine, still, the thread should exclusively be used for tx processor for now. Later this can be changed, but now there's nothing that requires this.
It doesn't change anything. This is now a new thread used by tx processor. Nothing else is using it. It is obvious that it will require memory. But now thread analyzer needs to be run to check how much is freed on sysworkq.
Sure, but how are you going to ensure this if even the bug that was triggered this change was hiding since removing of tx processor thread?
I mean, this entire task is driven by the deadlock As we discuss in the team not a long time ago. I will remind you ticket in PM.
This -> a dedicated thread for tx processor. |
This PR moves
tx_processor
off the system workqueue to a dedicated workqueue to prevent deadlocks. See commits for details.Core changes
Fallout fixes
write_cmd_cb
stay in system work queueCleanups
bt_cmd_send_sync
workaround disabled when tx_processor uses dedicated thread