Describe the bug
Referring to FreeRTOS SMP tag: V11.2.0 with the following relevant configuration:
#define configNUMBER_OF_CORES 2
#define configRUN_MULTIPLE_PRIORITIES 1
#define configUSE_PREEMPTION 1
#define configUSE_TASK_PREEMPTION_DISABLE 1
#define configUSE_TIME_SLICING 1
A task is performing a critical operation and cannot be interrupted. It calls vTaskPreemptionDisable(NULL) before the operation and vTaskPreemptionEnable(NULL) after the operation. There is another task at the same priority, which could run via time slicing but for the pre-emption being disabled. The critical operation is lengthy and involves switching the hardware into different modes, so an attempted context switch during the operation will crash the system. However, interrupts need to remain enabled and be serviced as normal.
During the critical operation, a timer tick occurs, causing an interrupt which calls xTaskIncrementTick(). The line
|
xYieldPendings[ xCoreID ] = pdTRUE; |
sets
xYieldPendings[ xCoreID ] = pdTRUE, because there is another task at the same priority that could run via time slicing, but the line
|
if( pxCurrentTCBs[ xCoreID ]->xPreemptionDisable == pdFALSE ) |
prevents this flag being acted upon immediately, because
pxCurrentTCBs[ xCoreID ]->xPreemptionDisable != pdFALSE and thus it neither sets
xSwitchRequired = pdTRUE nor calls
prvYieldCore( xCoreID ). So far, so good.
Then, still during the critical operation, an unrelated I/O operation completes, causing an interrupt which calls xTaskGenericNotifyFromISR(). The unrelated task being notified, was blocked waiting for this notification, so it gets to line
|
listINSERT_END( &( xPendingReadyList ), &( pxTCB->xEventListItem ) ); |
which adds it to the ready list, and then line
|
prvYieldForTask( pxTCB ); |
which has to make a decision about yielding to the newly woken task. It first calls
prvYieldForTask( pxTCB ) but this routine does not do anything because it correctly sees that
pxCurrentTCBs[ xCoreID ]->xPreemptionDisable != pdFALSE and so context switching is not allowed. Still good.
The bug is that upon return from prvYieldForTask(), the next line
|
if( xYieldPendings[ portGET_CORE_ID() ] == pdTRUE ) |
is checking whether
xYieldPendings[ portGET_CORE_ID() ] == pdTRUE, obviously intending determine the action taken by
prvYieldForTask() which is a void function so doesn't inform whether it yielded or not. Although
prvYieldForTask() did not yield, there is a leftover flag in
xYieldPendings[] which was set earlier by
xTaskIncrementTick(), even though that flag was not acted upon at the time. So it thinks that
prvYieldForTask() has yielded and goes on to line
|
*pxHigherPriorityTaskWoken = pdTRUE; |
which sets
*pxHigherPriorityTaskWoken = pdTRUE.
My I/O completion interrupt handler ends with a line like portYIELD_FROM_ISR(task_woken), and on the Xtensa port that I'm using, it ends up calling vTaskSwitchContext() from the assembly ISR wrapper, instead of returning to the user code that is performing my critical operation. The actual context switching is done in prvSelectHighestPriorityTask() and the if-block at
|
if( pxTCB->xTaskRunState == taskTASK_NOT_RUNNING ) |
appears to switch in a new ready task (that is not running) without checking
pxCurrentTCBs[ xCoreID ]->xPreemptionDisable. It does have such a check further down, at line
|
if( pxCurrentTCBs[ uxCore ]->xPreemptionDisable == pdFALSE ) |
, but this is only in regard to re-scheduling the evicted task on another core. The task with pre-emption disabled has already been evicted.
This bug has been verified on real hardware. I had to reconstruct the above sequence of events through tracing, so it is possible that the exact failure sequence could vary slightly from what I described. Indeed it fails in several different ways depending on the exact timing of the timer ticks and unrelated I/O operations (of which several are going on simultaneously). I can see two other possible failure paths through xTaskResumeAll() or xTaskResumeFromISR(), and these are called from basically everywhere -- I think another of my failure cases is caused by my calling xSemaphoreGiveFromISR() -> xQueueGenericSend() -> xTaskResumeAll(), but I haven't definitively verified this. What I can say, is that it is definitely dangerous to leave a true value in xYieldPendings[] because there are a number of ways that it could be acted upon later.
My diagnosis is that the incorrect logic is in xTaskIncrementTick(), and that the bug was introduced with commit https://github.com/FreeRTOS/FreeRTOS-Kernel/tree/294569e495515e238dc890a8b4613d01260d1f06 which was intended to optimize the logic for configNUMBER_OF_CORES > 1. See #1118 for the discussion. The original logic was correct, as far as I can see. Backing out this commit fixes the issue for me on real hardware.
Incidentally, the discussion of that PR is claiming that the original code line
BaseType_t xYieldRequiredForCore[ configNUMBER_OF_CORES ] = { pdFALSE };
was leaving some array members uninitialized. That's not the case. It has always been legal C to initialize just the first array member and rely on the compiler to fill the rest with zeros, even for an auto array as here. MISRA may have clouded the issue -- I believe MISRA accepts an initializer of {0} to initialize the entire array.
Target
- Development board: ESP32-S3-MINI-1-N4R2
- Instruction Set Architecture: Xtensa LX
- IDE and version: command line
- Toolchain and version: xtensa-esp-elf-gcc (crosstool-NG esp-13.2.0_20240530) 13.2.0
Host
- Host OS: Devuan
- Version: Excalibur
To Reproduce
Suggest to reproduce using ESP-IDF with its OTA update feature, which needs to disable pre-emption in order to write the flash chip. A non-trivial test program would have to be written with the following features:
- Use menuconfig to set the FreeRTOS type to FreeRTOS-SMP rather than Espressif's custom FreeRTOS.
- Manually upgrade the FreeRTOS-SMP version to the appropriate tag -- Espressif uses an older one without the bug, when they eventually upgrade to a newer version they are likely to encounter this bug eventually.
- Have either semaphores or direct-to-task notifications happening during the OTA update. Have this happen from interrupts marked as IRAM-safe.
- Have multiple tasks of the same priority as the one writing the flash during the OTA update, that could execute but for the preemption being disabled.
- Suggest to have task(s) blocked waiting on the semaphores or direct-to-task notifications. Suggest these additional tasks should have higher priority.
Expected behavior
A task which has disabled pre-emption should never be pre-empted, regardless of what interrupts happen.
Screenshots
Screenshots of the failure mode and the diagnostic tracing can be provided upon request. Additional tests can be performed upon request if there is a suggestion that the current logic should have worked or that the failure mode is different than described.
Additional context
None
Describe the bug
Referring to FreeRTOS SMP
tag: V11.2.0with the following relevant configuration:A task is performing a critical operation and cannot be interrupted. It calls
vTaskPreemptionDisable(NULL)before the operation andvTaskPreemptionEnable(NULL)after the operation. There is another task at the same priority, which could run via time slicing but for the pre-emption being disabled. The critical operation is lengthy and involves switching the hardware into different modes, so an attempted context switch during the operation will crash the system. However, interrupts need to remain enabled and be serviced as normal.During the critical operation, a timer tick occurs, causing an interrupt which calls
xTaskIncrementTick(). The lineFreeRTOS-Kernel/tasks.c
Line 4882 in 0adc196
xYieldPendings[ xCoreID ] = pdTRUE, because there is another task at the same priority that could run via time slicing, but the lineFreeRTOS-Kernel/tasks.c
Line 4931 in 0adc196
pxCurrentTCBs[ xCoreID ]->xPreemptionDisable != pdFALSEand thus it neither setsxSwitchRequired = pdTRUEnor callsprvYieldCore( xCoreID ). So far, so good.Then, still during the critical operation, an unrelated I/O operation completes, causing an interrupt which calls
xTaskGenericNotifyFromISR(). The unrelated task being notified, was blocked waiting for this notification, so it gets to lineFreeRTOS-Kernel/tasks.c
Line 8154 in 0adc196
FreeRTOS-Kernel/tasks.c
Line 8182 in 0adc196
prvYieldForTask( pxTCB )but this routine does not do anything because it correctly sees thatpxCurrentTCBs[ xCoreID ]->xPreemptionDisable != pdFALSEand so context switching is not allowed. Still good.The bug is that upon return from
prvYieldForTask(), the next lineFreeRTOS-Kernel/tasks.c
Line 8184 in 0adc196
xYieldPendings[ portGET_CORE_ID() ] == pdTRUE, obviously intending determine the action taken byprvYieldForTask()which is a void function so doesn't inform whether it yielded or not. AlthoughprvYieldForTask()did not yield, there is a leftover flag inxYieldPendings[]which was set earlier byxTaskIncrementTick(), even though that flag was not acted upon at the time. So it thinks thatprvYieldForTask()has yielded and goes on to lineFreeRTOS-Kernel/tasks.c
Line 8188 in 0adc196
*pxHigherPriorityTaskWoken = pdTRUE.My I/O completion interrupt handler ends with a line like
portYIELD_FROM_ISR(task_woken), and on the Xtensa port that I'm using, it ends up callingvTaskSwitchContext()from the assembly ISR wrapper, instead of returning to the user code that is performing my critical operation. The actual context switching is done inprvSelectHighestPriorityTask()and the if-block atFreeRTOS-Kernel/tasks.c
Line 1086 in 0adc196
pxCurrentTCBs[ xCoreID ]->xPreemptionDisable. It does have such a check further down, at lineFreeRTOS-Kernel/tasks.c
Line 1250 in 0adc196
This bug has been verified on real hardware. I had to reconstruct the above sequence of events through tracing, so it is possible that the exact failure sequence could vary slightly from what I described. Indeed it fails in several different ways depending on the exact timing of the timer ticks and unrelated I/O operations (of which several are going on simultaneously). I can see two other possible failure paths through
xTaskResumeAll()orxTaskResumeFromISR(), and these are called from basically everywhere -- I think another of my failure cases is caused by my callingxSemaphoreGiveFromISR()->xQueueGenericSend()->xTaskResumeAll(), but I haven't definitively verified this. What I can say, is that it is definitely dangerous to leave a true value inxYieldPendings[]because there are a number of ways that it could be acted upon later.My diagnosis is that the incorrect logic is in
xTaskIncrementTick(), and that the bug was introduced with commit https://github.com/FreeRTOS/FreeRTOS-Kernel/tree/294569e495515e238dc890a8b4613d01260d1f06 which was intended to optimize the logic forconfigNUMBER_OF_CORES > 1. See #1118 for the discussion. The original logic was correct, as far as I can see. Backing out this commit fixes the issue for me on real hardware.Incidentally, the discussion of that PR is claiming that the original code line
was leaving some array members uninitialized. That's not the case. It has always been legal C to initialize just the first array member and rely on the compiler to fill the rest with zeros, even for an
autoarray as here. MISRA may have clouded the issue -- I believe MISRA accepts an initializer of{0}to initialize the entire array.Target
Host
To Reproduce
Suggest to reproduce using ESP-IDF with its OTA update feature, which needs to disable pre-emption in order to write the flash chip. A non-trivial test program would have to be written with the following features:
Expected behavior
A task which has disabled pre-emption should never be pre-empted, regardless of what interrupts happen.
Screenshots
Screenshots of the failure mode and the diagnostic tracing can be provided upon request. Additional tests can be performed upon request if there is a suggestion that the current logic should have worked or that the failure mode is different than described.
Additional context
None