Skip to content

Conversation

@ixhamza
Copy link
Member

@ixhamza ixhamza commented Nov 17, 2025

Motivation and Context

This fixes a critical NULL pointer dereference that causes kernel panics when timer-based tasks are cancelled under high concurrency. The bug manifests during frequent task cancellations, particularly with snapshot automount expiry under memory pressure or high I/O load.

Description

The race condition occurs in taskq_cancel_id() when checking timer_pending() before calling timer_delete_sync(). The sequence is:

  1. Timer expires and is dequeued from timer wheel - timer_pending() returns FALSE
  2. taskq_cancel_id() skips timer_delete_sync() due to FALSE result
  3. task_done() frees the task, setting tqent_func = NULL and clearing flags
  4. Timer callback (task_expire) finally executes on another CPU
  5. Callback doesn't see CANCEL flag (already cleared) and queues the freed task
  6. Worker thread attempts to execute NULL tqent_func → kernel panic

The fix removes the unsafe conditional check and always calls timer_delete_sync() unconditionally. This ensures the timer callback completes before the task is freed, preventing the use-after-free vulnerability.

Kernel Panic:

[   64.447544] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   64.451222] #PF: supervisor instruction fetch in kernel mode
[   64.454037] #PF: error_code(0x0010) - not-present page
[   64.456747] PGD 0 P4D 0 
[   64.458224] Oops: Oops: 0010 [#1] PREEMPT SMP NOPTI
[   64.460972] CPU: 7 UID: 0 PID: 1708 Comm: spl_delay_taskq Not tainted 6.12.43-production+ #652
[   64.465751] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   64.471032] RIP: 0010:0x0
[   64.472597] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[   64.476205] RSP: 0018:ffffc9000c723da8 EFLAGS: 00010246
[   64.481669] RAX: 0000000000000000 RBX: ffff888102864000 RCX: 0000000000000001
[   64.488815] RDX: 00000000ffffffff RSI: fffffffffffffffe RDI: 0000000000000000
[   64.495298] RBP: ffffc9000c723ed0 R08: 000000003fffffff R09: fffffffffffffffe
[   64.501402] R10: ffffffff83a6ae40 R11: 0000000000000001 R12: ffff888117a6a600
[   64.507425] R13: ffff8881146919c0 R14: ffff88811085c900 R15: ffff888102864118
[   64.513305] FS:  0000000000000000(0000) GS:ffff888c0f780000(0000) knlGS:0000000000000000
[   64.520580] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   64.525437] CR2: ffffffffffffffd6 CR3: 0000000118598004 CR4: 0000000000770ef0
[   64.529825] PKRU: 55555554
[   64.530777] Call Trace:
[   64.531669]  <TASK>
[   64.532579]  taskq_thread+0x287/0x5d0
[   64.533981]  ? __pfx_default_wake_function+0x10/0x10
[   64.535694]  ? __pfx_taskq_thread+0x10/0x10
[   64.537084]  kthread+0xf3/0x120
[   64.538314]  ? __pfx_kthread+0x10/0x10
[   64.539391]  ret_from_fork+0x3d/0x60
[   64.540680]  ? __pfx_kthread+0x10/0x10
[   64.542053]  ret_from_fork_asm+0x1a/0x30
[   64.543488]  </TASK>
[   64.544298] Modules linked in:
[   64.545332] CR2: 0000000000000000

How Has This Been Tested?

The race condition can be made reliably reproducible by applying this debug patch to widen the race window:

--- a/module/os/linux/spl/spl-taskq.c
+++ b/module/os/linux/spl/spl-taskq.c
@@ -633,6 +633,7 @@ taskq_cancel_id(taskq_t *tq, taskqid_t id, boolean_t wait)

        t->tqent_flags |= TQENT_FLAG_CANCEL;
        TQSTAT_INC(tq, tasks_cancelled);
+       mdelay(10);  /* Debug: widen race window */

        /*
         * The task_expire() function takes the tq->tq_lock so drop

Reproduction script:

zpool create -f testpool mirror /dev/sdc /dev/sdd -O mountpoint=none
mkdir -p /run/testfs
zfs create -o mountpoint=/run/testfs -o snapdir=visible testpool/testfs
echo 1 > /sys/module/zfs/parameters/zfs_expire_snapshot
echo 524288 > /sys/module/zfs/parameters/zfs_arc_dnode_limit
for i in {1..100}; do zfs snapshot testpool/testfs@snap$i; done
bash -c 'for attempt in {1..43200}; do for i in {1..100}; do \
ls /run/testfs/.zfs/snapshot/snap$i/ >/dev/null 2>&1 & done; \
sleep 1; echo $attempt; done'

Results:

  • Without fix + debug patch: Panic occurs within seconds to minutes with NULL pointer dereference
  • With fix + debug patch: No panic after hours of continuous testing
  • This bug was discovered while testing the deadlock fix in PR #17941. Without this fix, panic occurs around 10 hour mark; with this fix, no panic after 24+ hours of testing

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Remove unsafe timer_pending() check in taskq_cancel_id() that created a
race where:
- Timer expires and timer_pending() returns FALSE
- task_done() frees task with tqent_func = NULL
- Timer callback executes and queues freed task
- Worker thread crashes executing NULL function

Always call timer_delete_sync() unconditionally to ensure timer callback
completes before task is freed.

Reliably reproducible by injecting mdelay(10) after setting CANCEL flag
to widen the race window, combined with frequent task cancellations
(e.g., snapshot automount expiry).

Signed-off-by: Ameer Hamza <[email protected]>
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 17, 2025
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 19, 2025
@behlendorf behlendorf merged commit 36e4f18 into openzfs:master Nov 19, 2025
23 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Accepted Ready to integrate (reviewed, tested)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants