Skip to content

[Bug]: SEGV null pointer dereference in qWorkerInit (qworker.c:1481) during vnode startup on 3.4.0.9 #35280

@Danceiny

Description

@Danceiny

Bug Description

taosd crashes with SIGSEGV (null pointer dereference at address 0x0) in qWorkerInit (qworker.c:1481) when a vnode-query thread starts up. The crash is deterministic — all 4 consecutive restarts hit the exact same binary offset. After the 4th crash, systemd triggered start-limit-hit and prevented further restarts until manual intervention.

To Reproduce

The crash occurs during vnode startup/restart under the following conditions:

  1. 3-node TDengine cluster (community edition 3.4.0.9) running on Ubuntu 24.04
  2. After WAL replay completes on node s3, the vnode-query thread calls qWorkerInit to initialize the query worker
  3. Inside qWorkerInit, an internal resource allocation (hash table, timer, or memory) fails
  4. The error-handling path triggers a null pointer dereference, causing SEGV
  5. Process crashes, systemd auto-restarts, hits the same bug 4 times, then start-limit-hit prevents further restart

The crash happened at 09:21:59, approximately 2 seconds after the last WAL commit at 09:21:57. A manual restart at 14:15 succeeded, indicating this is a timing/resource contention issue rather than persistent data corruption.

Expected Behavior

qWorkerInit should handle resource allocation failures gracefully — return an error code without crashing. The vnode should either retry initialization or report the error to the cluster, not segfault.

Crash Analysis

System journal (journalctl -u taosd) shows 4 identical crashes:

May 05 09:21:59 s3 taosd[24999]: taosd: qworker.c:1481: qWorkerInit: Assertion `(0) >= (0)' failed.
May 05 09:21:59 s3 kernel: taosd[24999]: segfault at 0 ip 0000564f7c98eff5 sp 00007f3c68f96940 error 4 in taosd[564f7c300000+1ac0000]
May 05 09:21:59 s3 kernel: Code: 00 00 00 00 00 00 00 00 00 e8 f6 0a da 00 <c7> 45 cc 0f 07 00 80 83 7d cc 00 74 0f b8 00 00 00 00 e8 aa 52 d9 00 8b 55 cc 89 10

addr2line resolution of the crash IP 0x68eff5:

$ addr2line -e /usr/local/taos/bin/taosd -f 0x68eff5
qWorkerInit
/path/to/source/libs/qworker/src/qworker.c:1481

All 4 crashes resolve to the exact same offset 0x68eff5 within the taosd binary.

Apport captured 4 core dumps (~800MB each, ~3.2GB total) at /var/lib/apport/coredump/.

Root Cause Analysis (source code level)

Two bugs in qWorkerInit (libs/qworker/src/qworker.c) contribute to the crash:

Bug 1: Missing terrno initialization on allocation failure

When taosHashInit() or taosTmrInit() returns NULL due to internal memory allocation failure, terrno is not set by these functions in all code paths. Subsequently, QW_ERR_JRET(terrno) reads a stale/uninitialized error value.

More critically, QW_RET(terrno) expands to:

#define QW_RET(c)                     \
  do {                                \
    int32_t _code = (c);              \
    if (_code != TSDB_CODE_SUCCESS) { \
      terrno = _code;                 \  // expands to: (*taosGetErrno()) = _code
    }                                 \
    return _code;                     \
  } while (0)

where terrno is (*taosGetErrno()) and taosGetErrno() returns &tsErrno (a __thread TLS variable). If the vnode-query thread's TLS is not properly initialized at this early stage, writing to this address causes the SEGV.

Bug 2: Use-after-free in schHash NULL check path (line 1510-1511)

if (NULL == mgmt->schHash) {
    taosMemoryFreeClear(mgmt);                              // frees mgmt, sets to NULL
    qError("init %d scheduler hash failed", mgmt->cfg.maxSchedulerNum);  // dereferences NULL!
    QW_ERR_JRET(terrno);
}

taosMemoryFreeClear(mgmt) frees and NULLs the pointer, then qError immediately dereferences mgmt->cfg.maxSchedulerNum — a classic NULL pointer dereference after free.

Environment (please complete the following information):

  • OS: Ubuntu 24.04 LTS (Linux 6.8.0-107-generic, x86_64)
  • Memory: 7.8 GB RAM, 4 vCPU
  • Disk: 142 GB (46% used)
  • TDengine Version: 3.4.0.9.community (git: ed90f14, build: 2026-03-10)
  • Cluster: 3 nodes, s3 crashed while s1/s2 remained healthy

Additional Context

  • Checked all TDengine GitHub issues — no existing report matches this specific qWorkerInit crash pattern
  • Verified source code: the bug is present and identical in all versions from 3.4.0.9 through 3.4.1.7 (latest), including main branch
  • The crash does not appear to be caused by: OOM, disk I/O errors, port conflicts, file permissions, or WAL corruption
  • 4 apport core dumps are available if the maintainers need them for further analysis

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions