Bug Description
taosd crashes with SIGSEGV (null pointer dereference at address 0x0) in qWorkerInit (qworker.c:1481) when a vnode-query thread starts up. The crash is deterministic — all 4 consecutive restarts hit the exact same binary offset. After the 4th crash, systemd triggered start-limit-hit and prevented further restarts until manual intervention.
To Reproduce
The crash occurs during vnode startup/restart under the following conditions:
- 3-node TDengine cluster (community edition 3.4.0.9) running on Ubuntu 24.04
- After WAL replay completes on node s3, the vnode-query thread calls
qWorkerInit to initialize the query worker
- Inside
qWorkerInit, an internal resource allocation (hash table, timer, or memory) fails
- The error-handling path triggers a null pointer dereference, causing SEGV
- Process crashes, systemd auto-restarts, hits the same bug 4 times, then
start-limit-hit prevents further restart
The crash happened at 09:21:59, approximately 2 seconds after the last WAL commit at 09:21:57. A manual restart at 14:15 succeeded, indicating this is a timing/resource contention issue rather than persistent data corruption.
Expected Behavior
qWorkerInit should handle resource allocation failures gracefully — return an error code without crashing. The vnode should either retry initialization or report the error to the cluster, not segfault.
Crash Analysis
System journal (journalctl -u taosd) shows 4 identical crashes:
May 05 09:21:59 s3 taosd[24999]: taosd: qworker.c:1481: qWorkerInit: Assertion `(0) >= (0)' failed.
May 05 09:21:59 s3 kernel: taosd[24999]: segfault at 0 ip 0000564f7c98eff5 sp 00007f3c68f96940 error 4 in taosd[564f7c300000+1ac0000]
May 05 09:21:59 s3 kernel: Code: 00 00 00 00 00 00 00 00 00 e8 f6 0a da 00 <c7> 45 cc 0f 07 00 80 83 7d cc 00 74 0f b8 00 00 00 00 e8 aa 52 d9 00 8b 55 cc 89 10
addr2line resolution of the crash IP 0x68eff5:
$ addr2line -e /usr/local/taos/bin/taosd -f 0x68eff5
qWorkerInit
/path/to/source/libs/qworker/src/qworker.c:1481
All 4 crashes resolve to the exact same offset 0x68eff5 within the taosd binary.
Apport captured 4 core dumps (~800MB each, ~3.2GB total) at /var/lib/apport/coredump/.
Root Cause Analysis (source code level)
Two bugs in qWorkerInit (libs/qworker/src/qworker.c) contribute to the crash:
Bug 1: Missing terrno initialization on allocation failure
When taosHashInit() or taosTmrInit() returns NULL due to internal memory allocation failure, terrno is not set by these functions in all code paths. Subsequently, QW_ERR_JRET(terrno) reads a stale/uninitialized error value.
More critically, QW_RET(terrno) expands to:
#define QW_RET(c) \
do { \
int32_t _code = (c); \
if (_code != TSDB_CODE_SUCCESS) { \
terrno = _code; \ // expands to: (*taosGetErrno()) = _code
} \
return _code; \
} while (0)
where terrno is (*taosGetErrno()) and taosGetErrno() returns &tsErrno (a __thread TLS variable). If the vnode-query thread's TLS is not properly initialized at this early stage, writing to this address causes the SEGV.
Bug 2: Use-after-free in schHash NULL check path (line 1510-1511)
if (NULL == mgmt->schHash) {
taosMemoryFreeClear(mgmt); // frees mgmt, sets to NULL
qError("init %d scheduler hash failed", mgmt->cfg.maxSchedulerNum); // dereferences NULL!
QW_ERR_JRET(terrno);
}
taosMemoryFreeClear(mgmt) frees and NULLs the pointer, then qError immediately dereferences mgmt->cfg.maxSchedulerNum — a classic NULL pointer dereference after free.
Environment (please complete the following information):
- OS: Ubuntu 24.04 LTS (Linux 6.8.0-107-generic, x86_64)
- Memory: 7.8 GB RAM, 4 vCPU
- Disk: 142 GB (46% used)
- TDengine Version: 3.4.0.9.community (git: ed90f14, build: 2026-03-10)
- Cluster: 3 nodes, s3 crashed while s1/s2 remained healthy
Additional Context
- Checked all TDengine GitHub issues — no existing report matches this specific
qWorkerInit crash pattern
- Verified source code: the bug is present and identical in all versions from 3.4.0.9 through 3.4.1.7 (latest), including
main branch
- The crash does not appear to be caused by: OOM, disk I/O errors, port conflicts, file permissions, or WAL corruption
- 4 apport core dumps are available if the maintainers need them for further analysis
Bug Description
taosdcrashes with SIGSEGV (null pointer dereference at address 0x0) inqWorkerInit(qworker.c:1481) when a vnode-query thread starts up. The crash is deterministic — all 4 consecutive restarts hit the exact same binary offset. After the 4th crash, systemd triggeredstart-limit-hitand prevented further restarts until manual intervention.To Reproduce
The crash occurs during vnode startup/restart under the following conditions:
qWorkerInitto initialize the query workerqWorkerInit, an internal resource allocation (hash table, timer, or memory) failsstart-limit-hitprevents further restartThe crash happened at 09:21:59, approximately 2 seconds after the last WAL commit at 09:21:57. A manual restart at 14:15 succeeded, indicating this is a timing/resource contention issue rather than persistent data corruption.
Expected Behavior
qWorkerInitshould handle resource allocation failures gracefully — return an error code without crashing. The vnode should either retry initialization or report the error to the cluster, not segfault.Crash Analysis
System journal (
journalctl -u taosd) shows 4 identical crashes:addr2lineresolution of the crash IP0x68eff5:All 4 crashes resolve to the exact same offset
0x68eff5within thetaosdbinary.Apport captured 4 core dumps (~800MB each, ~3.2GB total) at
/var/lib/apport/coredump/.Root Cause Analysis (source code level)
Two bugs in
qWorkerInit(libs/qworker/src/qworker.c) contribute to the crash:Bug 1: Missing
terrnoinitialization on allocation failureWhen
taosHashInit()ortaosTmrInit()returns NULL due to internal memory allocation failure,terrnois not set by these functions in all code paths. Subsequently,QW_ERR_JRET(terrno)reads a stale/uninitialized error value.More critically,
QW_RET(terrno)expands to:where
terrnois(*taosGetErrno())andtaosGetErrno()returns&tsErrno(a__threadTLS variable). If the vnode-query thread's TLS is not properly initialized at this early stage, writing to this address causes the SEGV.Bug 2: Use-after-free in
schHashNULL check path (line 1510-1511)taosMemoryFreeClear(mgmt)frees and NULLs the pointer, thenqErrorimmediately dereferencesmgmt->cfg.maxSchedulerNum— a classic NULL pointer dereference after free.Environment (please complete the following information):
Additional Context
qWorkerInitcrash patternmainbranch