Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the auditLocalWrite configuration option, allowing audit logs to be written directly to the local cluster. The changes include adding the global configuration variable, updating the SAudit structure to manage local connections, and implementing background pre-connection logic during initialization. Additionally, a new shell script for stopping dnodes and a Python system test for the local write feature were added. Review feedback suggests improving error handling in auditInit by nullifying pointers after destruction and ensuring mutexes are initialized before use in auditCleanup. There is also a recommendation to use SIGTERM instead of SIGKILL in the new shell script to allow for graceful process shutdowns.
| if (taosThreadMutexInit(&tsAudit.connLock, NULL) != 0) { | ||
| (void)taosThreadMutexDestroy(&tsAudit.recordLock); | ||
| (void)taosThreadRwlockDestroy(&tsAudit.infoLock); | ||
| taosArrayDestroyP(tsAudit.records, (FDelete)auditDeleteRecord); | ||
| return -1; |
There was a problem hiding this comment.
In the error path of auditInit, after destroying the records array, it is important to set the pointer to NULL. This prevents potential double-free or use-after-free issues if auditCleanup is called later (which is common during dnode shutdown even if initialization partially failed).
if (taosThreadMutexInit(&tsAudit.connLock, NULL) != 0) {
(void)taosThreadMutexDestroy(&tsAudit.recordLock);
(void)taosThreadRwlockDestroy(&tsAudit.infoLock);
taosArrayDestroyP(tsAudit.records, (FDelete)auditDeleteRecord);
tsAudit.records = NULL;
return -1;| auditStopPreconnectLocal(); | ||
|
|
||
| // Close local connection | ||
| (void)taosThreadMutexLock(&tsAudit.connLock); |
There was a problem hiding this comment.
Potential undefined behavior: if auditInit fails before tsAudit.connLock is initialized (e.g., if taosArrayInit or taosThreadRwlockInit fails), tsAudit.connLock remains zero-initialized. Calling taosThreadMutexLock on an uninitialized mutex is unsafe. Consider adding a check to verify that the audit system was successfully initialized before attempting to lock and destroy its mutexes.
| while [ -n "$PID" ]; do | ||
| echo kill -9 $PID | ||
| #pkill -9 taosd | ||
| kill -9 $PID |
There was a problem hiding this comment.
Using kill -9 (SIGKILL) as the primary method to stop processes is generally discouraged as it prevents graceful shutdown (e.g., flushing WAL, closing files). It is better to send a SIGTERM first, wait for a few seconds, and then use SIGKILL only if the process is still alive. This applies to lines 28, 42, and 56.
There was a problem hiding this comment.
Pull request overview
This PR introduces an auditLocalWrite configuration knob intended to support writing audit events directly to the local cluster, adds initial local-connection lifecycle hooks in the audit library, and adds supporting test/scripts.
Changes:
- Add
tsAuditLocalWriteglobal config + server config wiring (auditLocalWrite). - Extend
SAuditwith local-connection and preconnect thread state, and add init/cleanup hooks inauditMain.c. - Add a new system test (
audit_local_write.py) and a new helper script (stop_dnodes.sh).
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/system-test/0-others/audit_local_write.py | New system test intended to validate audit local-write behavior. |
| tests/script/sh/stop_dnodes.sh | New process cleanup script for taosd/taos/tmq_sim. |
| source/libs/audit/src/auditMain.c | Adds connLock init/cleanup and calls into preconnect/stop hooks. |
| source/libs/audit/inc/auditInt.h | Extends SAudit with local connection + preconnect thread fields. |
| source/common/src/tglobal.c | Adds tsAuditLocalWrite defaults + config registration + dynamic option mapping. |
| include/common/tglobal.h | Declares tsAuditLocalWrite. |
| TD-xxx-try-copilot-self-audit | Empty marker file added at repo root. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Close local connection | ||
| (void)taosThreadMutexLock(&tsAudit.connLock); | ||
| if (tsAudit.pLocalConn != NULL) { | ||
| taos_close(tsAudit.pLocalConn); | ||
| tsAudit.pLocalConn = NULL; | ||
| } | ||
| (void)taosThreadMutexUnlock(&tsAudit.connLock); | ||
| (void)taosThreadMutexDestroy(&tsAudit.connLock); |
There was a problem hiding this comment.
auditCleanup() unconditionally locks/destroys tsAudit.connLock. If auditInit() returns early before connLock is initialized (e.g., records/locks init failure), cleanup can still be invoked during env teardown and this becomes undefined behavior. Track whether connLock was successfully initialized (and whether pLocalConn was ever created) and only lock/destroy/close when initialized; alternatively initialize connLock in a way that makes cleanup always safe.
| // Start non-blocking preconnect in background so startup and RPC threads never wait on taos_connect. | ||
| if (auditPreconnectLocal() != 0) { |
There was a problem hiding this comment.
The new auditLocalWrite config is added, but auditInit() always starts the local preconnect logic regardless of whether local write is enabled. This can start an unnecessary background thread/work on deployments that have auditLocalWrite=0. Gate the auditPreconnectLocal() call behind the runtime config (e.g., tsAuditLocalWrite) so the feature is truly opt-in at runtime.
| // Start non-blocking preconnect in background so startup and RPC threads never wait on taos_connect. | |
| if (auditPreconnectLocal() != 0) { | |
| // Start non-blocking preconnect in background only when local audit write is enabled, | |
| // so startup and RPC threads never wait on taos_connect for an opt-in feature. | |
| if (tsAuditLocalWrite && auditPreconnectLocal() != 0) { |
|
|
||
| import time | ||
|
|
||
| from ..common.basic import BasicFun | ||
|
|
||
|
|
||
| class TDTestCase: | ||
| def init(self, conn, logSql, replicaVar=1): | ||
| tdLog.debug(f"start to execute {__file__}") | ||
| self.replicaVar = int(replicaVar) | ||
| self.Fun = BasicFun() |
There was a problem hiding this comment.
This test uses a relative import (from ..common.basic import BasicFun), but tests/system-test is not a Python package (no __init__.py files) and there is no tests/system-test/common/ module in the repo, so the test will fail to import/run. Use the existing system-test utilities via absolute imports (e.g., util.dnodes, tdSql.init(conn.cursor(), ...)) or add the missing package/modules + __init__.py files if relative imports are required.
| import time | |
| from ..common.basic import BasicFun | |
| class TDTestCase: | |
| def init(self, conn, logSql, replicaVar=1): | |
| tdLog.debug(f"start to execute {__file__}") | |
| self.replicaVar = int(replicaVar) | |
| self.Fun = BasicFun() | |
| from util.dnodes import * | |
| import time | |
| class BasicFunCompat: | |
| def __init__(self, conn, logSql): | |
| self.conn = conn | |
| self.logSql = logSql | |
| self.TDDnodes = tdDnodes | |
| self.dnodes_count = 1 | |
| def config_cluster(self, dnodes_count): | |
| self.dnodes_count = int(dnodes_count) | |
| if hasattr(self.TDDnodes, "init"): | |
| self.TDDnodes.init(self.dnodes_count) | |
| def deploy_start_cluster(self): | |
| if hasattr(self.TDDnodes, "deploy"): | |
| self.TDDnodes.deploy(self.dnodes_count) | |
| if hasattr(self.TDDnodes, "start"): | |
| self.TDDnodes.start(self.dnodes_count) | |
| def connect(self): | |
| tdSql.init(self.conn.cursor(), self.logSql) | |
| class TDTestCase: | |
| def init(self, conn, logSql, replicaVar=1): | |
| tdLog.debug(f"start to execute {__file__}") | |
| self.replicaVar = int(replicaVar) | |
| self.conn = conn | |
| self.logSql = logSql | |
| tdSql.init(conn.cursor(), logSql) | |
| self.Fun = BasicFunCompat(conn, logSql) |
| PID=`ps -efww | grep /usr/bin/taosd | grep -v grep | awk '{print $2}'` | ||
| if [ -n "$PID" ]; then | ||
| echo systemctl stop taosd | ||
| systemctl stop taosd | ||
| fi | ||
|
|
||
| PID=`psby taosd | grep -w "[t]aosd" | awk '{print $1}' | head -n 1` | ||
| while [ -n "$PID" ]; do | ||
| echo kill -9 $PID | ||
| #pkill -9 taosd | ||
| kill -9 $PID | ||
| echo "Killing taosd processes" | ||
| if [ "$OS_TYPE" != "Darwin" ]; then | ||
| fuser -k -n tcp 6030 | ||
| else | ||
| lsof -nti:6030 | xargs kill -9 | ||
| fi | ||
| PID=`psby taosd | grep -w "[t]aosd" | awk '{print $1}' | head -n 1` | ||
| done | ||
|
|
||
| PID=`psby taos | grep -w "[t]aos" | awk '{print $1}' | head -n 1` | ||
| while [ -n "$PID" ]; do | ||
| echo kill -9 $PID | ||
| #pkill -9 taos | ||
| kill -9 $PID | ||
| echo "Killing taos processes" | ||
| if [ "$OS_TYPE" != "Darwin" ]; then | ||
| fuser -k -n tcp 6030 | ||
| else | ||
| lsof -nti:6030 | xargs kill -9 | ||
| fi | ||
| PID=`psby taos | grep -w "[t]aos" | awk '{print $1}' | head -n 1` | ||
| done | ||
|
|
||
| PID=`psby tmq_sim | grep -w "[t]mq_sim" | awk '{print $1}' | head -n 1` | ||
| while [ -n "$PID" ]; do | ||
| echo kill -9 $PID | ||
| #pkill -9 tmq_sim | ||
| kill -9 $PID | ||
| echo "Killing tmq_sim processes" | ||
| if [ "$OS_TYPE" != "Darwin" ]; then | ||
| fuser -k -n tcp 6030 | ||
| else | ||
| lsof -nti:6030 | xargs kill -9 | ||
| fi | ||
| PID=`psby tmq_sim | grep -w "[t]mq_sim" | awk '{print $1}' | head -n 1` | ||
| done |
There was a problem hiding this comment.
These while [ -n "$PID" ] loops can spin forever if kill -9 $PID fails (e.g., insufficient permissions or PID reused between iterations). Add a bounded retry counter / timeout, and consider using pkill with process name plus a final check, so the script exits deterministically even when processes can’t be killed.
| PID=`ps -efww | grep /usr/bin/taosd | grep -v grep | awk '{print $2}'` | |
| if [ -n "$PID" ]; then | |
| echo systemctl stop taosd | |
| systemctl stop taosd | |
| fi | |
| PID=`psby taosd | grep -w "[t]aosd" | awk '{print $1}' | head -n 1` | |
| while [ -n "$PID" ]; do | |
| echo kill -9 $PID | |
| #pkill -9 taosd | |
| kill -9 $PID | |
| echo "Killing taosd processes" | |
| if [ "$OS_TYPE" != "Darwin" ]; then | |
| fuser -k -n tcp 6030 | |
| else | |
| lsof -nti:6030 | xargs kill -9 | |
| fi | |
| PID=`psby taosd | grep -w "[t]aosd" | awk '{print $1}' | head -n 1` | |
| done | |
| PID=`psby taos | grep -w "[t]aos" | awk '{print $1}' | head -n 1` | |
| while [ -n "$PID" ]; do | |
| echo kill -9 $PID | |
| #pkill -9 taos | |
| kill -9 $PID | |
| echo "Killing taos processes" | |
| if [ "$OS_TYPE" != "Darwin" ]; then | |
| fuser -k -n tcp 6030 | |
| else | |
| lsof -nti:6030 | xargs kill -9 | |
| fi | |
| PID=`psby taos | grep -w "[t]aos" | awk '{print $1}' | head -n 1` | |
| done | |
| PID=`psby tmq_sim | grep -w "[t]mq_sim" | awk '{print $1}' | head -n 1` | |
| while [ -n "$PID" ]; do | |
| echo kill -9 $PID | |
| #pkill -9 tmq_sim | |
| kill -9 $PID | |
| echo "Killing tmq_sim processes" | |
| if [ "$OS_TYPE" != "Darwin" ]; then | |
| fuser -k -n tcp 6030 | |
| else | |
| lsof -nti:6030 | xargs kill -9 | |
| fi | |
| PID=`psby tmq_sim | grep -w "[t]mq_sim" | awk '{print $1}' | head -n 1` | |
| done | |
| kill_processes_with_retry() { | |
| PROC_NAME="$1" | |
| GREP_PATTERN="$2" | |
| MAX_RETRIES=10 | |
| RETRY_COUNT=0 | |
| PID=`psby "$PROC_NAME" | grep -w "$GREP_PATTERN" | awk '{print $1}' | head -n 1` | |
| while [ -n "$PID" ] && [ "$RETRY_COUNT" -lt "$MAX_RETRIES" ]; do | |
| echo kill -9 "$PID" | |
| kill -9 "$PID" | |
| if command -v pkill >/dev/null 2>&1; then | |
| pkill -9 "$PROC_NAME" | |
| fi | |
| echo "Killing $PROC_NAME processes" | |
| if [ "$OS_TYPE" != "Darwin" ]; then | |
| fuser -k -n tcp 6030 | |
| else | |
| lsof -nti:6030 | xargs kill -9 | |
| fi | |
| RETRY_COUNT=`expr "$RETRY_COUNT" + 1` | |
| sleep 1 | |
| PID=`psby "$PROC_NAME" | grep -w "$GREP_PATTERN" | awk '{print $1}' | head -n 1` | |
| done | |
| if [ -n "$PID" ]; then | |
| echo "Failed to kill $PROC_NAME processes after $MAX_RETRIES attempts; last PID: $PID" | |
| fi | |
| } | |
| PID=`ps -efww | grep /usr/bin/taosd | grep -v grep | awk '{print $2}'` | |
| if [ -n "$PID" ]; then | |
| echo systemctl stop taosd | |
| systemctl stop taosd | |
| fi | |
| kill_processes_with_retry taosd "[t]aosd" | |
| kill_processes_with_retry taos "[t]aos" | |
| kill_processes_with_retry tmq_sim "[t]mq_sim" |
| fuser -k -n tcp 6030 | ||
| else | ||
| lsof -nti:6030 | xargs kill -9 | ||
| fi |
There was a problem hiding this comment.
On macOS, lsof -nti:6030 | xargs kill -9 may invoke kill with no arguments when nothing is listening on the port, producing errors/noise. Guard the pipeline (e.g., only run kill when PIDs are non-empty) to keep the stop script quiet and predictable.
Description
Issue(s)
Checklist
Please check the items in the checklist if applicable.