Skip to content

dmchen/try-copilot-self-audit#35244

Open
cadem wants to merge 1 commit into3.0from
dmchen/try-copilot-self-audit
Open

dmchen/try-copilot-self-audit#35244
cadem wants to merge 1 commit into3.0from
dmchen/try-copilot-self-audit

Conversation

@cadem
Copy link
Copy Markdown
Contributor

@cadem cadem commented Apr 28, 2026

Description

Issue(s)

  • Close/close/Fix/fix/Resolve/resolve: Issue Link

Checklist

Please check the items in the checklist if applicable.

  • Is the user manual updated?
  • Are the test cases passed and automated?
  • Is there no significant decrease in test coverage?

Copilot AI review requested due to automatic review settings April 28, 2026 02:39
@cadem cadem requested review from a team, dapan1121, guanshengliang and zitsen as code owners April 28, 2026 02:39
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the auditLocalWrite configuration option, allowing audit logs to be written directly to the local cluster. The changes include adding the global configuration variable, updating the SAudit structure to manage local connections, and implementing background pre-connection logic during initialization. Additionally, a new shell script for stopping dnodes and a Python system test for the local write feature were added. Review feedback suggests improving error handling in auditInit by nullifying pointers after destruction and ensuring mutexes are initialized before use in auditCleanup. There is also a recommendation to use SIGTERM instead of SIGKILL in the new shell script to allow for graceful process shutdowns.

Comment on lines +57 to +61
if (taosThreadMutexInit(&tsAudit.connLock, NULL) != 0) {
(void)taosThreadMutexDestroy(&tsAudit.recordLock);
(void)taosThreadRwlockDestroy(&tsAudit.infoLock);
taosArrayDestroyP(tsAudit.records, (FDelete)auditDeleteRecord);
return -1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the error path of auditInit, after destroying the records array, it is important to set the pointer to NULL. This prevents potential double-free or use-after-free issues if auditCleanup is called later (which is common during dnode shutdown even if initialization partially failed).

  if (taosThreadMutexInit(&tsAudit.connLock, NULL) != 0) {
    (void)taosThreadMutexDestroy(&tsAudit.recordLock);
    (void)taosThreadRwlockDestroy(&tsAudit.infoLock);
    taosArrayDestroyP(tsAudit.records, (FDelete)auditDeleteRecord);
    tsAudit.records = NULL;
    return -1;

auditStopPreconnectLocal();

// Close local connection
(void)taosThreadMutexLock(&tsAudit.connLock);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Potential undefined behavior: if auditInit fails before tsAudit.connLock is initialized (e.g., if taosArrayInit or taosThreadRwlockInit fails), tsAudit.connLock remains zero-initialized. Calling taosThreadMutexLock on an uninitialized mutex is unsafe. Consider adding a check to verify that the audit system was successfully initialized before attempting to lock and destroy its mutexes.

while [ -n "$PID" ]; do
echo kill -9 $PID
#pkill -9 taosd
kill -9 $PID
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using kill -9 (SIGKILL) as the primary method to stop processes is generally discouraged as it prevents graceful shutdown (e.g., flushing WAL, closing files). It is better to send a SIGTERM first, wait for a few seconds, and then use SIGKILL only if the process is still alive. This applies to lines 28, 42, and 56.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an auditLocalWrite configuration knob intended to support writing audit events directly to the local cluster, adds initial local-connection lifecycle hooks in the audit library, and adds supporting test/scripts.

Changes:

  • Add tsAuditLocalWrite global config + server config wiring (auditLocalWrite).
  • Extend SAudit with local-connection and preconnect thread state, and add init/cleanup hooks in auditMain.c.
  • Add a new system test (audit_local_write.py) and a new helper script (stop_dnodes.sh).

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/system-test/0-others/audit_local_write.py New system test intended to validate audit local-write behavior.
tests/script/sh/stop_dnodes.sh New process cleanup script for taosd/taos/tmq_sim.
source/libs/audit/src/auditMain.c Adds connLock init/cleanup and calls into preconnect/stop hooks.
source/libs/audit/inc/auditInt.h Extends SAudit with local connection + preconnect thread fields.
source/common/src/tglobal.c Adds tsAuditLocalWrite defaults + config registration + dynamic option mapping.
include/common/tglobal.h Declares tsAuditLocalWrite.
TD-xxx-try-copilot-self-audit Empty marker file added at repo root.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +79 to +86
// Close local connection
(void)taosThreadMutexLock(&tsAudit.connLock);
if (tsAudit.pLocalConn != NULL) {
taos_close(tsAudit.pLocalConn);
tsAudit.pLocalConn = NULL;
}
(void)taosThreadMutexUnlock(&tsAudit.connLock);
(void)taosThreadMutexDestroy(&tsAudit.connLock);
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auditCleanup() unconditionally locks/destroys tsAudit.connLock. If auditInit() returns early before connLock is initialized (e.g., records/locks init failure), cleanup can still be invoked during env teardown and this becomes undefined behavior. Track whether connLock was successfully initialized (and whether pLocalConn was ever created) and only lock/destroy/close when initialized; alternatively initialize connLock in a way that makes cleanup always safe.

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +65
// Start non-blocking preconnect in background so startup and RPC threads never wait on taos_connect.
if (auditPreconnectLocal() != 0) {
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new auditLocalWrite config is added, but auditInit() always starts the local preconnect logic regardless of whether local write is enabled. This can start an unnecessary background thread/work on deployments that have auditLocalWrite=0. Gate the auditPreconnectLocal() call behind the runtime config (e.g., tsAuditLocalWrite) so the feature is truly opt-in at runtime.

Suggested change
// Start non-blocking preconnect in background so startup and RPC threads never wait on taos_connect.
if (auditPreconnectLocal() != 0) {
// Start non-blocking preconnect in background only when local audit write is enabled,
// so startup and RPC threads never wait on taos_connect for an opt-in feature.
if (tsAuditLocalWrite && auditPreconnectLocal() != 0) {

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +14

import time

from ..common.basic import BasicFun


class TDTestCase:
def init(self, conn, logSql, replicaVar=1):
tdLog.debug(f"start to execute {__file__}")
self.replicaVar = int(replicaVar)
self.Fun = BasicFun()
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses a relative import (from ..common.basic import BasicFun), but tests/system-test is not a Python package (no __init__.py files) and there is no tests/system-test/common/ module in the repo, so the test will fail to import/run. Use the existing system-test utilities via absolute imports (e.g., util.dnodes, tdSql.init(conn.cursor(), ...)) or add the missing package/modules + __init__.py files if relative imports are required.

Suggested change
import time
from ..common.basic import BasicFun
class TDTestCase:
def init(self, conn, logSql, replicaVar=1):
tdLog.debug(f"start to execute {__file__}")
self.replicaVar = int(replicaVar)
self.Fun = BasicFun()
from util.dnodes import *
import time
class BasicFunCompat:
def __init__(self, conn, logSql):
self.conn = conn
self.logSql = logSql
self.TDDnodes = tdDnodes
self.dnodes_count = 1
def config_cluster(self, dnodes_count):
self.dnodes_count = int(dnodes_count)
if hasattr(self.TDDnodes, "init"):
self.TDDnodes.init(self.dnodes_count)
def deploy_start_cluster(self):
if hasattr(self.TDDnodes, "deploy"):
self.TDDnodes.deploy(self.dnodes_count)
if hasattr(self.TDDnodes, "start"):
self.TDDnodes.start(self.dnodes_count)
def connect(self):
tdSql.init(self.conn.cursor(), self.logSql)
class TDTestCase:
def init(self, conn, logSql, replicaVar=1):
tdLog.debug(f"start to execute {__file__}")
self.replicaVar = int(replicaVar)
self.conn = conn
self.logSql = logSql
tdSql.init(conn.cursor(), logSql)
self.Fun = BasicFunCompat(conn, logSql)

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +64
PID=`ps -efww | grep /usr/bin/taosd | grep -v grep | awk '{print $2}'`
if [ -n "$PID" ]; then
echo systemctl stop taosd
systemctl stop taosd
fi

PID=`psby taosd | grep -w "[t]aosd" | awk '{print $1}' | head -n 1`
while [ -n "$PID" ]; do
echo kill -9 $PID
#pkill -9 taosd
kill -9 $PID
echo "Killing taosd processes"
if [ "$OS_TYPE" != "Darwin" ]; then
fuser -k -n tcp 6030
else
lsof -nti:6030 | xargs kill -9
fi
PID=`psby taosd | grep -w "[t]aosd" | awk '{print $1}' | head -n 1`
done

PID=`psby taos | grep -w "[t]aos" | awk '{print $1}' | head -n 1`
while [ -n "$PID" ]; do
echo kill -9 $PID
#pkill -9 taos
kill -9 $PID
echo "Killing taos processes"
if [ "$OS_TYPE" != "Darwin" ]; then
fuser -k -n tcp 6030
else
lsof -nti:6030 | xargs kill -9
fi
PID=`psby taos | grep -w "[t]aos" | awk '{print $1}' | head -n 1`
done

PID=`psby tmq_sim | grep -w "[t]mq_sim" | awk '{print $1}' | head -n 1`
while [ -n "$PID" ]; do
echo kill -9 $PID
#pkill -9 tmq_sim
kill -9 $PID
echo "Killing tmq_sim processes"
if [ "$OS_TYPE" != "Darwin" ]; then
fuser -k -n tcp 6030
else
lsof -nti:6030 | xargs kill -9
fi
PID=`psby tmq_sim | grep -w "[t]mq_sim" | awk '{print $1}' | head -n 1`
done
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These while [ -n "$PID" ] loops can spin forever if kill -9 $PID fails (e.g., insufficient permissions or PID reused between iterations). Add a bounded retry counter / timeout, and consider using pkill with process name plus a final check, so the script exits deterministically even when processes can’t be killed.

Suggested change
PID=`ps -efww | grep /usr/bin/taosd | grep -v grep | awk '{print $2}'`
if [ -n "$PID" ]; then
echo systemctl stop taosd
systemctl stop taosd
fi
PID=`psby taosd | grep -w "[t]aosd" | awk '{print $1}' | head -n 1`
while [ -n "$PID" ]; do
echo kill -9 $PID
#pkill -9 taosd
kill -9 $PID
echo "Killing taosd processes"
if [ "$OS_TYPE" != "Darwin" ]; then
fuser -k -n tcp 6030
else
lsof -nti:6030 | xargs kill -9
fi
PID=`psby taosd | grep -w "[t]aosd" | awk '{print $1}' | head -n 1`
done
PID=`psby taos | grep -w "[t]aos" | awk '{print $1}' | head -n 1`
while [ -n "$PID" ]; do
echo kill -9 $PID
#pkill -9 taos
kill -9 $PID
echo "Killing taos processes"
if [ "$OS_TYPE" != "Darwin" ]; then
fuser -k -n tcp 6030
else
lsof -nti:6030 | xargs kill -9
fi
PID=`psby taos | grep -w "[t]aos" | awk '{print $1}' | head -n 1`
done
PID=`psby tmq_sim | grep -w "[t]mq_sim" | awk '{print $1}' | head -n 1`
while [ -n "$PID" ]; do
echo kill -9 $PID
#pkill -9 tmq_sim
kill -9 $PID
echo "Killing tmq_sim processes"
if [ "$OS_TYPE" != "Darwin" ]; then
fuser -k -n tcp 6030
else
lsof -nti:6030 | xargs kill -9
fi
PID=`psby tmq_sim | grep -w "[t]mq_sim" | awk '{print $1}' | head -n 1`
done
kill_processes_with_retry() {
PROC_NAME="$1"
GREP_PATTERN="$2"
MAX_RETRIES=10
RETRY_COUNT=0
PID=`psby "$PROC_NAME" | grep -w "$GREP_PATTERN" | awk '{print $1}' | head -n 1`
while [ -n "$PID" ] && [ "$RETRY_COUNT" -lt "$MAX_RETRIES" ]; do
echo kill -9 "$PID"
kill -9 "$PID"
if command -v pkill >/dev/null 2>&1; then
pkill -9 "$PROC_NAME"
fi
echo "Killing $PROC_NAME processes"
if [ "$OS_TYPE" != "Darwin" ]; then
fuser -k -n tcp 6030
else
lsof -nti:6030 | xargs kill -9
fi
RETRY_COUNT=`expr "$RETRY_COUNT" + 1`
sleep 1
PID=`psby "$PROC_NAME" | grep -w "$GREP_PATTERN" | awk '{print $1}' | head -n 1`
done
if [ -n "$PID" ]; then
echo "Failed to kill $PROC_NAME processes after $MAX_RETRIES attempts; last PID: $PID"
fi
}
PID=`ps -efww | grep /usr/bin/taosd | grep -v grep | awk '{print $2}'`
if [ -n "$PID" ]; then
echo systemctl stop taosd
systemctl stop taosd
fi
kill_processes_with_retry taosd "[t]aosd"
kill_processes_with_retry taos "[t]aos"
kill_processes_with_retry tmq_sim "[t]mq_sim"

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +34
fuser -k -n tcp 6030
else
lsof -nti:6030 | xargs kill -9
fi
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On macOS, lsof -nti:6030 | xargs kill -9 may invoke kill with no arguments when nothing is listening on the port, producing errors/noise. Guard the pipeline (e.g., only run kill when PIDs are non-empty) to keep the stop script quiet and predictable.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants