-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Auditd client goroutine leak #42933
base: main
Are you sure you want to change the base?
Fix Auditd client goroutine leak #42933
Conversation
This pull request doesn't have a |
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is sort of unfortunate. It looks like the draining behaviour should be in the library. I looked into this and this was also suggested in the change that added the deadlock protection.
closeWaiter := &sync.WaitGroup{} | ||
closeWaiter.Add(1) | ||
go func(testClient *libaudit.AuditClient) { | ||
for { | ||
_, err := client.Netlink.Receive(true, discard) | ||
_, err := testClient.Netlink.Receive(true, discard) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
for {
_, err := client.Netlink.Receive(true, discard)
with associated changes below.
rationale:
- hyperlocal labels don't need to be so heavy
- localise the wg
- the captured scope is just as safe as passing the parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed, that's a bit cleaner
// If we don't wait for the socket to drain, the calling function may tinker with | ||
// the AuditClient pointer while the goroutine is trying to drain it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this comment? AFAICS the *libaudit.AuditClient
that's passed in here is never altered after the call returns.
// If we don't wait for the socket to drain, the calling function can | ||
// assign a new client instance to the same pointer while our goroutine is using it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the situation, then it seems to me that referring to the containing struct and nilling out the field on completion would do this without a wait group. This would be
defer func() {
ms.control.Close()
client := ms.client
ms.client = nil
closeAuditClient(client, ms.log)
}()
in *MetricSet.Run
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we looking at the same chunk of code? The problem is in this loop:
for {
select {
case <-reporter.Done():
return
case <-timer.C:
if status, err := client.GetStatus(); err == nil {
ms.updateKernelLostMetric(status.Lost)
} else {
ms.log.Error("get status request failed:", err)
closeAuditClient(client, ms.log)
client, err = libaudit.NewAuditClient(nil)
if err != nil {
ms.log.Errorw("Failure creating audit monitoring client", "error", err)
reporter.Error(err)
return
}
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have client
being closed in the call which runs in a goroutine, and then client
is overwritten by the call to libaudit.NewClient
, so the old value is no longer available to anyone except the scope in the previous call to closeAuditClient
. There is another case like that nearby which is also similarly safe. In the case in *MetricSet.Run
this is not true since the ms.client
is passed in and retained in the MetricSet
; it's never used after that, but it could be, so nilling it out is an relinquishment of ownership.
Proposed commit message
This fixes a rather severe (and interesting) goroutine leak where a failure in
Run()
where ifclient.GetStatus()
fails, we can assign a new client to theclient
pointer before the goroutine created bycloseAuditClient()
exits, thus causing that goroutine to just live forever and periodically read from the client socket.I can't get
client.GetStatus()
to fail on its own, but I did some testing where I hard-coded an error return, and auditbeat will grow about 0-3 goroutines a minute. In the real world, I've seen this result in 30K or so goroutines after a few days of running.This just adds a waitgroup, so
closeAuditClient()
blocks until it's actually done.This currently doesn't add any tests, I'm debating on if we want to refactor
Run()
so we can better add some unit tests to this; ideally we would come up with something that can test both this and the original bug fixed by #26032We might also want to make
ms.kernelLost.enabled
, to enable workarounds for this in the future.Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.