Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Auditd client goroutine leak #42933

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

fearful-symmetry
Copy link
Contributor

Proposed commit message

This fixes a rather severe (and interesting) goroutine leak where a failure in Run() where if client.GetStatus() fails, we can assign a new client to the client pointer before the goroutine created by closeAuditClient() exits, thus causing that goroutine to just live forever and periodically read from the client socket.

I can't get client.GetStatus() to fail on its own, but I did some testing where I hard-coded an error return, and auditbeat will grow about 0-3 goroutines a minute. In the real world, I've seen this result in 30K or so goroutines after a few days of running.

This just adds a waitgroup, so closeAuditClient() blocks until it's actually done.

This currently doesn't add any tests, I'm debating on if we want to refactor Run() so we can better add some unit tests to this; ideally we would come up with something that can test both this and the original bug fixed by #26032

We might also want to make ms.kernelLost.enabled, to enable workarounds for this in the future.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@fearful-symmetry fearful-symmetry self-assigned this Feb 26, 2025
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner February 26, 2025 22:03
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 26, 2025
@botelastic
Copy link

botelastic bot commented Feb 26, 2025

This pull request doesn't have a Team:<team> label.

@fearful-symmetry fearful-symmetry changed the title Auditd client go leak Fix Auditd client goroutine leak Feb 26, 2025
Copy link
Contributor

mergify bot commented Feb 26, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is sort of unfortunate. It looks like the draining behaviour should be in the library. I looked into this and this was also suggested in the change that added the deadlock protection.

Comment on lines 157 to 161
closeWaiter := &sync.WaitGroup{}
closeWaiter.Add(1)
go func(testClient *libaudit.AuditClient) {
for {
_, err := client.Netlink.Receive(true, discard)
_, err := testClient.Netlink.Receive(true, discard)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest

var wg sync.WaitGroup
wg.Add(1)
go func() {
    defer wg.Done()
    for {
        _, err := client.Netlink.Receive(true, discard)

with associated changes below.

rationale:

  • hyperlocal labels don't need to be so heavy
  • localise the wg
  • the captured scope is just as safe as passing the parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed, that's a bit cleaner

Comment on lines 175 to 176
// If we don't wait for the socket to drain, the calling function may tinker with
// the AuditClient pointer while the goroutine is trying to drain it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this comment? AFAICS the *libaudit.AuditClient that's passed in here is never altered after the call returns.

Comment on lines +176 to +177
// If we don't wait for the socket to drain, the calling function can
// assign a new client instance to the same pointer while our goroutine is using it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the situation, then it seems to me that referring to the containing struct and nilling out the field on completion would do this without a wait group. This would be

	defer func() {
		ms.control.Close()
		client := ms.client
		ms.client = nil
		closeAuditClient(client, ms.log)
	}()

in *MetricSet.Run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we looking at the same chunk of code? The problem is in this loop:

			for {
				select {
				case <-reporter.Done():
					return
				case <-timer.C:
					if status, err := client.GetStatus(); err == nil {
						ms.updateKernelLostMetric(status.Lost)
					} else {
						ms.log.Error("get status request failed:", err)
						closeAuditClient(client, ms.log)
						client, err = libaudit.NewAuditClient(nil)
						if err != nil {
							ms.log.Errorw("Failure creating audit monitoring client", "error", err)
							reporter.Error(err)
							return
						}
					}
				}
			}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have client being closed in the call which runs in a goroutine, and then client is overwritten by the call to libaudit.NewClient, so the old value is no longer available to anyone except the scope in the previous call to closeAuditClient. There is another case like that nearby which is also similarly safe. In the case in *MetricSet.Run this is not true since the ms.client is passed in and retained in the MetricSet; it's never used after that, but it could be, so nilling it out is an relinquishment of ownership.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs_team Indicates that the issue/PR needs a Team:* label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants