Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lost recovery notifications after recovery outside of notification time period #10187

Merged
merged 1 commit into from
Oct 24, 2024

Conversation

julianbrost
Copy link
Contributor

@julianbrost julianbrost commented Oct 11, 2024

This is fixed by using Checkable::GetStateBeforeSuppression() only where relevant.

Not all calls to Checkable::NotificationReasonApplies() need GetStateBeforeSuppression() to be checked. In fact, for one caller, FireSuppressedNotifications() in
lib/notification/notificationcomponent.cpp, the state before suppression may not even be initialized properly, so that the default value of OK is used which can lead to incorrect return values. Note the difference between suppressions happening on the level of the Checkable object level and the Notification object level. Only the first sets the state before suppression in the Checkable object, but so far, also the latter used that value incorrectly.

This commit moves the check of GetStateBeforeSuppression() from Checkable::NotificationReasonApplies() to the one place where it's actually relevant: Checkable::FireSuppressedNotifications(). This made the existing call to NotificationReasonApplies() unneccessary as it would always return true: the type argument is computed based on the current check result, so there's no need to check it against the current check result.

Tests

I've written a short stand-alone Icinga 2 config file that can be used to reproduce the bug. It used some wizardry you wouldn't want to use in a production config to create a host that starts in a problem state and recovers 60 seconds after startup. In addition to that, it configures a notification with a dynamically generated time period that spans the full current day but has a gab between 30 seconds and 90 seconds after daemon startup. Thus, a sent recovery notification should be logged 90 seconds after daemon startup.

icinga2.conf
include <itl>

// Don't try this at home!
// This will cause pain and suffering if used for other purposes than testing!
// It will cause problems in multi-node clusters!
// Don't say you haven't been warned!
var t = get_time()
var leavePeriodAt = t + 30s
log("leaving period at " + DateTime(leavePeriodAt).format("%T"))
var recoverAt     = t + 60s
log("recovering at " + DateTime(recoverAt).format("%T"))
var enterPeriodAt = t + 90s
log("entering period at " + DateTime(enterPeriodAt).format("%T"))

object Host "test-host" use (recoverAt) {
	check_command = "dummy"
	max_check_attempts = 1
	check_interval = 10s
	retry_interval = 10s
	vars.dummy_state = () use (recoverAt) => {
		inside = ", inside test-period: " + String(get_time_period("test-period").is_inside)
		if (get_time() > recoverAt) {
			log("check OK" + inside)
			return 0
		} else {
			log("check CRITICAL" + inside)
			return 2
		}
	}
}

object TimePeriod "test-period" use (leavePeriodAt, enterPeriodAt) {
	ranges[DateTime(leavePeriodAt).format("%A").lower()] = "00:00:00-" + DateTime(leavePeriodAt).format("%T") + "," + DateTime(enterPeriodAt).format("%T") + "-24:00:00"
	log(ranges)
}

object Notification "test-notification" {
	host_name = "test-host"
	command = "nop"
	period = "test-period"
	users = ["dummy"]
}

object NotificationCommand "nop" {
	command = ["true"]
}

object User "dummy" {}
object CheckerComponent "checker" {}
object NotificationComponent "notification" {}

It can easily be fired up in a container like this:

docker run --rm -it -v $(pwd)/icinga2.conf:/icinga2.conf:ro icinga/icinga2 timeout 2m icinga2 daemon -c /icinga2.conf

Current master branch

[2024-10-11 11:48:38 +0000] information/cli: Icinga application loader (version: v2.14.0-308-g5e9e0bbcd; debug)
[...]
[2024-10-11 11:48:38 +0000] information/config: leaving period at 11:49:08
[2024-10-11 11:48:38 +0000] information/config: recovering at 11:49:38
[2024-10-11 11:48:38 +0000] information/config: entering period at 11:50:08
[...]
[2024-10-11 11:48:46 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:48:46 +0000] information/Checkable: Checkable 'test-host' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
[2024-10-11 11:48:46 +0000] information/Notification: Sending 'Problem' notification 'test-host!test-notification' for user 'dummy'
[2024-10-11 11:48:46 +0000] information/Notification: Completed sending 'Problem' notification 'test-host!test-notification' for checkable 'test-host' and user 'dummy' using command 'nop'.
[2024-10-11 11:48:55 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:49:05 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:49:15 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:49:25 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:49:35 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:49:45 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:49:45 +0000] information/Checkable: Checkable 'test-host' has 1 notification(s). Checking filters for type 'Recovery', sends will be logged.
[2024-10-11 11:49:55 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:50:05 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:50:15 +0000] information/config: check OK, inside test-period: true
[2024-10-11 11:50:25 +0000] information/config: check OK, inside test-period: true
[2024-10-11 11:50:35 +0000] information/config: check OK, inside test-period: true

The problem notification is logged, though there's no recovery notification logged around the time when the output switches back to "inside test-period: true".

This PR

[2024-10-11 11:51:05 +0000] information/cli: Icinga application loader (version: v2.14.0-309-g7d0a43f92; debug)
[...]
[2024-10-11 11:51:05 +0000] information/config: leaving period at 11:51:35
[2024-10-11 11:51:05 +0000] information/config: recovering at 11:52:05
[2024-10-11 11:51:05 +0000] information/config: entering period at 11:52:35
[...]
[2024-10-11 11:51:13 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:51:13 +0000] information/Checkable: Checkable 'test-host' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
[2024-10-11 11:51:13 +0000] information/Notification: Sending 'Problem' notification 'test-host!test-notification' for user 'dummy'
[2024-10-11 11:51:13 +0000] information/Notification: Completed sending 'Problem' notification 'test-host!test-notification' for checkable 'test-host' and user 'dummy' using command 'nop'.
[2024-10-11 11:51:23 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:51:32 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:51:41 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:51:51 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:52:00 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:52:10 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:52:10 +0000] information/Checkable: Checkable 'test-host' has 1 notification(s). Checking filters for type 'Recovery', sends will be logged.
[2024-10-11 11:52:19 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:52:28 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:52:35 +0000] information/Notification: Sending 'Recovery' notification 'test-host!test-notification' for user 'dummy'
[2024-10-11 11:52:35 +0000] information/Notification: Completed sending 'Recovery' notification 'test-host!test-notification' for checkable 'test-host' and user 'dummy' using command 'nop'.
[2024-10-11 11:52:38 +0000] information/config: check OK, inside test-period: true
[2024-10-11 11:52:47 +0000] information/config: check OK, inside test-period: true
[2024-10-11 11:52:57 +0000] information/config: check OK, inside test-period: true

Comparison with PR #10032

My rationale for this PR is that it simplifies the code compared to #10032.

Looking purely at the functionality, there should be no difference between the two. Actually, if you take #10032, move the !(cr->GetState() == GetStateBeforeSuppression() && GetSuppressedNotifications() & stateNotifications) part proposed there to each call site of Checkable::NotificationReasonApplies() and then remove all the redundant checks, you should end up pretty much with this PR.

fixes #10025
closes #10032
refs #9207 (introduced the issue)

This fixes an issue where recovery notifications get lost if they happen
outside of a notification time period.

Not all calls to `Checkable::NotificationReasonApplies()` need
`GetStateBeforeSuppression()` to be checked. In fact, for one caller,
`FireSuppressedNotifications()` in
`lib/notification/notificationcomponent.cpp`, the state before suppression may
not even be initialized properly, so that the default value of OK is used which
can lead to incorrect return values. Note the difference between suppressions
happening on the level of the `Checkable` object level and the `Notification`
object level. Only the first sets the state before suppression in the
`Checkable` object, but so far, also the latter used that value incorrectly.

This commit moves the check of `GetStateBeforeSuppression()` from
`Checkable::NotificationReasonApplies()` to the one place where it's actually
relevant: `Checkable::FireSuppressedNotifications()`. This made the existing
call to `NotificationReasonApplies()` unneccessary as it would always return
true: the `type` argument is computed based on the current check result, so
there's no need to check it against the current check result.
@julianbrost julianbrost added bug Something isn't working area/notifications Notification events consider backporting Should be considered for inclusion in a bugfix release labels Oct 11, 2024
@cla-bot cla-bot bot added the cla/signed label Oct 11, 2024
@icinga-probot icinga-probot bot added the ref/IP label Oct 11, 2024
@julianbrost julianbrost changed the title Use Checkable::GetStateBeforeSuppression() only where relevant Fix lost recovery notifications after recovery outside of notification time period Oct 11, 2024
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LFTM!

@yhabteab yhabteab added this to the 2.15.0 milestone Oct 21, 2024
lib/icinga/checkable-notification.cpp Show resolved Hide resolved
lib/icinga/checkable-notification.cpp Show resolved Hide resolved
lib/icinga/checkable-notification.cpp Show resolved Hide resolved
lib/icinga/checkable-notification.cpp Show resolved Hide resolved
@Al2Klimov Al2Klimov merged commit fb8badf into master Oct 24, 2024
26 checks passed
@Al2Klimov Al2Klimov deleted the state-before-suppression branch October 24, 2024 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/notifications Notification events bug Something isn't working cla/signed consider backporting Should be considered for inclusion in a bugfix release ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing notification after recovery outside the time period
3 participants