Fix lost recovery notifications after recovery outside of notification time period #10187

julianbrost · 2024-10-11T11:57:21Z

This is fixed by using Checkable::GetStateBeforeSuppression() only where relevant.

Not all calls to Checkable::NotificationReasonApplies() need GetStateBeforeSuppression() to be checked. In fact, for one caller, FireSuppressedNotifications() in
lib/notification/notificationcomponent.cpp, the state before suppression may not even be initialized properly, so that the default value of OK is used which can lead to incorrect return values. Note the difference between suppressions happening on the level of the Checkable object level and the Notification object level. Only the first sets the state before suppression in the Checkable object, but so far, also the latter used that value incorrectly.

This commit moves the check of GetStateBeforeSuppression() from Checkable::NotificationReasonApplies() to the one place where it's actually relevant: Checkable::FireSuppressedNotifications(). This made the existing call to NotificationReasonApplies() unneccessary as it would always return true: the type argument is computed based on the current check result, so there's no need to check it against the current check result.

Tests

I've written a short stand-alone Icinga 2 config file that can be used to reproduce the bug. It used some wizardry you wouldn't want to use in a production config to create a host that starts in a problem state and recovers 60 seconds after startup. In addition to that, it configures a notification with a dynamically generated time period that spans the full current day but has a gab between 30 seconds and 90 seconds after daemon startup. Thus, a sent recovery notification should be logged 90 seconds after daemon startup.

icinga2.conf

include <itl>

// Don't try this at home!
// This will cause pain and suffering if used for other purposes than testing!
// It will cause problems in multi-node clusters!
// Don't say you haven't been warned!
var t = get_time()
var leavePeriodAt = t + 30s
log("leaving period at " + DateTime(leavePeriodAt).format("%T"))
var recoverAt     = t + 60s
log("recovering at " + DateTime(recoverAt).format("%T"))
var enterPeriodAt = t + 90s
log("entering period at " + DateTime(enterPeriodAt).format("%T"))

object Host "test-host" use (recoverAt) {
	check_command = "dummy"
	max_check_attempts = 1
	check_interval = 10s
	retry_interval = 10s
	vars.dummy_state = () use (recoverAt) => {
		inside = ", inside test-period: " + String(get_time_period("test-period").is_inside)
		if (get_time() > recoverAt) {
			log("check OK" + inside)
			return 0
		} else {
			log("check CRITICAL" + inside)
			return 2
		}
	}
}

object TimePeriod "test-period" use (leavePeriodAt, enterPeriodAt) {
	ranges[DateTime(leavePeriodAt).format("%A").lower()] = "00:00:00-" + DateTime(leavePeriodAt).format("%T") + "," + DateTime(enterPeriodAt).format("%T") + "-24:00:00"
	log(ranges)
}

object Notification "test-notification" {
	host_name = "test-host"
	command = "nop"
	period = "test-period"
	users = ["dummy"]
}

object NotificationCommand "nop" {
	command = ["true"]
}

object User "dummy" {}
object CheckerComponent "checker" {}
object NotificationComponent "notification" {}

It can easily be fired up in a container like this:

docker run --rm -it -v $(pwd)/icinga2.conf:/icinga2.conf:ro icinga/icinga2 timeout 2m icinga2 daemon -c /icinga2.conf

Current master branch

[2024-10-11 11:48:38 +0000] information/cli: Icinga application loader (version: v2.14.0-308-g5e9e0bbcd; debug)
[...]
[2024-10-11 11:48:38 +0000] information/config: leaving period at 11:49:08
[2024-10-11 11:48:38 +0000] information/config: recovering at 11:49:38
[2024-10-11 11:48:38 +0000] information/config: entering period at 11:50:08
[...]
[2024-10-11 11:48:46 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:48:46 +0000] information/Checkable: Checkable 'test-host' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
[2024-10-11 11:48:46 +0000] information/Notification: Sending 'Problem' notification 'test-host!test-notification' for user 'dummy'
[2024-10-11 11:48:46 +0000] information/Notification: Completed sending 'Problem' notification 'test-host!test-notification' for checkable 'test-host' and user 'dummy' using command 'nop'.
[2024-10-11 11:48:55 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:49:05 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:49:15 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:49:25 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:49:35 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:49:45 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:49:45 +0000] information/Checkable: Checkable 'test-host' has 1 notification(s). Checking filters for type 'Recovery', sends will be logged.
[2024-10-11 11:49:55 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:50:05 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:50:15 +0000] information/config: check OK, inside test-period: true
[2024-10-11 11:50:25 +0000] information/config: check OK, inside test-period: true
[2024-10-11 11:50:35 +0000] information/config: check OK, inside test-period: true

The problem notification is logged, though there's no recovery notification logged around the time when the output switches back to "inside test-period: true".

This PR

[2024-10-11 11:51:05 +0000] information/cli: Icinga application loader (version: v2.14.0-309-g7d0a43f92; debug)
[...]
[2024-10-11 11:51:05 +0000] information/config: leaving period at 11:51:35
[2024-10-11 11:51:05 +0000] information/config: recovering at 11:52:05
[2024-10-11 11:51:05 +0000] information/config: entering period at 11:52:35
[...]
[2024-10-11 11:51:13 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:51:13 +0000] information/Checkable: Checkable 'test-host' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
[2024-10-11 11:51:13 +0000] information/Notification: Sending 'Problem' notification 'test-host!test-notification' for user 'dummy'
[2024-10-11 11:51:13 +0000] information/Notification: Completed sending 'Problem' notification 'test-host!test-notification' for checkable 'test-host' and user 'dummy' using command 'nop'.
[2024-10-11 11:51:23 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:51:32 +0000] information/config: check CRITICAL, inside test-period: true
[2024-10-11 11:51:41 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:51:51 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:52:00 +0000] information/config: check CRITICAL, inside test-period: false
[2024-10-11 11:52:10 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:52:10 +0000] information/Checkable: Checkable 'test-host' has 1 notification(s). Checking filters for type 'Recovery', sends will be logged.
[2024-10-11 11:52:19 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:52:28 +0000] information/config: check OK, inside test-period: false
[2024-10-11 11:52:35 +0000] information/Notification: Sending 'Recovery' notification 'test-host!test-notification' for user 'dummy'
[2024-10-11 11:52:35 +0000] information/Notification: Completed sending 'Recovery' notification 'test-host!test-notification' for checkable 'test-host' and user 'dummy' using command 'nop'.
[2024-10-11 11:52:38 +0000] information/config: check OK, inside test-period: true
[2024-10-11 11:52:47 +0000] information/config: check OK, inside test-period: true
[2024-10-11 11:52:57 +0000] information/config: check OK, inside test-period: true

Comparison with PR #10032

My rationale for this PR is that it simplifies the code compared to #10032.

Looking purely at the functionality, there should be no difference between the two. Actually, if you take #10032, move the !(cr->GetState() == GetStateBeforeSuppression() && GetSuppressedNotifications() & stateNotifications) part proposed there to each call site of Checkable::NotificationReasonApplies() and then remove all the redundant checks, you should end up pretty much with this PR.

fixes #10025
closes #10032
refs #9207 (introduced the issue)

This fixes an issue where recovery notifications get lost if they happen outside of a notification time period. Not all calls to `Checkable::NotificationReasonApplies()` need `GetStateBeforeSuppression()` to be checked. In fact, for one caller, `FireSuppressedNotifications()` in `lib/notification/notificationcomponent.cpp`, the state before suppression may not even be initialized properly, so that the default value of OK is used which can lead to incorrect return values. Note the difference between suppressions happening on the level of the `Checkable` object level and the `Notification` object level. Only the first sets the state before suppression in the `Checkable` object, but so far, also the latter used that value incorrectly. This commit moves the check of `GetStateBeforeSuppression()` from `Checkable::NotificationReasonApplies()` to the one place where it's actually relevant: `Checkable::FireSuppressedNotifications()`. This made the existing call to `NotificationReasonApplies()` unneccessary as it would always return true: the `type` argument is computed based on the current check result, so there's no need to check it against the current check result.

yhabteab

LFTM!

lib/icinga/checkable-notification.cpp

julianbrost added bug Something isn't working area/notifications Notification events consider backporting Should be considered for inclusion in a bugfix release labels Oct 11, 2024

cla-bot bot added the cla/signed label Oct 11, 2024

icinga-probot bot added the ref/IP label Oct 11, 2024

julianbrost requested review from Al2Klimov and yhabteab October 11, 2024 11:57

julianbrost changed the title ~~Use Checkable::GetStateBeforeSuppression() only where relevant~~ Fix lost recovery notifications after recovery outside of notification time period Oct 11, 2024

yhabteab approved these changes Oct 21, 2024

View reviewed changes

yhabteab added this to the 2.15.0 milestone Oct 21, 2024

Al2Klimov approved these changes Oct 23, 2024

View reviewed changes

lib/icinga/checkable-notification.cpp Show resolved Hide resolved

lib/icinga/checkable-notification.cpp Show resolved Hide resolved

lib/icinga/checkable-notification.cpp Show resolved Hide resolved

lib/icinga/checkable-notification.cpp Show resolved Hide resolved

Al2Klimov merged commit fb8badf into master Oct 24, 2024
26 checks passed

Al2Klimov deleted the state-before-suppression branch October 24, 2024 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix lost recovery notifications after recovery outside of notification time period #10187

Fix lost recovery notifications after recovery outside of notification time period #10187

julianbrost commented Oct 11, 2024 •

edited

Loading

yhabteab left a comment

Fix lost recovery notifications after recovery outside of notification time period #10187

Fix lost recovery notifications after recovery outside of notification time period #10187

Conversation

julianbrost commented Oct 11, 2024 • edited Loading

Tests

Current master branch

This PR

Comparison with PR #10032

yhabteab left a comment

Choose a reason for hiding this comment

julianbrost commented Oct 11, 2024 •

edited

Loading