Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

administratively OFF seems to have propagated to another UPS #2822

Open
opoplawski opened this issue Feb 25, 2025 · 1 comment
Open

administratively OFF seems to have propagated to another UPS #2822

opoplawski opened this issue Feb 25, 2025 · 1 comment
Labels
APC Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time impacts-release-2.8.2 Issues reported against NUT release 2.8.2 (maybe vanilla or with minor packaging tweaks) serial port Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved SNMP upsmon

Comments

@opoplawski
Copy link

We have systems with redundant power supplies on different UPSes. Yesterday we had to cold restart (completely power off) one of the UPSes to clear an error. This incorrectly triggered a shutdown on some of the clients.

smartups6000rt@SERVER1 is the UPS that was powered off.
smartups3000rmxl_rack1@SERVER2 was not. However, there seems to have been a communication glitch around the same time.

On the client:

Feb 24 11:04:37 nut-monitor[2789]: UPS smartups3000rmxl_rack1@SERVER2: administratively OFF or asleep
Feb 24 11:04:52 nut-monitor[2789]: UPS smartups6000rt@SERVER1: administratively OFF or asleep
Feb 24 11:05:12 nut-monitor[2789]: ups_is_off: smartups3000rmxl_rack1@SERVER2 is in state OFF for 35 sec, assuming the line is not fed (if it is calibrating etc., check the upsmon 'OFFDURATION' option)
Feb 24 11:05:12 nut-monitor[2789]: UPS [smartups3000rmxl_rack1@SERVER2] is reported as (administratively) OFF
Feb 24 11:05:17 nut-monitor[2789]: ups_is_off: smartups3000rmxl_rack1@SERVER2 is in state OFF for 40 sec, assuming the line is not fed (if it is calibrating etc., check the upsmon 'OFFDURATION' option)
Feb 24 11:05:17 nut-monitor[2789]: UPS [smartups3000rmxl_rack1@SERVER2] is reported as (administratively) OFF
Feb 24 11:05:22 nut-monitor[2789]: ups_is_off: smartups3000rmxl_rack1@SERVER2 is in state OFF for 45 sec, assuming the line is not fed (if it is calibrating etc., check the upsmon 'OFFDURATION' option)
Feb 24 11:05:22 nut-monitor[2789]: UPS [smartups3000rmxl_rack1@SERVER2] is reported as (administratively) OFF
Feb 24 11:05:27 nut-monitor[2789]: ups_is_off: smartups6000rt@SERVER1 is in state OFF for 35 sec, assuming the line is not fed (if it is calibrating etc., check the upsmon 'OFFDURATION' option)
Feb 24 11:05:27 nut-monitor[2789]: ups_is_off: smartups3000rmxl_rack1@SERVER2 is in state OFF for 50 sec, assuming the line is not fed (if it is calibrating etc., check the upsmon 'OFFDURATION' option)
Feb 24 11:05:27 nut-monitor[2789]: UPS [smartups6000rt@SERVER1] is reported as (administratively) OFF
Feb 24 11:05:27 nut-monitor[2789]: UPS [smartups3000rmxl_rack1@SERVER2] is reported as (administratively) OFF
Feb 24 11:05:27 nut-monitor[2789]: Executing automatic power-fail shutdown
Feb 24 11:05:27 nut-monitor[943636]: wall: cannot get tty name: Inappropriate ioctl for device
Feb 24 11:05:27 nut-monitor[2789]: Auto logout and shutdown proceeding
Feb 24 11:05:27 nut-monitor[943638]: wall: cannot get tty name: Inappropriate ioctl for device
Feb 24 11:05:27 nut-monitor[943637]: Network UPS Tools upsmon 2.8.2
Feb 24 11:05:32 nut-monitor[2789]: Network UPS Tools upsmon 2.8.2
Feb 24 11:05:32 nut-monitor[943639]: Shutdown scheduled for Mon 2025-02-24 11:05:32 MST, use 'shutdown -c' to cancel.

Why is it reporting smartups3000rmxl_rack1@SERVER2 as off?

On SERVER1:

Feb 24 11:07:38 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for device.description
Feb 24 11:07:38 upsd[3469324]: Data for UPS [smartups6000rt] is stale - check driver
Feb 24 11:07:38 nut-server[3469324]: Data for UPS [smartups6000rt] is stale - check driver
Feb 24 11:07:44 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for device.contact
Feb 24 11:07:50 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for device.location
Feb 24 11:07:56 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for input.voltage
Feb 24 11:08:02 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for input.voltage.maximum
Feb 24 11:08:08 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for input.voltage.minimum
Feb 24 11:08:14 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for input.frequency
Feb 24 11:08:19 nut-server[3469324]: SSL handshake done successfully with client 10.10.40.1
Feb 24 11:08:19 upsd[3469324]: SSL handshake done successfully with client 10.10.40.1
Feb 24 11:08:19 nut-server[3469324]: User [email protected] logged into UPS [smartups6000rt] (SSL)
Feb 24 11:08:19 upsd[3469324]: User [email protected] logged into UPS [smartups6000rt] (SSL)
Feb 24 11:08:20 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for input.transfer.low
Feb 24 11:08:26 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for input.transfer.high
Feb 24 11:08:32 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for input.transfer.reason
Feb 24 11:08:38 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for ups.status
Feb 24 11:08:44 snmp-ups[3469946]: [smartups6000rt] snmp_ups_walk: data stale for ups.status
Feb 24 11:08:48 nut-server[3469324]: UPS [smartups6000rt] data is no longer stale
Feb 24 11:08:48 upsd[3469324]: UPS [smartups6000rt] data is no longer stale
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for device.description
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for device.contact
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for device.location
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for input.voltage
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for input.voltage.maximum
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for input.voltage.minimum
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for input.frequency
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for input.transfer.low
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for input.transfer.high
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for input.transfer.reason
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for ups.status
Feb 24 11:09:21 snmp-ups[3469946]: [smartups6000rt] get_and_process_data: data resumed for ups.status

On SERVER2:

Feb 24 11:04:53 nut-monitor[243268]: UPS smartups6000rt@SERVER1: administratively OFF or asleep
Feb 24 11:05:28 nut-monitor[243268]: ups_is_off: smartups6000rt@SERVER1 is in state OFF for 35 sec, assuming the line is not fed (if it is calibrating etc., check the upsmon 'OFFDURATION' option)
Feb 24 11:05:28 nut-monitor[243268]: UPS [smartups6000rt@SERVER1] is reported as (administratively) OFF
...
Feb 24 11:07:11 apcsmart[244440]: Communications with UPS lost: serial port read timeout: 813(poll_data)
Feb 24 11:07:11 upsd[243986]: Data for UPS [smartups3000rmxl_rack1] is stale - check driver
Feb 24 11:07:11 nut-server[243986]: Data for UPS [smartups3000rmxl_rack1] is stale - check driver
Feb 24 11:07:33 nut-monitor[243268]: ups_is_off: smartups6000rt@SERVER1 is in state OFF for 160 sec, assuming the line is not fed (if it is calibrating etc., check the upsmon 'OFFDURATION' option)
Feb 24 11:07:33 nut-monitor[243268]: UPS [smartups6000rt@SERVER1] is reported as (administratively) OFF
Feb 24 11:07:38 nut-monitor[243268]: Poll UPS [smartups6000rt@SERVER1] failed - Data stale
Feb 24 11:07:38 nut-monitor[243268]: Communications with UPS smartups6000rt@SERVER1 lost
Feb 24 11:07:38 nut-monitor[243268]: UPS [smartups6000rt@SERVER1] was last known to be (administratively) OFF and currently is not communicating, assuming dead
Feb 24 11:07:43 nut-monitor[243268]: Poll UPS [smartups6000rt@SERVER1] failed - Data stale
Feb 24 11:07:43 nut-monitor[243268]: UPS [smartups6000rt@SERVER1] was last known to be (administratively) OFF and currently is not communicating, assuming dead
...
Feb 24 11:08:32 apcsmart[244440]: Communications with UPS re-established
Feb 24 11:08:33 nut-monitor[243268]: Poll UPS [smartups6000rt@SERVER1] failed - Data stale
Feb 24 11:08:33 nut-monitor[243268]: UPS [smartups6000rt@SERVER1] was last known to be (administratively) OFF and currently is not communicating, assuming dead
Feb 24 11:08:36 nut-server[243986]: UPS [smartups3000rmxl_rack1] data is no longer stale
Feb 24 11:08:36 upsd[243986]: UPS [smartups3000rmxl_rack1] data is no longer stale
Feb 24 11:09:04 nut-server[243986]: UPS [smartups3000rmxl_rack1] data is no longer stale
Feb 24 11:09:04 upsd[243986]: UPS [smartups3000rmxl_rack1] data is no longer stale
Feb 24 11:09:08 nut-monitor[243268]: UPS [smartups6000rt@SERVER1] was last known to be (administratively) OFF and currently is not communicating, assuming dead
Feb 24 11:09:13 nut-monitor[243268]: UPS [smartups6000rt@SERVER1] was last known to be (administratively) OFF and currently is not communicating, assuming dead
Feb 24 11:09:18 nut-monitor[243268]: UPS [smartups6000rt@SERVER1] was last known to be (administratively) OFF and currently is not communicating, assuming dead
Feb 24 11:09:23 nut-monitor[243268]: Communications with UPS smartups6000rt@SERVER1 established
Feb 24 11:09:23 nut-monitor[243268]: UPS smartups6000rt@SERVER1: no longer administratively OFF or asleep

Near as I can tell, the fact that smartups6000rt went OFF on SERVER2 somehow triggered the state of smartups3000rmxl_rack1@SERVER2 to be OFF as well.

This is with NUT 2.8.2.

@jimklimov
Copy link
Member

jimklimov commented Feb 26, 2025

Quite odd, and concerning, thanks for bringing this up although I so far struggle to somehow explain what could be happening. While writing this, I've got some ideas - positing them below.

From messages I surmise the smartups6000rt runs with snmp-ups driver on SERVER1 and data are shared via NUT to other systems, and smartups3000rmxl_rack1 is connected with a serial port and apcsmart driver on SERVER2?

Can I assume the timestamps in question are all in sync on the servers and clients that provided the logs? At what timestamps were the systems rebooted? I suppose between 11:05 and 11:07?

Are there any messages (maybe not in NUT but UPSes' on-board controller UI) from the preceding minutes, maybe an hour, which would indicate them going into calibration or similar?

In these messages I see that smartups3000rmxl_rack1@SERVER2 became considered as in OFF state some 15 sec earlier (rounded to 5-sec polling cycle) than smartups6000rt@SERVER1:

Feb 24 11:05:27 nut-monitor[2789]: ups_is_off: smartups6000rt@SERVER1 is in state OFF for 35 sec, assuming the line is not fed (if it is calibrating etc., check the upsmon 'OFFDURATION' option)
Feb 24 11:05:27 nut-monitor[2789]: ups_is_off: smartups3000rmxl_rack1@SERVER2 is in state OFF for 50 sec, assuming the line is not fed (if it is calibrating etc., check the upsmon 'OFFDURATION' option)

While it is a similar timing, it does not seem that one leaked into another programmatically.

Did the problem with smartups3000rmxl_rack1 communications start shortly before the log you posted?

My current hypothesis is that around 11:03-11:04 the serial cable of smartups3000rmxl_rack1 got unwedged as you manipulated something for smartups6000rt maintenance. Looking at apcsmart.c::ups_status_set(), a lack of any known status bits causes it to status_set("OFF");

Then you powered off the smartups6000rt, and in drivers/apc-mib.c I see several mappings that emit "OFF" (with comments about actual state like off, timedSleeping, softwareBypass, sleepingUntilPowerReturn) with the default value per snmp_info_default("ups.status", ST_FLAG_STRING, SU_INFOSIZE, APCC_OID_POWER_STATUS, "OFF", ... also being "OFF".

So I think that there was a communication loss to both devices, which defaulted to reporting "OFF" for each of them independently (or in case of SNMP, the controller might have kept running and communicating and reported that state actively, and you only powered off the load? higher debug level could expose that nuance) and then upsmon behaved as designed with that feature.

This does not "forgive" the (presumed) fact that apcsmart should not have raised the "OFF" state if it lost communications and knew about that at the time, which could be also about unfortunate timing (no status bits and raised OFF happened first, then no reply from serial port queries were recognized - then it is probably correct to keep the OFF state in light of no other knowledge about the situation, or needs a lot more logic and checks and balances to get this bit right).

Also CCing @desertwitch as the resident expert on APC and these recent logical additions in NUT :)

@jimklimov jimklimov added APC SNMP serial port upsmon Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time impacts-release-2.8.2 Issues reported against NUT release 2.8.2 (maybe vanilla or with minor packaging tweaks) labels Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APC Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time impacts-release-2.8.2 Issues reported against NUT release 2.8.2 (maybe vanilla or with minor packaging tweaks) serial port Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved SNMP upsmon
Projects
None yet
Development

No branches or pull requests

2 participants