Nut shutdown assuming dead ups #2794

electrofloat · 2025-02-01T10:46:27Z

Hi!

I don't know if this is a bug or a misconfiguration but here is what happened.
I have 2 machines, one is netserver other is netclient.
The netserver is connected through usb to a device.model: Back-UPS RS 550G with the following config

[ups]
  driver = usbhid-ups
  port = auto
  onlinedischarge_calibration = 1

Both of the machines running Ubuntu 24.04.1 LTS with nut version 2.8.1-3.1ubuntu2.

The ups seems to calibrate itself every two weeks and the last one was on:

2025-01-29T14:39:04.808340+01:00 server nut-monitor[2772730]: UPS ups@localhost: administratively OFF or asleep
2025-01-29T14:39:04.816059+01:00 server nut-monitor[1551623]: Network UPS Tools upsmon 2.8.1
2025-01-29T14:39:14.810120+01:00 server nut-monitor[2772730]: UPS ups@localhost: no longer administratively OFF or asleep
2025-01-29T14:39:14.818648+01:00 server nut-monitor[1551648]: Network UPS Tools upsmon 2.8.1

Ubuntu has this function that when a lib used by a process is upgraded it restarts the corresponding systemd unit so it uses the updated lib.
Recently a bunch of libs were updated and when nut restarted on the netserver the netclient machine shut itself down.
At first I did not understand what happened, so after manually starting netclient and checked the logs I could see this:

2025-01-30T19:57:28.515107+01:00 client nut-monitor[3178]: Poll UPS [[email protected]] failed - Write error: Broken pipe
2025-01-30T19:57:28.515556+01:00 client nut-monitor[3178]: Communications with UPS [email protected] lost
2025-01-30T19:57:28.515684+01:00 client nut-monitor[3178]: UPS [[email protected]] was last known to be calibrating and currently is not communicating, assuming dead
2025-01-30T19:57:28.515770+01:00 client nut-monitor[3178]: Executing automatic power-fail shutdown
2025-01-30T19:57:28.520256+01:00 client nut-monitor[2709826]: Network UPS Tools upsmon 2.8.1
2025-01-30T19:57:28.520422+01:00 client nut-monitor[3178]: Auto logout and shutdown proceeding
2025-01-30T19:57:28.524341+01:00 client nut-monitor[2709831]: Network UPS Tools upsmon 2.8.1
2025-01-30T19:57:33.521944+01:00 client nut-monitor[3178]: Network UPS Tools upsmon 2.8.1
2025-01-30T19:57:33.540056+01:00 client shutdown[2709872]: Shutdown scheduled for Thu 2025-01-30 19:57:33 CET, use 'shutdown -c' to cancel.
2025-01-30T19:57:33.541551+01:00 client nut-monitor[3147]: Network UPS Tools upsmon 2.8.1
2025-01-30T19:57:33.543370+01:00 client systemd[1]: nut-monitor.service: Deactivated successfully.
2025-01-30T19:57:33.543530+01:00 client systemd[1]: nut-monitor.service: Consumed 3min 45.465s CPU time, 4.5M memory peak, 556.0K memory swap peak.

So it seems that the netclinet was thinking that the ups is still in calibrating mode when the netserver restarted its nut systemd service, and it was assuming the ups is dead and it immediately shutdown.

Of course I did not check what nut thinks about the state of the ups before the libs upgraded, but after that I could see that both netclient and netserver reports: ups.status: OL correctly.

Also yesterday and today again a bunch of libs were upgraded and nut was restarted on netserver and this time it did not trigger the shutdown (and I've checked this time and the status was still OL)

Now.. is this a bug that somehow the netclient did not get the memo that the ups is not in calibrating state, or something is misconfigured?

The text was updated successfully, but these errors were encountered:

jimklimov · 2025-02-02T07:57:40Z

Interesting situation. My guess would be that it works as designed, at least.

There are things known to you and unknown to the servers, such as that the service restart was "intentional". I have little idea how to even propagate the concept (is it known to services they are being restarted and will return in a second or minute?)

Perhaps upsmon could wait a cycle or two on connection loss (maybe tied to knowledge that the driver or its data server program is gracefully going down and it was not power yanked from their machine or network gear - something that could be developed e.g. around driver.state already reported for the curious), but in power-event case, the safe default assumption is the pessimistic one - that we have seconds to live. At least it keeps the data safe and filesystems consistent, as much as we can ensure.

Note that knowledge of being calibrated does not necessarily allow us to discount the on-battery situation as safe. I did see UPSes losing power for the load because its earlier guess was too optimistic and calibration thought the battery was still 20-30% full. (Batteries do degrade over time; last calibration could have been too long ago, and your battery capacity and/or the fed load changed considerably in between). Regular calibration with same load probably reduces this possibility, but still...

electrofloat · 2025-02-02T09:43:36Z

I think the main issue here is that the netclient thought that the ups is still in calibrating state as stated by this log line:
2025-01-30T19:57:28.515684+01:00 client nut-monitor[3178]: UPS [[email protected]] was last known to be calibrating and currently is not communicating, assuming dead

But it was not. The last calibration was on 2025.01.29 and the service restart happened on 2025.01.30. So 1 day later.
Also the calibration takes only 10 seconds.

So the UPS was in a simple OL state (I know it was physically seeing it, but as I said I have not checked the nutserver to be sure it also thinks it is in OL state) when the service restart happened on the server, but since the client thought that the ups was in ST_CAL it shut down the machine.

So either the server thought that the ups is still in st_cal and that state propagated to the netclient, or the server correctly thought it is on OL state but that state change somehow did not get to the client for more than a full day somehow.

jimklimov · 2025-02-04T13:21:51Z

Well... either new calibration kicked in right during those seconds, or... From messages above, I assume you are running a NUT v2.8.1 release build? looking at https://github.com/networkupstools/nut/blob/v2.8.1/clients/upsmon.c#L2111-L2170 it might be that ST_CAL flag was only raised and never un-set (no clearflag mentions it). This was fixed in v2.8.2 by introduction of ups_is_notcal() and similar methods for other states: https://github.com/networkupstools/nut/blob/v2.8.2/clients/upsmon.c#L1234-L1242

Can you try a newer package or ideally a custom build per https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests to check that the current codebase actually does not have this practical buggy use-case, please?

electrofloat · 2025-02-04T13:37:26Z

It was not calibrating at that moment that is for sure.
And yes I am using version 2.8.1-3.1ubuntu2 which is the official version in Ubuntu Noble (24.04.1). (so 2.8.1 with modifications from debian/ubuntu).

I'll wait for the next calibration first and check the statuses on both server and client to see what they are reporting.

Then I'll maybe recompile ubuntu's version with the added https://github.com/networkupstools/nut/blob/v2.8.2/clients/upsmon.c#L1234-L1242 function and https://github.com/networkupstools/nut/blob/v2.8.2/clients/upsmon.c#L2181-L2182 call and see if that helps.

jimklimov · 2025-02-04T15:47:30Z

Well, IMHO monkey-patching sources like that may be a bit risky - just too easy to miss something. If you do go that route, use git blame (or github UI) to track down the commits and whole PRs that delivered the change, to reduce that particular risk. It may be that the changes relied on some other work, possibly in other source files, that would not be in your patched history and codebase though.

It may be more fruitful to use a source tarball from 2.8.2 (or generate one from current master branch with make dist) and adjust the packaging recipes to use it whole instead of 2.8.1 tarball.

The "in-place" builds as detailed on Wiki should overlay much of the packaged installation, especially of a recent one that reports its configure flags, but the main focus is on functional replacement (same config files and runtime user/group names) so some "unreferenced" files might remain if e.g. drivers are newly installed into standard dirs by default, and older packaged /lib/nut/ ones remain too but nobody would call them anymore.

jimklimov added impacts-release-2.8.1 Issues reported against NUT release 2.8.1 (maybe vanilla or with minor packaging tweaks) and removed impacts-release-2.8.2 Issues reported against NUT release 2.8.2 (maybe vanilla or with minor packaging tweaks) labels Feb 4, 2025

jimklimov added this to the 2.8.2 milestone Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nut shutdown assuming dead ups #2794

Nut shutdown assuming dead ups #2794

electrofloat commented Feb 1, 2025 •

edited

Loading

jimklimov commented Feb 2, 2025 •

edited

Loading

electrofloat commented Feb 2, 2025

jimklimov commented Feb 4, 2025

electrofloat commented Feb 4, 2025

jimklimov commented Feb 4, 2025

Nut shutdown assuming dead ups #2794

Nut shutdown assuming dead ups #2794

Comments

electrofloat commented Feb 1, 2025 • edited Loading

jimklimov commented Feb 2, 2025 • edited Loading

electrofloat commented Feb 2, 2025

jimklimov commented Feb 4, 2025

electrofloat commented Feb 4, 2025

jimklimov commented Feb 4, 2025

electrofloat commented Feb 1, 2025 •

edited

Loading

jimklimov commented Feb 2, 2025 •

edited

Loading