Random intermittent SERVFAILs #1225

cinergi2 · 2025-01-18T18:29:06Z

Hello,

I regularly get SERVFAILs in the log as follows (typical example from today):

2025-01-18T13:19:18-05:00 Error unbound [93701:2] error: SERVFAIL <v10.events.data.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T13:12:20-05:00 Error unbound [93701:2] error: SERVFAIL <ocws.officeapps.live.com. AAAA IN>: misc failure
2025-01-18T13:12:20-05:00 Error unbound [93701:3] error: SERVFAIL <ocws.officeapps.live.com. A IN>: misc failure
2025-01-18T13:12:20-05:00 Error unbound [93701:0] error: SERVFAIL <ocws.officeapps.live.com. AAAA IN>: failed to get a delegation (eg. prime failure)
2025-01-18T13:03:31-05:00 Error unbound [93701:3] error: SERVFAIL <nrdp-ipv6.prod.ftl.netflix.com. A IN>: misc failure
2025-01-18T13:03:04-05:00 Error unbound [93701:0] error: SERVFAIL <v10.events.data.microsoft.com. A IN>: misc failure
2025-01-18T12:53:16-05:00 Error unbound [93701:2] error: SERVFAIL <v20.events.data.microsoft.com. A IN>: misc failure
2025-01-18T12:53:16-05:00 Error unbound [93701:1] error: SERVFAIL <v20.events.data.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:49:19-05:00 Error unbound [93701:3] error: SERVFAIL <mobile.events.data.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:49:18-05:00 Error unbound [93701:0] error: SERVFAIL <mobile.events.data.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:49:18-05:00 Error unbound [93701:1] error: SERVFAIL <mobile.events.data.microsoft.com. A IN>: misc failure
2025-01-18T12:48:07-05:00 Error unbound [93701:0] error: SERVFAIL <download.windowsupdate.com. AAAA IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:48:07-05:00 Error unbound [93701:2] error: SERVFAIL <download.windowsupdate.com. AAAA IN>: misc failure
2025-01-18T12:36:38-05:00 Error unbound [93701:2] error: SERVFAIL <mobile.events.data.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:34:27-05:00 Error unbound [93701:1] error: SERVFAIL <login.live.com. AAAA IN>: misc failure
2025-01-18T12:34:27-05:00 Error unbound [93701:0] error: SERVFAIL <login.live.com. AAAA IN>: misc failure
2025-01-18T12:33:39-05:00 Error unbound [93701:3] error: SERVFAIL <msedge.b.tlu.dl.delivery.mp.microsoft.com. A IN>: misc failure
2025-01-18T12:33:39-05:00 Error unbound [93701:2] error: SERVFAIL <msedge.b.tlu.dl.delivery.mp.microsoft.com. A IN>: misc failure
2025-01-18T12:33:03-05:00 Error unbound [93701:1] error: SERVFAIL <8-courier.push.apple.com. AAAA IN>: misc failure
2025-01-18T12:33:00-05:00 Error unbound [93701:2] error: SERVFAIL <displaycatalog.mp.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:33:00-05:00 Error unbound [93701:1] error: SERVFAIL <fs.microsoft.com. A IN>: misc failure
2025-01-18T12:33:00-05:00 Error unbound [93701:1] error: SERVFAIL <fs.microsoft.com. AAAA IN>: misc failure
2025-01-18T12:33:00-05:00 Error unbound [93701:3] error: SERVFAIL <fs.microsoft.com. AAAA IN>: misc failure
2025-01-18T12:18:31-05:00 Error unbound [93701:2] error: SERVFAIL <nrdp-ipv6.prod.ftl.netflix.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:18:31-05:00 Error unbound [93701:0] error: SERVFAIL <nrdp-ipv6.prod.ftl.netflix.com. A IN>: exceeded the maximum nameserver nxdomains

The same domains resolve fine seconds or minutes later, so it's intermittent. Any ideas how to resolve this?

Thanks

wcawijngaards · 2025-01-20T08:41:08Z

Unusual to see 'misc failure', it should have more details. With a verbosity 4 output of logs it would capture those details.

The error about maximum nameserver nxdomains, that means the delegations for nameservers are going to addresses that do not exist, the 'nxdomain'. This has a stop counter and it is hit. What could also happen is that the nameserver address lookup are failing, due to other reasons. One reason that could change in a minute, is packet drop. So perhaps there are very large amounts of packet drops. During this time unbound fails to lookup. These failures cause nameserver lookups to fail, and when those exceed the counter, it fails the query. It can also cause the query itself to fail, obviously. And that would be seen as a timeout printout.

What could also happen, and be gone in a couple minutes, is a failure like buffer trouble by the system, that could cause system calls to all fail, and thus lookups fail, but not a timeout, so the misc failure fits. Like the system is out of memory, buffer allowance, something like that could also look like these errors I guess.

More information would be a highly verbose output log, with like verbosity 4, maybe also set logfile to capture it in its own file. But it looks like maybe packet drop or maybe route drop by the next hop router or buffer allowance, and then maybe something like most packets cannot be sent, with an error.

cinergi2 · 2025-01-20T15:03:46Z

Thank you for the detailed reply. I don't think the issue is network packet drop or system out of memory, as my connection is very stable and system memory is only 10% utilized. Running OPNsense, by the way. Buffer allowance is a possibility I guess, though I haven't changed the default (which I assume is designed to be reasonable). This is a home network with approximately 40 devices, a combination of user devices and IOT.

One potential clue, I don't know if this is useful: After restarting unbound, I don't get any SERVFAIL messages for approximately 24 hours. The messages only start appearing after 24 hours. This makes me think it could perhaps be a caching issue?

In the statistics, my Cache Misses is equal to my Recursive Replies which I interpret to mean that all recursive requests are getting a reply (which is good). And my Request Queue Exceeded counter is 0 (also good).

I will increase the log verbosity as you suggested (currently set at the default 1) and report back.

Thanks!

wcawijngaards · 2025-01-20T15:12:54Z

I guess it is not those causes then. If it is linked to the 24h. That is a default max cache lifetime for unbound. Unless that was changed with max-ttl settigs? If so, that means a number of items could time out at around that time and need a lookup again. I wonder why it would fail to lookup the second lookup.

One reason why the second lookup could fail is if some serve-expired, or very large min ttl is used, but the cloud instances are moved after 24h to other IPs, causing lookup misses, and those names look like they are from cloud nameservers. But perhaps that is not it. Another is that unbound, is child-preferent, and in the first lookups it found child side information of nameservers, it is after 24h trying to update the information with lookups to those child side zones, but they do not work. Hence the first lookups did work. But this kind of misconfiguration, eg. child side information mostly does not answer to the fail counter, seems a bit unlikely too, I mean that much. Perhaps it is qname minimisation related, and the second lookup qname minimised differently; and cloud nameservers have an issue due to it possibly, qname-minimisation: no turns that off, Unbound then does not perform qname minimisation.

cinergi2 · 2025-01-20T16:09:33Z

Thank you once again. I do have Serve-Expired enabled with default parameter values including the max cache lifetime and the minimum TTL. I also have Prefetch Support, Prefetch DNS Key, and Harden DNSSEC Data enabled. Queue minimisation is enabled but Strict QNAME is disabled. I previously had Aggressive NSEC enabled as well, but I've now disabled it because I've read that it can cause similar SERVFAIL issues. It's too early to tell if that worked since it hasn't been 24 hours. If it doesn't work, I guess the next step will be to entirely disable QNAME Minimisation.

infideler · 2025-01-21T16:14:35Z

@cinergi2 Not sure if this info helps you (please understand I know zero about zero). I had the exact same issue with about 10% of my websites randomly not loading. I'm using mvance's Unbound-RPI Docker image on a Raspberry Pi with PiHole/Unbound I started getting SRVFAIL errors in PiHole itself. I went through all the options in unbound.conf until I got to "use-caps-for-id:" which was enabled. After I disabled that things started loading much faster, and no more SRVFAIL errors.

cinergi2 · 2025-01-21T16:48:58Z

@cinergi2 Not sure if this info helps you (please understand I know zero about zero). I had the exact same issue with about 10% of my websites randomly not loading. I'm using mvance's Unbound-RPI Docker image on a Raspberry Pi with PiHole/Unbound I started getting SRVFAIL errors in PiHole itself. I went through all the options in unbound.conf until I got to "use-caps-for-id:" which was enabled. After I disabled that things started loading much faster, and no more SRVFAIL errors.

Thanks, this is already off.

After exactly 24 hours after restarting unbound, the SERVFAIL errors returned. Disabling the Harden DNSSEC Data option didn't help. @wcawijngaards is there a way to simulate a cache timeout so that I don't have to wait 24 hours after each troubleshooting step to check if it worked? Or do I just set the Maximum TTL for RRsets and messages to a low value for testing, like a few minutes?

cinergi2 · 2025-01-21T19:30:17Z

OK, I set the maximum cache TTL to 600 seconds (10 minutes) and I was able to capture several SERVFAILS in the log with loglevel 4. However, even for an 8-minute interval, the log is huge (500 MB) and contains unrelated queries that succeed, so I can't post it here. Anything I could filter out to make it smaller? I don't see anything obvious that explains the SERVFAIL but the log is so detailed that I could easily be missing something, especially since I'm by no means an Unbound expert...

wcawijngaards · 2025-01-22T10:00:37Z

The log is fairly straight forward to read, it prints the query that it is working on. The parts of the log about the query that it is working on are pertinent.

kylekrajnyak · 2025-02-13T00:46:39Z

Just chiming in to say I've been dealing with these same intermittent SERVFAIL errors too. Subscribing for updates.

kylekrajnyak · 2025-02-13T15:10:25Z

Here's an example that just happened (which impacts work) where SERVFAILs were occurring for portal.azure.com. I access that domain daily, and most of the time it works, but intermittently this happens:

Which appears like this in Pihole:

I decided to do a DIG which was successful:

Then, as soon as I tried to access it again in the browser, it worked, and showed successful in Pihole:

wcawijngaards · 2025-02-14T10:24:18Z

The misc failure is printed because it was servfail, but no other details are posted by the code. So there is a code path that becomes servfail, but does not add information to print for it. The detailed logs, on the same query, that log what it is working on can reveal what is actually wrong. Then it could be changed to print a nicer error message. Also, to find the problem, the additional logs could show what is going on, by having logging at higher verbosity, like 4 or 5. And then looking at the parts that are about the query in question. Unbound logs both the attempts where it sends a query to an upstream and the reply from that upstream is logged.

kylekrajnyak · 2025-02-26T00:37:00Z

Just wanted to provide an update here to say that I've completely eliminated my SERVFAIL errors by ensuring that cache-min-ttl: 0 and serve-expired: no. Additionally, I ensured that prefetch: yes.

Obviously, your mileage may vary, but in my case at least, it seems many of the domains I frequent had issues with stale cache.

cinergi2 · 2025-02-26T02:15:14Z

Just wanted to provide an update here to say that I've completely eliminated my SERVFAIL errors by ensuring that cache-min-ttl: 0 and serve-expired: no. Additionally, I ensured that prefetch: yes.

Obviously, your mileage may vary, but in my case at least, it seems many of the domains I frequent had issues with stale cache.

Thank you, I will try this as well and report back.

wcawijngaards · 2025-02-26T12:38:20Z

The setting for serve-expired-client-timeout: 1800 could help with serve-expired. If serve expired is enabled, it would report back the up to date values if that can be fetched within the timeout, after that it uses the expired contents. That makes more content use fresh values, and could make both serve expired work and also remove the stale content as an issue, possibly.

The default for this option has changed in the code repository, that has the 1800 value as default, it is suggested by RFC 8767.

Dillton mentioned this issue Feb 16, 2025

Frequent error: SERVFAIL exceeded the maximum number of sends #1234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random intermittent SERVFAILs #1225

Random intermittent SERVFAILs #1225

cinergi2 commented Jan 18, 2025

wcawijngaards commented Jan 20, 2025

cinergi2 commented Jan 20, 2025

wcawijngaards commented Jan 20, 2025

cinergi2 commented Jan 20, 2025

infideler commented Jan 21, 2025

cinergi2 commented Jan 21, 2025

cinergi2 commented Jan 21, 2025 •

edited

Loading

wcawijngaards commented Jan 22, 2025

kylekrajnyak commented Feb 13, 2025

kylekrajnyak commented Feb 13, 2025

wcawijngaards commented Feb 14, 2025

kylekrajnyak commented Feb 26, 2025

cinergi2 commented Feb 26, 2025

wcawijngaards commented Feb 26, 2025

Random intermittent SERVFAILs #1225

Random intermittent SERVFAILs #1225

Comments

cinergi2 commented Jan 18, 2025

wcawijngaards commented Jan 20, 2025

cinergi2 commented Jan 20, 2025

wcawijngaards commented Jan 20, 2025

cinergi2 commented Jan 20, 2025

infideler commented Jan 21, 2025

cinergi2 commented Jan 21, 2025

cinergi2 commented Jan 21, 2025 • edited Loading

wcawijngaards commented Jan 22, 2025

kylekrajnyak commented Feb 13, 2025

kylekrajnyak commented Feb 13, 2025

wcawijngaards commented Feb 14, 2025

kylekrajnyak commented Feb 26, 2025

cinergi2 commented Feb 26, 2025

wcawijngaards commented Feb 26, 2025

cinergi2 commented Jan 21, 2025 •

edited

Loading