-
-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random intermittent SERVFAILs #1225
Comments
Unusual to see 'misc failure', it should have more details. With a verbosity 4 output of logs it would capture those details. The error about maximum nameserver nxdomains, that means the delegations for nameservers are going to addresses that do not exist, the 'nxdomain'. This has a stop counter and it is hit. What could also happen is that the nameserver address lookup are failing, due to other reasons. One reason that could change in a minute, is packet drop. So perhaps there are very large amounts of packet drops. During this time unbound fails to lookup. These failures cause nameserver lookups to fail, and when those exceed the counter, it fails the query. It can also cause the query itself to fail, obviously. And that would be seen as a timeout printout. What could also happen, and be gone in a couple minutes, is a failure like buffer trouble by the system, that could cause system calls to all fail, and thus lookups fail, but not a timeout, so the misc failure fits. Like the system is out of memory, buffer allowance, something like that could also look like these errors I guess. More information would be a highly verbose output log, with like verbosity 4, maybe also set logfile to capture it in its own file. But it looks like maybe packet drop or maybe route drop by the next hop router or buffer allowance, and then maybe something like most packets cannot be sent, with an error. |
Thank you for the detailed reply. I don't think the issue is network packet drop or system out of memory, as my connection is very stable and system memory is only 10% utilized. Running OPNsense, by the way. Buffer allowance is a possibility I guess, though I haven't changed the default (which I assume is designed to be reasonable). This is a home network with approximately 40 devices, a combination of user devices and IOT. One potential clue, I don't know if this is useful: After restarting unbound, I don't get any SERVFAIL messages for approximately 24 hours. The messages only start appearing after 24 hours. This makes me think it could perhaps be a caching issue? In the statistics, my Cache Misses is equal to my Recursive Replies which I interpret to mean that all recursive requests are getting a reply (which is good). And my Request Queue Exceeded counter is 0 (also good). I will increase the log verbosity as you suggested (currently set at the default 1) and report back. Thanks! |
I guess it is not those causes then. If it is linked to the 24h. That is a default max cache lifetime for unbound. Unless that was changed with max-ttl settigs? If so, that means a number of items could time out at around that time and need a lookup again. I wonder why it would fail to lookup the second lookup. One reason why the second lookup could fail is if some serve-expired, or very large min ttl is used, but the cloud instances are moved after 24h to other IPs, causing lookup misses, and those names look like they are from cloud nameservers. But perhaps that is not it. Another is that unbound, is child-preferent, and in the first lookups it found child side information of nameservers, it is after 24h trying to update the information with lookups to those child side zones, but they do not work. Hence the first lookups did work. But this kind of misconfiguration, eg. child side information mostly does not answer to the fail counter, seems a bit unlikely too, I mean that much. Perhaps it is qname minimisation related, and the second lookup qname minimised differently; and cloud nameservers have an issue due to it possibly, |
Thank you once again. I do have Serve-Expired enabled with default parameter values including the max cache lifetime and the minimum TTL. I also have Prefetch Support, Prefetch DNS Key, and Harden DNSSEC Data enabled. Queue minimisation is enabled but Strict QNAME is disabled. I previously had Aggressive NSEC enabled as well, but I've now disabled it because I've read that it can cause similar SERVFAIL issues. It's too early to tell if that worked since it hasn't been 24 hours. If it doesn't work, I guess the next step will be to entirely disable QNAME Minimisation. |
@cinergi2 Not sure if this info helps you (please understand I know zero about zero). I had the exact same issue with about 10% of my websites randomly not loading. I'm using mvance's Unbound-RPI Docker image on a Raspberry Pi with PiHole/Unbound I started getting SRVFAIL errors in PiHole itself. I went through all the options in unbound.conf until I got to "use-caps-for-id:" which was enabled. After I disabled that things started loading much faster, and no more SRVFAIL errors. |
Thanks, this is already off. After exactly 24 hours after restarting unbound, the SERVFAIL errors returned. Disabling the Harden DNSSEC Data option didn't help. @wcawijngaards is there a way to simulate a cache timeout so that I don't have to wait 24 hours after each troubleshooting step to check if it worked? Or do I just set the Maximum TTL for RRsets and messages to a low value for testing, like a few minutes? |
OK, I set the maximum cache TTL to 600 seconds (10 minutes) and I was able to capture several SERVFAILS in the log with loglevel 4. However, even for an 8-minute interval, the log is huge (500 MB) and contains unrelated queries that succeed, so I can't post it here. Anything I could filter out to make it smaller? I don't see anything obvious that explains the SERVFAIL but the log is so detailed that I could easily be missing something, especially since I'm by no means an Unbound expert... |
The log is fairly straight forward to read, it prints the query that it is working on. The parts of the log about the query that it is working on are pertinent. |
Just chiming in to say I've been dealing with these same intermittent |
The misc failure is printed because it was servfail, but no other details are posted by the code. So there is a code path that becomes servfail, but does not add information to print for it. The detailed logs, on the same query, that log what it is working on can reveal what is actually wrong. Then it could be changed to print a nicer error message. Also, to find the problem, the additional logs could show what is going on, by having logging at higher verbosity, like 4 or 5. And then looking at the parts that are about the query in question. Unbound logs both the attempts where it sends a query to an upstream and the reply from that upstream is logged. |
Just wanted to provide an update here to say that I've completely eliminated my Obviously, your mileage may vary, but in my case at least, it seems many of the domains I frequent had issues with stale cache. |
Thank you, I will try this as well and report back. |
The setting for The default for this option has changed in the code repository, that has the 1800 value as default, it is suggested by RFC 8767. |
Hello,
I regularly get SERVFAILs in the log as follows (typical example from today):
2025-01-18T13:19:18-05:00 Error unbound [93701:2] error: SERVFAIL <v10.events.data.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T13:12:20-05:00 Error unbound [93701:2] error: SERVFAIL <ocws.officeapps.live.com. AAAA IN>: misc failure
2025-01-18T13:12:20-05:00 Error unbound [93701:3] error: SERVFAIL <ocws.officeapps.live.com. A IN>: misc failure
2025-01-18T13:12:20-05:00 Error unbound [93701:0] error: SERVFAIL <ocws.officeapps.live.com. AAAA IN>: failed to get a delegation (eg. prime failure)
2025-01-18T13:03:31-05:00 Error unbound [93701:3] error: SERVFAIL <nrdp-ipv6.prod.ftl.netflix.com. A IN>: misc failure
2025-01-18T13:03:04-05:00 Error unbound [93701:0] error: SERVFAIL <v10.events.data.microsoft.com. A IN>: misc failure
2025-01-18T12:53:16-05:00 Error unbound [93701:2] error: SERVFAIL <v20.events.data.microsoft.com. A IN>: misc failure
2025-01-18T12:53:16-05:00 Error unbound [93701:1] error: SERVFAIL <v20.events.data.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:49:19-05:00 Error unbound [93701:3] error: SERVFAIL <mobile.events.data.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:49:18-05:00 Error unbound [93701:0] error: SERVFAIL <mobile.events.data.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:49:18-05:00 Error unbound [93701:1] error: SERVFAIL <mobile.events.data.microsoft.com. A IN>: misc failure
2025-01-18T12:48:07-05:00 Error unbound [93701:0] error: SERVFAIL <download.windowsupdate.com. AAAA IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:48:07-05:00 Error unbound [93701:2] error: SERVFAIL <download.windowsupdate.com. AAAA IN>: misc failure
2025-01-18T12:36:38-05:00 Error unbound [93701:2] error: SERVFAIL <mobile.events.data.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:34:27-05:00 Error unbound [93701:1] error: SERVFAIL <login.live.com. AAAA IN>: misc failure
2025-01-18T12:34:27-05:00 Error unbound [93701:0] error: SERVFAIL <login.live.com. AAAA IN>: misc failure
2025-01-18T12:33:39-05:00 Error unbound [93701:3] error: SERVFAIL <msedge.b.tlu.dl.delivery.mp.microsoft.com. A IN>: misc failure
2025-01-18T12:33:39-05:00 Error unbound [93701:2] error: SERVFAIL <msedge.b.tlu.dl.delivery.mp.microsoft.com. A IN>: misc failure
2025-01-18T12:33:03-05:00 Error unbound [93701:1] error: SERVFAIL <8-courier.push.apple.com. AAAA IN>: misc failure
2025-01-18T12:33:00-05:00 Error unbound [93701:2] error: SERVFAIL <displaycatalog.mp.microsoft.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:33:00-05:00 Error unbound [93701:1] error: SERVFAIL <fs.microsoft.com. A IN>: misc failure
2025-01-18T12:33:00-05:00 Error unbound [93701:1] error: SERVFAIL <fs.microsoft.com. AAAA IN>: misc failure
2025-01-18T12:33:00-05:00 Error unbound [93701:3] error: SERVFAIL <fs.microsoft.com. AAAA IN>: misc failure
2025-01-18T12:18:31-05:00 Error unbound [93701:2] error: SERVFAIL <nrdp-ipv6.prod.ftl.netflix.com. A IN>: exceeded the maximum nameserver nxdomains
2025-01-18T12:18:31-05:00 Error unbound [93701:0] error: SERVFAIL <nrdp-ipv6.prod.ftl.netflix.com. A IN>: exceeded the maximum nameserver nxdomains
The same domains resolve fine seconds or minutes later, so it's intermittent. Any ideas how to resolve this?
Thanks
The text was updated successfully, but these errors were encountered: