|
| 1 | +--- |
| 2 | +layout: :theme/post |
| 3 | +title: "DNS name tracking with Network Observability" |
| 4 | +description: "Overview of DNS name tracking feature" |
| 5 | +tags: network,observability,dns,loki,troubleshooting |
| 6 | +authors: [memodi, jpinsonneau] |
| 7 | +--- |
| 8 | + |
| 9 | +Network Observability has long had a feature that reports the DNS latencies and |
| 10 | +response codes for the DNS resolutions in your Kubernetes cluster. `DNSTracking` |
| 11 | +feature can be simply enabled in Flowcollector config as below. |
| 12 | + |
| 13 | +```yaml |
| 14 | +spec: |
| 15 | + agent: |
| 16 | + ebpf: |
| 17 | + features: |
| 18 | + - DNSTracking |
| 19 | +``` |
| 20 | +
|
| 21 | +In the most recent 1.11 release, a major enhancement was added to existing |
| 22 | +`DNSTracking` feature to report DNS query names as well without any additional |
| 23 | +configuration to the flowcollector. |
| 24 | + |
| 25 | +The current implementation captures DNS latencies, response codes, and query |
| 26 | +names from DNS response packets. To understand this better, let's examine the |
| 27 | +structure of a standard DNS response packet: |
| 28 | + |
| 29 | + |
| 30 | + |
| 31 | +As you may have guessed DNS query name is being captured from the Question |
| 32 | +section of a response packet. DNS resolution being the first step to all the |
| 33 | +network requests, in this blog, let us demonstrate how having this information |
| 34 | +could help you troubleshoot configuration issues or potentially save egress |
| 35 | +costs. |
| 36 | + |
| 37 | +"We're running an OpenShift cluster on AWS with a simple test setup: a `client` |
| 38 | +pod making requests to an `nginx` service in a different namespace. The nginx |
| 39 | +service runs in the `server` namespace, while the client pod runs in the |
| 40 | +`client` namespace." |
| 41 | + |
| 42 | +```bash |
| 43 | + while : ; do |
| 44 | + curl nginx.server.svc:80/data/100K 2>&1 > /dev/null |
| 45 | + sleep 5 |
| 46 | + done |
| 47 | +``` |
| 48 | +While the requests to fetch 100K object does succeed, can you spot the |
| 49 | +configuration issue in the above curl command for the nginx requests that its |
| 50 | +making? Let's look at what we do see in the flowlogs: |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | +We see several requests failing due to `NXDOMAIN` response code and the ones |
| 55 | +that succeed have query names `nginx.server.svc.cluster.local`. Since we |
| 56 | +configured short DNS name `nginx.server.svc` in the curl command, the cluster |
| 57 | +DNS service tries multiple search paths to find answer based on /etc/resolv.conf |
| 58 | +search directive. |
| 59 | + |
| 60 | +```bash |
| 61 | +cat /etc/resolv.conf |
| 62 | +search server.svc.cluster.local svc.cluster.local cluster.local us-east-2.compute.internal |
| 63 | +nameserver 172.30.0.10 |
| 64 | +options ndots:5 |
| 65 | +``` |
| 66 | + |
| 67 | +Short DNS names for cluster services causes high load on the cluster DNS service |
| 68 | +resulting in higher latencies, negative caching and increased dns traffic. This |
| 69 | +negative impact can be prevented by using Fully Qualified DNS Name (FQDN) in the |
| 70 | +requests. After updating the hostname to `nginx.server.svc.cluster.local.` in |
| 71 | +the curl requests, we are not seeing any NXDOMAINS and reduced unnecessary dns |
| 72 | +traffic in our cluster. You can imagine the performance impact if such |
| 73 | +configuration issue propagated to hundreds of services in your cluster. |
| 74 | + |
| 75 | + |
| 76 | + |
| 77 | +The web console also has new Overview Panels to fetch top 5 DNS names which are |
| 78 | +queried most: |
| 79 | + |
| 80 | + |
| 81 | + |
| 82 | +Note that `pod` filters are removed in above image since the DNS traffic is |
| 83 | +reported by the coredns `Service`. This visualization can identify suspicious |
| 84 | +domain name activities in your cluster and with table view you can narrow down |
| 85 | +to the resource where such activities could be coming from. |
| 86 | + |
| 87 | +While DNS name decoding has great use-cases in identifying and troubleshooting |
| 88 | +issues, it comes with some caveats to favor performance. This feature isn't |
| 89 | +supported with Prometheus as datastore since storing DNS names as metric values |
| 90 | +could cause high cardinality. That means, if you're looking to use this feature |
| 91 | +you must use Loki as your datasource. Captured DNS names will be truncated at 32 |
| 92 | +bytes to balance the memory netobserv-ebpf-agent's memory utilization. |
| 93 | + |
| 94 | +DNS name tracking currently does not support DNS compression pointers — a |
| 95 | +space-saving technique defined in |
| 96 | +([RFC 1035 section 4.1.4](https://www.rfc-editor.org/rfc/rfc1035.html#section-4.1.4)). |
| 97 | +While this is a known limitation, it has minimal practical impact since |
| 98 | +compression is rarely used in the Question section where queries are tracked. |
| 99 | +Compression pointers are predominantly used in Answer sections to reference the |
| 100 | +queried domain name. |
| 101 | + |
| 102 | +In combination with other Network Observability features such as built in alerts |
| 103 | +for overall network health, DNS name tracking will help identify real world |
| 104 | +issues faster. Before we wrap up, we'd like to acknowledge <__add |
| 105 | +acknowledgements__>. |
| 106 | + |
| 107 | +If you'd like to share feedback or engage with us, feel free to ping us on |
| 108 | +[slack](https://cloud-native.slack.com/archives/C08HHHDA9ND) or drop in a |
| 109 | +[discussion](https://github.com/orgs/netobserv/discussions). |
| 110 | + |
| 111 | + |
| 112 | +Thank you for reading! |
0 commit comments