Skip to content

Commit d0d4cbf

Browse files
committed
DNS name Tracking blog
1 parent 4728fd3 commit d0d4cbf

File tree

5 files changed

+182
-0
lines changed

5 files changed

+182
-0
lines changed
Lines changed: 70 additions & 0 deletions
Loading
355 KB
Loading
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
---
2+
layout: :theme/post
3+
title: "DNS name tracking with Network Observability"
4+
description: "Overview of DNS name tracking feature"
5+
tags: network,observability,dns,loki,troubleshooting
6+
authors: [memodi, jpinsonneau]
7+
---
8+
9+
Network Observability has long had a feature that reports the DNS latencies and
10+
response codes for the DNS resolutions in your Kubernetes cluster. `DNSTracking`
11+
feature can be simply enabled in Flowcollector config as below.
12+
13+
```yaml
14+
spec:
15+
agent:
16+
ebpf:
17+
features:
18+
- DNSTracking
19+
```
20+
21+
In the most recent 1.11 release, a major enhancement was added to existing
22+
`DNSTracking` feature to report DNS query names as well without any additional
23+
configuration to the flowcollector.
24+
25+
The current implementation captures DNS latencies, response codes, and query
26+
names from DNS response packets. To understand this better, let's examine the
27+
structure of a standard DNS response packet:
28+
29+
![DNS Response Packet](dns-response-packet.svg)
30+
31+
As you may have guessed DNS query name is being captured from the Question
32+
section of a response packet. DNS resolution being the first step to all the
33+
network requests, in this blog, let us demonstrate how having this information
34+
could help you troubleshoot configuration issues or potentially save egress
35+
costs.
36+
37+
"We're running an OpenShift cluster on AWS with a simple test setup: a `client`
38+
pod making requests to an `nginx` service in a different namespace. The nginx
39+
service runs in the `server` namespace, while the client pod runs in the
40+
`client` namespace."
41+
42+
```bash
43+
while : ; do
44+
curl nginx.server.svc:80/data/100K 2>&1 > /dev/null
45+
sleep 5
46+
done
47+
```
48+
While the requests to fetch 100K object does succeed, can you spot the
49+
configuration issue in the above curl command for the nginx requests that its
50+
making? Let's look at what we do see in the flowlogs:
51+
52+
![Queries to nginx server](nginx-queries.png)
53+
54+
We see several requests failing due to `NXDOMAIN` response code and the ones
55+
that succeed have query names `nginx.server.svc.cluster.local`. Since we
56+
configured short DNS name `nginx.server.svc` in the curl command, the cluster
57+
DNS service tries multiple search paths to find answer based on /etc/resolv.conf
58+
search directive.
59+
60+
```bash
61+
cat /etc/resolv.conf
62+
search server.svc.cluster.local svc.cluster.local cluster.local us-east-2.compute.internal
63+
nameserver 172.30.0.10
64+
options ndots:5
65+
```
66+
67+
Short DNS names for cluster services causes high load on the cluster DNS service
68+
resulting in higher latencies, negative caching and increased dns traffic. This
69+
negative impact can be prevented by using Fully Qualified DNS Name (FQDN) in the
70+
requests. After updating the hostname to `nginx.server.svc.cluster.local.` in
71+
the curl requests, we are not seeing any NXDOMAINS and reduced unnecessary dns
72+
traffic in our cluster. You can imagine the performance impact if such
73+
configuration issue propagated to hundreds of services in your cluster.
74+
75+
![FQDN DNS names](fixed-client-config.png)
76+
77+
The web console also has new Overview Panels to fetch top 5 DNS names which are
78+
queried most:
79+
80+
![Top 5 DNS Names panel](top5-dns-name.png)
81+
82+
Note that `pod` filters are removed in above image since the DNS traffic is
83+
reported by the coredns `Service`. This visualization can identify suspicious
84+
domain name activities in your cluster and with table view you can narrow down
85+
to the resource where such activities could be coming from.
86+
87+
While DNS name decoding has great use-cases in identifying and troubleshooting
88+
issues, it comes with some caveats to favor performance. This feature isn't
89+
supported with Prometheus as datastore since storing DNS names as metric values
90+
could cause high cardinality. That means, if you're looking to use this feature
91+
you must use Loki as your datasource. Captured DNS names will be truncated at 32
92+
bytes to balance the memory netobserv-ebpf-agent's memory utilization.
93+
94+
DNS name tracking currently does not support DNS compression pointers — a
95+
space-saving technique defined in
96+
([RFC 1035 section 4.1.4](https://www.rfc-editor.org/rfc/rfc1035.html#section-4.1.4)).
97+
While this is a known limitation, it has minimal practical impact since
98+
compression is rarely used in the Question section where queries are tracked.
99+
Compression pointers are predominantly used in Answer sections to reference the
100+
queried domain name.
101+
102+
In combination with other Network Observability features such as built in alerts
103+
for overall network health, DNS name tracking will help identify real world
104+
issues faster. Before we wrap up, we'd like to acknowledge <__add
105+
acknowledgements__>.
106+
107+
If you'd like to share feedback or engage with us, feel free to ping us on
108+
[slack](https://cloud-native.slack.com/archives/C08HHHDA9ND) or drop in a
109+
[discussion](https://github.com/orgs/netobserv/discussions).
110+
111+
112+
Thank you for reading!
386 KB
Loading
136 KB
Loading

0 commit comments

Comments
 (0)