Skip to content

Commit c9f94c1

Browse files
jotakAmoghrdleandroberetta
authored
Building a flow matrix with FlowMetrics (#28)
* Building a flow matrix with FlowMetrics * fix template * summary ... * typos * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Amogh Rameshappa Devapura <aramesha@redhat.com> * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Amogh Rameshappa Devapura <aramesha@redhat.com> * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Amogh Rameshappa Devapura <aramesha@redhat.com> * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Amogh Rameshappa Devapura <aramesha@redhat.com> * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Amogh Rameshappa Devapura <aramesha@redhat.com> * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Amogh Rameshappa Devapura <aramesha@redhat.com> * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Amogh Rameshappa Devapura <aramesha@redhat.com> * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Amogh Rameshappa Devapura <aramesha@redhat.com> * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Amogh Rameshappa Devapura <aramesha@redhat.com> * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Leandro Beretta <lea.beretta@gmail.com> * Update content/posts/2025-10-28-flow-matrix/index.md Co-authored-by: Leandro Beretta <lea.beretta@gmail.com> * Add reviewers + small change* --------- Co-authored-by: Amogh Rameshappa Devapura <aramesha@redhat.com> Co-authored-by: Leandro Beretta <lea.beretta@gmail.com>
1 parent 104e950 commit c9f94c1

10 files changed

Lines changed: 290 additions & 0 deletions
17.7 KB
Loading
14.2 KB
Loading
Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
---
2+
layout: :theme/post
3+
title: "Identifying network flow matrix with NetObserv"
4+
description: "NetObserv provides all the data you need to build your network flow matrix, which helps you build network policies"
5+
tags: flow,matrix,networkpolicy,flowmetrics,metrics,prometheus
6+
authors: [jotak]
7+
---
8+
9+
_Thanks to: Amogh Rameshappa Devapura and Leandro Beretta for reviewing_
10+
11+
The NetObserv eBPF agents can observe all the traffic going through your cluster. They extract all the meaningful metadata which is then used to represent the network topology. There's everything you need there to understand who's talking to who in your cluster.
12+
13+
In fact, there is probably _too much_ information, making it potentially hard to navigate, depending on what you're trying to achieve. For instance, to figure out the network policies needed for a given namespace, using only the raw data provided by NetObserv might not be the simplest way. We'd recommend using more advanced features, in particular the _FlowMetrics_ API, which allows you to generate metrics tailored for your needs.
14+
15+
## A use-case: NetObserv own network flow matrix
16+
17+
Let’s consider this use case: using NetObserv to understand its own relationships with other components. Considering you have already installed NetObserv, Loki and the FlowCollector resource with **sampling set to 1**.
18+
19+
The `FlowMetrics` API does not require Loki, however it's still recommended, for better troubleshooting. Installing Loki for testing purpose is as simple as:
20+
21+
```bash
22+
oc create namespace netobserv
23+
oc apply -f https://raw.githubusercontent.com/netobserv/documents/5410e65b8e05aaabd1244a9524cfedd8ac8c56b5/examples/zero-click-loki/1-storage.yaml -n netobserv
24+
oc apply -f https://raw.githubusercontent.com/netobserv/documents/5410e65b8e05aaabd1244a9524cfedd8ac8c56b5/examples/zero-click-loki/2-loki.yaml -n netobserv
25+
```
26+
27+
### The "out-of-the-box" approach
28+
29+
The first thing you can do is to look at the Traffic flows view in the Console plugin, filtering on the namespace you're interested in (here `netobserv`). You can filter just by source and select the "back and forth" option, in order to get the bidirectional traffic.
30+
31+
![Raw flows table](./raw-flows-table.png)
32+
33+
Here we see some nodes talking to `flowslogs-pipeline`, or `netobserv-plugin` talking to the Loki gateway, and more. All the information is accessible from there, but not in the most suitable way. It's a flat view, lacking of aggregation, with redundancies, and also with undesired noise e.g. with source ports (which become destination ports in responses). Moreover, the data is pulled from Loki, which is not great if you want weeks of data. We need something more concise.
34+
35+
The topology view helps for sure: it aggregates some of the data, for instance per owner (workload) instead of per pod.
36+
37+
![Topology](./topology.png)
38+
39+
So you clearly see the relationships between workloads. It's also better at showing long time ranges, being based on Prometheus metrics. However, it might not be the best fit for building our flow matrix, especially if it's a large one, as it can become messy when the observed system is very complex. It's also lacking the destination port information which we may want to include.
40+
41+
Let's look for another approach.
42+
43+
### The FlowMetrics API
44+
45+
The [FlowMetrics API](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowMetric.md) is a very good fit here, because it allows you to shape the metrics that you want. You choose what to aggregate on, what to filter on, what to observe. What do we want here? For every connection:
46+
47+
- The source (client) workloads and namespaces
48+
- The destination (server) workloads, namespaces and ports
49+
50+
That's mostly it.
51+
52+
There's a catch with the destination ports: as mentioned before, when a client connects to a server, the destination port is the server port, but in the server responses it's the other way around, the destination becomes the client. Because NetObserv operates at the L3/4 level, it doesn't know what is the client and what is the server. So we want to rule out the client ports from the flow matrix, which are basically random ports.
53+
54+
An option to solve it is to focus just on SYN packets. When a SYN packet is observed, we can assume the destination port is the server/listening port. Since NetObserv does capture the TCP flags, we can simply filter on that in our metric definition; here's the full YAML:
55+
56+
```yaml
57+
apiVersion: flows.netobserv.io/v1alpha1
58+
kind: FlowMetric
59+
metadata:
60+
name: workload-syn-in
61+
namespace: netobserv
62+
spec:
63+
type: Counter
64+
labels:
65+
- SrcSubnetLabel
66+
- SrcK8S_Namespace
67+
- SrcK8S_OwnerName
68+
- SrcK8S_OwnerType
69+
- DstSubnetLabel
70+
- DstK8S_Namespace
71+
- DstK8S_OwnerName
72+
- DstK8S_OwnerType
73+
- DstPort
74+
- Flags
75+
flatten: [Flags]
76+
filters:
77+
- field: Flags
78+
value: SYN
79+
remap:
80+
SrcK8S_Namespace: from_namespace
81+
SrcK8S_OwnerName: from_workload
82+
SrcK8S_OwnerType: from_kind
83+
SrcSubnetLabel: from_subnet_label
84+
DstK8S_Namespace: namespace
85+
DstK8S_OwnerName: workload
86+
DstK8S_OwnerType: kind
87+
DstSubnetLabel: subnet_label
88+
DstPort: port
89+
```
90+
91+
Explanation, part by part:
92+
93+
```yaml
94+
spec:
95+
type: Counter
96+
```
97+
98+
We're going to count every flow received that satisfies the filters. The other types are `Gauge` and `Histogram`, which are not relevant for this purpose.
99+
In fact, in this case, we don't really care about the metric value. What we care about is the relationship between labels. So it doesn't matter too much which metric type we're setting here.
100+
101+
```yaml
102+
labels:
103+
- SrcSubnetLabel
104+
- SrcK8S_Namespace
105+
- SrcK8S_OwnerName
106+
- SrcK8S_OwnerType
107+
- DstSubnetLabel
108+
- DstK8S_Namespace
109+
- DstK8S_OwnerName
110+
- DstK8S_OwnerType
111+
- DstPort
112+
- Flags
113+
```
114+
115+
Labels are what flows are going to be aggregated on. When several flows are recorded with the exact same set of labels, the corresponding metric counter is incremented. If any of those labels differ from the previously recorded flows, it results in a new time-series that starts at 1.
116+
117+
This list of labels is roughly what we described above, plus the `SrcSubnetLabel` / `DstSubnetLabel` which will be explained below, and the `Flags` (TCP flags) which we need for flattening+filtering as explained below.
118+
119+
You can read more about all the available fields [here](https://github.com/netobserv/network-observability-operator/blob/main/docs/flows-format.adoc).
120+
121+
```yaml
122+
flatten: [Flags]
123+
filters:
124+
- field: Flags
125+
value: SYN
126+
```
127+
128+
Because `Flags` comes as a list of strings, we need to flatten it before we can filter. Without the flatten operation, the `Flags` would appear as a list such as `Flags=SYN,ACK,RST`. When flattened, that flow is mapped into three flows (`Flags=SYN`, `Flags=ACK` and `Flags=RST`). The filter operation keeps only the SYN one, which stands for the TCP connection being established between a client and a server.
129+
130+
```yaml
131+
remap:
132+
SrcK8S_Namespace: from_namespace
133+
SrcK8S_OwnerName: from_workload
134+
SrcK8S_OwnerType: from_kind
135+
SrcSubnetLabel: from_subnet_label
136+
DstK8S_Namespace: namespace
137+
DstK8S_OwnerName: workload
138+
DstK8S_OwnerType: kind
139+
DstSubnetLabel: subnet_label
140+
DstPort: port
141+
```
142+
143+
Finally, the remapping operation is optional but is provided as a syntactic sugar when manipulating later the metric with `promQL`, the Prometheus query language. Because we are creating here a metric named `workload-syn-in`, that is to say, focused on the incoming traffic to our namespace of interest, we're renaming the `Dst*` labels in a more workload-centric fashion and the `Src*` as the opposite side prefixed with `from_`.
144+
145+
With this config, the query looks like: `netobserv_workload_syn_in{ namespace="my-namespace"}`.
146+
147+
We then create another metric, almost identical, except for the remapping, focused on the outgoing traffic:
148+
149+
```yaml
150+
apiVersion: flows.netobserv.io/v1alpha1
151+
kind: FlowMetric
152+
metadata:
153+
name: workload-syn-out
154+
namespace: netobserv
155+
spec:
156+
type: Counter
157+
flatten: [Flags]
158+
labels:
159+
- SrcSubnetLabel
160+
- SrcK8S_Namespace
161+
- SrcK8S_OwnerName
162+
- SrcK8S_OwnerType
163+
- DstSubnetLabel
164+
- DstK8S_Namespace
165+
- DstK8S_OwnerName
166+
- DstK8S_OwnerType
167+
- DstPort
168+
- Flags
169+
filters:
170+
- field: Flags
171+
value: SYN
172+
remap:
173+
DstK8S_Namespace: to_namespace
174+
DstK8S_OwnerName: to_workload
175+
DstK8S_OwnerType: to_kind
176+
DstSubnetLabel: to_subnet_label
177+
SrcK8S_Namespace: namespace
178+
SrcK8S_OwnerName: workload
179+
SrcK8S_OwnerType: kind
180+
SrcSubnetLabel: subnet_label
181+
DstPort: port
182+
```
183+
184+
With promQL looking like: `netobserv_workload_syn_out{ namespace="my-namespace"}.`
185+
186+
You could really create just 1 metric here instead of 2, without the remapping but the `promQL` is a little less simple to reason about when you have to juggle between Src and Dst fields.
187+
188+
### Viewing the result
189+
190+
To get all the outgoing traffic, open your Prometheus console and run:
191+
192+
```
193+
sum(rate(netobserv_workload_syn_out{ namespace="netobserv"}[1m])) by (workload, kind, port, to_kind, to_namespace, to_subnet_label, to_workload)
194+
```
195+
196+
(Replace "netobserv" with any namespace you're interested in)
197+
198+
![workload-syn-out](./workload-syn-out.png)
199+
200+
We can see which traffic we need to allow:
201+
- `flowlogs-pipeline` talks to `kubernetes` on 443 and nodes on 6443 (it's the API server) and to the Loki gateway on 8080
202+
- `netobserv-plugin` talks to the Loki gateway on 8080, Loki frontend on 3100, Thanos on 9091 and the API server on 443.
203+
204+
When in doubt, we can fallback on the regular Traffic flows view of the plugin to analyse traffic more in depth.
205+
206+
For the incoming traffic:
207+
208+
```
209+
sum(rate(netobserv_workload_syn_in{ namespace="netobserv"}[1m])) by (workload, kind, port, from_kind, from_namespace, from_subnet_label, from_workload)
210+
```
211+
212+
![workload-syn-in](./workload-syn-in.png)
213+
214+
Here we immediately notice that many nodes are talking to `flowlogs-pipeline`, which creates some noise, so let's split the query into two: one for the nodes and another for the rest. Another possibility would be to create two different FlowMetrics, one being dedicated to the traffic coming from nodes.
215+
216+
For nodes:
217+
218+
```
219+
sum(rate(netobserv_workload_syn_in{ namespace="netobserv",from_kind="Node"}[1m])) by (workload, kind, port, from_subnet_label)
220+
```
221+
222+
As you can see, we do _not_ aggregate by `from_workload`, thus removing the noise of which node the traffic originates from - we don't care about that, knowing that it comes from nodes in general should be sufficient.
223+
224+
![workload-syn-in-nodes](./workload-syn-in-nodes.png)
225+
226+
Which leaves us with just two entries, and more specifically two ports:
227+
- 2055: used for collecting netflows from the eBPF agent. The reason why it's considered node traffic is because it's configured to use the host network.
228+
- 8080: this is the port that we declared for Kubernetes health probes.
229+
230+
And finally, for non-nodes:
231+
232+
```
233+
sum(rate(netobserv_workload_syn_in{ namespace="netobserv",from_kind!="Node"}[1m])) by (workload, kind, port, from_kind, from_namespace, from_subnet_label, from_workload)
234+
```
235+
236+
![workload-syn-in-no-nodes](./workload-syn-in-no-nodes.png)
237+
238+
We just see here the OpenShift Console calling our plugin on port 9001, and OpenShift Monitoring fetching our metrics on 9401 (for flowlogs-pipeline) and 9002 (for our plugin).
239+
240+
### What about external traffic?
241+
242+
We will cover more in detail, in another blog post, how to identify external traffic in NetObserv metrics.
243+
244+
But to make it short here, when the traffic is going to a public / cluster external IP, NetObserv doesn't know what it is so you will see mostly empty fields in the related metrics:
245+
246+
![empty-fields](./empty-fields.png)
247+
248+
... that is, unless you help NetObserv understand what it is.
249+
250+
Of course, that means you need to know what are the external workloads or services that your workloads are talking to. We can figure that out by going to the Traffic flows tab of the Console plugin.
251+
252+
Setting filters with the desired Source Namespace and Destination Kind to an empty double-quoted string (`""`), will show what we want. Additionally, we can change the visible columns to show the Destination IP (click on "Show advanced options" then "Manage columns").
253+
254+
![external-ips](./investigate-ips.png)
255+
256+
There are IPs such as 3.5.205.175. A `whois` shows that it's Amazon S3 behind that.
257+
258+
Let's reconfigure our `FlowCollector` with the Amazon S3 IP ranges, so that NetObserv is aware of it.
259+
260+
```yaml
261+
spec:
262+
processor:
263+
subnetLabels:
264+
openShiftAutoDetect: true
265+
customLabels:
266+
- cidrs:
267+
- 16.12.20.0/24
268+
- 52.95.156.0/24
269+
- 3.5.204.0/22
270+
- 52.95.154.0/23
271+
- 16.12.18.0/23
272+
- 3.5.224.0/22
273+
- 13.36.84.48/28
274+
- 13.36.84.64/28
275+
name: "AWS_S3_eu-west-3"
276+
```
277+
278+
Now, back to the metrics, you will see this name appearing under the label `to_subnet_label` (or `from_subnet_label`).
279+
280+
![aws-s3](./aws-s3.png)
281+
282+
If you have more undetermined traffic, rinse and repeat until you identify everything, and then you should have all the required pieces for your network policy.
283+
284+
## Summary and additional notes
285+
286+
With this use-case and the FlowMetrics API, we've been able to identify precisely, and in a concise way, which workloads we are talking to. We clearly identify both the ingress and egress traffic, which help create a network policy. Some additional aspects to take into account:
287+
288+
- After having created the `FlowMetrics`, you should probably restart the pods that you want to monitor, so that they re-establish all the connections. This is especially needed if they use long-standing connections, where the SYN packets that we monitor aren't going to be sent again.
289+
- We focused here on TCP connections. You can monitor UDP in a similar way, except that you won't have the SYN trick for removing the noise with source ports. The `Proto` field holds the L4 protocol and can be used for filtering ([UDP number is 17](https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers)).
290+
- When you create a network policy for OVN, you can use the Network Events feature, alongside with a TechPreview OpenShift cluster, to troubleshoot network policy allowed and denied traffic. [This previous post](https://netobserv.io/posts/monitoring-ovn-networking-events-using-network-observability/) tells you more about it.
154 KB
Loading
198 KB
Loading
172 KB
Loading
66.6 KB
Loading
37.5 KB
Loading
178 KB
Loading
143 KB
Loading

0 commit comments

Comments
 (0)