NLB in instance mode + Nginx Fabric Gateway API target group health. #4065

bozho · 2025-02-21T11:23:18Z

Hi all,

I'm not sure if this is a bug or I'm doing something wrong here...

Our k8s v1.31 cluster setup if fairly simple: three worker nodes we run some services on, use Nginx Fabric as a reverse proxy with some routes set up for our services. aws-load-balancer-controller v2.11.0 is installed using helm and configured as NLB in instance mode.

This all works, including cert-manager provisioning Let's Encrypt certs and external-dns taking care of our service FQDNs setting CNAME records to the AWS Load Balancer FQDN.

Nginx Fabric is installed with practically default settings (we enable Gateway API experimental features) and spins up a single pod.

The problem I'm having is that AWS Load Balancer's target groups include all three of the worker nodes' EC2 instances and only the one on which the nginx pod is actually running shows as healthy. The target groups run default checks, HTTP on path /healthz and port 30632.

Each EC2 instance has 2 private IPv4 addresses, both in the same subnet, e.g. 10.0.6.47 and 10.0.6.195. When I log into each node and run curl http://<private IP addr>:30632/healthz, I get the following results: when a node is connecting to its own IPs, both IPs work. When a node is connecting to another node, only one of the IPs will work (and it's the same IP for node X when nodes Y and Z are connecting). The "successful" IP is the one instance's private IP DNS name resolves to (e.g. ip-10-0-6-47.eu-west-1.compute.internal), on the network interface with index 0.

netstat on all three nodes shows kube-proxy listening on the 30632 port:

tcp6 0  0 :::30632    :::*     LISTEN   2125/kube-proxy

There are no custom firewall rules on nodes, all AWS security groups are defaults.

This looks like target group health checks connect to one of the two private IPs, failing for some nodes, succeeding for others.

The text was updated successfully, but these errors were encountered:

bozho · 2025-02-21T12:39:25Z

And to reply to myself, it's because nginx fabric gateway (like nginx ingress controller) sets service.externalTrafficPolicy to Local in order to preserve source IPs, but as a consequence, requests are not routed to nodes not running nginx gateway/ingress pods.

Details here.

So, it looks like we have two options here:

Run nginx gateway with Cluster external traffic policy, making our lives easier, but masking source IPs. In this case, health checks are TCP checks performed on traffic ports.
Run nginx gateway with Local external traffic policy if we need to preserve source IPs. In that case, we would need to have aws load balancer only select nodes running nginx gateway pods for target groups. In this case, health checks are HTTP checks on the health check port and /healthz path.

Are there other options here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLB in instance mode + Nginx Fabric Gateway API target group health. #4065

NLB in instance mode + Nginx Fabric Gateway API target group health. #4065

bozho commented Feb 21, 2025 •

edited

Loading

bozho commented Feb 21, 2025

NLB in instance mode + Nginx Fabric Gateway API target group health. #4065

NLB in instance mode + Nginx Fabric Gateway API target group health. #4065

Comments

bozho commented Feb 21, 2025 • edited Loading

bozho commented Feb 21, 2025

bozho commented Feb 21, 2025 •

edited

Loading