-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to avoid unhealthy backend / 502s on rolling deployments #1718
Comments
May also be related to #1656. |
The
As soon as the |
/kind support A clarification question. Are you seeing that the requests are going to the pod that has been deleted when you get 502s or are seeing that the requests are going to the new pod? I am wondering if the issue you are facing is that the old pod on termination has not been removed from the NEG yet, so there is a latency between the pod being removed in Kubernetes and the pod being removed from the NEG. |
I can see that the 502 responses are interspersed with valid responses from the new pod, so I think you are right -- the load balancer 502 errors are because the load balancer is still sending requests to the pod that has terminated in k8s. I thought the issue was with my setting of |
Forcing the stopping container to stick around for a few extra seconds via |
Even The new pod did have the error message I originally mentioned above:
in which the "cloud.google.com/load-balancer-neg-ready" condition appears to be marked True before it should be. So, to summarize, it seems there are actually two problems contributing to 502 errors during rolling deployments:
|
@rocketraman, as this pod changes are occurring, are there any node or zone changes that are occurring? |
@swetharepakula Yes, this cluster is an autopilot cluster. When I deploy this update, the deployment generally requires a new node to be created to host the new pod. |
In addition, usually a few minutes later the pod is migrated back to the original node, and the new node is shutdown. That process rarely completes without some 502s as well. |
@rocketraman, is that node being created in a new zone? Is the autopilot cluster a regional cluster? |
@swetharepakula I haven't checked specifically on the zone of the new node. It is a regional cluster (my understanding is that all autopilot clusters are regional clusters), so I presume autopilot is indeed balancing the new node across zones. |
Yes, I just confirmed the new node is in a different zone than the existing node that hosts that pod. |
So I believe the following is what is happening:
Neg controller does immediately respond to endpoint changes. However, there can be latency due to the time it takes for a detach operation to complete. In the time it takes the detach operation to complete, the load balancer may still route traffic to the terminating pod. There are a few options to mitigate this:
Since you are already doing (1), that is probably the easiest approach.
This is a race between the Ingress controller and a workload being scheduled/started on a new node. The NEG controller sees the update before the Ingress controller and adds the endpoint. However if this is on a node in a new zone, the NEG controller creates a new NEG in the new zone. The Ingress controller then needs to attach that NEG to the backend service. However if the Ingress controller doesn't finish that before the workloads are scheduled on the node, those new pods will have their readiness gates switched to ready immediately since the NEG will not be in any BackendService. For non auto-pilot clusters our recommendation is to reduce the number of zone changes and try to run workloads in every zone the cluster is in. Since this is an autopilot cluster, your options may be limited in this regard. At this time we are still looking into how to make the experience better for both of these cases. |
FYI I actually don't think this works -- at least it didn't when I tried it. I actually wouldn't expect it to:
That is too bad, but it does suggest a work-around. Scale up the deployment so that there is at least one pod per zone. That does raise system cost significantly when only one pod is required to meet load/redundancy requirements. Will this become the tracking issue for improving this behavior, or can you reference me the issue I should follow? |
The workaround I suggested with For now we will keep this issue open to communicate updates on this front. There isn't another issue open to track this work. |
I'm also getting 502 errors during availability zone changes. I added the Are there drawbacks to this workaround? Any reason it wasn't suggested in this thread before? |
I think you missed this important message which shouldn't be occurring
This can occur If the attachement of the Pod to a NEG is done, but the backend isn't yet linked to this NEG. The result is that the old pod is detached and you have a wonderful backend service without any NEG and as a result 502 errors. I also added minReadySeconds as a safety of this event to wait a bit before removing the old pod, but it isn't a 100% guaranteed success, especially if something is exceptionally slow at pod startup, as I have two backend services linked to my service, I often encounter this issue 😕. |
@swetharepakula what are the recommended steps to reliably recover a system experiencing this issue, separate from the preventative configuration you recommended? is deleting the service, ingress, and deployment sufficient, then reapplying those k8s resources sufficient? I'm not seeing intermittent 502s but rather persistent once the system enters this failure mode. thankfully this is a development cluster and I have the option to completely destroy/deprovision and reprovision/deploy, but that's obviously not a viable option in a production cluster |
@talzion12 , in your approach, it means you have switched to using Instance Groups which is not our recommended or default stack. NEGs provide you with a container native solution, while the instance group solution is susceptible to the double hop problem. NEGs are our recommended approach. @saez0pub , The neg controller and the Ingress controllers operate in parallel, so there can be a race when a new NEG is created but it has not been added to the backend service yet. We have released a fix as part of ingress-gce v1.20 that should ensure that NEGs are created are sooner (as soon as the node in the new zone is ready) which should hopefully give more time for the Ingress Controller to add the NEG to the backend service before the workloads are scheduled in the new zone. @revero-doug , this does not sound like the a 502 due to a rolling development. Can you expand more on what the symptoms are, and what is occurring in the cluster? Since it is a different kind of an issue, can you open a new issue with those details. Thanks! |
@swetharepakula This started happening very frequently in one of our clusters after upgrading to 1.24. The k8s/gce-ingress have not been previously mentioned in this issue -- is it possible this race was introduced between 1.21 - 1.24? The version mapping in the readme has not been updated in some time...is there any other way to determine what gce-ingress version we are using? Or what future k8s version the fix in v1.20 will be tied to? |
Hello, my version is 1.23.14-gke.1800. As I'm using cloud native load balancing and Cluster IP, the NEG is created at each deployment. |
@swetharepakula I understand but I'd rather have slightly worse performance with the double hop than downtime with NEG unless I'm missing some other considerations. |
You may find this GCP documentation helpful, as it describes the problems here are some possible solutions: https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#traffic_does_not_reach_endpoints |
Indeed, but this is not a stop problem, it is a start problem, kube starts to delete the old pod while the gclb has not attached the new pod to the NEG. It takes a random time to auto fix, so 60 seconds is not always sufficient. There is obviously a problem in the ingress management. I'm praying each time I deploy, kubernetes just brings me downtime 💢 . Can't Google estimate the rate of occurences of LoadBalancerNegWithoutHealthCheck in their gke ? |
Is there any update on when we might see a fix for this problem hitting GCP itself? I'm seeing 502's during rolling upgrades of pods on an autopilot cluster, even though the new pod is fully warmed up (I'm using argo rollouts to enable blue-green deployments). The |
@denismccarthykerry did you try to add a minReadySeconds of 30 seconds to your deployment ? To solve the issue, you need to ensure you are using container native load balancer https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing HTH |
We have |
Thanks for the reply Guillaume. I've tried minReadySeconds (my services were already annotated with If I knew it was a known bug at least I could rest a little easier that it's not something I'm doing wrong myself... |
@denismccarthykerry Is I'm also a bit surprised that adding |
You're right, that was not it. It did make it appear that the issue was resolved, but only because the health check itself was not configured correctly, so the deployment never routed traffic to the new node - so obviously I didn't get the 502's on switchover. I've deleted the comment so as not to mislead people. Looks like this one is still unresolved... |
@swetharepakula , this issue seems related to a problem we are having with 503s when HPA scales down a service. We get a bunch of pods getting killed, followed by a bunch of 503s. The difference in our configuration is an istio service mesh that injects a sidecar proxy container. From looking at the events and logs, the application container dies almost immediately on the Ultimately the problem looks the same, namely GCLB is still sending traffic to a pod that has been killed. Are there any different recommendations for fixing the issue when a service mesh is in play? Thanks! |
This can potentially be of relevance: It explains what might be happening and how to tackle it. |
While it's a great summary and shows visually what is happening, there is nothing new in it that was not already discussed above in this issue -- even my original post mentions the Also see comment #1718 (comment) and the following comments with a solution for the second problem that can cause 500s related to zones. I think this information should also be added to that article. On a managed platform, I don't think users in general should be configuring application-level resources with things like preStop hooks where the sole reason is to solve a problem with how the platform operates. |
Have you any workarounds for this problem @rocketraman ? It's still occurring for me. The downtime is a matter of a few seconds per deploy, but it is super irritating and is a real concern when considering rolling updates. I'm considering switching to some other technology at this point due to the lack of progress on this issue. E.g. if it was a case that creating a new production autopilot cluster could resolve it I would take the pain to do that, but there's no indication that that is the case. Then again, this does not seem to be a widespread issue affecting all gke clusters using global load balancing, so there must be something specific- somewhere - at the root of this problem. |
HelloHave you checked that you are using the cloud native load balancing ? FYI this is not the default if you are using a shared VPC.If not I would suggest to enable it using the explicit annotation on the serviceannotations: cloud.google.com/neg: '{"ingress": true}'Another thing you can add, is a minReadySecondParameter in you deployment to let the LB and heath check time to enable a new podLe 23 déc. 2023 à 08:22, Denis McCarthy ***@***.***> a écrit :
Have you any workarounds for this problem @rocketraman ? It's still occurring for me. The downtime is a matter of a few seconds per deploy, but it is super irritating and is a real concern when considering rolling updates. I'm considering switching to some other technology at this point due to the lack of progress on this issue.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Thanks @GuillaumeMorini . I have added the |
I think what the troubleshooting guide fails to mention is that the pre-stop hook is not exactly the right approach. For the load balancer to know that the container is terminating, the container should start failing the readiness probe as soon as the pod termination is triggered (kubernetes/kubernetes#110191), and a pre-stop hook is just wasting time in this regard. Instead, the timeout should happen in the main process. It should listen to SIGTERM, and as soon as it is received, it should start failing the readiness probe, while continuing doing business as usual, including all other probes, until the timeout expires, at which point it should do a graceful shutdown. |
That's really abusing the normal meaning of SIGTERM, and therefore coupling application code to kubernetes in an uncomfortable way. |
I'm still seeing the 502's unfortunately. Couldn't reproduce them yesterday after I made the change, happening today though when I deploy. Still waiting for some sort of a fix or mitigation for this issue. I'm amazed it doesn't seem to be more prevalent - has anybody who experienced the issue tried recreating their cluster I wonder? There's a bit of work in this for me, but if it were to work I would do it. @IvanUkhov I'll try your approach when I get the time to implement the change in the app - it does couple behaviour to K8S as @rocketraman points out, but I would sell body parts at this point to resolve the problem. |
@denismccarthykerry, it is not really a solution but a mitigation strategy. I don't see as many errors in the application load balancer as before, but I have seen a few still. One observation is that is that it mainly happens in clusters with minimal replication. I am experimenting with two: one has one replica per deployment, and the other three. And it is always the first one that is troubling. When there are more resources, it is more forgiving, it seems. It could be so due to the sheer number of replicas, or it can have something to do with not having a working node in each region or zone. The errors appear not so much during the rollout itself but the subsequent rebalancing. |
@IvanUkhov , we see this issue happen when our replica counts scale down. In an effort to prepare for large increases in volume (typically after a mobile app push notification) we pre-scale the deployments by increasing min replicas. Then later we reduce the replicas to their original value. It is on the scale down that we see lots of errors from pods that k8s killed still receiving traffic. What is even more crazy is they ARE failing the readiness probes yet they continue to get traffic. Seems to take about 10s for the traffic to stop going to the killed pods regardless of state of the application container. 🤷 We did add the |
That is that happens after a rollout. It first needs to scale up to get the new pods running somewhere before doing anything with the old ones, and then it scales back down, which is when errors start to happen.
How are they failing the readiness probe if you later say you are only planning to implement that? Do you mean they are failing because they are simply not running? That probe should be failing while the rest should be working as usual for a while until the load balancer catches up. And the 10 seconds you mentioned could be the periodicity of the probe: if the pod shuts down, it can take up to 10 seconds for the rest of the system to realizes what happened. That is why the pod should stay alive and continue to respond. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Asked Google Support related to this, they said it is intended, related code is here ingress-gce/pkg/neg/readiness/poller.go Line 236 in 1e99441
🤷♂️ |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Is this helps anyone. We (believe) we have managed to mitigate this issue. We run Autopilot on GKE. We were getting issues where a pod was spinning up for a deployment in a zone, where there was no NEG, and it was being put into service before the pod had passed health checks. This caused the 502s. We made use of topologySpreadConstraints to ensure that we are always running a pod for each web based deployment in a specific zone. If there is always a pod in that zone, then there is always the matching NEG. So when new pods start up, they don't have this race condition. Before this we would regularly be using 2 of the 3 GKE zones and flipping between them. We are also running some very small pods with a node selector to ensure we always have a node in each zone. (This might be overkill) This is the yaml we needed to add to each deployment.
|
I have a GCE ingress in front of an HPA-managed deployment (at this time, with a single replica).
On a rolling deployment, I sometimes run into the backend being marked as unhealthy and resulting 502 errors, usually for about 15-20 seconds.
According to the pod events, the
neg-readiness-reflector
appears to markcloud.google.com/load-balancer-neg-ready
toTrue
for the pod before it is actually ready:While in this state, the previous pod terminates, but the load balancer does not route requests to the new pod, resulting in 502s.
I do have the deployment strategy set that should not allow this but I guess the neg being set to Ready is subverting this:
My deployment does also define a readiness probe as can be seen in the events above.
I do also have a health check configuration defined for the backend:
I found this stackoverflow in which the user works around the issue with delaying the pod stop with a sleep on the lifecycle.preStop, but that seems more like a hack than a proper solution to this issue: https://stackoverflow.com/questions/71127572/neg-is-not-attached-to-any-backendservice-with-health-checking.
The text was updated successfully, but these errors were encountered: