Pod restarts without clear reason (timeout when doing GET to configmap) #3126

gals-ma · 2023-03-27T15:53:58Z

Describe the bug
aws-lb-contoller restarts unexpectedly (happened multiple times already) when doing GET to configmap.

Steps to reproduce
Unknown

Expected outcome
Retry mechanism

Environment

AWS Load Balancer controller version 2.4.0
Using EKS yes, 1.22

Additional Context:
Pod logs before the restart:

E0327 15:21:45.005515       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Get "https://172.10.0.1:443/api/v1/namespaces/kube-system/configmaps/aws-load-balancer-controller-leader": context deadline exceeded
I0327 15:21:45.005561       1 leaderelection.go:278] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition
{"level":"error","ts":1679930505.0055985,"logger":"setup","msg":"problem running manager","error":"leader election lost"}
{"level":"info","ts":1679930505.005635,"logger":"controller.service","msg":"Shutdown signal received, waiting for all workers to finish"}

The text was updated successfully, but these errors were encountered:

oliviassss · 2023-03-29T17:32:10Z

@gals-ma, can you provide more info on this error? Before it occurred, was there any upgrade, deletions or something else? Can you provide more logs before the error lines, so we can better understand the situation? You can also send the logs to k8s-alb-controller-triage AT amazon.com

gals-ma · 2023-03-30T07:35:00Z

@gals-ma, can you provide more info on this error? Before it occurred, was there any upgrade, deletions or something else? Can you provide more logs before the error lines, so we can better understand the situation? You can also send the logs to k8s-alb-controller-triage AT amazon.com

@oliviassss nothing specific happened at the same time, it happens to us from time to time
I also saw this log:
leaderelection.go:278] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition

In addition, is there any reason to have 2 replicas of lb-contoller? isn't it a problem with quarom when electing a leader?

gals-ma · 2023-03-30T08:04:24Z

@gals-ma, can you provide more info on this error? Before it occurred, was there any upgrade, deletions or something else? Can you provide more logs before the error lines, so we can better understand the situation? You can also send the logs to k8s-alb-controller-triage AT amazon.com

Can you also please share more information about this error?
What exactly it means?
Is there a way to increase the timeout of the leader-election check?

kishorj · 2023-04-05T22:28:38Z

@gals-ma, the two replicas are in active-standby mode. The issue is not with the controller itself, but the API server is not responding to the controller request. It could either be due to network connectivity issues between the controller and the API server or SG permissions preventing access. Does your controller recover eventually?

gals-ma · 2023-04-10T14:29:49Z

@kishorj @oliviassss So after talking with AWS it was found out that the issue was actually due to the etcd being defragmented and the load-balancer-controller is getting timed out reaching to the etcd server.

so my questions are-

why does the LB controller need to contact the etcd server?
Is there a way to increase the timeout (or add a retry mechanism) to avoid the restarts?

Thanks again.

k8s-triage-robot · 2023-07-09T15:15:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

lqhl · 2023-09-13T12:41:53Z

I also encounter this issue. It will be restarted by k8s, but I'm not sure what is affected.
/remove-lifecycle stale

k8s-triage-robot · 2024-01-28T06:59:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mengqiy · 2024-02-16T06:51:30Z

why does the LB controller need to contact the etcd server?

Every k8s controller that uses leader election relies on apiserver to elect leader and renew lease. APIServer uses etcd as backing store.

Is there a way to increase the timeout (or add a retry mechanism) to avoid the restarts?

ALB controller uses controller-runtime which support setting the lease duration and retry period.

It's expected to see a restart when leader loses lease.

Related discussion: kubernetes-sigs/controller-runtime#1774 (comment)

mengqiy · 2024-02-16T06:51:40Z

/remove-lifecycle stale

k8s-triage-robot · 2024-05-16T07:23:17Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

nileshgadgi · 2024-05-22T11:46:30Z

I'm also facing this issue in the ALB controller here is the logs I found before restart. (Thanks to pod-restart-info-collector)

2024-05-21T10:31:59 E0521 10:31:59       1 leaderelection.go:330] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Get "https://172.20.0.1:443/api/v1/namespaces/kube-system/configmaps/aws-load-balancer-controller-leader": context deadline exceeded
2024-05-21T10:31:59 I0521 10:31:59       1 leaderelection.go:283] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition
2024-05-21T10:31:59 {"level":"error","ts":"2024-05-19T10:31:59Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

Could someone help in this?

chahin-healthhelper · 2024-05-22T14:30:07Z

I'm also facing this issue in the ALB controller here is the logs I found before restart. (Thanks to pod-restart-info-collector)

2024-05-21T10:31:59 E0521 10:31:59       1 leaderelection.go:330] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Get "https://172.20.0.1:443/api/v1/namespaces/kube-system/configmaps/aws-load-balancer-controller-leader": context deadline exceeded
2024-05-21T10:31:59 I0521 10:31:59       1 leaderelection.go:283] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition
2024-05-21T10:31:59 {"level":"error","ts":"2024-05-19T10:31:59Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

Could someone help in this?

Same here! any update on this?

My log BTW :

2024-05-22 14:26:59.832	{"level":"error","ts":"2024-05-22T13:26:59Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}
	
	
2024-05-22 14:26:59.830	I0522 13:26:59.830195       1 leaderelection.go:283] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition
	
	
2024-05-22 14:26:59.829	E0522 13:26:59.829397       1 leaderelection.go:367] Failed to update lock: Put "https://10.100.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/aws-load-balancer-controller-leader": context deadline exceeded
	
	
2024-05-22 14:26:56.847	E0522 13:26:56.846970       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out

nileshgadgi · 2024-06-10T11:06:21Z

I'm also facing this issue in the ALB controller here is the logs I found before restart. (Thanks to pod-restart-info-collector)

2024-05-21T10:31:59 E0521 10:31:59       1 leaderelection.go:330] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Get "https://172.20.0.1:443/api/v1/namespaces/kube-system/configmaps/aws-load-balancer-controller-leader": context deadline exceeded
2024-05-21T10:31:59 I0521 10:31:59       1 leaderelection.go:283] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition
2024-05-21T10:31:59 {"level":"error","ts":"2024-05-19T10:31:59Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

Could someone help in this?

@oliviassss can you help us in this issues if there is anything in AWS EKS we have to perform do something in configuration then suggest. thnx in advance!

k8s-triage-robot · 2024-07-10T11:41:23Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-08-09T12:12:48Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-08-09T12:12:53Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hong539 · 2024-09-13T16:08:59Z

Any Updates?

nileshgadgi · 2024-09-13T18:51:42Z

@hong539 , I’ve tried several approaches, but it’s still unclear to me what caused the pod to restart. please consider reopening the ticket, as the issue is still unresolved.

David-Crty · 2024-10-26T21:53:31Z

@mengqiy they speak about LeaseDuration and RenewDeadline, I don't see any way to change those values in the helm chart or in the app

Could #3835 solve this?

We are currently facing many restarts per day. Should we worry about something?

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 13, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 16, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 10, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod restarts without clear reason (timeout when doing GET to configmap) #3126

Pod restarts without clear reason (timeout when doing GET to configmap) #3126

gals-ma commented Mar 27, 2023

oliviassss commented Mar 29, 2023

gals-ma commented Mar 30, 2023

gals-ma commented Mar 30, 2023

kishorj commented Apr 5, 2023

gals-ma commented Apr 10, 2023 •

edited

Loading

k8s-triage-robot commented Jul 9, 2023

lqhl commented Sep 13, 2023

k8s-triage-robot commented Jan 28, 2024

mengqiy commented Feb 16, 2024 •

edited

Loading

mengqiy commented Feb 16, 2024

k8s-triage-robot commented May 16, 2024

nileshgadgi commented May 22, 2024

chahin-healthhelper commented May 22, 2024 •

edited

Loading

nileshgadgi commented Jun 10, 2024

k8s-triage-robot commented Jul 10, 2024

k8s-triage-robot commented Aug 9, 2024

k8s-ci-robot commented Aug 9, 2024

hong539 commented Sep 13, 2024

nileshgadgi commented Sep 13, 2024

David-Crty commented Oct 26, 2024 •

edited

Loading

Pod restarts without clear reason (timeout when doing GET to configmap) #3126

Pod restarts without clear reason (timeout when doing GET to configmap) #3126

Comments

gals-ma commented Mar 27, 2023

oliviassss commented Mar 29, 2023

gals-ma commented Mar 30, 2023

gals-ma commented Mar 30, 2023

kishorj commented Apr 5, 2023

gals-ma commented Apr 10, 2023 • edited Loading

k8s-triage-robot commented Jul 9, 2023

lqhl commented Sep 13, 2023

k8s-triage-robot commented Jan 28, 2024

mengqiy commented Feb 16, 2024 • edited Loading

mengqiy commented Feb 16, 2024

k8s-triage-robot commented May 16, 2024

nileshgadgi commented May 22, 2024

chahin-healthhelper commented May 22, 2024 • edited Loading

nileshgadgi commented Jun 10, 2024

k8s-triage-robot commented Jul 10, 2024

k8s-triage-robot commented Aug 9, 2024

k8s-ci-robot commented Aug 9, 2024

hong539 commented Sep 13, 2024

nileshgadgi commented Sep 13, 2024

David-Crty commented Oct 26, 2024 • edited Loading

gals-ma commented Apr 10, 2023 •

edited

Loading

mengqiy commented Feb 16, 2024 •

edited

Loading

chahin-healthhelper commented May 22, 2024 •

edited

Loading

David-Crty commented Oct 26, 2024 •

edited

Loading