-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD continuously printing "transport is closing" in the logs , and etcdctl commands failing with context deadline exceeded (request timeout error) #17438
Comments
@ahrtr |
Please provide the statefulset yaml.
Please provide the exact sequence of commands you run in order to create the issue once the statefulset has been applied to a cluster. @rahulbapumore please do not open multiple issues, this is already an active support discussion under #17394. Please remember, community support is provided by volunteers on a best efforts basis and is not guaranteed. If you need more hands on or timely support for etcd you may be best to engage with a software vendor. If you would like this bug to be accepted you need to give us a way to recreate the issue you are seeing, otherwise it will be closed as a duplicate of the support discussion. |
Hi @ahrtr @jmhbnz
Then etcd goes in such a bad state that its not able to recover at all, we tried everything but no luck. |
And this issue is reproducible always |
Thanks for providing the sequence. Can you please provide the statefulset output as yaml so I can attempt to recreate it in my own cluster. |
Hi @jmhbnz , And we have also observed same issue when only single replica is kept for long time and suddenly this issue happens , all etcdctl commands stop working and cluster is unrecoverable . |
@ahrtr |
I'll try to repro this over the course of today, can you please give us the full config/helm chart for repro? The pod logs would be also helpful. It also seems odd that you would configure the service for peer url advertisement?
You see that from your continuous member listing, it's resolving the DNS to an IP that doesn't exist anymore at first, then it just does not even resolve any IP. Which makes me believe your pod is not actually picked up by the service anymore. |
/assign @tjungblu Thanks |
Hi @tjungblu , |
@tjungblu
In short , this kind of state is not recoverable at all |
@tjungblu |
Sorry @rahulbapumore but this is far away from being reproducible for me. What are all of those pieces?
Please give us a minimal reproducible example with actual released etcd images and tools from this repository. |
Hi @tjungblu for pod-0 it does following things ->
For pod-1 and pod-2, it does ->
entrypoint.sh is startup script for dced container , where we keep running etcd process for all of the 3 pods |
Hi @tjungblu , And transport is closing is very generic message, which can come after any network disturbances or it can also come if etcd is not able to handle connection properly due to large number of request. |
Okay, I believe we're conflating many different issues here now.
That's the purpose of etcd, you can only guarantee correct writes when you have quorum. It's not surprising that your PUT fails if the majority of your configured cluster is down. I also think that you can't just easily switch from a clustered etcd into a single-node experience with just removing members, you would need to reconfigure the environment to match it as well. If you look at other examples on how etcd is ran as STS, they seem to handle those scenarios more gracefully:
What is that bad / unrecoverable state? Looking at your PVC configs, I see |
Bad state is same for which I have given reproduction steps -> |
Hi @tjungblu Thanks |
Hi @tjungblu |
Hi @ahrtr |
I am facing the same issue. Could it be that the no. of Pods pe node limit has been reached? |
Bug report criteria
What happened?
We have ETCD 3.5.7 deployed inside container of pod managed by statefulset controller, and with only 1 replica.
Deployment is kept for some days and key/values are inserted constantly , it works fine for dew days, but when statefulset is scale down to zero and then scaled up to 1, and when pod is coming up, entered some of the etcdctl commands before etcd completely starts up. This is the trigger point and all etcdctl commands stops working and etcd goes in very bad state that couldnt be recovered unless db file is deleted.
Firstly ETCD gives latest balancer error as below,
{"attempt":0,"caller":"[email protected]/retry_interceptor.go:62","error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.107.143.247:2379: connect: connection refused\"","logger":"etcd-client","message":"retrying of unary invoker failed","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","target":"etcd-endpoints://0xc00002c000/dced.namespace1:2379","timestamp":"2024-02-16T11:19:41.244+00:00","version":"1.2.0"}
So etcd will be up and running, but it will give context deadline exceeded saying request timeout error and inside logs it will keep printing below log line
{"message":"WARNING: 2024/02/16 16:04:22 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T16:04:22.567+00:00","version":"1.2.0"} {"message":"WARNING: 2024/02/16 16:11:46 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T16:11:46.177+00:00","version":"1.2.0"} {"message":"WARNING: 2024/02/16 16:24:42 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T16:24:42.662+00:00","version":"1.2.0"} {"message":"WARNING: 2024/02/16 16:34:52 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T16:34:52.710+00:00","version":"1.2.0"} {"message":"WARNING: 2024/02/16 16:41:46 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T16:41:46.176+00:00","version":"1.2.0"} {"message":"WARNING: 2024/02/16 16:46:46 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T16:46:46.178+00:00","version":"1.2.0"} {"message":"WARNING: 2024/02/16 16:51:46 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T16:51:46.177+00:00","version":"1.2.0"} {"message":"WARNING: 2024/02/16 16:55:12 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T16:55:12.803+00:00","version":"1.2.0"} {"message":"WARNING: 2024/02/16 16:56:46 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T16:56:46.178+00:00","version":"1.2.0"} {"message":"WARNING: 2024/02/16 17:00:17 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T17:00:17.825+00:00","version":"1.2.0"} {"message":"WARNING: 2024/02/16 17:01:46 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transpor is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"dced-0"},"service_id":"dced","severity":"warning","timestamp":"2024-02-16T17:01:46.177+00:00","version":"1.2.0"}
logs.txt
What did you expect to happen?
ETCD shouldnt have gone into such bad state which required db deletion losing all data
How can we reproduce it (as minimally and precisely as possible)?
We have ETCD 3.5.7 deployed inside container of pod managed by statefulset controller, and with only 1 replica.
Deployment is kept for some days and key/values are inserted constantly , then suddenly taking down pod by scaling statefulset down and then scaling statefulset up, and before etcd get configured, we are running some of the etcdctl commands which is the trigger point according to us.
Anything else we need to know?
No response
Etcd version (please run commands below)
bash-4.4$ etcd --version
etcd Version: 3.5.7
Git SHA: 215b53c
Go Version: go1.17.13
Go OS/Arch: linux/amd64
bash-4.4$ etcdctl version
etcdctl version: 3.5.7
API version: 3.5
bash-4.4$
Etcd configuration (command line flags or environment variables)
bash-4.4$ clear
bash-4.4$ env
BOOTSTRAP_ENABLED=false
SIP_SERVICE_PORT_HTTP_METRIC_TLS=8889
VALID_PARAMETERS=valid
ETCD_INITIAL_CLUSTER_TOKEN=dced
TLS_ENABLED=true
ETCD_MAX_SNAPSHOTS=3
CLIENT_PORTS=2379
SIP_SERVICE_PORT=8889
TZ=UTC
HOSTNAME=dced-0
SIP_PORT_8889_TCP_PORT=8889
COMPONENT_VERSION=v3.5.7
HTTP_PROBE_CMD_DIR=/usr/local/bin/health
HTTP_PROBE_READINESS_CMD_TIMEOUT_SEC=15
ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379
ETCD_HEARTBEAT_INTERVAL=100
ETCD_AUTO_COMPACTION_RETENTION=100
DISARM_ALARM_PEER_INTERVAL=6
NAMESPACE=namespace1
ETCD_TRUSTED_CA_FILE=/data/combinedca/cacertbundle.pem
DB_THRESHOLD_PERCENTAGE=70
MONITOR_ALARM_INTERVAL=5
PEER_CERT_AUTH_ENABLED=true
KMS_SERVICE_HOST=10.107.175.120
SIP_PORT_8889_TCP_PROTO=tcp
KMS_PORT_8200_TCP_PROTO=tcp
TRUSTED_CA=/data/combinedca/cacertbundle.pem
PEER_CLIENTS_CERTS=/run/sec/certs/peer/srvcert.pem
FIFO_DIR=/fifo
KUBERNETES_PORT_443_TCP_PROTO=tcp
ENTRYPOINT_RESTART_ETCD=true
HTTP_PROBE_NAMESPACE=namespace1
KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1
ETCDCTL_CERT=/run/sec/certs/client/clicert.pem
ENTRYPOINT_DCED_PROCESS_INTERVAL=5
DEFRAGMENT_ENABLE=true
DCED_SERVICE_HOST=10.107.143.247
ETCD_LOG_LEVEL=info
ENTRYPOINT_CHECKSNUMBER=60
SIP_PORT=tcp://10.111.183.137:8889
KUBERNETES_PORT=tcp://10.96.0.1:443
POD_NAME=dced-0
DCED_SERVICE_PORT=2379
SIP_SERVICE_HOST=10.111.183.137
PWD=/
ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380
HOME=/home/dced
DCED_SERVICE_PORT_CLIENT_PORT_TLS=2379
ETCD_AUTO_COMPACTION_MODE=revision
KUBERNETES_SERVICE_PORT_HTTPS=443
DCED_PORT_2379_TCP_ADDR=10.107.143.247
KUBERNETES_PORT_443_TCP_PORT=443
ETCD_LOGGER=zap
PEER_AUTO_TLS_ENABLED=true
KMS_SERVICE_PORT_HTTPS_KMS=8200
ETCD_CERT_FILE=/run/sec/certs/server/srvcert.pem
ETCD_PEER_AUTO_TLS=true
DCED_PORT_2379_TCP_PORT=2379
KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443
DCED_PORT_2379_TCP=tcp://10.107.143.247:2379
LISTEN_PEER_URLS=https://0.0.0.0:2380
DEFRAGMENT_PERIODIC_INTERVAL=60
CONTAINER_NAME=dced
COMPONENT=etcd
ETCD_DATA_DIR=/data
ETCD_CLIENT_CERT_AUTH=true
TERM=xterm
KMS_PORT=tcp://10.107.175.120:8200
ETCDCTL_ENDPOINTS=dced.namespace1:2379
HTTP_PROBE_LIVENESS_CMD_TIMEOUT_SEC=15
ETCD_METRICS=basic
PEER_CLIENT_KEY_FILE=/run/sec/certs/peer/srvprivkey.pem
HTTP_PROBE_CONTAINER_NAME=dced
SIP_PORT_8889_TCP_ADDR=10.111.183.137
GODEBUG=tls13=1
ETCDCTL_API=3
DCED_PORT=tcp://10.107.143.247:2379
ETCD_SNAPSHOT_COUNT=5000
ETCD_MAX_WALS=3
SHLVL=1
KMS_PORT_8200_TCP_ADDR=10.107.175.120
HTTP_PROBE_POD_NAME=dced-0
KUBERNETES_SERVICE_PORT=443
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://dced-0.dced-peer.namespace1.svc.cluster.local:2380
HTTP_PROBE_STARTUP_CMD_TIMEOUT_SEC=15
ETCD_KEY_FILE=/run/sec/certs/server/srvprivkey.pem
ETCD_ELECTION_TIMEOUT=1000
HTTP_PROBE_SERVICE_NAME=dced
ETCDCTL_CACERT=/data/combinedca/cacertbundle.pem
ETCD_NAME=dced-0
ETCD_QUOTA_BACKEND_BYTES=268435456
SIP_PORT_8889_TCP=tcp://10.111.183.137:8889
ENTRYPOINT_PIPE_TIMEOUT=5
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
ETCD_ADVERTISE_CLIENT_URLS=https://dced-0.dced.namespace1:2379
DCED_PORT=2379
KMS_SERVICE_PORT=8200
KUBERNETES_SERVICE_HOST=10.96.0.1
FLAVOUR=etcd-v3.5.7-linux-amd64
KMS_PORT_8200_TCP=tcp://10.107.175.120:8200
DCED_PORT_2379_TCP_PROTO=tcp
KMS_PORT_8200_TCP_PORT=8200
ETCDCTL_KEY=/run/sec/certs/client/cliprivkey.pem
_=/usr/bin/env
bash-4.4$
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
bash-4.4$ etcdctl member list -w table
+------------------+---------+----------------------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+----------------------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+------------+
| 7928e6047223afac | started | eric-data-distributed-coordinator-ed-0 | https://eric-data-distributed-coordinator-ed-0.eric-data-distributed-coordinator-ed-peer.zmorrah1.svc.cluster.local:2380 | https://eric-data-distributed-coordinator-ed-0.eric-data-distributed-coordinator-ed.zmorrah1:2379 | false |
+------------------+---------+----------------------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+------------+
bash-4.4$ etcdctl endpoint status -w table
+----------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| eric-data-distributed-coordinator-ed.zmorrah1:2379 | 7928e6047223afac | 3.5.7 | 2.6 MB | true | false | 3 | 231 | 231 | |
+----------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
bash-4.4$
Relevant log output
No response
The text was updated successfully, but these errors were encountered: