Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: read-only range request ... took too long (...) to execute #2692

Open
erichorwath opened this issue Mar 24, 2022 · 8 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@erichorwath
Copy link

erichorwath commented Mar 24, 2022

What happened:
Slow etcd in Kind control-plane slowing down complete helm instal / kubectl apply.
Etcd logs contain lots of errors of such type:

2022-03-16 12:20:25.445202 W | etcdserver: read-only range request "key:\"/registry/helm.toolkit.fluxcd.io/helmreleases/istio-system/istio\" " with result "range_response_count:0 size:5" took too long (1.088797682s) to execute

What you expected to happen:
No etcd errors

How to reproduce it (as minimally and precisely as possible):
Startup Kind, check the etcd pod logs.

Anything else we need to know?:
We could reproduce this behavior in all our scenarios, which are mainly two:

  • running on my 32GB laptop (in my case WSL2, see below)
  • running inside a Kubernetes cluster, in a Docker-in-Docker-pod (see also document how to run kind in a kubernetes pod #303 ), even when using the fasted available VMs and biggest premiumSSDs the hyperscaler offers, we still see the same error. For me, it doesn't look like a DinD issue, but a problem related to Kind.

Environment:

  • kind version: tested on v0.12.0 and v0.11.1
  • Kubernetes version: node version: kindest/node:v1.21.1@sha256:69860bda5563ac81e3c0057d654b5253219618a22ec3a346306239bba8cfa1a6 and kindest/node:v1.21.10@sha256:84709f09756ba4f863769bdcabe5edafc2ada72d3c8c44d6515fc581b66b029c
  • Docker version: (use docker info): Docker 20.10.11 (WSL2) and docker:20.10.12-dind (DinD)
  • OS (e.g. from /etc/os-release): tested on WSL2-Ubuntu20.04LTS and Alpine Linux v3.15 (DinD)

Workaround:
Run etcd in memory (tmpfs) resolves the etcd error messages and helm install timeouts:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  etcd:
    local:
      dataDir: /tmp/etcd

See #845 (comment) or knative-extensions/net-contour#444

@erichorwath erichorwath added the kind/bug Categorizes issue or PR as related to a bug. label Mar 24, 2022
@aojea
Copy link
Contributor

aojea commented Mar 24, 2022

Why do you say is kind?
etcd needs to fsync to disk and is very iops intensive, now think on all the filesystem layers between etcd and the disk in a setup with docker, dind and wsl2 ... I see that this is a "setup" issue

@erichorwath
Copy link
Author

let me phrase it differently: I haven't found a (hardware/os) "setup" where kind is not having those etcd issues.

Maybe Kind itself is not the problem but maybe of how Kind is using Docker, but I'm not an expert here...

@aojea
Copy link
Contributor

aojea commented Mar 24, 2022

let me phrase it differently: I haven't found a (hardware/os) "setup" where kind is not having those etcd issues.

Maybe Kind itself is not the problem but maybe of how Kind is using Docker, but I'm not an expert here...

is a setup problem, docker uses overlay2 as default storage driver , this can illustrate the problem better
https://www.storageconference.us/2017/Presentations/PerformanceAnalysisOfContainerizedApplications-slides.pdf

@erichorwath
Copy link
Author

erichorwath commented Mar 24, 2022

And why not creating a data volume for etcd when kind creates nodes?
image

@aojea
Copy link
Contributor

aojea commented Mar 24, 2022

heh, it seems it already runs in a volume, I was assuming it wasn't (EDIT: lol, my memory does not work well, I can see in the issue I reported about etcd performance I mentioned it XD)

  58213 ?        Ssl   42:08  \_ etcd --advertise-client-urls=https://172.18.0.3:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --client-cert-auth=true --data-dir=/var/lib/etcd --initial-advertise-peer-urls=https://172.18.0.3:2380 --initial-clu
                "Type": "volume",                                                                                                                                                                                                                         
                "Name": "3a17927a7e69565f6dd30372f1211892cb1fc12e85ab91bc59f18924a6657244",                                                                                                                                                               
                "Source": "/home/var/docker/volumes/3a17927a7e69565f6dd30372f1211892cb1fc12e85ab91bc59f18924a6657244/_data",                                                                                                                              
                "Destination": "/var",                                                                                                                                                                                                                    
                "Driver": "local",                                                                                                                                                                                                                        
                "Mode": "",                                                                                                                                                                                                                               
                "RW": true,                                                                                                                                                                                                                               
                "Propagation": ""                                                                                                                                                                                                                         
            }             

I think that this may require some benchmarks and debugging then, to understand the bottleneck

@BenTheElder
Copy link
Member

BenTheElder commented Mar 24, 2022

let me phrase it differently: I haven't found a (hardware/os) "setup" where kind is not having those etcd issues.

We generally have the opposite problem, a few users have reported issues, but only on slow spinning disks ...

There's only so much this project can do, kind ships upstream kubernetes/etcd, the performance of etcd + how kubernetes uses it largely come from those projects.

And why not creating a data volume for etcd when kind creates nodes?

kind is.

Regarding tmpfs:

#845 discusses an option users can leverage with tmpfs, but doing so is not something we can do by default (due to lack of data durability ...), though possible to configure with existing features as you found.

@erichorwath
Copy link
Author

JFYI, we have saw during kind create cluster && helm install a max load of ~ 1k IOPS
image
In this example we used 1TB Azure PremiumSSD with 5k IOPS (30k burst).

@BenTheElder
Copy link
Member

BenTheElder commented Apr 18, 2023

Some of that may be docker pulling / unpacking the image.

I don't think there's much more we can do here. It's already possible to put Etcd on a tmpfs with kind with the significant trade offs that causes. We use a docker volume otherwise and I don't think there's a faster option we could implement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants