Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster creation fails with error 137 #3160

Closed
WajahatAliAbid opened this issue Apr 6, 2023 · 13 comments
Closed

Cluster creation fails with error 137 #3160

WajahatAliAbid opened this issue Apr 6, 2023 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@WajahatAliAbid
Copy link

What happened:

I installed kind just recently following the documentation and tried to create a cluster with and without manual configuration. And it always fails with error 137. And only rarely creates cluster when I do not specify a configuration.

Following is the command output

kind create cluster --name sample --config ~/cluster_config.yaml --retain
Creating cluster "sample" ...
 ✓ Ensuring node image (kindest/node:v1.26.3) 🖼
 ✓ Preparing nodes 📦 📦 📦 📦  
 ✓ Writing configuration 📜 
 ✗ Starting control-plane 🕹 
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged sample-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 137
Command Output: I0406 22:21:30.198623     142 initconfiguration.go:254] loading configuration from "/kind/kubeadm.conf"
W0406 22:21:30.199698     142 initconfiguration.go:331] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.26.3
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I0406 22:21:30.203898     142 certs.go:112] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
I0406 22:21:30.335061     142 certs.go:519] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost sample-control-plane] and IPs [10.96.0.1 172.18.0.4 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I0406 22:21:30.497530     142 certs.go:112] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I0406 22:21:30.570108     142 certs.go:519] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I0406 22:21:30.723856     142 certs.go:112] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I0406 22:21:30.837962     142 certs.go:519] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [localhost sample-control-plane] and IPs [172.18.0.4 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [localhost sample-control-plane] and IPs [172.18.0.4 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I0406 22:21:31.257037     142 certs.go:78] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
I0406 22:21:31.292681     142 kubeconfig.go:103] creating kubeconfig file for admin.conf
[kubeconfig] Writing "admin.conf" kubeconfig file
I0406 22:21:31.339689     142 kubeconfig.go:103] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I0406 22:21:31.413808     142 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I0406 22:21:31.547176     142 kubeconfig.go:103] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I0406 22:21:31.684911     142 kubelet.go:67] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I0406 22:21:31.778519     142 manifests.go:99] [control-plane] getting StaticPodSpecs
I0406 22:21:31.778788     142 certs.go:519] validating certificate period for CA certificate
I0406 22:21:31.778843     142 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I0406 22:21:31.778849     142 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I0406 22:21:31.778853     142 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I0406 22:21:31.778857     142 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I0406 22:21:31.778860     142 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
I0406 22:21:31.781061     142 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I0406 22:21:31.781078     142 manifests.go:99] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I0406 22:21:31.781229     142 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I0406 22:21:31.781235     142 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I0406 22:21:31.781238     142 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I0406 22:21:31.781241     142 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I0406 22:21:31.781244     142 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I0406 22:21:31.781248     142 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I0406 22:21:31.781251     142 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
I0406 22:21:31.781762     142 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
I0406 22:21:31.781778     142 manifests.go:99] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-scheduler"
I0406 22:21:31.781910     142 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I0406 22:21:31.782328     142 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0406 22:21:31.783418     142 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0406 22:21:31.783429     142 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
I0406 22:21:31.783811     142 loader.go:373] Config loaded from file:  /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0406 22:21:31.785073     142 round_trippers.go:553] GET https://sample-control-plane:6443/healthz?timeout=10s  in 0 milliseconds

The configuration file is very simple, as taken from the kind page

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker

I tried exporting logs and I get this

ERROR: [command "docker exec --privileged sample-worker3 sh -c 'tar --hard-dereference -C /var/log/ -chf - . || (r=$?; [ $r -eq 1 ] || exit $r)'" failed with error: exit status 1, command "docker exec --privileged sample-worker sh -c 'tar --hard-dereference -C /var/log/ -chf - . || (r=$?; [ $r -eq 1 ] || exit $r)'" failed with error: exit status 1, command "docker exec --privileged sample-control-plane sh -c 'tar --hard-dereference -C /var/log/ -chf - . || (r=$?; [ $r -eq 1 ] || exit $r)'" failed with error: exit status 1, command "docker exec --privileged sample-worker2 sh -c 'tar --hard-dereference -C /var/log/ -chf - . || (r=$?; [ $r -eq 1 ] || exit $r)'" failed with error: exit status 1, [[command "docker exec --privileged sample-worker journalctl --no-pager -u kubelet.service" failed with error: exit status 1, command "docker exec --privileged sample-worker journalctl --no-pager" failed with error: exit status 1, command "docker exec --privileged sample-worker cat /kind/version" failed with error: exit status 1, command "docker exec --privileged sample-worker crictl images" failed with error: exit status 1, command "docker exec --privileged sample-worker journalctl --no-pager -u containerd.service" failed with error: exit status 1], [command "docker exec --privileged sample-worker3 journalctl --no-pager -u kubelet.service" failed with error: exit status 1, command "docker exec --privileged sample-worker3 crictl images" failed with error: exit status 1, command "docker exec --privileged sample-worker3 cat /kind/version" failed with error: exit status 1, command "docker exec --privileged sample-worker3 journalctl --no-pager" failed with error: exit status 1, command "docker exec --privileged sample-worker3 journalctl --no-pager -u containerd.service" failed with error: exit status 1], [command "docker exec --privileged sample-control-plane crictl images" failed with error: exit status 1, command "docker exec --privileged sample-control-plane cat /kind/version" failed with error: exit status 1, command "docker exec --privileged sample-control-plane journalctl --no-pager -u kubelet.service" failed with error: exit status 1, command "docker exec --privileged sample-control-plane journalctl --no-pager" failed with error: exit status 1, command "docker exec --privileged sample-control-plane journalctl --no-pager -u containerd.service" failed with error: exit status 1], [command "docker exec --privileged sample-worker2 crictl images" failed with error: exit status 1, command "docker exec --privileged sample-worker2 cat /kind/version" failed with error: exit status 1, command "docker exec --privileged sample-worker2 journalctl --no-pager" failed with error: exit status 1, command "docker exec --privileged sample-worker2 journalctl --no-pager -u kubelet.service" failed with error: exit status 1, command "docker exec --privileged sample-worker2 journalctl --no-pager -u containerd.service" failed with error: exit status 1]]]

What you expected to happen:

I expected the cluster to be created without any hiccups.

How to reproduce it (as minimally and precisely as possible):

I am not sure, I just installed kind using go install sigs.k8s.io/kind@latest
Anything else we need to know?:
None
Environment:

  • kind version: (use kind version): kind v0.18.0 go1.20.1 linux/amd64
  • Runtime info: (use docker info or podman info):
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 46
  Running: 2
  Paused: 0
  Stopped: 44
 Images: 200
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.19.0-38-generic
 Operating System: KDE neon 5.27
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.16GiB
 Name: pk-dellg155515-wajahat
 ID: UG4P:PVYX:ELEZ:RA3E:VDMY:GBMX:CWYI:GS6Y:KQOK:EAX4:7HJ2:Y6H6
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="KDE neon 5.27"
NAME="KDE neon"
VERSION_ID="22.04"
VERSION="5.27"
VERSION_CODENAME=jammy
ID=neon
ID_LIKE="ubuntu debian"
HOME_URL="https://neon.kde.org/"
SUPPORT_URL="https://neon.kde.org/"
BUG_REPORT_URL="https://bugs.kde.org/"
PRIVACY_POLICY_URL="https://kde.org/privacypolicy/"
UBUNTU_CODENAME=jammy
  • Kubernetes version: (use kubectl version):
clientVersion:
  buildDate: "2023-02-22T13:39:03Z"
  compiler: gc
  gitCommit: fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b
  gitTreeState: clean
  gitVersion: v1.26.2
  goVersion: go1.19.6
  major: "1"
  minor: "26"
  platform: linux/amd64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2023-03-30T06:34:50Z"
  compiler: gc
  gitCommit: 9e644106593f3f4aa98f8a84b23db5fa378900bd
  gitTreeState: clean
  gitVersion: v1.26.3
  goVersion: go1.19.7
  major: "1"
  minor: "26"
  platform: linux/amd64
  • Any proxies or other special environment settings?: None
@WajahatAliAbid WajahatAliAbid added the kind/bug Categorizes issue or PR as related to a bug. label Apr 6, 2023
@BenTheElder
Copy link
Member

And it always fails with error 137. And only rarely creates cluster when I do not specify a configuration.

If it's failing when you add more nodes, it's typically a resource limitation, such as https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files

We can't diagnose those without the kind export logs output, you may have to run that shortly after starting kind create cluster --retain. Also even if an error is logged that doesn't mean no logs were exported, it should print a path for log export and there may still be other logs that didn't error.

@WajahatAliAbid
Copy link
Author

So I do have much higher amount of file watchers configured than what is mentioned on page

fs.inotify.max_user_watches = 1638400
fs.inotify.max_user_instances = 1638400

Furthermore, I've uploaded logs here.

@BenTheElder
Copy link
Member

BenTheElder commented Apr 6, 2023

from entrypoint logs 2924451674/sample2-worker3/serial.log:

chown: cannot access '': No such file or directory

That's .... bizzare?
This is the only place we chown in the node entrypoint:

chown root:root "$(which mount)" "$(which umount)"

Those binaries are shipped with the image ...

@BenTheElder
Copy link
Member

The specific inotify for Kubernetes requirement varies with the usage from the host environment (e.g. sometimes IDEs are already consuming many of them, depends on how many nodes etc), so knowing the amount doesn't help us much, but indeed the logs don't indicate it's related to that specficially as the nodes are exiting early during a simple filesystem operation.

@BenTheElder
Copy link
Member

I think this is a variant of #3043

Security Options:
apparmor
seccomp
Profile: default
cgroupns

@WajahatAliAbid
Copy link
Author

This is certainly weird. Any idea what could cause this? Or any way I can avoid this?

@BenTheElder
Copy link
Member

I think you could try changing default-cgroupns-mode to host instead of private but we'll need to handle this in kind. I'm not sure why we're not seeing this much yet. https://docs.docker.com/engine/reference/commandline/dockerd/

Looks like maybe the default is typically host, maybe a difference in distros?

@WajahatAliAbid
Copy link
Author

It is host by default as far as I'm aware.

@WajahatAliAbid
Copy link
Author

@BenTheElder I looked further into configuration of my system and it was being blocked by a security solution installed on my machine. I forgot that was installed and once I checked, I realized it was blocking mount & unmount operations. Added those to exception list and now everything works as expected.

Thanks to all the information you provided, I managed to learn a lot of new things. Appreciate it. I'll close this issue now.

@BenTheElder
Copy link
Member

Thanks, if you're able to share I'd be curious what security solution 😅

blocking mount operations would definitely do it, we need to do lots of those for the containers, volumes etc.

@WajahatAliAbid
Copy link
Author

@BenTheElder Sorry I cannot disclose that information, but I think any security solution like CrowdStrike or Cylance could block this.

@BenTheElder
Copy link
Member

Understood! For #3174 it turns out to be #3180 anyhow 😅

@banaari
Copy link

banaari commented Jan 19, 2024

I can CONFIRM from bitter experience that Crowdstrike CAN and DOES kill pods at birth - and you get a 137 error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants