Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docker installed with snap] HA Creation - Error: failed to create cluster: failed to copy certificate ca.crt: exit status 1 #724

Closed
jonstelly opened this issue Jul 19, 2019 · 18 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@jonstelly
Copy link

What happened:
Running kind to create an HA cluster like the one found here (except with 2 control-planes instead of 3)

kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

What you expected to happen:
Cluster to get created and come up

How to reproduce it (as minimally and precisely as possible):
kind create cluster --retain --loglevel trace --config "./kind-cluster.yaml" --wait 5m;

Anything else we need to know?:
Creating a single control-plane cluster works fine on this machine. Deleted and recreated several times to verify

Debug logging output:

[addons] Applied essential addon: kube-proxy
I0719 17:37:45.225992     142 loader.go:359] Config loaded from file:  /etc/kubernetes/admin.conf
I0719 17:37:45.226805     142 loader.go:359] Config loaded from file:  /etc/kubernetes/admin.conf

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of control-plane nodes by copying certificate authorities 
and service account keys on each node and then running the following as root:

  kubeadm join 172.17.0.3:6443 --token <value withheld> \
    --discovery-token-ca-cert-hash sha256:e8f007ca6d45412c838744e330cb1516774f0dac8593f1588b90b33d3a248a57 \
    --experimental-control-plane          

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 172.17.0.3:6443 --token <value withheld> \
    --discovery-token-ca-cert-hash sha256:e8f007ca6d45412c838744e330cb1516774f0dac8593f1588b90b33d3a248a57  
DEBU[12:37:45] Running: /snap/bin/docker [docker inspect -f {{(index (index .NetworkSettings.Ports "6443/tcp") 0).HostPort}} kind-external-load-balancer] 
DEBU[12:37:45] Running: /snap/bin/docker [docker exec --privileged kind-control-plane cat /etc/kubernetes/admin.conf] 
 ✓ Starting control-plane 🕹️ 
DEBU[12:37:45] Running: /snap/bin/docker [docker exec --privileged kind-control-plane cat /kind/manifests/default-cni.yaml] 
DEBU[12:37:45] Running: /snap/bin/docker [docker exec --privileged -i kind-control-plane kubectl create --kubeconfig=/etc/kubernetes/admin.conf -f -] 
 ✓ Installing CNI 🔌 
DEBU[12:37:46] Running: /snap/bin/docker [docker exec --privileged -i kind-control-plane kubectl --kubeconfig=/etc/kubernetes/admin.conf apply -f -] 
 ✓ Installing StorageClass 💾 
DEBU[12:37:46] Running: /snap/bin/docker [docker exec --privileged kind-control-plane2 mkdir -p /etc/kubernetes/pki/etcd] 
DEBU[12:37:46] Running: /snap/bin/docker [docker cp kind-control-plane:/etc/kubernetes/pki/ca.crt /tmp/864842991/ca.crt] 
 ✗ Joining more control-plane nodes 🎮 
Error: failed to create cluster: failed to copy certificate ca.crt: exit status 1

Environment:

  • kind version: (use kind version): v0.4.0
  • Kubernetes version: (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-21T13:09:06Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version: (use docker info):
Containers: 5
 Running: 5
 Paused: 0
 Stopped: 0
Images: 19
Server Version: 18.06.1-ce
Storage Driver: aufs
 Root Dir: /var/snap/docker/common/var-lib-docker/aufs
 Backing Filesystem: extfs
 Dirs: 129
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: N/A (expected: 69663f0bd4b60df09991c08812a60108003fa340)
init version: 949e6fa (expected: fec3683)
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 5.0.0-20-generic
Operating System: Ubuntu Core 16
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 62.84GiB
Name: codor
ID: RR2C:GHT4:VNPO:ZXHC:RWW4:YYDG:OPA3:Y53B:WTBZ:23C3:AHLB:UWLN
Docker Root Dir: /var/snap/docker/common/var-lib-docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 53
 Goroutines: 62
 System Time: 2019-07-19T12:47:49.623878746-05:00
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 hub.home.local
 ubuntu:32000
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support
  • OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="19.04 (Disco Dingo)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 19.04"
VERSION_ID="19.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=disco
UBUNTU_CODENAME=disco
@jonstelly jonstelly added the kind/bug Categorizes issue or PR as related to a bug. label Jul 19, 2019
@BenTheElder
Copy link
Member

2 is invalid in kubeadm IIRC?

@jonstelly
Copy link
Author

Ah, I hadn't seen that in the documentation yet... this was my first attempt to spin up an HA cluster. I'll try it again with 3 control planes like the example and make sure that works.

Assuming that's my problem, it would be nice to validate the configuration and throw a friendly error when people try to do this (assuming I'm not the only one to ever do it). I'm not really a go guy but I may see if I can figure something out and submit a pull request as penance for my not reading the documentation.

Thanks!

@BenTheElder
Copy link
Member

yes, sorry about that, I'm digging around trying to find something authoritative on the kubeadm side of things, I can't actually find anything now but I could have sworn that it only supports 3 control planes specifically.

IF / when we get that confirmed we should document in the kind docs and validate. If that's not true, then we have some bug here to fix 😅

cc @neolit123 @fabriziopandini

@BenTheElder
Copy link
Member

so we should validate this to be odd, but also we probably need to look at getting some docs to clarify this upstream,.

the cert copy issue may be unrelated to the number however, I'll circle back to this one as I'm triaging a few other issues..

@neolit123
Copy link
Member

neolit123 commented Jul 19, 2019 via email

@jonstelly
Copy link
Author

I just tried a config with 3 control-planes and got the same error.

It is worth noting I might be trying to do too much on my dev machine. I'm using microk8s on this box for my development, I was really just using kind to spin up a quick larger cluster for testing some scaling behavior.

But I'll try to dig into this a bit.

@neolit123
Copy link
Member

neolit123 commented Jul 20, 2019 via email

@jonstelly
Copy link
Author

jonstelly commented Jul 20, 2019

I tracked it down but will need some guidance on how to fix it correctly if you'd like a pull request. As mentioned, my go skills are near non-existent so if it's easier for one of you to make this change, I won't be offended, hopefully the below helps...

TL;DR: I'm running docker from a snap so docker doesn't have access to the host's /tmp directory that kind uses to copy around certs, etc... so the docker cp <container>:/... /tmp/... fails. It looks like kind needs to detect if docker is installed as a snap and if so, use a different temp directory.

I found some helpful info about snaps and directories here, including this command: snap run --shell docker.docker to get a shell where you can get the SNAP variables with env | grep SNAP.

I hackedskillfully changed the temp directory as follows and was able to spin up a cluster with 2 or 3 control planes (though I'm guessing the 2 control planes aren't really HA since etcd uses raft):

diff --git a/pkg/cluster/internal/create/actions/kubeadmjoin/join.go b/pkg/cluster/internal/create/actions/kubeadmjoin/join.go                                                          
index 5283d98..b1b26e2 100644
--- a/pkg/cluster/internal/create/actions/kubeadmjoin/join.go
+++ b/pkg/cluster/internal/create/actions/kubeadmjoin/join.go
@@ -31,7 +31,6 @@ import (
        "sigs.k8s.io/kind/pkg/cluster/nodes"
        "sigs.k8s.io/kind/pkg/concurrent"
        "sigs.k8s.io/kind/pkg/exec"
-       "sigs.k8s.io/kind/pkg/fs"
 )

 // Action implements action for creating the kubeadm join
@@ -145,13 +144,9 @@ func runKubeadmJoinControlPlane(

        // creates a temporary folder on the host that should acts as a transit area
        // for moving necessary cluster certificates
-       tmpDir, err := fs.TempDir("", "")
-       if err != nil {
-               return err
-       }
-       defer os.RemoveAll(tmpDir)
+       var tmpDir = "/home/jon/snap/docker/current/tmp"

-       err = os.MkdirAll(filepath.Join(tmpDir, "/etcd"), os.ModePerm)
+       var err = os.MkdirAll(filepath.Join(tmpDir, "/etcd"), os.ModePerm)
        if err != nil {
                return err
        }
@@ -170,11 +165,11 @@ func runKubeadmJoinControlPlane(
                tmpPath := filepath.Join(tmpDir, fileName)
                // copies from bootstrap control plane node to tmp area
                if err := controlPlaneHandle.CopyFrom(containerPath, tmpPath); err != nil {
-                       return errors.Wrapf(err, "failed to copy certificate %s", fileName)
+                       return errors.Wrapf(err, "failed to copy certificate %s:%s from %s to %s", controlPlaneHandle, fileName, containerPath, tmpPath)                                
                }
                // copies from tmp area to joining node
                if err := node.CopyTo(tmpPath, containerPath); err != nil {
-                       return errors.Wrapf(err, "failed to copy certificate %s", fileName)
+                       return errors.Wrapf(err, "failed to copy certificate %s:%s from %s to %s", node, fileName, tmpPath, containerPath)                                              
                }
        }

@aojea
Copy link
Contributor

aojea commented Jul 21, 2019

so ... the problem is that the folder created by tmpDir, err := fs.TempDir is not accessible within snap?
@BenTheElder do you think this is something to be fixed in https://golang.org/src/os/file.go?s=10008:10029#L319 ?

@jonstelly
Copy link
Author

jonstelly commented Jul 21, 2019

so ... the problem is that the folder created by tmpDir, err := fs.TempDir is not accessible within snap?

Yeah, snap's model is to sandbox each installed application unless the snap component is distributed with classic confinement. But docker uses the strict confinement policy.

@BenTheElder do you think this is something to be fixed in https://golang.org/src/os/file.go?s=10008:10029#L319 ?

It feels like /tmp is the correct thing for go's os api to return. And certainly in this situation where kind is asking for a temp directory... I can't imagine a way of changing TempDir() to account for docker (not kind) running in a snap.

One easy check to figure this out... which docker returns /snap/bin/docker for me. I'm not 100% sure if snap on different platforms like fedora, etc... use the same directory, but I would assume so. So the easy fix options seem like:

  1. Detect if docker is running from /snap/bin/docker and change the temp path behavior in that case
  2. Add a command-line boolean option docker-snap and add a note to documentation/readme
  3. Add a command-line option temp-path and document this snap case where it would be useful (this might be useful elsewhere, but probably not imho)

@neolit123
Copy link
Member

"1" seems like the better option to me.
snap detection can happen in:
https://github.com/kubernetes-sigs/kind/blob/master/pkg/fs/fs.go#L31

@BenTheElder
Copy link
Member

Ah so, Docker installed via snap is listed under known issues and mentions TMPDIR=$HOME as an option that requires no modifications. https://kind.sigs.k8s.io/docs/user/known-issues/#docker-installed-with-snap

We don't need to use a temp directory to copy these in the first place and we should consider refactoring that, but in the meantime:

  • 1. Won't be reliable.
  • 3. Already has a standard equivalent in TEMPDIR and we have documented this (perhaps we need to make this more visible though??) This is also standard on UNIX / POSIX and many tools respect it. This functionality comes from the go standard library *
  • 2. Is relatively kludgey, I don't want a precedence with flags per runtime / host / ... especially when it's not actually necessary.

* we have our own method that wraps this to internally deal with a minor oddity of the macOS platform...

Note that we do need a tempdir for building node images, so if you do that you will find the same issue there. Otherwise we've avoided this

@BenTheElder BenTheElder changed the title HA Creation - Error: failed to create cluster: failed to copy certificate ca.crt: exit status 1 [docker installed with snap] HA Creation - Error: failed to create cluster: failed to copy certificate ca.crt: exit status 1 Jul 21, 2019
@neolit123
Copy link
Member

i'd close this ticked as the issue is document in https://github.com/kubernetes-sigs/kind/blob/master/site/content/docs/user/known-issues.md#docker-installed-with-snap

and extend the issue template to let the users have a look at the list of known issues before reporting.

@BenTheElder
Copy link
Member

We should still prevent and / or detect this on failure, we could avoid the staging tempdir for this usage at least but for the node build we can't.

We could start on #39 and detect this in the failure diagnostics and point to the docs automatically... (so it stays out of the "happy path"). I've still be discussing designs for that though...

@jonstelly
Copy link
Author

Ah, yep, setting TEMPDIR is super simple so I'll just set that up in my scripts... thanks.

And yeah, I had looked at the known issues but I was fixating on the error message and the way docker in snap is documented on the Known Issues page it didn't jump out at me. Maybe it would be helpful to add some detail to the error message in this case, like below? In the current form the error doesn't mention the temp path (though the debug output does).

Thanks for the quick help on this and feel free to close this issue at your convenience.

                tmpPath := filepath.Join(tmpDir, fileName)
                // copies from bootstrap control plane node to tmp area
                if err := controlPlaneHandle.CopyFrom(containerPath, tmpPath); err != nil {
-                       return errors.Wrapf(err, "failed to copy certificate %s", fileName)
+                       return errors.Wrapf(err, "failed to copy certificate %s:%s from %s to %s", controlPlaneHandle, fileName, containerPath, tmpPath)                                
                }
                // copies from tmp area to joining node
                if err := node.CopyTo(tmpPath, containerPath); err != nil {
-                       return errors.Wrapf(err, "failed to copy certificate %s", fileName)
+                       return errors.Wrapf(err, "failed to copy certificate %s:%s from %s to %s", node, fileName, tmpPath, containerPath)                                              
                }
        }

@aojea
Copy link
Contributor

aojea commented Oct 9, 2019

/close

I think that logging has improved a lot in newer versions thanks to Ben

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

/close

I think that logging has improved a lot in newer versions thanks to Ben

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@BenTheElder
Copy link
Member

the tempdir will be gone in the next release #1023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants