Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kind-control-plane fails to cleanup snapshots in a timely manner [SOLVED] #2865

Closed
gvkhna opened this issue Aug 8, 2022 · 2 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/duplicate Indicates an issue is a duplicate of other open issue.

Comments

@gvkhna
Copy link

gvkhna commented Aug 8, 2022

What happened:

My docker.img was filling up and running out of space over time on Unraid OS.

What you expected to happen:

Old containers/images deployed to the cluster would be cleaned up.

How to reproduce it (as minimally and precisely as possible):

Deploying containers overtime would fill up the snapshotter directory with snapshots I believe. I'm unclear about reproduction exactly, this may be a simple case of provisioning and removing several deployments in a row, and seeing that space is not cleared.

Anything else we need to know?:

This may be an upstream issue, I looked into garbage collection in kubernetes. My deployment does not have crictl installed. But I was able to resolve the issue with a periodic cron job that runs like so:

docker exec -it kind-control-plane /bin/bash -c "crictl rmi --prune"

This does prune old snapshots which doesn't seem to happen automatically. I've seen this drop the space usage of the kind-control-plane volume in docker. The space was being taken up in the io.containerd.snapshotter directory. I was able to correctly diagnose this by going through the kind-control-plane volume until i hit upon folders that were taking large amounts of space. Before running pruning, this folder was filling up with a huge amount of snapshots.

du -sh /var/lib/docker/volumes/b9a31a525352d37694c8ba2048fb83c998038e9885b49daaff85528ffff77264/_data/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*

I understand garbage collection is in flux as kubectl is transitioning to a new ownership model. But i thought I would post this information as a fix as well if it can be helpful to anyone or it can be addressed in some manner.

Environment:

  • kind version: (use kind version): v0.12.0 go1.17.8 linux/amd64
  • Kubernetes version: (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-03-06T21:32:53Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version: (use docker info):
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 36
  Running: 27
  Paused: 0
  Stopped: 9
 Images: 40
 Server Version: 20.10.14
 Storage Driver: btrfs
  Build Version: Btrfs v4.20.1
  Library Version: 102
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3df54a852345ae127d1fa3092b95168e4a88e2f8
 runc version: v1.0.3-0-gf46b6ba2
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.15.46-Unraid
 Operating System: Slackware 15.0 x86_64
 OSType: linux
 Architecture: x86_64
 CPUs: 24
 Total Memory: 62.53GiB
 Name: Tower
 ID: YWZH:D5L6:V7M5:YYJO:PQW3:GWIZ:3XSJ:AZ3R:3ZCZ:WU5M:7WP2:UDTU
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine
  • OS (e.g. from /etc/os-release):
NAME=Slackware
VERSION="15.0"
ID=slackware
VERSION_ID=15.0
PRETTY_NAME="Slackware 15.0 x86_64"
ANSI_COLOR="0;34"
CPE_NAME="cpe:/o:slackware:slackware_linux:15.0"
HOME_URL="http://slackware.com/"
SUPPORT_URL="http://www.linuxquestions.org/questions/slackware-14/"
BUG_REPORT_URL="http://www.linuxquestions.org/questions/slackware-14/"
VERSION_CODENAME=stable
@gvkhna gvkhna added the kind/bug Categorizes issue or PR as related to a bug. label Aug 8, 2022
@BenTheElder
Copy link
Member

kind clusters were never meant to be long lived and were meant to be cheap and disposable.

That said, this has an existing issue #735

Unfortunately, this one is a little trickier than it sounds at first glance and will probably require new functionality in Kubernetes or containerd or some new GC process.

(consider: we need to avoid GCing the core side-loaded images, and that currently kubernetes image GC is handled by kubelet GCing images only when we cross a disk percent-usage threshold, but kind disk usage reporting to kubelet will be based on the total host capacity and many users run kind in a small fraction of their total disk space).

/triage duplicate
/close

@k8s-ci-robot k8s-ci-robot added the triage/duplicate Indicates an issue is a duplicate of other open issue. label Aug 8, 2022
@BenTheElder BenTheElder self-assigned this Aug 8, 2022
@k8s-ci-robot
Copy link
Contributor

@BenTheElder: Closing this issue.

In response to this:

kind clusters were never meant to be long lived and were meant to be cheap and disposable.

That said, this has an existing issue #735

Unfortunately, this one is a little trickier than it sounds at first glance and will probably require new functionality in Kubernetes or containerd or some new GC process.

(consider: we need to avoid GCing the core side-loaded images, and that currently kubernetes image GC is handled by kubelet GCing images only when we cross a disk percent-usage threshold, but kind disk usage reporting to kubelet will be based on the total host capacity and many users run kind in a small fraction of their total disk space).

/triage duplicate
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/duplicate Indicates an issue is a duplicate of other open issue.
Projects
None yet
Development

No branches or pull requests

3 participants