kind-control-plane fails to cleanup snapshots in a timely manner [SOLVED] #2865

gvkhna · 2022-08-08T17:18:01Z

What happened:

My docker.img was filling up and running out of space over time on Unraid OS.

What you expected to happen:

Old containers/images deployed to the cluster would be cleaned up.

How to reproduce it (as minimally and precisely as possible):

Deploying containers overtime would fill up the snapshotter directory with snapshots I believe. I'm unclear about reproduction exactly, this may be a simple case of provisioning and removing several deployments in a row, and seeing that space is not cleared.

Anything else we need to know?:

This may be an upstream issue, I looked into garbage collection in kubernetes. My deployment does not have crictl installed. But I was able to resolve the issue with a periodic cron job that runs like so:

docker exec -it kind-control-plane /bin/bash -c "crictl rmi --prune"

This does prune old snapshots which doesn't seem to happen automatically. I've seen this drop the space usage of the kind-control-plane volume in docker. The space was being taken up in the io.containerd.snapshotter directory. I was able to correctly diagnose this by going through the kind-control-plane volume until i hit upon folders that were taking large amounts of space. Before running pruning, this folder was filling up with a huge amount of snapshots.

du -sh /var/lib/docker/volumes/b9a31a525352d37694c8ba2048fb83c998038e9885b49daaff85528ffff77264/_data/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*

I understand garbage collection is in flux as kubectl is transitioning to a new ownership model. But i thought I would post this information as a fix as well if it can be helpful to anyone or it can be addressed in some manner.

Environment:

kind version: (use kind version): v0.12.0 go1.17.8 linux/amd64
Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-03-06T21:32:53Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}

Docker version: (use docker info):

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 36
  Running: 27
  Paused: 0
  Stopped: 9
 Images: 40
 Server Version: 20.10.14
 Storage Driver: btrfs
  Build Version: Btrfs v4.20.1
  Library Version: 102
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3df54a852345ae127d1fa3092b95168e4a88e2f8
 runc version: v1.0.3-0-gf46b6ba2
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.15.46-Unraid
 Operating System: Slackware 15.0 x86_64
 OSType: linux
 Architecture: x86_64
 CPUs: 24
 Total Memory: 62.53GiB
 Name: Tower
 ID: YWZH:D5L6:V7M5:YYJO:PQW3:GWIZ:3XSJ:AZ3R:3ZCZ:WU5M:7WP2:UDTU
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

OS (e.g. from /etc/os-release):

NAME=Slackware
VERSION="15.0"
ID=slackware
VERSION_ID=15.0
PRETTY_NAME="Slackware 15.0 x86_64"
ANSI_COLOR="0;34"
CPE_NAME="cpe:/o:slackware:slackware_linux:15.0"
HOME_URL="http://slackware.com/"
SUPPORT_URL="http://www.linuxquestions.org/questions/slackware-14/"
BUG_REPORT_URL="http://www.linuxquestions.org/questions/slackware-14/"
VERSION_CODENAME=stable

The text was updated successfully, but these errors were encountered:

BenTheElder · 2022-08-08T17:30:28Z

kind clusters were never meant to be long lived and were meant to be cheap and disposable.

That said, this has an existing issue #735

Unfortunately, this one is a little trickier than it sounds at first glance and will probably require new functionality in Kubernetes or containerd or some new GC process.

(consider: we need to avoid GCing the core side-loaded images, and that currently kubernetes image GC is handled by kubelet GCing images only when we cross a disk percent-usage threshold, but kind disk usage reporting to kubelet will be based on the total host capacity and many users run kind in a small fraction of their total disk space).

/triage duplicate
/close

k8s-ci-robot · 2022-08-08T17:30:39Z

@BenTheElder: Closing this issue.

In response to this:

kind clusters were never meant to be long lived and were meant to be cheap and disposable.

That said, this has an existing issue #735

Unfortunately, this one is a little trickier than it sounds at first glance and will probably require new functionality in Kubernetes or containerd or some new GC process.

(consider: we need to avoid GCing the core side-loaded images, and that currently kubernetes image GC is handled by kubelet GCing images only when we cross a disk percent-usage threshold, but kind disk usage reporting to kubelet will be based on the total host capacity and many users run kind in a small fraction of their total disk space).

/triage duplicate
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gvkhna added the kind/bug Categorizes issue or PR as related to a bug. label Aug 8, 2022

k8s-ci-robot added the triage/duplicate Indicates an issue is a duplicate of other open issue. label Aug 8, 2022

BenTheElder self-assigned this Aug 8, 2022

k8s-ci-robot closed this as completed Aug 8, 2022

BenTheElder mentioned this issue Aug 8, 2022

kind should GC unused images #735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kind-control-plane fails to cleanup snapshots in a timely manner [SOLVED] #2865

kind-control-plane fails to cleanup snapshots in a timely manner [SOLVED] #2865

gvkhna commented Aug 8, 2022

BenTheElder commented Aug 8, 2022

k8s-ci-robot commented Aug 8, 2022

kind-control-plane fails to cleanup snapshots in a timely manner [SOLVED] #2865

kind-control-plane fails to cleanup snapshots in a timely manner [SOLVED] #2865

Comments

gvkhna commented Aug 8, 2022

BenTheElder commented Aug 8, 2022

k8s-ci-robot commented Aug 8, 2022