You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we're using latest STORK plugin from the upstream, by default it is coming with health-monitor enabled:
--health-monitor Enable health monitoring of the storage driver (default: true)
And today we faced with the painful issue. We have many nodes, sometimes some of them might be overloaded, they are flapping between Online and OFFLINE state.
STORK detects these nodes and trying to reattach the volumes and restart the pods on place, example log message:
time="2020-10-10T19:46:16Z" level=info msg="Deleting Pod from Node m9c17 due to volume driver status: Offline ()" Namespace=hosting Owner=ReplicaSet/hc1-wd48-678d9888fb PodName=hc1-wd48-678d9888fb-p8gck
This causes really weird behavior from the linstor-csi driver:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned hosting/hc1-wd48-678d9888fb-fsmcq to m9c17
Warning FailedMount 9m39s (x11 over 10m) kubelet, m9c17 MountVolume.WaitForAttach failed for volume "pvc-ddd150c5-94eb-48a2-9126-4d1339811752" : volume attachment is being deleted
Warning FailedMount 9m35s (x10 over 10m) kubelet, m9c17 MountVolume.SetUp failed for volume "pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = Internal desc = NodePublishVolume failed for pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4: checking device path failed: path "" does not exist
Warning FailedMount 9m7s kubelet, m9c17 MountVolume.WaitForAttach failed for volume "pvc-ddd150c5-94eb-48a2-9126-4d1339811752" : volume pvc-ddd150c5-94eb-48a2-9126-4d1339811752 has GET error for volume attachment csi-9ff6fcc944f9e40da6106d5175b34c3e53f7449ee0a990f6c2c69ba07764d9e1: volumeattachments.storage.k8s.io "csi-9ff6fcc944f9e40da6106d5175b34c3e53f7449ee0a990f6c2c69ba07764d9e1" is forbidden: User "system:node:m9c17" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node "m9c17" and this object
Warning FailedMount 8m20s kubelet, m9c17 Unable to attach or mount volumes: unmounted volumes=[vol-data-backup vol-data-web], unattached volumes=[wd48-vol-data-global run vol-data-backup wd48-vol-shared default-token-jt2jk cgroup fuse vol-data-web wd48-vol-data-proxy]: timed out waiting for the condition
Normal SuccessfulAttachVolume 8m14s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-ddd150c5-94eb-48a2-9126-4d1339811752"
Warning FailedMount 3m55s (x4 over 9m3s) kubelet, m9c17 MountVolume.SetUp failed for volume "pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = Internal desc = NodePublishVolume failed for pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4: 404 Not Found
However it will not allow pod to start, because drbd device is missing, the one of possible way to fix it, is to create resource manually, to satisfy existing volumeattachment
linstor r c m9c17 pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4 --diskless
Hi, we're using latest STORK plugin from the upstream, by default it is coming with health-monitor enabled:
And today we faced with the painful issue. We have many nodes, sometimes some of them might be overloaded, they are flapping between Online and OFFLINE state.
STORK detects these nodes and trying to reattach the volumes and restart the pods on place, example log message:
This causes really weird behavior from the linstor-csi driver:
The volume might stuck on DELETING:
The csi-attacher logs says:
After a while the diskless resource will be removed from the node:
But volumeattachment will continue existing on the node
However it will not allow pod to start, because drbd device is missing, the one of possible way to fix it, is to create resource manually, to satisfy existing volumeattachment
I guess this is exact case mentioned by @rck in #52 (comment)
The text was updated successfully, but these errors were encountered: