Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller never becomes ready after upgrade to 2.5.2 #698

Open
zrav opened this issue Aug 25, 2024 · 5 comments
Open

Controller never becomes ready after upgrade to 2.5.2 #698

zrav opened this issue Aug 25, 2024 · 5 comments

Comments

@zrav
Copy link

zrav commented Aug 25, 2024

This is a k8s 1.30 test cluster on Ubuntu Jammy. After upgrade from 2.5.1 to 2.5.2 the controller never becomes ready:

root@k8sw3:~# modinfo drbd
filename:       /lib/modules/5.15.0-119-generic/updates/dkms/drbd.ko
softdep:        post: handshake
alias:          block-major-147-*
license:        GPL
version:        9.2.10
description:    drbd - Distributed Replicated Block Device v9.2.10
author:         Philipp Reisner <[email protected]>, Lars Ellenberg <[email protected]>
srcversion:     BADB6D1C4C401B173352EAB
depends:        lru_cache,libcrc32c
retpoline:      Y
name:           drbd
vermagic:       5.15.0-119-generic SMP mod_unload modversions
parm:           enable_faults:int
parm:           fault_rate:int
parm:           fault_count:int
parm:           fault_devs:int
parm:           disable_sendpage:bool
parm:           allow_oos:DONT USE! (bool)
parm:           minor_count:Approximate number of drbd devices (1U-255U) (uint)
parm:           usermode_helper:string
parm:           protocol_version_min:drbd_protocol_version
parm:           strict_names:restrict resource and connection names to ascii alnum and a subset of punct (drbd_strict_names)

k8s@k8scp2:~$ kubectl get pods -n piraeus-datastore -o wide
NAME                                                  READY   STATUS     RESTARTS        AGE   IP             NODE                NOMINATED NODE   READINESS GATES
ha-controller-n8hm6                                   1/1     Running    7 (33m ago)     92d   10.244.5.116   k8sw3.example.com   <none>           <none>
ha-controller-vn95v                                   1/1     Running    7 (35m ago)     92d   10.244.3.239   k8sw2.example.com   <none>           <none>
ha-controller-zhrhc                                   1/1     Running    7 (33m ago)     92d   10.244.2.216   k8sw1.example.com   <none>           <none>
linstor-controller-5bc9565d7f-gc9ks                   0/1     Running    4 (2m44s ago)   24m   10.244.5.125   k8sw3.example.com   <none>           <none>
linstor-csi-controller-5d9f5bb46f-cqkj9               0/7     Unknown    37              92d   10.244.2.217   k8sw1.example.com   <none>           <none>
linstor-csi-controller-6bd5d4dc9c-l7z5q               0/7     Init:0/1   0               24m   10.244.3.247   k8sw2.example.com   <none>           <none>
linstor-csi-node-2f2lf                                0/3     Init:0/1   0               24m   10.244.3.248   k8sw2.example.com   <none>           <none>
linstor-csi-node-qdt8f                                0/3     Init:0/1   0               24m   10.244.2.220   k8sw1.example.com   <none>           <none>
linstor-csi-node-vrvz7                                0/3     Init:0/1   0               24m   10.244.5.124   k8sw3.example.com   <none>           <none>
linstor-satellite.k8sw1.example.com-cxrzm             2/2     Running    0               24m   10.82.0.23     k8sw1.example.com   <none>           <none>
linstor-satellite.k8sw2.example.com-8kdsh             2/2     Running    0               24m   10.82.0.24     k8sw2.example.com   <none>           <none>
linstor-satellite.k8sw3.example.com-tkkzk             2/2     Running    0               24m   10.82.0.26     k8sw3.example.com   <none>           <none>
piraeus-operator-controller-manager-db9f69747-v49pl   1/1     Running    0               24m   10.244.3.245   k8sw2.example.com   <none>           <none>
piraeus-operator-gencert-7997864766-rqhdl             1/1     Running    0               24m   10.244.3.246   k8sw2.example.com   <none>           <none>

k8s@k8scp2:~/linstor$ kubectl logs -n piraeus-datastore linstor-controller-5bc9565d7f-gc9ks
time="2024-08-25T08:21:16Z" level=info msg="running k8s-await-election" version=refs/tags/v0.4.1
time="2024-08-25T08:21:16Z" level=info msg="no status endpoint specified, will not be created"
I0825 08:21:16.899744       1 leaderelection.go:250] attempting to acquire leader lease piraeus-datastore/linstor-controller...
I0825 08:21:16.919634       1 leaderelection.go:260] successfully acquired lease piraeus-datastore/linstor-controller
time="2024-08-25T08:21:16Z" level=info msg="long live our new leader: 'linstor-controller-5bc9565d7f-gc9ks'!"
time="2024-08-25T08:21:16Z" level=info msg="starting command '/usr/bin/piraeus-entry.sh' with arguments: '[startController]'"
LINSTOR, Module Controller
Version:            1.28.0 (959382f7b4fb9436fefdd21dfa262e90318edaed)
Build time:         2024-07-11T10:21:06+00:00
Java Version:       17
Java VM:            Debian, Version 17.0.11+9-Debian-1deb12u1
Operating system:   Linux, Version 5.15.0-119-generic
Environment:        amd64, 2 processors, 1986 MiB memory reserved for allocations


System components initialization in progress

Loading configuration file "/etc/linstor/linstor.toml"
08:21:17.620 [main] INFO  LINSTOR/Controller - SYSTEM - ErrorReporter DB version 1 found.
08:21:17.621 [main] INFO  LINSTOR/Controller - SYSTEM - Log directory set to: '/var/log/linstor-controller'
08:21:17.640 [main] INFO  LINSTOR/Controller - SYSTEM - Database type is Kubernetes-CRD
08:21:17.640 [Main] INFO  LINSTOR/Controller - SYSTEM - Loading API classes started.
08:21:18.131 [Main] INFO  LINSTOR/Controller - SYSTEM - API classes loading finished: 491ms
08:21:18.131 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection started.
08:21:18.139 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
08:21:18.140 [Main] INFO  LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
08:21:18.140 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
08:21:18.149 [Main] INFO  LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
08:21:18.149 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule"
08:21:18.150 [Main] INFO  LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" was successful
08:21:19.707 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection finished: 1576ms
08:21:19.708 [Main] INFO  LINSTOR/Controller - SYSTEM - Cryptography provider: Using default cryptography module
08:21:20.511 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing authentication subsystem
08:21:21.041 [Main] INFO  LINSTOR/Controller - SYSTEM - SpaceTrackingService: Instance added as a system service
08:21:21.043 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
08:21:21.044 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing the k8s crd database connector
08:21:21.044 [Main] INFO  LINSTOR/Controller - SYSTEM - Kubernetes-CRD connection URL is "k8s"
08:21:22.463 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'K8sCrdDatabaseService' of type K8sCrdDatabaseService
08:21:22.470 [Main] INFO  LINSTOR/Controller - SYSTEM - Security objects load from database is in progress
08:21:23.119 [Main] INFO  LINSTOR/Controller - SYSTEM - Security objects load from database completed
08:21:23.121 [Main] INFO  LINSTOR/Controller - SYSTEM - Core objects load from database is in progress
08:21:26.580 [Main] INFO  LINSTOR/Controller - SYSTEM - Core objects load from database completed
08:21:26.796 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'TaskScheduleService' of type TaskScheduleService
08:21:26.800 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing network communications services
08:21:26.801 [Main] WARN  LINSTOR/Controller - SYSTEM - The SSL network communication service 'DebugSslConnector' could not be started because the keyStore file (/etc/linstor/ssl/keystore.jks) is missing
08:21:26.806 [Main] INFO  LINSTOR/Controller - SYSTEM - Created network communication service 'PlainConnector'
08:21:26.807 [Main] WARN  LINSTOR/Controller - SYSTEM - The SSL network communication service 'SslConnector' could not be started because the keyStore file (/etc/linstor/ssl/keystore.jks) is missing
08:21:26.807 [Main] INFO  LINSTOR/Controller - SYSTEM - Created network communication service 'SslConnector'
08:21:26.807 [Main] INFO  LINSTOR/Controller - SYSTEM - Reconnecting to previously known nodes
08:21:26.808 [Main] INFO  LINSTOR/Controller - SYSTEM - Establishing connection to node 'k8sw1.example.com' via /10.82.0.23:3366 ...
08:21:26.818 [Main] INFO  LINSTOR/Controller - SYSTEM - Establishing connection to node 'k8sw2.example.com' via /10.82.0.24:3366 ...
08:21:26.819 [Main] INFO  LINSTOR/Controller - SYSTEM - Establishing connection to node 'k8sw3.example.com' via /10.82.0.26:3366 ...
08:21:26.827 [Main] INFO  LINSTOR/Controller - SYSTEM - Reconnect requests sent
08:21:26.828 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'SpaceTrackingService' of type SpaceTrackingService
08:21:26.836 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'ScheduleBackupService' of type ScheduleBackupService
08:21:26.837 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'EbsStatusPoll' of type EbsStatusPoll
08:21:26.998 [SpaceTrackingService] ERROR LINSTOR/Controller - SYSTEM - Uncaught exception in j [Report number 66CAE97D-00000-000000]

Aug 25, 2024 8:21:27 AM org.glassfish.jersey.server.wadl.WadlFeature configure
WARNING: JAX-B API not found . WADL feature is disabled.
Aug 25, 2024 8:21:27 AM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [[::]:3370]
Aug 25, 2024 8:21:27 AM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
08:21:27.792 [Main] WARN  LINSTOR/Controller - SYSTEM - Not calling 'systemd-notify' as NOTIFY_SOCKET is null
08:21:27.793 [Main] INFO  LINSTOR/Controller - SYSTEM - Controller initialized

08:21:27.794 [SslConnector] INFO  LINSTOR/Controller - SYSTEM - Sending authentication to satellite 'k8sw1.example.com'
08:21:27.826 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - Establishing connection to node 'k8sw3.example.com' via /10.82.0.26:3366 ...
08:21:27.827 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - Establishing connection to node 'k8sw2.example.com' via /10.82.0.24:3366 ...
08:21:27.830 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - LogArchive: Running log archive on directory: /var/log/linstor-controller
08:21:27.831 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - LogArchive: No logs to archive.
08:21:27.832 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - BalanceResourcesTask/START
08:21:27.840 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - BalanceResourcesTask/END: Adjusted: 0 - Removed: 0
08:21:27.841 [SslConnector] INFO  LINSTOR/Controller - SYSTEM - Sending authentication to satellite 'k8sw2.example.com'
08:21:27.849 [SslConnector] INFO  LINSTOR/Controller - SYSTEM - Sending authentication to satellite 'k8sw3.example.com'
08:21:28.506 [MainWorkerPool-1] INFO  LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw1.example.com' from OFFLINE -> CONNECTED
08:21:28.509 [MainWorkerPool-1] INFO  LINSTOR/Controller - SYSTEM - Satellite 'k8sw1.example.com' authenticated
08:21:28.511 [MainWorkerPool-2] INFO  LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw2.example.com' from OFFLINE -> CONNECTED
08:21:28.514 [MainWorkerPool-2] INFO  LINSTOR/Controller - SYSTEM - Satellite 'k8sw2.example.com' authenticated
08:21:28.515 [MainWorkerPool-2] INFO  LINSTOR/Controller - SYSTEM - Sending full sync to Node: 'k8sw2.example.com'.
08:21:28.519 [MainWorkerPool-1] INFO  LINSTOR/Controller - SYSTEM - Sending full sync to Node: 'k8sw1.example.com'.
08:21:28.915 [MainWorkerPool-1] INFO  LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw3.example.com' from OFFLINE -> CONNECTED
08:21:28.931 [MainWorkerPool-1] INFO  LINSTOR/Controller - SYSTEM - Satellite 'k8sw3.example.com' authenticated
08:21:28.932 [MainWorkerPool-1] INFO  LINSTOR/Controller - SYSTEM - Sending full sync to Node: 'k8sw3.example.com'.
08:21:29.288 [MainWorkerPool-2] INFO  LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw3.example.com' from CONNECTED -> ONLINE
08:21:29.472 [MainWorkerPool-1] INFO  LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw1.example.com' from CONNECTED -> ONLINE
08:21:29.487 [MainWorkerPool-2] INFO  LINSTOR/Controller - SYSTEM - Changing connection state of node 'k8sw2.example.com' from CONNECTED -> ONLINE

k8s@k8scp2:~$ kubectl describe -n piraeus-datastore pod linstor-controller-5bc9565d7f-gc9ks
Name:                 linstor-controller-5bc9565d7f-gc9ks
Namespace:            piraeus-datastore
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      linstor-controller
Node:                 k8sw3.example.com/10.82.0.26
Start Time:           Sun, 25 Aug 2024 07:59:36 +0000
Labels:               app.kubernetes.io/component=linstor-controller
                      app.kubernetes.io/instance=linstorcluster
                      app.kubernetes.io/name=piraeus-datastore
                      pod-template-hash=5bc9565d7f
Annotations:          kubectl.kubernetes.io/default-container: linstor-controller
Status:               Running
IP:                   10.244.5.125
IPs:
  IP:           10.244.5.125
Controlled By:  ReplicaSet/linstor-controller-5bc9565d7f
Init Containers:
  run-migration:
    Container ID:  containerd://079a193ebd297dacb65d44f98e71ef28ff406e5395569139577ddaf335813bb1
    Image:         quay.io/piraeusdatastore/piraeus-server:v1.28.0
    Image ID:      quay.io/piraeusdatastore/piraeus-server@sha256:cc21d7daf44f7f328f9175ff443a0d2289efa3038182d8892ce753d9e67178a3
    Port:          <none>
    Host Port:     <none>
    Args:
      runMigration
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 25 Aug 2024 08:00:37 +0000
      Finished:     Sun, 25 Aug 2024 08:01:22 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      K8S_AWAIT_ELECTION_ENABLED:         1
      K8S_AWAIT_ELECTION_NAME:            linstor-controller
      K8S_AWAIT_ELECTION_LOCK_NAME:       linstor-controller
      K8S_AWAIT_ELECTION_LOCK_NAMESPACE:  piraeus-datastore (v1:metadata.namespace)
      K8S_AWAIT_ELECTION_IDENTITY:        linstor-controller-5bc9565d7f-gc9ks (v1:metadata.name)
    Mounts:
      /etc/linstor from etc-linstor (ro)
      /run/migration from migration (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pkqjc (ro)
Containers:
  linstor-controller:
    Container ID:  containerd://a0e046313a75f00c56a29de1ae58f980526705be3a411c6d4d1fbf5e12773848
    Image:         quay.io/piraeusdatastore/piraeus-server:v1.28.0
    Image ID:      quay.io/piraeusdatastore/piraeus-server@sha256:cc21d7daf44f7f328f9175ff443a0d2289efa3038182d8892ce753d9e67178a3
    Port:          3370/TCP
    Host Port:     0/TCP
    Args:
      startController
    State:          Running
      Started:      Sun, 25 Aug 2024 08:21:16 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Sun, 25 Aug 2024 08:16:16 +0000
      Finished:     Sun, 25 Aug 2024 08:21:16 +0000
    Ready:          False
    Restart Count:  4
    Liveness:       http-get http://:api/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Startup:        http-get http://:api/health delay=0s timeout=1s period=10s #success=1 #failure=30
    Environment:
      JAVA_OPTS:                          -Djdk.tls.acknowledgeCloseNotify=true
      K8S_AWAIT_ELECTION_ENABLED:         1
      K8S_AWAIT_ELECTION_NAME:            linstor-controller
      K8S_AWAIT_ELECTION_LOCK_NAME:       linstor-controller
      K8S_AWAIT_ELECTION_LOCK_NAMESPACE:  piraeus-datastore (v1:metadata.namespace)
      K8S_AWAIT_ELECTION_IDENTITY:        linstor-controller-5bc9565d7f-gc9ks (v1:metadata.name)
    Mounts:
      /etc/linstor from etc-linstor (ro)
      /tmp from tmp (rw)
      /var/log/linstor-controller from var-log-linstor-controller (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pkqjc (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  etc-linstor:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      linstor-controller-config
    Optional:  false
  var-log-linstor-controller:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  migration:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-pkqjc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              worker=true
Tolerations:                 drbd.linbit.com/force-io-error:NoSchedule
                             drbd.linbit.com/lost-quorum:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  25m                  default-scheduler  Successfully assigned piraeus-datastore/linstor-controller-5bc9565d7f-gc9ks to k8sw3.example.com
  Normal   Pulling    25m                  kubelet            Pulling image "quay.io/piraeusdatastore/piraeus-server:v1.28.0"
  Normal   Pulled     24m                  kubelet            Successfully pulled image "quay.io/piraeusdatastore/piraeus-server:v1.28.0" in 52.078s (1m0.455s including waiting). Image size: 323505259 bytes.
  Normal   Created    24m                  kubelet            Created container run-migration
  Normal   Started    24m                  kubelet            Started container run-migration
  Normal   Pulled     23m                  kubelet            Container image "quay.io/piraeusdatastore/piraeus-server:v1.28.0" already present on machine
  Normal   Created    23m                  kubelet            Created container linstor-controller
  Normal   Started    23m                  kubelet            Started container linstor-controller
  Warning  Unhealthy  23m                  kubelet            Startup probe failed: Get "http://10.244.5.125:3370/health": dial tcp 10.244.5.125:3370: connect: connection refused
  Warning  Unhealthy  33s (x139 over 23m)  kubelet            Startup probe failed: HTTP probe failed with statuscode: 500

Happy to provide more info if told where to look.

@WanzenBug
Copy link
Member

08:21:26.998 [SpaceTrackingService] ERROR LINSTOR/Controller - SYSTEM - Uncaught exception in j [Report number 66CAE97D-00000-000000]

For some reason this service does not want to start. This could be related to some corrupted DB entries. You can try to clean the space tracking database:

  • Create a backup of the entries: kubectl get -oyaml spacehistory.internal.linstor.linbit.com > spacehistory.yaml. Attach it to this issue so we can debug the issue
  • Delete the entries: kubectl delete spacehistory.internal.linstor.linbit.com --all and see if the controller starts.

@zrav
Copy link
Author

zrav commented Aug 26, 2024

Deleting the space history did indeed make the controller become ready again. File attached.
spacehistory.txt

@dimm0
Copy link

dimm0 commented Sep 10, 2024

Same issue here, deleting spacehistory resolved it. I can send my db if needed.

@zrav
Copy link
Author

zrav commented Sep 29, 2024

We experienced this again on our production cluster when upgrading from 2.5.1. Luckily we knew about the workaround.

@salekseev
Copy link

Same issue here, deleting spacehistory resolved it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@salekseev @dimm0 @WanzenBug @zrav and others