Skip to content

Kubesolo fails to start on boot due to apiserver lease UID mismatch / RBAC bootstrap failure / Database Corruption #145

@benoitschipper

Description

@benoitschipper

Description

On v1.1.3 (also reproduced on v1.1.2), Kubesolo occasionally fails to start as a systemd service on boot. The API server health check fails due to a lease UID precondition mismatch and an RBAC bootstrap-roles failure, causing Kubesolo to abort startup entirely. This appears to be related to stale/corrupted etcd/Kine state left over from a previous run.

Expected Behavior

Kubesolo should start cleanly on boot, even if a previous run left behind stale lease objects in etcd/Kine. The API server should either clean up stale leases automatically or tolerate the mismatch and proceed with RBAC bootstrapping.

Actual Behavior

On boot, the API server detects a UID mismatch on its own identity lease object and fails to update it:

StorageError: invalid object, Code: 4,
Key: /registry/leases/kube-system/apiserver-mglq5p6wzlxekntua2wo7lt2jm,
ResourceVersion: 0,
AdditionalErrorMsg: Precondition failed: UID in precondition: 6e41d968-09ce-48ea-8089-e9bb98772c43, UID in object meta: ""

This causes the poststarthook/rbac/bootstrap-roles hook to fail (reason withheld), which in turn causes the /healthz check to fail. After multiple retries, Kubesolo aborts and shuts down:

API server failed to start: apiserver /healthz check failed: component health check failed after multiple attempts

The systemd service then exits cleanly (Deactivated successfully) but Kubesolo is not running, requiring manual intervention.


Root Cause Hypothesis

The lease object for the API server identity (apiserver-mglq5p6wzlxekntua2wo7lt2jm) persists in etcd/Kine from the previous run with a specific UID. On the next boot, the API server generates a new UID and attempts to update the existing lease, but the precondition check (expecting the old UID) fails because the stored object has a mismatched or empty UID in its metadata.

This is likely a stale etcd/Kine database state issue — possibly triggered by an unclean shutdown (e.g., power loss), where the lease was written but not properly expired or cleaned up. The RBAC bootstrap failure may be a downstream consequence of the lease controller being in a broken state.

⚠️ Most Likely caused by a database corruption — the Kine/etcd datastore appears to contain an inconsistent lease object that blocks clean startup.


Reproduction Steps

  1. Run Kubesolo v1.1.3 | v1.1.2 as a systemd service on an edge device.
  2. Perform an unclean shutdown (e.g., hard power cut or cold boot after power loss).
  3. Power the device back on and observe Kubesolo startup via journalctl -u kubesolo -f.
  4. Observe the lease UID mismatch error and RBAC bootstrap failure in logs, followed by Kubesolo aborting.

Key Log Excerpts

Lease UID precondition failure:

E0427 04:50:26.534503 630 controller.go:195] "Failed to update lease"
err="Operation cannot be fulfilled on leases.coordination.k8s.io
\"apiserver-mglq5p6wzlxekntua2wo7lt2jm\": StorageError: invalid object,
Code: 4, Key: /registry/leases/kube-system/apiserver-mglq5p6wzlxekntua2wo7lt2jm,
ResourceVersion: 0, AdditionalErrorMsg: Precondition failed:
UID in precondition: 6e41d968-09ce-48ea-8089-e9bb98772c43, UID in object meta: "

RBAC bootstrap failure and healthz:

[-]poststarthook/rbac/bootstrap-roles failed: reason withheld
healthz check failed | component=apiserver/healthz

API server abort:

ERR executor.go:77 > API server failed to start: apiserver /healthz check failed:
component health check failed after multiple attempts | component=apiserver
INF executor.go:96 > terminating the API server... | component=apiserver
INF main.go:251 > shutdown requested before apiserver was ready... | component=kubesolo

Environment

  • OS: Debian GNU/Linux 13 (trixie), self-built via ISAR
  • Device: Siemens IOT2050
  • Kubesolo version: v1.1.3 (also reproduced on v1.1.2)
  • Deployment type: Edge device, systemd service, subject to unclean shutdowns / power loss

Impact

After an unclean shutdown or power loss, Kubesolo may fail to start entirely and will not recover automatically. On unattended edge deployments, this means the device is effectively dead until someone manually intervenes (e.g., clears the stale etcd state or restarts the service).


Questions / Request

  • On the edge irregular shutdowns are more common than you might expect so would be amazing if we can somehow protect against this. Is there a known mechanism in Kubesolo/Kine to detect and clean up stale API server identity leases or corrupted database on startup?
  • Should Kubesolo's startup logic delete or ignore stale lease objects whose UID no longer matches before attempting to start the API server, possibly with a cli/var flag if you like living on the edge?
  • As discussed previously, backups might also be an interesting narrative to pursue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions