Crash of Controllers Following Reports-Server Upgrade to 0.1.3 #260

sbathgate · 2025-02-19T22:04:37Z

Description

After upgrading to version 0.1.3 and wiping the PVC/PV associated with the Kyverno reports-server, we experienced a cascade of failures. The reports-server crashed with an OOM followed by a series of duplicate key errors, which in turn caused the Admission Controller, Background Controller, and Reports Controller to become unavailable. This rendered the cluster unable to start any new workloads because the controllers couldn’t reach the appropriate endpoints and the reports-server was unable to recover.

Steps to Reproduce

I have admittedly been struggling to force a reproduction on a dev cluster, as this naturally occurred on production instances that required immediate resolution. However, the issue definitely appeared to occur following the upgrade from 0.1.2 > 0.1.3. The clusters in question, may not have had a fresh db (delete PV/PVC) following the upgrade, but I hadn't seen any indicators that this was a requirement. I have tried to reduce the replicas of the reports server to 0 and the issue didn't occur. But we have witnessed it on two separate clusters since upgrading.

Expected Behavior

The outage of the reports-server should not impact the Admission Controller and related components. Since our policies are configured to fail safe, the controllers should remain operational even if the reports-server is temporarily down.

Actual Behavior

The failure of the reports-server led to all controllers becoming unavailable. This resulted in a substantial cluster outage, as no new workloads could start due to the missing endpoints and failed controllers.

Workaround / Recovery Steps

Manually scale down all components to 0, including the Postgres StatefulSet.
Delete the problematic PV/PVC, we could have just dropped the rows I am sure, but went with the more nuclear approach.
Bring all components back up to restore full cluster functionality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash of Controllers Following Reports-Server Upgrade to 0.1.3 #260

Crash of Controllers Following Reports-Server Upgrade to 0.1.3 #260

sbathgate commented Feb 19, 2025

Crash of Controllers Following Reports-Server Upgrade to 0.1.3 #260

Crash of Controllers Following Reports-Server Upgrade to 0.1.3 #260

Comments

sbathgate commented Feb 19, 2025

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Workaround / Recovery Steps