You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After upgrading to version 0.1.3 and wiping the PVC/PV associated with the Kyverno reports-server, we experienced a cascade of failures. The reports-server crashed with an OOM followed by a series of duplicate key errors, which in turn caused the Admission Controller, Background Controller, and Reports Controller to become unavailable. This rendered the cluster unable to start any new workloads because the controllers couldn’t reach the appropriate endpoints and the reports-server was unable to recover.
Steps to Reproduce
I have admittedly been struggling to force a reproduction on a dev cluster, as this naturally occurred on production instances that required immediate resolution. However, the issue definitely appeared to occur following the upgrade from 0.1.2 > 0.1.3. The clusters in question, may not have had a fresh db (delete PV/PVC) following the upgrade, but I hadn't seen any indicators that this was a requirement. I have tried to reduce the replicas of the reports server to 0 and the issue didn't occur. But we have witnessed it on two separate clusters since upgrading.
Expected Behavior
The outage of the reports-server should not impact the Admission Controller and related components. Since our policies are configured to fail safe, the controllers should remain operational even if the reports-server is temporarily down.
Actual Behavior
The failure of the reports-server led to all controllers becoming unavailable. This resulted in a substantial cluster outage, as no new workloads could start due to the missing endpoints and failed controllers.
Workaround / Recovery Steps
Manually scale down all components to 0, including the Postgres StatefulSet.
Delete the problematic PV/PVC, we could have just dropped the rows I am sure, but went with the more nuclear approach.
Bring all components back up to restore full cluster functionality.
The text was updated successfully, but these errors were encountered:
Description
After upgrading to version 0.1.3 and wiping the PVC/PV associated with the Kyverno reports-server, we experienced a cascade of failures. The reports-server crashed with an OOM followed by a series of duplicate key errors, which in turn caused the Admission Controller, Background Controller, and Reports Controller to become unavailable. This rendered the cluster unable to start any new workloads because the controllers couldn’t reach the appropriate endpoints and the reports-server was unable to recover.
Steps to Reproduce
I have admittedly been struggling to force a reproduction on a dev cluster, as this naturally occurred on production instances that required immediate resolution. However, the issue definitely appeared to occur following the upgrade from 0.1.2 > 0.1.3. The clusters in question, may not have had a fresh db (delete PV/PVC) following the upgrade, but I hadn't seen any indicators that this was a requirement. I have tried to reduce the replicas of the reports server to 0 and the issue didn't occur. But we have witnessed it on two separate clusters since upgrading.
Expected Behavior
The outage of the reports-server should not impact the Admission Controller and related components. Since our policies are configured to fail safe, the controllers should remain operational even if the reports-server is temporarily down.
Actual Behavior
The failure of the reports-server led to all controllers becoming unavailable. This resulted in a substantial cluster outage, as no new workloads could start due to the missing endpoints and failed controllers.
Workaround / Recovery Steps
The text was updated successfully, but these errors were encountered: