Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash of Controllers Following Reports-Server Upgrade to 0.1.3 #260

Open
sbathgate opened this issue Feb 19, 2025 · 0 comments
Open

Crash of Controllers Following Reports-Server Upgrade to 0.1.3 #260

sbathgate opened this issue Feb 19, 2025 · 0 comments

Comments

@sbathgate
Copy link

Description

After upgrading to version 0.1.3 and wiping the PVC/PV associated with the Kyverno reports-server, we experienced a cascade of failures. The reports-server crashed with an OOM followed by a series of duplicate key errors, which in turn caused the Admission Controller, Background Controller, and Reports Controller to become unavailable. This rendered the cluster unable to start any new workloads because the controllers couldn’t reach the appropriate endpoints and the reports-server was unable to recover.

Steps to Reproduce

I have admittedly been struggling to force a reproduction on a dev cluster, as this naturally occurred on production instances that required immediate resolution. However, the issue definitely appeared to occur following the upgrade from 0.1.2 > 0.1.3. The clusters in question, may not have had a fresh db (delete PV/PVC) following the upgrade, but I hadn't seen any indicators that this was a requirement. I have tried to reduce the replicas of the reports server to 0 and the issue didn't occur. But we have witnessed it on two separate clusters since upgrading.

Expected Behavior

The outage of the reports-server should not impact the Admission Controller and related components. Since our policies are configured to fail safe, the controllers should remain operational even if the reports-server is temporarily down.

Actual Behavior

The failure of the reports-server led to all controllers becoming unavailable. This resulted in a substantial cluster outage, as no new workloads could start due to the missing endpoints and failed controllers.

Workaround / Recovery Steps

  • Manually scale down all components to 0, including the Postgres StatefulSet.
  • Delete the problematic PV/PVC, we could have just dropped the rows I am sure, but went with the more nuclear approach.
  • Bring all components back up to restore full cluster functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant