Add automatic disk scaling based on disk usage by kurtmc · Pull Request #27 · halter/valkey-cluster-operator

kurtmc · 2026-06-10T21:39:45Z

Users no longer need to set a disk size. The operator provisions data
volumes at 1Gi (or spec.storage's request, which becomes the
initial/minimum size), measures disk usage of every pod once a minute
by running df against the data mount, and when the fullest volume in
the cluster exceeds 50% used, grows the target size for all volumes by
50%, rounded up to a whole Gi. The current target is tracked in
status.storageSize; spec is never mutated and volumes only ever grow.

spec.storage is now optional; defaults to the cluster default storage
class with ReadWriteOnce access
new optional spec.storageLimit caps growth; when reached (or when the
StorageClass does not allow volume expansion) the operator emits a
warning event and sets the StorageLimited status condition
another expansion is only requested once every PVC has reached the
current target capacity, so growth cannot compound while an expansion
is in flight (EBS allows one modification per volume per ~6h)
stable clusters requeue every minute to keep monitoring usage

Users no longer need to set a disk size. The operator provisions data volumes at 1Gi (or spec.storage's request, which becomes the initial/minimum size), measures disk usage of every pod once a minute by running df against the data mount, and when the fullest volume in the cluster exceeds 50% used, grows the target size for all volumes by 50%, rounded up to a whole Gi. The current target is tracked in status.storageSize; spec is never mutated and volumes only ever grow. - spec.storage is now optional; defaults to the cluster default storage class with ReadWriteOnce access - new optional spec.storageLimit caps growth; when reached (or when the StorageClass does not allow volume expansion) the operator emits a warning event and sets the StorageLimited status condition - another expansion is only requested once every PVC has reached the current target capacity, so growth cannot compound while an expansion is in flight (EBS allows one modification per volume per ~6h) - stable clusters requeue every minute to keep monitoring usage

jzho987

How should down scaling work? Sounds like we just don't care about handling scaling down? Which is fine.

kurtmc · 2026-06-10T22:43:43Z

@jzho987 I think we should never scale down, it will keep it quite simple and safe.

The e2e "enable auth" test failed on a host with >50% disk usage: with the local-path provisioner, df inside the pods reports the node filesystem, so every reconcile crossed the threshold, attempted to grow, hit the non-expandable StorageClass, and rewrote the StorageLimited condition whose message embedded the fluctuating usage percentage. Each status write triggered another reconcile, adding exec and API churn during the auth rolling update. - check StorageClass expandability before measuring: non-expandable clusters get one static StorageLimited condition and no df execs - drop the usage percentage from condition messages so SetStatusCondition only fires on real transitions (the percentage stays in the one-shot expansion event) - clear StorageLimitReached with a Normal event once usage recedes below the threshold Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- MaxConcurrentReconciles 2 -> 8 so one slow cluster cannot stall reconciliation of the others now that stable clusters requeue periodically for disk monitoring - raise client-go rate limits to 50 QPS / 100 burst; the reconciler fans out exec, valkey and status calls across every pod of every cluster and the controller-runtime default of 20/30 throttles it - lengthen the disk usage poll interval from 1 to 5 minutes; the 50% threshold with 50% growth steps leaves enough headroom that a five-minute detection latency is safe Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Enabling a password on a running cluster deadlocked the rolling update, deterministically: a restarted replica sends AUTH (from primaryauth) to a primary that has not restarted yet and so has no requirepass; valkey treats that as a fatal replication error ("Unable to AUTH to PRIMARY"), the replica's master link stays down, and the health-gated rolling update never proceeds. This is why the "enable auth" e2e test has been failing on main since the replication catch-up health gate was added. The operator now live-applies requirepass/primaryauth via CONFIG SET to every running pod before the rolling restart, so replication stays authenticated in every mixed-config state and the restart only makes the config persistent. Password removal is handled by the same path. Also fix two races in the upgrade e2e test, reproduced against a live cluster with a 2s GET watch (the key never disappeared; the lone empty read happened the moment the exec-target pod was being replaced): - cluster state and version are now verified at the same instant; previously state could pass before the last pod was replaced and version while it was terminating, declaring the rollout done early - the data check is retried; genuinely lost data never recovers, so retrying cannot mask real loss - give the auth rollout (six health-gated pod restarts) 10 minutes Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jzho987 reviewed Jun 10, 2026

View reviewed changes

kurtmc and others added 3 commits June 11, 2026 10:45

kurtmc merged commit faef41e into main Jun 11, 2026
3 checks passed

kurtmc deleted the feature/auto-scale-disk branch June 11, 2026 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatic disk scaling based on disk usage#27

Add automatic disk scaling based on disk usage#27
kurtmc merged 4 commits into
mainfrom
feature/auto-scale-disk

kurtmc commented Jun 10, 2026

Uh oh!

jzho987 left a comment

Uh oh!

kurtmc commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kurtmc commented Jun 10, 2026

Uh oh!

jzho987 left a comment

Choose a reason for hiding this comment

Uh oh!

kurtmc commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants