Skip to content

Add automatic disk scaling based on disk usage#27

Merged
kurtmc merged 4 commits into
mainfrom
feature/auto-scale-disk
Jun 11, 2026
Merged

Add automatic disk scaling based on disk usage#27
kurtmc merged 4 commits into
mainfrom
feature/auto-scale-disk

Conversation

@kurtmc

@kurtmc kurtmc commented Jun 10, 2026

Copy link
Copy Markdown
Member

Users no longer need to set a disk size. The operator provisions data
volumes at 1Gi (or spec.storage's request, which becomes the
initial/minimum size), measures disk usage of every pod once a minute
by running df against the data mount, and when the fullest volume in
the cluster exceeds 50% used, grows the target size for all volumes by
50%, rounded up to a whole Gi. The current target is tracked in
status.storageSize; spec is never mutated and volumes only ever grow.

  • spec.storage is now optional; defaults to the cluster default storage
    class with ReadWriteOnce access
  • new optional spec.storageLimit caps growth; when reached (or when the
    StorageClass does not allow volume expansion) the operator emits a
    warning event and sets the StorageLimited status condition
  • another expansion is only requested once every PVC has reached the
    current target capacity, so growth cannot compound while an expansion
    is in flight (EBS allows one modification per volume per ~6h)
  • stable clusters requeue every minute to keep monitoring usage

Users no longer need to set a disk size. The operator provisions data
volumes at 1Gi (or spec.storage's request, which becomes the
initial/minimum size), measures disk usage of every pod once a minute
by running df against the data mount, and when the fullest volume in
the cluster exceeds 50% used, grows the target size for all volumes by
50%, rounded up to a whole Gi. The current target is tracked in
status.storageSize; spec is never mutated and volumes only ever grow.

- spec.storage is now optional; defaults to the cluster default storage
  class with ReadWriteOnce access
- new optional spec.storageLimit caps growth; when reached (or when the
  StorageClass does not allow volume expansion) the operator emits a
  warning event and sets the StorageLimited status condition
- another expansion is only requested once every PVC has reached the
  current target capacity, so growth cannot compound while an expansion
  is in flight (EBS allows one modification per volume per ~6h)
- stable clusters requeue every minute to keep monitoring usage

@jzho987 jzho987 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should down scaling work? Sounds like we just don't care about handling scaling down? Which is fine.

@kurtmc

kurtmc commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

@jzho987 I think we should never scale down, it will keep it quite simple and safe.

kurtmc and others added 3 commits June 11, 2026 10:45
The e2e "enable auth" test failed on a host with >50% disk usage: with
the local-path provisioner, df inside the pods reports the node
filesystem, so every reconcile crossed the threshold, attempted to
grow, hit the non-expandable StorageClass, and rewrote the
StorageLimited condition whose message embedded the fluctuating usage
percentage. Each status write triggered another reconcile, adding exec
and API churn during the auth rolling update.

- check StorageClass expandability before measuring: non-expandable
  clusters get one static StorageLimited condition and no df execs
- drop the usage percentage from condition messages so SetStatusCondition
  only fires on real transitions (the percentage stays in the one-shot
  expansion event)
- clear StorageLimitReached with a Normal event once usage recedes
  below the threshold

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- MaxConcurrentReconciles 2 -> 8 so one slow cluster cannot stall
  reconciliation of the others now that stable clusters requeue
  periodically for disk monitoring
- raise client-go rate limits to 50 QPS / 100 burst; the reconciler
  fans out exec, valkey and status calls across every pod of every
  cluster and the controller-runtime default of 20/30 throttles it
- lengthen the disk usage poll interval from 1 to 5 minutes; the 50%
  threshold with 50% growth steps leaves enough headroom that a
  five-minute detection latency is safe

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Enabling a password on a running cluster deadlocked the rolling update,
deterministically: a restarted replica sends AUTH (from primaryauth) to
a primary that has not restarted yet and so has no requirepass; valkey
treats that as a fatal replication error ("Unable to AUTH to PRIMARY"),
the replica's master link stays down, and the health-gated rolling
update never proceeds. This is why the "enable auth" e2e test has been
failing on main since the replication catch-up health gate was added.

The operator now live-applies requirepass/primaryauth via CONFIG SET to
every running pod before the rolling restart, so replication stays
authenticated in every mixed-config state and the restart only makes
the config persistent. Password removal is handled by the same path.

Also fix two races in the upgrade e2e test, reproduced against a live
cluster with a 2s GET watch (the key never disappeared; the lone empty
read happened the moment the exec-target pod was being replaced):
- cluster state and version are now verified at the same instant;
  previously state could pass before the last pod was replaced and
  version while it was terminating, declaring the rollout done early
- the data check is retried; genuinely lost data never recovers, so
  retrying cannot mask real loss
- give the auth rollout (six health-gated pod restarts) 10 minutes

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@kurtmc kurtmc merged commit faef41e into main Jun 11, 2026
3 checks passed
@kurtmc kurtmc deleted the feature/auto-scale-disk branch June 11, 2026 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants