Description
Related to InditexTech/redkeyoperator#48
Observed Behavior
Robin retains redis-cluster-3 through redis-cluster-7 in its target node list (because its ConfigMap specifies primaries: 8) even though only 3 pods exist. It enters an infinite cycle:
- Robin tries to initialize each ghost node sequentially, each failing with
failed to connect after 10 retries
- Robin attempts
CLUSTER MEET with ghost nodes, which fails with ERR Invalid node address specified: :6379 (empty IP since pod doesn't exist)
- Robin alternates between
ScalingUp, ScalingUpError, and CheckingIntegrity status
- The operator is stuck in
ScalingDown / EndingFastScaling, polling Robin every 30 seconds
- The 3 running Redis pods have
cluster_slots_assigned:0 — slots were never distributed after recreation
Additionally, when the operator asks Robin to recreate the cluster during an ongoing CheckIntegrity operation, Robin responds: Cluster cannot be recreated right now due to conflicting operation (operation=CheckIntegrity, status=Running).
Solution
Robin should accept configuration changes at any phase.
Steps to Reproduce
There is a test in the redkeyoperator code to reproduce it:
make test-chaos GINKGO_EXTRA_OPTS='--focus="recovers when Robin ConfigMap has stale primaries from failed scale-down"'
Expected Behavior
No response
Version / Environment
No response
Additional context or logs
No response
Description
Related to InditexTech/redkeyoperator#48
Observed Behavior
Robin retains
redis-cluster-3throughredis-cluster-7in its target node list (because its ConfigMap specifiesprimaries: 8) even though only 3 pods exist. It enters an infinite cycle:failed to connect after 10 retriesCLUSTER MEETwith ghost nodes, which fails withERR Invalid node address specified: :6379(empty IP since pod doesn't exist)ScalingUp,ScalingUpError, andCheckingIntegritystatusScalingDown/EndingFastScaling, polling Robin every 30 secondscluster_slots_assigned:0— slots were never distributed after recreationAdditionally, when the operator asks Robin to recreate the cluster during an ongoing
CheckIntegrityoperation, Robin responds:Cluster cannot be recreated right now due to conflicting operation(operation=CheckIntegrity, status=Running).Solution
Robin should accept configuration changes at any phase.
Steps to Reproduce
There is a test in the redkeyoperator code to reproduce it:
Expected Behavior
No response
Version / Environment
No response
Additional context or logs
No response