Skip to content

[Bug] Robin stuck when the configuration is not updated correctly #26

@DanielDorado

Description

@DanielDorado

Description

Related to InditexTech/redkeyoperator#48

Observed Behavior

Robin retains redis-cluster-3 through redis-cluster-7 in its target node list (because its ConfigMap specifies primaries: 8) even though only 3 pods exist. It enters an infinite cycle:

  1. Robin tries to initialize each ghost node sequentially, each failing with failed to connect after 10 retries
  2. Robin attempts CLUSTER MEET with ghost nodes, which fails with ERR Invalid node address specified: :6379 (empty IP since pod doesn't exist)
  3. Robin alternates between ScalingUp, ScalingUpError, and CheckingIntegrity status
  4. The operator is stuck in ScalingDown / EndingFastScaling, polling Robin every 30 seconds
  5. The 3 running Redis pods have cluster_slots_assigned:0 — slots were never distributed after recreation

Additionally, when the operator asks Robin to recreate the cluster during an ongoing CheckIntegrity operation, Robin responds: Cluster cannot be recreated right now due to conflicting operation (operation=CheckIntegrity, status=Running).

Solution

Robin should accept configuration changes at any phase.

Steps to Reproduce

There is a test in the redkeyoperator code to reproduce it:

 make test-chaos GINKGO_EXTRA_OPTS='--focus="recovers when Robin ConfigMap has stale primaries from failed scale-down"'

Expected Behavior

No response

Version / Environment

No response

Additional context or logs

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions