Round robin load balancing isn't working as expected if primary node goes down for Redis cluster mode

Using `redis-py==6.2.0`. I have a 6 node Redis Cluster running in EKS (3 primaries / 3 replicas), and whenever one of the primary nodes goes down, my service will raise `TimeoutError` despite the replica being available, having sufficient `Retry` on the client, and using client-side load balancing as `LoadBalancingStrategy.ROUND_ROBIN`. 

I've boiled it down to the following, so I can better observe what's happening in the `RedisCluster._internal_execute_command` [retry loop](https://github.com/redis/redis-py/blob/1a59471870658c7fc661a4705541a0762acd8d85/redis/cluster.py#L1156-L1188) (note the client doesn't have a `Retry` on it in this example because I'm handling it with the manual loop, but rest assured the service that is running has something like `retry=Retry(backoff=NoBackoff(), retries=3)` on it):

```python
import os
from redis.cluster import LoadBalancingStrategy, RedisCluster
rc = RedisCluster(
    host=os.getenv('REDIS_HOST'),
    port=os.getenv('REDIS_PORT'),
    password=os.getenv('REDIS_PASSWORD'),
    load_balancing_strategy=LoadBalancingStrategy.ROUND_ROBIN,
    socket_connect_timeout=1.0,
    require_full_coverage=False,
    decode_responses=True,
)
rc.set("foo", "bar")  # True
slot = rc.determine_slot("GET", "foo")  # 12182
lbs = rc.load_balancing_strategy  # ROUND_ROBIN
for i in range(5):
    print(f"\nAttempt {i + 1}")
    try:
        primary_name = rc.nodes_manager.slots_cache[slot][0].name
        n_slots = len(rc.nodes_manager.slots_cache[slot])
        node_idx = rc.nodes_manager.read_load_balancer.get_server_index(primary_name, n_slots, lbs)
        node = rc.nodes_manager.slots_cache[slot][node_idx]
        print(f"idx: {node_idx} | node: {node.name} | type: {node.server_type}")
        rc._execute_command(node, "GET", "foo")
    except Exception as e:
        print(f"Exception: {e}")
```

With a healthy cluster, this will output:

```
Attempt 1
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'

Attempt 2
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 3
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'

Attempt 4
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 5
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'
```

If I kill the primary node in EKS (`kubectl delete pod redis-node-3` where this was the `100.66.97.179` pod), and running the loop again, I get the following (until EKS gets `redis-node-3` back up and running):

```
Attempt 1
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 2
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Attempt 3
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Attempt 4
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Attempt 5
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server
```

Basically, as soon as I get the `TimeoutError` from the primary node, the load balancer gets stuck and stops bouncing between the primary and the replica, but keeps trying the primary over and over.

If I instead kill the replica node in EKS, I get exactly what I'd expect, where the load balancer still round robins between both nodes:

```
Attempt 1
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'

Attempt 2
idx: 1 | node: 100.66.106.241:6379 | type: replica
Exception: Timeout connecting to server

Attempt 3
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'

Attempt 4
idx: 1 | node: 100.66.106.241:6379 | type: replica
Exception: Timeout connecting to server

Attempt 5
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Round robin load balancing isn't working as expected if primary node goes down for Redis cluster mode #3681

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Round robin load balancing isn't working as expected if primary node goes down for Redis cluster mode #3681

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions