Description
Using redis-py==6.2.0
. I have a 6 node Redis Cluster running in EKS (3 primaries / 3 replicas), and whenever one of the primary nodes goes down, my service will raise TimeoutError
despite the replica being available, having sufficient Retry
on the client, and using client-side load balancing as LoadBalancingStrategy.ROUND_ROBIN
.
I've boiled it down to the following, so I can better observe what's happening in the RedisCluster._internal_execute_command
retry loop (note the client doesn't have a Retry
on it in this example because I'm handling it with the manual loop, but rest assured the service that is running has something like retry=Retry(backoff=NoBackoff(), retries=3)
on it):
import os
from redis.cluster import LoadBalancingStrategy, RedisCluster
rc = RedisCluster(
host=os.getenv('REDIS_HOST'),
port=os.getenv('REDIS_PORT'),
password=os.getenv('REDIS_PASSWORD'),
load_balancing_strategy=LoadBalancingStrategy.ROUND_ROBIN,
socket_connect_timeout=1.0,
require_full_coverage=False,
decode_responses=True,
)
rc.set("foo", "bar") # True
slot = rc.determine_slot("GET", "foo") # 12182
lbs = rc.load_balancing_strategy # ROUND_ROBIN
for i in range(5):
print(f"\nAttempt {i + 1}")
try:
primary_name = rc.nodes_manager.slots_cache[slot][0].name
n_slots = len(rc.nodes_manager.slots_cache[slot])
node_idx = rc.nodes_manager.read_load_balancer.get_server_index(primary_name, n_slots, lbs)
node = rc.nodes_manager.slots_cache[slot][node_idx]
print(f"idx: {node_idx} | node: {node.name} | type: {node.server_type}")
rc._execute_command(node, "GET", "foo")
except Exception as e:
print(f"Exception: {e}")
With a healthy cluster, this will output:
Attempt 1
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'
Attempt 2
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'
Attempt 3
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'
Attempt 4
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'
Attempt 5
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'
If I kill the primary node in EKS (kubectl delete pod redis-node-3
where this was the 100.66.97.179
pod), and running the loop again, I get the following (until EKS gets redis-node-3
back up and running):
Attempt 1
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'
Attempt 2
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server
Attempt 3
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server
Attempt 4
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server
Attempt 5
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server
Basically, as soon as I get the TimeoutError
from the primary node, the load balancer gets stuck and stops bouncing between the primary and the replica, but keeps trying the primary over and over.
If I instead kill the replica node in EKS, I get exactly what I'd expect, where the load balancer still round robins between both nodes:
Attempt 1
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'
Attempt 2
idx: 1 | node: 100.66.106.241:6379 | type: replica
Exception: Timeout connecting to server
Attempt 3
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'
Attempt 4
idx: 1 | node: 100.66.106.241:6379 | type: replica
Exception: Timeout connecting to server
Attempt 5
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'