JedisCluster Requests Hang Indefinitely After Lock, Ignoring Timeout Configurations #4002

Nacol-174 · 2024-10-25T03:19:28Z

Expected behavior

The command timeout can be interrupted.

Actual behavior

When performing Jedis operations in the production environment, the system experiences lags lasting several minutes. After troubleshooting with jstack, we found that numerous threads enter a WAITING state after calling getSlotConnection(). Upon examining the source code for JedisClusterInfoCache, I noticed that this class uses a ReentrantReadWriteLock, leading me to suspect that a write lock is being held, which is causing prolonged read lock blocking. Based on this, I developed a tool to proactively acquire the write lock, as outlined below:

package com.nacol.redisbandwidth.component.cache;

import redis.clients.jedis.*;

import java.lang.reflect.Field;
import java.util.concurrent.locks.Lock;

public class JedisClusterInfoCacheLockUtil {

    private final Lock writeLock;

    private final Lock readLock;

    public JedisClusterInfoCacheLockUtil(JedisCluster jedisCluster) throws Exception {

        Field connectionHandlerField = BinaryJedisCluster.class.getDeclaredField("connectionHandler");
        connectionHandlerField.setAccessible(true);
        JedisClusterConnectionHandler connectionHandler = (JedisClusterConnectionHandler) connectionHandlerField.get(jedisCluster);

        Field cacheField = JedisClusterConnectionHandler.class.getDeclaredField("cache");
        cacheField.setAccessible(true);
        JedisClusterInfoCache jedisClusterInfoCache = (JedisClusterInfoCache) cacheField.get(connectionHandler);

        Field writeLockField = JedisClusterInfoCache.class.getDeclaredField("w");
        writeLockField.setAccessible(true);
        this.writeLock = (Lock) writeLockField.get(jedisClusterInfoCache);
        
        Field readLockField = JedisClusterInfoCache.class.getDeclaredField("r");
        readLockField.setAccessible(true);
        this.readLock = (Lock) readLockField.get(jedisClusterInfoCache);
    }

    public void lockWrite() {
        writeLock.lock();
    }

    public void unlockWrite() {
        writeLock.unlock();
    }

    public void lockRead() {
        readLock.lock();
    }

    public void unlockRead() {
        readLock.unlock();
    }

}

Then, execute the following demo:

1.	Initialize JedisCluster.
2.	Acquire the write lock.
3.	Start a child thread to execute the get command (executing get in the same thread would re-enter, which doesn’t fit the scenario).

        // STEP init Clsuter
        JedisCluster cluster = JedisClient.getCluster();

        // STEP init clock util
        JedisClusterInfoCacheLockUtil util = new JedisClusterInfoCacheLockUtil(cluster);

        // STEP lock
        util.lockWrite();

        // STEP Start a child thread to execute the get command 
        //    (executing get in the same thread would re-enter, which doesn’t fit the 
        Executors.newFixedThreadPool(1).execute(() ->{
            // At this point, execution will be indefinitely blocked.
            cluster.get("test-key");
            log.info("sub finish");
        });

        Thread.sleep(10000000);
        util.unlockWrite();

        log.info("main finish");

Execution result:
The child thread’s get command will remain stalled, waiting for the write lock to be released. Even if maxWaitMillis, connectionTimeout, soTimeout, and maxAttempts are configured, the operation will not trigger an interruption.
This leads to a multi-minute blocking delay.

ENV

Jedis Configuration
- maxWaitMillis: 4000ms
- connectionTimeout: 2000ms
- maxAttempts: 3
- soTimeout: 350ms
Jedis version：3.5.0
Redis version：6.2.14
java version：8

The text was updated successfully, but these errors were encountered:

sazzad16 · 2024-10-28T14:09:03Z

@Nacol-174 Thank you for your work and sharing.

Nacol-174 · 2024-10-29T09:42:25Z

@Nacol-174 Thank you for your work and sharing.

This issue occurs across Jedis versions 3, 4, and 5.

atakavci · 2025-02-27T15:16:26Z

Hi @Nacol-174 ,

sorry for long time no response. I am trying to understand the possible ways you end up with what you experienced.
Configuration parameters you mentioned;

Jedis Configuration

maxWaitMillis: 4000ms

connectionTimeout: 2000ms

maxAttempts: 3

soTimeout: 350ms

they are (except maxWaitMillis, not able to find it anywhere) all around the transport layer protocol; targeting their limitations/weaknesses and attempting to make them more practical and predictable.

What you are demonstrating on the other hand is about having race conditions and deadlocks in client side code.
Could you provide the stack trace of the WAITING threads as well as more info around getSlotConnection() method you mentioned ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JedisCluster Requests Hang Indefinitely After Lock, Ignoring Timeout Configurations #4002

JedisCluster Requests Hang Indefinitely After Lock, Ignoring Timeout Configurations #4002

Nacol-174 commented Oct 25, 2024

sazzad16 commented Oct 28, 2024

Nacol-174 commented Oct 29, 2024

atakavci commented Feb 27, 2025

JedisCluster Requests Hang Indefinitely After Lock, Ignoring Timeout Configurations #4002

JedisCluster Requests Hang Indefinitely After Lock, Ignoring Timeout Configurations #4002

Comments

Nacol-174 commented Oct 25, 2024

Expected behavior

Actual behavior

ENV

sazzad16 commented Oct 28, 2024

Nacol-174 commented Oct 29, 2024

atakavci commented Feb 27, 2025