Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] fix ClientConductor concurrent bug #1772

Conversation

WorkingChen
Copy link
Contributor

java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1 at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:100) ~[?:?] at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106) ~[?:?] at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302) ~[?:?] at java.base/java.util.Objects.checkIndex(Objects.java:385) ~[?:?] at java.base/java.util.ArrayList.get(ArrayList.java:427) ~[?:?] at io.aeron.ClientConductor.checkLingeringResources(ClientConductor.java:1655) ~[aeron-client-1.46.8.jar!/:1.46.8] at io.aeron.ClientConductor.checkTimeouts(ClientConductor.java:1565) ~[aeron-client-1.46.8.jar!/:1.46.8] at io.aeron.ClientConductor.service(ClientConductor.java:1470) ~[aeron-client-1.46.8.jar!/:1.46.8] at io.aeron.ClientConductor.awaitResponse(ClientConductor.java:1529) ~[aeron-client-1.46.8.jar!/:1.46.8] at io.aeron.ClientConductor.addExclusivePublication(ClientConductor.java:473) ~[aeron-client-1.46.8.jar!/:1.46.8] at io.aeron.Aeron.addExclusivePublication(Aeron.java:294) ~[aeron-client-1.46.8.jar!/:1.46.8] at io.aeron.archive.client.AeronArchive.connect(AeronArchive.java:245) ~[aeron-archive-1.46.8.jar!/:1.46.8] at io.aeron.cluster.service.ClusteredServiceAgent.loadSnapshot(ClusteredServiceAgent.java:892) ~[aeron-cluster-1.46.8.jar!/:1.46.8] at io.aeron.cluster.service.ClusteredServiceAgent.recoverState(ClusteredServiceAgent.java:754) ~[aeron-cluster-1.46.8.jar!/:1.46.8] at io.aeron.cluster.service.ClusteredServiceAgent.onStart(ClusteredServiceAgent.java:182) ~[aeron-cluster-1.46.8.jar!/:1.46.8] at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:150) ~[agrona-1.23.1.jar!/:1.23.1] at java.base/java.lang.Thread.run(Thread.java:1584) [?:?]

@WorkingChen
Copy link
Contributor Author

We need to protect the io.aeron.ClientConductor#releaseLogBuffers method with clientLock and avoid manipulating lingeringLogBuffers objects concurrently

image

@TomatoCream
Copy link

TomatoCream commented Mar 27, 2025

We might need to lock this part too. When closing Subscription, this method will be called by a non ClientConductor thread.

    void closeImages(final Image[] images, final UnavailableImageHandler unavailableImageHandler, final long lingerNs)
    {
        for (final Image image : images)
        {
            image.close();
        }

        for (final Image image : images)
        {
            releaseLogBuffers(image.logBuffers(), image.correlationId(), lingerNs);
        }

        if (null != unavailableImageHandler)
        {
            for (final Image image : images)
            {
                notifyImageUnavailable(unavailableImageHandler, image);
            }
        }
    }

@WorkingChen
Copy link
Contributor Author

We might need to lock this part too. When closing Subscription, this method will be called by a non ClientConductor thread.

    void closeImages(final Image[] images, final UnavailableImageHandler unavailableImageHandler, final long lingerNs)
    {
        for (final Image image : images)
        {
            image.close();
        }

        for (final Image image : images)
        {
            releaseLogBuffers(image.logBuffers(), image.correlationId(), lingerNs);
        }

        if (null != unavailableImageHandler)
        {
            for (final Image image : images)
            {
                notifyImageUnavailable(unavailableImageHandler, image);
            }
        }
    }

yeah, the closeImages method is called by internalClose(), which has already been protected by clientLock in current branch.

@vyazelenko
Copy link
Contributor

False positive.

@vyazelenko vyazelenko closed this Mar 27, 2025
@vyazelenko
Copy link
Contributor

For posterity: ConsensusModule and ClusteredServiceContainer were configured using a shared AeronArchive.Context instance which lead to ConsensusModule's Aeron instance (which has no lock) being used by multiple threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants