Skip to content

A stream with a group of SACs may fail to "elect" an active consumer after a rolling cluster restart #9159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
michaelklishin opened this issue Aug 23, 2023 · 4 comments

Comments

@michaelklishin
Copy link
Collaborator

michaelklishin commented Aug 23, 2023

Someone has reported the following scenario that can be reproduced every few runs and looks very similar to #7743 but with a more recent version (e.g. 3.11.18).

The steps to reproduce are:

  1. Cluster three nodes, add some stream SAC consumers
  2. Restart RabbitMQ using a K8S rollout restart (so pods are restarted one by one)
  3. Observe no active SAC consumer after the cluster restart (evidence is collected using rabbitmqctl list_stream_consumers)
  4. Active consumer is not picked until a restart of the consumer applications

With 3.11.2 (which does not include #7743) it can be reproduced every 2-3 runs.
With 3.11.18 (which does include #7743), it takes 10 to 15 attempts but the issue still
can be reproduced.

Consumer setup code

Using RabbitMQ Java Stream client 0.10.0:

    ConsumerBuilder createConsumerBuilder(String steadyStream) {
        return rabbitMQStreamEnvironment.consumerBuilder()
                .name(SERVICE_NAME)
                .stream(steadyStream)
                .singleActiveConsumer()
                // use OffsetSpecification.next() for all cases to start consuming where it left
                .offset(OffsetSpecification.next())
                // use manualTrackingStrategy() because we want to commit offset if certain conditions are met
                .manualTrackingStrategy()
                .checkInterval(Duration.ofSeconds(rabbitMQProperties.getManualTrackingStrategyInterval()))
                .builder();
    }

Logged Exception

[warning] <0.1328.0> rabbit_stream_coordinator: failed to stop member [redacted]-publisher-prd_1680603078129697429 'rabbit@[redacted node 1]' Error: {{nodedown,'rabbit@[redacted node 1]'},{gen_server,call,[{osiris_server_sup,'rabbit@[redacted node 1]'},{terminate_child, …

Environment details

I cannot publish a collect-env tarball publicly but it will be available for the core team to inspect.

@michaelklishin
Copy link
Collaborator Author

It turns out, #7743 introduced a feature flag, stream_sac_coordinator_unblock_group , that is disabled in this environment. Which means the change in #7743 is only partially applied at best.

Once the flag is enabled, the reporter will conduct a new round of tests.

@michaelklishin
Copy link
Collaborator Author

Looks like the issue can be reproduced post-#7743 with the feature flag enabled, just with a several times lower probability :(

@mkuratczyk
Copy link
Contributor

I've performed ~50 attempted with main and 3.11.18 (for a total of ~100 attempts) and this hasn't happened. 🤷

@acogoluegnes
Copy link
Contributor

The reporter mentioned they could not reproduce the issue on 3.12.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants