A stream with a group of SACs may fail to "elect" an active consumer after a rolling cluster restart #9159

michaelklishin · 2023-08-23T12:40:35Z

Someone has reported the following scenario that can be reproduced every few runs and looks very similar to #7743 but with a more recent version (e.g. 3.11.18).

The steps to reproduce are:

Cluster three nodes, add some stream SAC consumers
Restart RabbitMQ using a K8S rollout restart (so pods are restarted one by one)
Observe no active SAC consumer after the cluster restart (evidence is collected using rabbitmqctl list_stream_consumers)
Active consumer is not picked until a restart of the consumer applications

With 3.11.2 (which does not include #7743) it can be reproduced every 2-3 runs.
With 3.11.18 (which does include #7743), it takes 10 to 15 attempts but the issue still
can be reproduced.

Consumer setup code

Using RabbitMQ Java Stream client 0.10.0:

    ConsumerBuilder createConsumerBuilder(String steadyStream) {
        return rabbitMQStreamEnvironment.consumerBuilder()
                .name(SERVICE_NAME)
                .stream(steadyStream)
                .singleActiveConsumer()
                // use OffsetSpecification.next() for all cases to start consuming where it left
                .offset(OffsetSpecification.next())
                // use manualTrackingStrategy() because we want to commit offset if certain conditions are met
                .manualTrackingStrategy()
                .checkInterval(Duration.ofSeconds(rabbitMQProperties.getManualTrackingStrategyInterval()))
                .builder();
    }

Logged Exception

[warning] <0.1328.0> rabbit_stream_coordinator: failed to stop member [redacted]-publisher-prd_1680603078129697429 'rabbit@[redacted node 1]' Error: {{nodedown,'rabbit@[redacted node 1]'},{gen_server,call,[{osiris_server_sup,'rabbit@[redacted node 1]'},{terminate_child, …

Environment details

I cannot publish a collect-env tarball publicly but it will be available for the core team to inspect.

The text was updated successfully, but these errors were encountered:

michaelklishin · 2023-08-23T13:36:51Z

It turns out, #7743 introduced a feature flag, stream_sac_coordinator_unblock_group , that is disabled in this environment. Which means the change in #7743 is only partially applied at best.

Once the flag is enabled, the reporter will conduct a new round of tests.

michaelklishin · 2023-08-24T11:29:27Z

Looks like the issue can be reproduced post-#7743 with the feature flag enabled, just with a several times lower probability :(

mkuratczyk · 2023-08-28T14:24:38Z

I've performed ~50 attempted with main and 3.11.18 (for a total of ~100 attempts) and this hasn't happened. 🤷

acogoluegnes · 2023-08-29T09:17:08Z

The reporter mentioned they could not reproduce the issue on 3.12.3.

michaelklishin closed this as completed Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A stream with a group of SACs may fail to "elect" an active consumer after a rolling cluster restart #9159

A stream with a group of SACs may fail to "elect" an active consumer after a rolling cluster restart #9159

michaelklishin commented Aug 23, 2023 •

edited

Loading

michaelklishin commented Aug 23, 2023

michaelklishin commented Aug 24, 2023

mkuratczyk commented Aug 28, 2023

acogoluegnes commented Aug 29, 2023

A stream with a group of SACs may fail to "elect" an active consumer after a rolling cluster restart #9159

A stream with a group of SACs may fail to "elect" an active consumer after a rolling cluster restart #9159

Comments

michaelklishin commented Aug 23, 2023 • edited Loading

Consumer setup code

Logged Exception

Environment details

michaelklishin commented Aug 23, 2023

michaelklishin commented Aug 24, 2023

mkuratczyk commented Aug 28, 2023

acogoluegnes commented Aug 29, 2023

michaelklishin commented Aug 23, 2023 •

edited

Loading