4.1.4: a rolling restart leaves 2 quorum queues out of 1600 without an elected leader #14780

BuJo · 2025-10-22T09:20:38Z

BuJo
Oct 22, 2025

Describe the bug

After a Kubernetes Upgrade (which merely cycles the RabbitMQ pods), two of ~2200 Queues (~1600 of then being Quorum Queues) were "broken".

RabbitMQ is in a 3 node cluster, RabbitMQ v4.1.4, using the RabbitMQ Operator.

I have now deleted the queue, to avoid crashlooping the client, but I tried to collect as much information as I could below.

Reproduction steps

I have no idea on how to reproduce this. We have not restarted RabbitMQ since the problem occured to keep things stable.

Expected behavior

The Queue Should not be unuseable.

Additional context

I tried to analyze what was going on in the background, however I'm quite lost when it comes to the innards and Erlang.

Logs for RabbitMQ:

2025-10-21 09:16:22.697845+00:00 [info] <0.34741913.1> connection 10.250.85.76:37067 -> 100.111.189.215:5671 has a client-provided name: ed752021
2025-10-21 09:16:22.699201+00:00 [info] <0.34741913.1> connection 10.250.85.76:37067 -> 100.111.189.215:5671 - ed752021: user 'ed' authenticated and granted access to vhost '/'

2025-10-21 09:17:32.943839+00:00 [error] <0.34741913.1> Error on AMQP connection <0.34741913.1> (10.250.85.76:37067 -> 100.111.189.215:5671 - ed752021, vhost: '/', user: 'ed', state: running), channel 85:
2025-10-21 09:17:32.943839+00:00 [error] <0.34741913.1>  operation basic.consume caused a connection exception internal_error: "timed out consuming from quorum queue 'broken-queue' in vhost '/': {'%2F_broken-queue',\n                                                                                                                                   '[email protected]'}"
2025-10-21 09:17:32.953388+00:00 [info] <0.34741913.1> closing AMQP connection (10.250.85.76:37067 -> 100.111.189.215:5671 - ed752021, vhost: '/', user: 'ed', duration: '1M, 10s')

Client Logs (Spring Boot):

2025-10-17T04:52:54.614Z  WARN 1 --- [ontainer#81-806] [tos,,] o.s.a.r.l.SimpleMessageListenerContainer : Consumer raised exception, processing can restart if the connection factory supports it

org.springframework.amqp.AmqpIOException: java.io.IOException
        at org.springframework.amqp.rabbit.support.RabbitExceptionTranslator.convertRabbitAccessException(RabbitExceptionTranslator.java:70) ~[spring-rabbit-3.1.12.jar:3.1.12]
        at org.springframework.amqp.rabbit.listener.BlockingQueueConsumer.setQosAndCreateConsumers(BlockingQueueConsumer.java:693) ~[spring-rabbit-3.1.12.jar:3.1.12]
        at org.springframework.amqp.rabbit.listener.BlockingQueueConsumer.start(BlockingQueueConsumer.java:638) ~[spring-rabbit-3.1.12.jar:3.1.12]
        at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer$AsyncMessageProcessingConsumer.initialize(SimpleMessageListenerContainer.java:1478) ~[spring-rabbit-3.1.12.jar:3.1.12]
        at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer$AsyncMessageProcessingConsumer.run(SimpleMessageListenerContainer.java:1318) ~[spring-rabbit-3.1.12.jar:3.1.12]
        at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]
Caused by: java.io.IOException: null
        at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:140) ~[amqp-client-5.21.0.jar:5.21.0]
        at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:136) ~[amqp-client-5.21.0.jar:5.21.0]
        at com.rabbitmq.client.impl.ChannelN.basicConsume(ChannelN.java:1408) ~[amqp-client-5.21.0.jar:5.21.0]
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[na:na]
        at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[na:na]
        at org.springframework.amqp.rabbit.connection.CachingConnectionFactory$CachedChannelInvocationHandler.invoke(CachingConnectionFactory.java:1201) ~[spring-rabbit-3.1.12.jar:3.1.12]
        at jdk.proxy2/jdk.proxy2.$Proxy336.basicConsume(Unknown Source) ~[na:na]
        at org.springframework.amqp.rabbit.listener.BlockingQueueConsumer.consumeFromQueue(BlockingQueueConsumer.java:730) ~[spring-rabbit-3.1.12.jar:3.1.12]
        at org.springframework.amqp.rabbit.listener.BlockingQueueConsumer.setQosAndCreateConsumers(BlockingQueueConsumer.java:687) ~[spring-rabbit-3.1.12.jar:3.1.12]
        ... 4 common frames omitted
Caused by: com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=541, reply-text=INTERNAL_ERROR - timed out consuming from quorum queue 'broken-queue' in vhost '/': {'%2F_broken-queue',
                     ..., class-id=60, method-id=20)
        at com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:66) ~[amqp-client-5.21.0.jar:5.21.0]
        at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36) ~[amqp-client-5.21.0.jar:5.21.0]
        at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:552) ~[amqp-client-5.21.0.jar:5.21.0]
        at com.rabbitmq.client.impl.ChannelN.basicConsume(ChannelN.java:1402) ~[amqp-client-5.21.0.jar:5.21.0]
        ... 10 common frames omitted

rabbitmq-diagnostics quorum_status broken-queue -p /
Status of quorum queue broken-queue on node [email protected] ...
Error:
{{:case_clause, %{state: :noproc, membership: :unknown, machine_version: 7}}, [{:rabbit_quorum_queue, :"-status/2-lc$^0/1-0-", 2, [file: ~c"src/rabbit_quorum_queue.erl", line: 1251]}, {:rabbit_quorum_queue, :status, 2, []}]}

rabbitmqctl eval "ra:members({'%2F_broken-queue', '[email protected]'})."
{error,noproc}

rabbitmqctl eval "ra:members({'%2F_good-queue', '[email protected]'})."
{ok,[{'%2F_good-queue',
         '[email protected]'},
     {'%2F_good-queue',
         '[email protected]'},
     {'%2F_good-queue',
         '[email protected]'}],
    {'%2F_good-queue',
        '[email protected]'}}

rabbitmqctl eval 'application:loaded_applications().'

{ra,"Raft library","2.16.13"},

rabbitmqctl eval 'rabbit_amqqueue:list(<<"/">>).' | grep -B 3 -A 15  broken-queue

 {amqqueue,
     {resource,<<"/">>,queue,
         <<"broken-queue">>},
     true,false,none,
     [{<<"x-expires">>,signedint,60000},
      {<<"x-queue-type">>,longstr,<<"quorum">>}],
     {'%2F_broken-queue',
         '[email protected]'},
     [],[],[],undefined,undefined,[],[],live,0,[],<<"/">>,
     #{user => <<"tos">>},
     rabbit_quorum_queue,
     #{nodes =>
           ['[email protected]',
            '[email protected]',
            '[email protected]']}},

cat /var/lib/rabbitmq/mnesia/[email protected]/quorum/[email protected]/2F_TOS0TRQCNWSKCG1/config
#{id =>
      {'%2F_broken-queue',
          '[email protected]'},
  machine =>
      {module,rabbit_fifo,
          #{name =>
                '%2F_broken-queue',
            max_length => undefined,max_bytes => undefined,
            queue_resource =>
                {resource,<<"/">>,queue,
                    <<"broken-queue">>},
            created => 1760620080236,dead_letter_handler => undefined,
            delivery_limit => 20,expires => 60000,msg_ttl => undefined,
            overflow_strategy => drop_head,
            become_leader_handler =>
                {rabbit_quorum_queue,become_leader,
                    [{resource,<<"/">>,queue,
                         <<"broken-queue">>}]},
            single_active_consumer_on => false}},
  membership => voter,
  friendly_name =>
      "queue 'broken-queue' in vhost '/'",
  cluster_name =>
      '%2F_broken-queue',
  uid => <<"2F_TOS0TRQCNWSKCG1">>,initial_machine_version => 7,
  initial_members =>
      [{'%2F_broken-queue',
           '[email protected]'},
       {'%2F_broken-queue',
           '[email protected]'},
       {'%2F_broken-queue',
           '[email protected]'}],
  log_init_args =>
      #{max_checkpoints => 3,min_checkpoint_interval => 64,
        snapshot_interval => 8192,uid => <<"2F_TOS0TRQCNWSKCG1">>},
  metrics_key =>
      {resource,<<"/">>,queue,
          <<"broken-queue">>},
  ra_event_formatter =>
      {rabbit_quorum_queue,format_ra_event,
          [{resource,<<"/">>,queue,
               <<"broken-queue">>}]},
  tick_timeout => 5000,broadcast_time => 100,
  install_snap_rpc_timeout => 120000,await_condition_timeout => 30000}.

Config in /var/lib/rabbitmq/mnesia/.../quorum/2F_TOS0TRQCNWSKCG1/config (only on node 1)

#{id =>
      {'%2F_broken-queue',
          '[email protected]'},
  machine =>
      {module,rabbit_fifo,
          #{name =>
                '%2F_broken-queue',
            max_length => undefined,max_bytes => undefined,
            queue_resource =>
                {resource,<<"/">>,queue,
                    <<"broken-queue">>},
            created => 1760620080236,dead_letter_handler => undefined,
            delivery_limit => 20,expires => 60000,msg_ttl => undefined,
            overflow_strategy => drop_head,
            become_leader_handler =>
                {rabbit_quorum_queue,become_leader,
                    [{resource,<<"/">>,queue,
                         <<"broken-queue">>}]},
            single_active_consumer_on => false}},
  membership => voter,
  friendly_name =>
      "queue 'broken-queue' in vhost '/'",
  cluster_name =>
      '%2F_broken-queue',
  uid => <<"2F_TOS0TRQCNWSKCG1">>,initial_machine_version => 7,
  initial_members =>
      [{'%2F_broken-queue',
           '[email protected]'},
       {'%2F_broken-queue',
           '[email protected]'},
       {'%2F_broken-queue',
           '[email protected]'}],
  log_init_args =>
      #{max_checkpoints => 3,min_checkpoint_interval => 64,
        snapshot_interval => 8192,uid => <<"2F_TOS0TRQCNWSKCG1">>},
  metrics_key =>
      {resource,<<"/">>,queue,
          <<"broken-queue">>},
  ra_event_formatter =>
      {rabbit_quorum_queue,format_ra_event,
          [{resource,<<"/">>,queue,
               <<"broken-queue">>}]},
  tick_timeout => 5000,broadcast_time => 100,
  install_snap_rpc_timeout => 120000,await_condition_timeout => 30000}.

ls /var/lib/rabbitmq/mnesia/.../quorum/2F_TOS0TRQCNWSKCG1/

0000000000000001.segment  checkpoints  config  snapshots

rabbitmq@rabbitmq-server-1:/$ rabbitmqctl eval "ra:members({'%2F_broken-queue', '[email protected]'})."
{error,noproc}
rabbitmq@rabbitmq-server-1:/$ rabbitmqctl eval "ra:members({'%2F_broken-queue', '[email protected]'})."
{error,noproc}
rabbitmq@rabbitmq-server-1:/$ rabbitmqctl eval "ra:members({'%2F_broken-queue', '[email protected]'})."
{timeout,
    {'%2F_broken-queue',
        '[email protected]'}}

I tried recovering the queue (with whatever the AI said...)

rabbitmqctl eval "QName = rabbit_misc:r(<<\"\/\">>, queue, <<\"broken-queue\">>), {ok, Q} = rabbit_amqqueue:lookup(QName), rabbit_quorum_queue:is_recoverable(Q)."
true

rabbitmqctl eval "QName = rabbit_misc:r(<<\"\/\">>, queue, <<\"broken-queue\">>), {ok, Q} = rabbit_amqqueue:lookup(QName), rabbit_quorum_queue:recover(Q)."
Error:
{:undef, [{:rabbit_quorum_queue, :recover, [{:amqqueue, {:resource, "/", :queue, "broken-queue"}, true, false, :none, [{"x-expires", :signedint, 60000}, {"x-queue-type", :longstr, "quorum"}], {:"%2F_broken-queue", :"[email protected]"}, [], [], [], :undefined, :undefined, [], [], :live, 0, [], "/", %{user: "tos"}, :rabbit_quorum_queue, %{nodes: [:"[email protected]", :"[email protected]", :"[email protected]"]}}], []}, {:erl_eval, :do_apply, 7, [file: ~c"erl_eval.erl", line: 915]}, {:erl_eval, :exprs, 2, []}]}

rabbitmqctl eval "
{ok, [Config]} = file:consult('/var/lib/rabbitmq/mnesia/[email protected]/quorum/[email protected]/2F_TOS0TRQCNWSKCG1/config'),
ra_server_sup_sup:start_server(coordination, Config).
"
{error,
    {shutdown,
        {failed_to_start_child,
            '%2F_broken-queue',
            {already_started,<13876.17309.0>}}}}

kjnilsson · 2025-10-22T09:35:37Z

kjnilsson
Oct 22, 2025
Maintainer

Please don't just do what AI says. Almost everything there is incorrect.

Somewhere in your broker logs there will be further information as to why at least 2 of the 3 members of the queue aren't running, perhaps they encountered an exception during recovery.

You seem comfortable in running arbitrary commands in your environment so you could try:

rabbitmqctl eval "ra:restart_server(quorum_queues, {'%2F_broken-queue', '[email protected]'})."

This will retry the start phase of the queue member on node [email protected], it will most likely fail but you should see the crash reason for it on the [email protected] node logs.

8 replies

kjnilsson Oct 22, 2025
Maintainer

hmm ok, it would have been useful to have the stack traces for the broken queues. Would you be up for restarting your env a few more times at some point to see if it occurs again?

BuJo Oct 22, 2025
Author

Yes, we can restart the cluster at will.
Is there a way to check for queues that are similarily broken so I can quickly check after a restart if something can be analyzed further?

kjnilsson Oct 22, 2025
Maintainer

I think looking for crash logs for the queues may be the easiest

kjnilsson Oct 22, 2025
Maintainer

on second thought maybe rabbitmqctl list_queues with the right columns could be useful

BuJo Oct 22, 2025
Author

Allright, I will try breaking it again. Thank you for the help!

michaelklishin · 2025-10-23T07:53:31Z

michaelklishin
Oct 23, 2025
Maintainer

With 1600 queues this can be a known scenario addressed by #14401, which will ship in 4.2.0. There is a 4.2.0-rc.1 for those willing to try #14401.

0 replies

4.1.4: a rolling restart leaves 2 quorum queues out of 1600 without an elected leader #14780

Uh oh!

BuJo Oct 22, 2025

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 2 comments · 8 replies

Uh oh!

Uh oh!

kjnilsson Oct 22, 2025 Maintainer

Uh oh!

kjnilsson Oct 22, 2025 Maintainer

Uh oh!

Uh oh!

BuJo Oct 22, 2025 Author

Uh oh!

Uh oh!

kjnilsson Oct 22, 2025 Maintainer

Uh oh!

kjnilsson Oct 22, 2025 Maintainer

Uh oh!

BuJo Oct 22, 2025 Author

Uh oh!

michaelklishin Oct 23, 2025 Maintainer

BuJo
Oct 22, 2025

Replies: 2 comments 8 replies

kjnilsson
Oct 22, 2025
Maintainer

kjnilsson Oct 22, 2025
Maintainer

BuJo Oct 22, 2025
Author

kjnilsson Oct 22, 2025
Maintainer

kjnilsson Oct 22, 2025
Maintainer

BuJo Oct 22, 2025
Author

michaelklishin
Oct 23, 2025
Maintainer