Khepri - Auto delete queues after a network partition are stuck / down #14527

luos · 2025-09-11T09:41:52Z

luos
Sep 11, 2025

Describe the bug

Hi,

We were testing out Khepri and noticed that some auto-delete durable queues can become unavailable if there is a network partition.

Reproduction steps

Create a 3 node cluster, 4.1.3, with Khepri enabled
Create an auto-delete, (transient or durable) queue with a consumer on NODE-1
Create a fully partitioned NODE-1
Observe the queue going to the ‘down’ on NODE-2 and NODE-3
Consumer gets disconnected from NODE-1
Restore the network for NODE-1
Queue shows “running” state but cannot function. It can be deleted manually.
On re-declaration, the operation fails with a not_found exception:

2025-09-11 08:40:24.537518+00:00 [error] <0.2066.0> Channel error on connection <0.2057.0> (172.31.31.98:18500 -> 172.31.22.178:5672, vhost: '/', user: 'testuser'), channel 1:
2025-09-11 08:40:24.537518+00:00 [error] <0.2066.0> operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'cq-1' in vhost '/' due to timeout
2025-09-11 08:40:24.869547+00:00 [warning] <0.2057.0> closing AMQP connection <0.2057.0> (172.31.31.98:18500 -> 172.31.22.178:5672 - perf-test-configuration-0, vhost: '/', user: 'testuser', duration: '1M, 2s'):

Expected behavior

Auto delete queue is deleted on majority partition because it lost the consumer.

Additional context

Answered by dumbbell

Sep 30, 2025

After several iterations, I could improve the fix I prepared for a similar issue with exclusive queues. It’s available in #14573.

@luos, could you please give it a try and tell me if it is fixed for you too?

View full answer

michaelklishin · 2025-09-11T16:10:26Z

michaelklishin
Sep 11, 2025
Maintainer

@luos I don't have any logs to work with but very likely this comes down to the fact that

Node 1 is in the minority, so it won't be able to perform any schema database updates
Queue recovery entirely depends on the data it gets from the schema database

So this is not an issue with the implementation, this is a fundamental incompatibility of non-replicated durable queues and Khepri's fundamental assumptions that stem from Raft (a node in a minority cannot perform any writes).

You are welcome to investigate this further to have a more detailed description of what's going on.

0 replies

michaelklishin · 2025-09-11T18:28:16Z

michaelklishin
Sep 11, 2025
Maintainer

@luos @dumbbell suggests that we have seen this before with exclusive CQs, and it is pending an investigation but currently there are higher priority Khepri improvements.

1 reply

dumbbell Sep 11, 2025
Maintainer

The Khepri improvements being worked on are to address the issue with exclusive queues. I don't know if this will fix anything with auto-delete (unlikely).

kjnilsson · 2025-09-11T19:48:54Z

kjnilsson
Sep 11, 2025
Maintainer

I would expect the auto delete queue to stay running until it is able to perform the meta data store update to delete itself.

4 replies

kjnilsson Sep 12, 2025
Maintainer

Quorum queues have a similar issue when a delete happens on a khepri minority

rabbitmq-server/deps/rabbit/src/rabbit_quorum_queue.erl

Lines 885 to 930 in 403e5aa

    
           case ra:delete_cluster(Servers, Timeout) of 
        
               {ok, {_, LeaderNode} = Leader} -> 
        
                   MRef = erlang:monitor(process, Leader), 
        
                   receive 
        
                       {'DOWN', MRef, process, _, _} -> 
        
                           %% leader is down, 
        
                           %% force delete remaining members 
        
                           ok = force_delete_queue(lists:delete(Leader, Servers)), 
        
                           ok 
        
                   after Timeout -> 
        
                           erlang:demonitor(MRef, [flush]), 
        
                           ok = force_delete_queue(Servers) 
        
                   end, 
        
                   notify_decorators(QName, shutdown), 
        
                   case delete_queue_data(Q, ActingUser) of 
        
                       ok -> 
        
                           _ = erpc_call(LeaderNode, rabbit_core_metrics, queue_deleted, [QName], 
        
                                         ?RPC_TIMEOUT), 
        
                           {ok, ReadyMsgs}; 
        
                       {error, timeout} = Err -> 
        
                           Err 
        
                   end; 
        
               {error, {no_more_servers_to_try, Errs}} -> 
        
                   case lists:all(fun({{error, noproc}, _}) -> true; 
        
                                     (_) -> false 
        
                                  end, Errs) of 
        
                       true -> 
        
                           %% If all ra nodes were already down, the delete 
        
                           %% has succeed 
        
                           ok; 
        
                       false -> 
        
                           %% attempt forced deletion of all servers 
        
                           ?LOG_WARNING( 
        
                             "Could not delete quorum '~ts', not enough nodes " 
        
                              " online to reach a quorum: ~255p." 
        
                              " Attempting force delete.", 
        
                             [rabbit_misc:rs(QName), Errs]), 
        
                           ok = force_delete_queue(Servers), 
        
                           notify_decorators(QName, shutdown) 
        
                   end, 
        
                   case delete_queue_data(Q, ActingUser) of 
        
                       ok -> 
        
                           {ok, ReadyMsgs}; 
        
                       {error, timeout} = Err -> 
        
                           Err 
        
                   end

kjnilsson Sep 12, 2025
Maintainer

We could perhaps reduce some of these by making the meta data store update for deletes having two phase:

Update queue record with deleting status
Perform actual delete
Remove queue record

If 1 fails we don't proceed with the rest or enter a retry loop. Ofc partitions can happen between 1 and 3 but it is much less likely to occur.

luos Sep 12, 2025
Author

Hi,

Thanks for reviewing this.

I think this may be a problem in case let's say the node hosting the AD queue simply crashes as well - in which cases the live process never comes back so it can not delete itself. (I haven't tested this, maybe there is some cleanup on restart). I'd also "guess" that it's relatively common to have a partition and a node crash at the same time, which means the process is no longer alive.

I am not sure what would be a good solution, I agree it's similar but different to the exclusive queue case.

Maybe something when the node reconnects to the cluster could clean up ad/transient queues which are no longer running?

Probably it's fine to leave the queue record during the partition as it could be still be running on the other side, so somehow the reconnecting node could be responsible for the cleanup.

dumbbell Sep 19, 2025
Maintainer

The feature I’m adding to Khepri would make Khepri delete the queue record when the queue process exits. During a network partition, it would wait for the partition to solve (and check if the process is still running) or for the node to be explicitly removed from the cluster (which covers the node loss).

The queue record would have a keep_while condition pointing to the queue process.

That said, the node loss technically is already covered: RabbitMQ deletes queue records associated with a node that is removed from the cluster. Currently it only does it for durable queues, but #14573 takes care of transient queues.

@kjnilsson: What do you think? Do you a scenario in mind where this is not enough/not working?

dumbbell · 2025-09-30T10:26:06Z

dumbbell
Sep 30, 2025
Maintainer

After several iterations, I could improve the fix I prepared for a similar issue with exclusive queues. It’s available in #14573.

@luos, could you please give it a try and tell me if it is fixed for you too?

3 replies

dumbbell Oct 1, 2025
Maintainer

The pull request was merged, but I’m still interested in feedback :-)

luos Oct 2, 2025
Author

Hi @dumbbell , thank you for preparing a fix for this!

I run a test with rabbitmq:pr-14573-otp28 container in a 3 node cluster.

Started perftest with 10 auto delete queues distributed among the nodes. Introduced a full network partition on a node. Then "due to the partition" clients also got disconnected (killed perftest).

I also did the same test with exclusive queues.

In both cases, during the partition the unavailable queues were in a down state on both sides of the partition. After the partition, the queues got cleaned up.

In the case where the clients did not disconnect, the auto-delete queues seemed to stay alive.

I think this behaviour seems reasonable and working as I'd expect. Thank you again!

dumbbell Oct 2, 2025
Maintainer

Awesome, thank you for testing!

The fix will be shipped with RabbitMQ 4.2.0, but not 4.1.x because it’s only with Khepri and it’s a bit too invasive for my taste to be this late in a patch release of 4.1.x.

Uh oh!

Khepri - Auto delete queues after a network partition are stuck / down #14527

Uh oh!

Uh oh!

luos Sep 11, 2025

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 4 comments · 8 replies

Uh oh!

Uh oh!

michaelklishin Sep 11, 2025 Maintainer

Uh oh!

michaelklishin Sep 11, 2025 Maintainer

Uh oh!

dumbbell Sep 11, 2025 Maintainer

Uh oh!

kjnilsson Sep 11, 2025 Maintainer

Uh oh!

kjnilsson Sep 12, 2025 Maintainer

Uh oh!

kjnilsson Sep 12, 2025 Maintainer

Uh oh!

luos Sep 12, 2025 Author

Uh oh!

dumbbell Sep 19, 2025 Maintainer

Uh oh!

Uh oh!

dumbbell Sep 30, 2025 Maintainer

Uh oh!

dumbbell Oct 1, 2025 Maintainer

Uh oh!

Uh oh!

luos Oct 2, 2025 Author

Uh oh!

Uh oh!

dumbbell Oct 2, 2025 Maintainer

luos
Sep 11, 2025

Replies: 4 comments 8 replies

michaelklishin
Sep 11, 2025
Maintainer

michaelklishin
Sep 11, 2025
Maintainer

dumbbell Sep 11, 2025
Maintainer

kjnilsson
Sep 11, 2025
Maintainer

kjnilsson Sep 12, 2025
Maintainer

kjnilsson Sep 12, 2025
Maintainer

luos Sep 12, 2025
Author

dumbbell Sep 19, 2025
Maintainer

dumbbell
Sep 30, 2025
Maintainer

dumbbell Oct 1, 2025
Maintainer

luos Oct 2, 2025
Author

dumbbell Oct 2, 2025
Maintainer