[rabbitmq-discuss] mirrored cluster crashes after node failure

Mon Sep 30 19:34:40 BST 2013

We have a 3 node rabbitmq cluster consisting of 2 disk nodes and one memory 
node.  (disk nodes are rabbitmq-00 and 01, memory node is core-01) 

queues are durable and mirrored (show +2 in control panel, etc.) and show 
syncronized:

# rabbitmqctl list_queues name slave_pids synchronised_slave_pids
Listing queues ...
...
SVC_mailbox_lookup [<'rabbit at rabbitmq-01'.2.301.0>, 
<'rabbit at core-01'.1.268.0>] [<'rabbit at core-01'.1.268.0>, 
<'rabbit at rabbitmq-01'.2.301.0>]
...

#  rabbitmqctl list_policies
Listing policies ...
/ ha-all ^SVC_ {"ha-mode":"all"} 0
...done.

We put in SSD mounted to '/var/lib/rabbitmq' to host the mnesia database on 
rabbitmq-00/01.  we only did a single drive figuring that if the disk 
failed the node would crash and the others in the HA cluster would take 
over - all clients have been coded for failover.

The SSD on rabbitmq-00 failed.   i don't have logs of that event from 
rabbitmq-00's point of view - for some reason it didn't write out anything.

I do have it from rabbitmq-01's:

=INFO REPORT==== 26-Sep-2013::16:07:20 ===
Mirrored-queue (queue 'SVC_mailbox_lookup' in vhost '/'): Slave <'
rabbit at rabbitmq-01'.3.785.0> <rabbit at 4c-rabbitmq-01'.3.785.0>> saw deaths 
of mirrors <'rabbit at rabbitmq-00'.3.1415.0><rabbit at 4c-rabbitmq-00'.3.1415.0>> 

=INFO REPORT==== 26-Sep-2013::16:07:20 ===
Mirrored-queue (queue 'SVC_mailbox_lookup' in vhost '/'): Promoting slave <'
rabbit at rabbitmq-01'.3.785.0> <rabbit at 4c-rabbitmq-01'.3.785.0>> to master

but then:

=ERROR REPORT==== 26-Sep-2013::16:17:17 ===
connection <0.487.0>, channel 1 - soft error:
{amqp_error,not_found,
"home node 'rabbit at core-01' of durable queue 'SVC_mailbox_lookup' in vhost 
'/' is down or inaccessible",
'queue.declare'}

This is repeated for each queue. 

It looks like rabbitmq-01 took over as master, but then the nodes become 
non-responsive because they can't write to disk on core-01 (the memory 
node.)

we shutdown whatever was still running on rabbitmq-00.   and everything was 
still unavailable.  we then shutdown core-01 and lastly rabbitmq-01, then 
restarted rabbitmq-01, but it came up with NO queues.   

is this an error with the way the HA cluster is handling failover or an 
error with our configurations - should we not mix memory and disk nodes in 
an HA cluster?   

I'm trying to figure this our because we want to be sure that if any node 
in the cluster fails, the others take over seamlessly.  our code does 
that... we just need the clusters to soldier on and that no records are 
lost.

Thanks.  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130930/3cc961a5/attachment.htm>