[rabbitmq-discuss] mirrored cluster crashes after node failure
Matt Wheeler
matt.wheeler at 4cite.com
Mon Sep 30 19:34:40 BST 2013
We have a 3 node rabbitmq cluster consisting of 2 disk nodes and one memory
node. (disk nodes are rabbitmq-00 and 01, memory node is core-01)
queues are durable and mirrored (show +2 in control panel, etc.) and show
syncronized:
# rabbitmqctl list_queues name slave_pids synchronised_slave_pids
Listing queues ...
...
SVC_mailbox_lookup [<'rabbit at rabbitmq-01'.2.301.0>,
<'rabbit at core-01'.1.268.0>] [<'rabbit at core-01'.1.268.0>,
<'rabbit at rabbitmq-01'.2.301.0>]
...
# rabbitmqctl list_policies
Listing policies ...
/ ha-all ^SVC_ {"ha-mode":"all"} 0
...done.
We put in SSD mounted to '/var/lib/rabbitmq' to host the mnesia database on
rabbitmq-00/01. we only did a single drive figuring that if the disk
failed the node would crash and the others in the HA cluster would take
over - all clients have been coded for failover.
The SSD on rabbitmq-00 failed. i don't have logs of that event from
rabbitmq-00's point of view - for some reason it didn't write out anything.
I do have it from rabbitmq-01's:
=INFO REPORT==== 26-Sep-2013::16:07:20 ===
Mirrored-queue (queue 'SVC_mailbox_lookup' in vhost '/'): Slave <'
rabbit at rabbitmq-01'.3.785.0> <rabbit at 4c-rabbitmq-01'.3.785.0>> saw deaths
of mirrors <'rabbit at rabbitmq-00'.3.1415.0><rabbit at 4c-rabbitmq-00'.3.1415.0>>
=INFO REPORT==== 26-Sep-2013::16:07:20 ===
Mirrored-queue (queue 'SVC_mailbox_lookup' in vhost '/'): Promoting slave <'
rabbit at rabbitmq-01'.3.785.0> <rabbit at 4c-rabbitmq-01'.3.785.0>> to master
but then:
=ERROR REPORT==== 26-Sep-2013::16:17:17 ===
connection <0.487.0>, channel 1 - soft error:
{amqp_error,not_found,
"home node 'rabbit at core-01' of durable queue 'SVC_mailbox_lookup' in vhost
'/' is down or inaccessible",
'queue.declare'}
This is repeated for each queue.
It looks like rabbitmq-01 took over as master, but then the nodes become
non-responsive because they can't write to disk on core-01 (the memory
node.)
we shutdown whatever was still running on rabbitmq-00. and everything was
still unavailable. we then shutdown core-01 and lastly rabbitmq-01, then
restarted rabbitmq-01, but it came up with NO queues.
is this an error with the way the HA cluster is handling failover or an
error with our configurations - should we not mix memory and disk nodes in
an HA cluster?
I'm trying to figure this our because we want to be sure that if any node
in the cluster fails, the others take over seamlessly. our code does
that... we just need the clusters to soldier on and that no records are
lost.
Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130930/3cc961a5/attachment.htm>
More information about the rabbitmq-discuss
mailing list