[rabbitmq-discuss] mirrored cluster crashes after node failure

Tue Oct 1 10:33:56 BST 2013

That's not how things should go. Could you post the complete logs 
somewhere in case there's anything pointing to what happened?

Cheers, Simon

On 30/09/13 19:34, Matt Wheeler wrote:
> We have a 3 node rabbitmq cluster consisting of 2 disk nodes and one
> memory node.  (disk nodes are rabbitmq-00 and 01, memory node is core-01)
>
> queues are durable and mirrored (show +2 in control panel, etc.) and
> show syncronized:
>
> # rabbitmqctl list_queues name slave_pids synchronised_slave_pids
> Listing queues ...
> ...
> SVC_mailbox_lookup[<'rabbit at rabbitmq-01'.2.301.0>,
> <'rabbit at core-01'.1.268.0>][<'rabbit at core-01'.1.268.0>,
> <'rabbit at rabbitmq-01'.2.301.0>]
> ...
>
> #  rabbitmqctl list_policies
> Listing policies ...
> /ha-all^SVC_{"ha-mode":"all"}0
> ...done.
>
>
> We put in SSD mounted to '/var/lib/rabbitmq' to host the mnesia database
> on rabbitmq-00/01.  we only did a single drive figuring that if the disk
> failed the node would crash and the others in the HA cluster would take
> over - all clients have been coded for failover.
>
> The SSD on rabbitmq-00 failed.   i don't have logs of that event from
> rabbitmq-00's point of view - for some reason it didn't write out anything.
>
> I do have it from rabbitmq-01's:
>
> =INFO REPORT==== 26-Sep-2013::16:07:20 ===
> Mirrored-queue (queue 'SVC_mailbox_lookup' in vhost '/'): Slave
> <'rabbit at rabbitmq-01'.3.785.0> <rabbit at 4c-rabbitmq-01'.3.785.0>> saw
> deaths of mirrors <'rabbit at rabbitmq-00'.3.1415.0>
> <rabbit at 4c-rabbitmq-00'.3.1415.0>>
>
> =INFO REPORT==== 26-Sep-2013::16:07:20 ===
> Mirrored-queue (queue 'SVC_mailbox_lookup' in vhost '/'): Promoting
> slave <'rabbit at rabbitmq-01'.3.785.0> <rabbit at 4c-rabbitmq-01'.3.785.0>>
> to master
>
> but then:
>
> =ERROR REPORT==== 26-Sep-2013::16:17:17 ===
> connection <0.487.0>, channel 1 - soft error:
> {amqp_error,not_found,
> "home node 'rabbit at core-01' of durable queue 'SVC_mailbox_lookup' in
> vhost '/' is down or inaccessible",
> 'queue.declare'}
>
> This is repeated for each queue.
>
> It looks like rabbitmq-01 took over as master, but then the nodes become
> non-responsive because they can't write to disk on core-01 (the memory
> node.)
>
> we shutdown whatever was still running on rabbitmq-00.   and everything
> was still unavailable.  we then shutdown core-01 and lastly rabbitmq-01,
> then restarted rabbitmq-01, but it came up with NO queues.
>
> is this an error with the way the HA cluster is handling failover or an
> error with our configurations - should we not mix memory and disk nodes
> in an HA cluster?
>
> I'm trying to figure this our because we want to be sure that if any
> node in the cluster fails, the others take over seamlessly.  our code
> does that... we just need the clusters to soldier on and that no records
> are lost.
>
> Thanks.
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>

-- 
Simon MacMullen
RabbitMQ, Pivotal