Active-active crash report

Matthew Sackman matthew at rabbitmq.com
Thu Apr 26 23:31:24 BST 2012

Hi Vadim,

On Thu, Apr 26, 2012 at 01:01:20PM -0700, Vadim Chekan wrote:
> I'm testing my active-active setup (2.8.1, linux 64) and I am randomly
> running into some crashes when I'm stopping a node. I can stop one node
> abut another one fails along with it. Below is a crash log.
> =ERROR REPORT==== 26-Apr-2012::12:15:59 ===
> Discarding message
> {'$gen_call',{<0.1955.0>,#Ref<>},{add_on_right,{9,<0.1955.0>}}}
> from <0.1955.0> to <0.26823.834>
>  in an old incarnation (2) of this node (3)

I'm worried about these messages. Someone else on this list has seen
this sort of thing too and it's causing them trouble. I've not seen this
issue myself in testing which is frustrating. However, that's not the
cause of your crash in this case (I think).

> ** Generic server <0.1800.0> terminating
> ** Last message in was {'$gen_cast',{gm_deaths,[<0.4684.0>]}}
> ** When Server state == {state,
>                             {amqqueue,
>                                 {resource,<<"/">>,queue,<<"test_29">>},
>                                 true,false,<0.1433.0>,
>                                 [{<<"x-ha-policy">>,longstr,<<"all">>},
>                                  {<<"x-message-ttl">>,signedint,600000}],
>                                 <0.1799.0>,[],all},
>                             <0.1801.0>,
>                             {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
>                                  []},
>                                 {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
>                                   [],[]}}},
>                             #Fun<rabbit_mirror_queue_master.1.2951048>,
>                             #Fun<rabbit_mirror_queue_master.2.72654940>}
> ** Reason for termination ==
> ** {{case_clause,{ok,<3066.9234.0>,[<0.4683.0>]}},
>     [{rabbit_mirror_queue_coordinator,handle_cast,2},
>      {gen_server2,handle_msg,2},
>      {proc_lib,wake_up,3}]}

Well this is very odd. We fixed a bug that looked like this, but it got
fixed in 2.7.1 (and related to x-ha-policy = nodes. Could you just check
that you really are running 2.8.1? We're not aware of any bug in this
area in 2.8.1, but that's certainly not saying there's not one there! Is
there any particular sequence of events that you can perform that
reliably triggers this crash? Could you also check the logs of the other
nodes (both .log and -sasl.log) to see if there's further crash reports
in there?

Also, there have been discovered lots of bugs relating to the code
changes made to add DLX support in 2.8.1, especially in relation to HA.
It's possible one of the issues I found with TTL and HA is causing this.
2.8.2 should be out soonish which might introduce fewer new bugs than it
fixes, but in the mean time, could you try without the TTL and see if
that helps?


