[rabbitmq-discuss] Network partition detected, did not recover gracefully even though cluster_partition_handling = autoheal

Tue Dec 10 18:17:13 GMT 2013

I received a message regarding my cluster state saying "Network partition 
detected". I went to check my RabbitMQ logs and I can see a bunch of error 
reports like this:

=ERROR REPORT==== 7-Dec-2013::08:46:18 ===
** Generic server <0.507.0> terminating
** Last message in was {'DOWN',#Ref<0.0.0.74464>,process,<7022.1390.0>,
                               noconnection}
** When Server state == {state,
                            {0,<0.507.0>},
                            {{7,<7022.1390.0>},#Ref<0.0.0.74464>},
                            {{0,<7021.456.0>},#Ref<0.0.0.69498>},
                            {resource,<<"UAT_ENT">>,queue,
                                <<"queue.1">>},
                            rabbit_mirror_queue_coordinator,
                            {8,
                             [{{0,<7021.456.0>},
                               {view_member,
                                   {0,<7021.456.0>},
                                   [],
                                   {0,<0.507.0>},
                                   {7,<7022.1390.0>}}},
                              {{0,<0.507.0>},
                               {view_member,
                                   {0,<0.507.0>},
                                   [],
                                   {7,<7022.1390.0>},
                                   {0,<7021.456.0>}}},
                              {{7,<7022.1390.0>},
                               {view_member,
                                   {7,<7022.1390.0>},
                                   [],
                                   {0,<7021.456.0>},
                                   {0,<0.507.0>}}}]},
                            43,
                            [{{0,<7021.456.0>},{member,{[],[]},0,0}},
                             {{0,<0.507.0>},{member,{[],[]},43,43}},
                             {{7,<7022.1390.0>},{member,{[],[]},0,0}}],
                            [<0.506.0>],
                            {[],[]},
                            [],undefined,
                            #Fun<rabbit_misc.execute_mnesia_transaction.1>}
** Reason for termination == 
** {function_clause,[{orddict,fetch,
                              [{0,<0.507.0>},[]],
                              [{file,"orddict.erl"},{line,72}]},
                     {gm,check_neighbours,1,[]},
                     {gm,handle_info,2,[]},
                     {gen_server2,handle_msg,2,[]},

 {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,237}]}]}

After restarting the troubled node, which fixed the network partition 
message, I see the following message in my logs many times:

Discarding message {'$gen_call',{<0.26793.8>,#Ref<0.0.1.31326>},stat} from 
<0.26793.8> to <0.433.0> in an old incarnation (3) of this node (2)

I'm not sure why it failed, but I did have some network failure indicated 
in other systems, so I assume it was that. My issue is that the network 
never tried to rescue itself afterwards, even though in my rabbitmq.conf I 
have cluster_partition_handling set to autoheal. It is my understanding 
that setting it to autoheal will cause the nodes to fix its network 
partition, is this assumption incorrect?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131210/b3a15dd9/attachment.html>