[rabbitmq-discuss] Network partition detected, did not recover gracefully even though cluster_partition_handling = autoheal
Nicholas Stuart
nicholasastuart at gmail.com
Tue Dec 10 18:17:13 GMT 2013
I received a message regarding my cluster state saying "Network partition
detected". I went to check my RabbitMQ logs and I can see a bunch of error
reports like this:
=ERROR REPORT==== 7-Dec-2013::08:46:18 ===
** Generic server <0.507.0> terminating
** Last message in was {'DOWN',#Ref<0.0.0.74464>,process,<7022.1390.0>,
noconnection}
** When Server state == {state,
{0,<0.507.0>},
{{7,<7022.1390.0>},#Ref<0.0.0.74464>},
{{0,<7021.456.0>},#Ref<0.0.0.69498>},
{resource,<<"UAT_ENT">>,queue,
<<"queue.1">>},
rabbit_mirror_queue_coordinator,
{8,
[{{0,<7021.456.0>},
{view_member,
{0,<7021.456.0>},
[],
{0,<0.507.0>},
{7,<7022.1390.0>}}},
{{0,<0.507.0>},
{view_member,
{0,<0.507.0>},
[],
{7,<7022.1390.0>},
{0,<7021.456.0>}}},
{{7,<7022.1390.0>},
{view_member,
{7,<7022.1390.0>},
[],
{0,<7021.456.0>},
{0,<0.507.0>}}}]},
43,
[{{0,<7021.456.0>},{member,{[],[]},0,0}},
{{0,<0.507.0>},{member,{[],[]},43,43}},
{{7,<7022.1390.0>},{member,{[],[]},0,0}}],
[<0.506.0>],
{[],[]},
[],undefined,
#Fun<rabbit_misc.execute_mnesia_transaction.1>}
** Reason for termination ==
** {function_clause,[{orddict,fetch,
[{0,<0.507.0>},[]],
[{file,"orddict.erl"},{line,72}]},
{gm,check_neighbours,1,[]},
{gm,handle_info,2,[]},
{gen_server2,handle_msg,2,[]},
{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,237}]}]}
After restarting the troubled node, which fixed the network partition
message, I see the following message in my logs many times:
Discarding message {'$gen_call',{<0.26793.8>,#Ref<0.0.1.31326>},stat} from
<0.26793.8> to <0.433.0> in an old incarnation (3) of this node (2)
I'm not sure why it failed, but I did have some network failure indicated
in other systems, so I assume it was that. My issue is that the network
never tried to rescue itself afterwards, even though in my rabbitmq.conf I
have cluster_partition_handling set to autoheal. It is my understanding
that setting it to autoheal will cause the nodes to fix its network
partition, is this assumption incorrect?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131210/b3a15dd9/attachment.html>
More information about the rabbitmq-discuss
mailing list