[rabbitmq-discuss] Problems with the {cluster_partition_handling, pause_minority} option
David Rodrigues
rdgsdavid at gmail.com
Wed Mar 5 17:18:14 GMT 2014
Dear RabbitMQ Community,
I'm having some problems with the {cluster_partition_handling,
pause_minority} option and I would like to share with you my doubts.
First, the architecture. I'm running RabbitMQ 3.2.4 (Erlang R14A) on 3
nodes (rabbitmq at rabbitmq-test01, rabbitmq at rabbitmq-test02 and
rabbitmq at rabbitmq-test03) on a virtualized platform - quite similar to
EC2. And because of that I have, from time to time, connectivity issues -
I know, that's bad ;)
Digging into the doc I have found that the best way to handle the problem
is using the pause_minority option -
https://www.rabbitmq.com/partitions.html.
But from time to time my nodes get disconnected from each other and do not
recover automatically. Hopefully I have managed to reproduce the problem.
Here are the steps.
THE CONFIGURATION FILE
***********************************************************
My configuration file is quite simple:
%% -*- mode: erlang -*-
[
{rabbit,
[
{auth_mechanisms, ['PLAIN', 'AMQPLAIN']},
{default_vhost, <<"/">>},
{default_user, <<"admin">>},
{default_pass, <<"admin">>},
{default_permissions, [<<".*">>, <<".*">>, <<".*">>]},
{default_user_tags, [administrator]},
{cluster_partition_handling, pause_minority},
{cluster_nodes, {['rabbitmq at rabbitmq-test01', 'rabbitmq at rabbitmq-test02',
'rabbitmq at rabbitmq-test03'], disc}}
]},
{kernel, []},
{rabbitmq_management, []},
{rabbitmq_management_agent, []},
{rabbitmq_shovel,
[{shovels, []}
]},
{rabbitmq_stomp, []},
{rabbitmq_mqtt, []},
{rabbitmq_amqp1_0, []},
{rabbitmq_auth_backend_ldap, []}
].
Has you can see, the {cluster_partition_handling, pause_minority} option is
there.
PAUSE_MINORITY WORKING
***********************************************************
When the network outage is long enough, the option works perfectly.
To simulate a connection problem on rabbitmq-test03 I run:
iptables -A INPUT -s rabbitmq-test01 -j DROP; iptables -A OUTPUT -d
rabbitmq-test01 -j DROP
iptables -A INPUT -s rabbitmq-test02 -j DROP; iptables -A OUTPUT -d
rabbitmq-test02 -j DROP
Then wait long enough for the following messages to appear in the logs of
rabbitmq-test03 (approximately 180 seconds):
=ERROR REPORT==== 5-Mar-2014::16:51:02 ===
** Node 'rabbitmq at rabbitmq-test02' not responding **
** Removing (timedout) connection **
=ERROR REPORT==== 5-Mar-2014::16:51:02 ===
** Node 'rabbitmq at rabbitmq-test01' not responding **
** Removing (timedout) connection **
=INFO REPORT==== 5-Mar-2014::16:51:02 ===
rabbit on node 'rabbitmq at rabbitmq-test02' down
=WARNING REPORT==== 5-Mar-2014::16:51:30 ===
Cluster minority status detected - awaiting recovery
=INFO REPORT==== 5-Mar-2014::16:51:30 ===
rabbit on node 'rabbitmq at rabbitmq-test01' down
=INFO REPORT==== 5-Mar-2014::16:51:30 ===
Stopping RabbitMQ
=INFO REPORT==== 5-Mar-2014::16:51:30 ===
stopped TCP Listener on [::]:5672
=WARNING REPORT==== 5-Mar-2014::16:51:58 ===
Cluster minority status detected - awaiting recovery
When flushing the rules (iptables -F) the connectivity is reestablished and
the cluster works perfectly.
In the logs:
=INFO REPORT==== 5-Mar-2014::16:52:58 ===
started TCP Listener on [::]:5672
=INFO REPORT==== 5-Mar-2014::16:52:58 ===
rabbit on node 'rabbitmq at rabbitmq-test01' up
=INFO REPORT==== 5-Mar-2014::16:52:58 ===
rabbit on node 'rabbitmq at rabbitmq-test02' up
Finally the cluster status :
Cluster status of node 'rabbitmq at rabbitmq-test03' ...
[{nodes,[{disc,['rabbitmq at rabbitmq-test01','rabbitmq at rabbitmq-test02',
'rabbitmq at rabbitmq-test03']}]},
{running_nodes,['rabbitmq at rabbitmq-test01','rabbitmq at rabbitmq-test02',
'rabbitmq at rabbitmq-test03']},
{partitions,[]}]
...done.
So far, so good. The option works flawlessly.
PAUSE_MINORITY NOT WORKING
***********************************************************
Life is not so bright when the network partition is not long enough.
On rabbitmq-test03 I will run my iptables commands again:
iptables -A INPUT -s rabbitmq-test01 -j DROP; iptables -A OUTPUT -d
rabbitmq-test01 -j DROP
iptables -A INPUT -s rabbitmq-test02 -j DROP; iptables -A OUTPUT -d
rabbitmq-test02 -j DROP
However this time I'll only wait 60 seconds before flushing my rules with
iptables -F.
And that's the result in rabbitmq-test03 logs:
=INFO REPORT==== 5-Mar-2014::16:55:00 ===
rabbit on node 'rabbitmq at rabbitmq-test02' down
=ERROR REPORT==== 5-Mar-2014::16:55:00 ===
Mnesia('rabbitmq at rabbitmq-test03'): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network,
'rabbitmq at rabbitmq-test02'}
=INFO REPORT==== 5-Mar-2014::16:55:00 ===
rabbit on node 'rabbitmq at rabbitmq-test01' down
=INFO REPORT==== 5-Mar-2014::16:55:00 ===
Statistics database started.
=INFO REPORT==== 5-Mar-2014::16:55:00 ===
Statistics database started.
Again, the result is quiet ugly on rabbitmq-test01:
=ERROR REPORT==== 5-Mar-2014::16:55:00 ===
** Node 'rabbitmq at rabbitmq-test03' not responding **
** Removing (timedout) connection **
=INFO REPORT==== 5-Mar-2014::16:55:00 ===
rabbit on node 'rabbitmq at rabbitmq-test03' down
=INFO REPORT==== 5-Mar-2014::16:55:01 ===
global: Name conflict terminating {rabbit_mgmt_db,<2669.1582.0>}
Finally my cluster status:
Cluster status of node 'rabbitmq at rabbitmq-test03' ...
[{nodes,[{disc,['rabbitmq at rabbitmq-test01','rabbitmq at rabbitmq-test02',
'rabbitmq at rabbitmq-test03']}]},
{running_nodes,['rabbitmq at rabbitmq-test03']},
{partitions,[{'rabbitmq at rabbitmq-test03',['rabbitmq at rabbitmq-test02']}]}]
...done.
That's it. Even with the pause_minority option my cluster was disintegrated.
SYNOPSIS
***********************************************************
In short, if the network outage is long enough everything goes according to
the plan and the cluster works perfectly once the connectivity is
reestablished. However if the network outage has a intermediate duration
(not too short, not too long) the pause_minority option seems not to work.
Are you aware of this problem? Is there any solution to cope with this
particular situation?
Thanks,
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140305/bbd1951d/attachment.html>
More information about the rabbitmq-discuss
mailing list