[rabbitmq-discuss] cluster buster

Mark Ward ward.mark at gmail.com
Wed Oct 17 16:53:13 BST 2012


Hello,
I am testing what happens to the cluster when nodes are shutdown.  The 
results of my testing have not been positive.  I hope to either be advised 
if I have a mistake in my cluster setup or if something is really going 
wrong.

The issue is when I shutdown a server using the Windows Services to stop a 
node the cluster does fail over the queue but does not continue delivering 
messages to the subscriber.  The subscriber also does not receive any 
errors.  The client's connection still has a running status but no data 
flows.  I am able to reproduce this error.

I have added two test run results to this email. I have more tests and the 
results saved if more information is required.  I can also rerun future 
tests.

I have the following puny virtual machines for my cluster.
3 Virtual Machines
• Windows 2008 R2 sp1 64-bit
• 1 GB of RAM
• 16 GB of drive space free
RabbitMQ: 2.8.7
Erlang: R15B02
RabbitMQ is setup as the following on each of the nodes:
[ {kernel, [{inet_dist_listen_min, 55021}, {inet_dist_listen_max, 55021} ] 
},
{rabbit, 
[{cluster_nodes, ['rabbit at RIOBARON-1', 'rabbit at RIOOVERLORD-1', 
'rabbit at CUST1-MASTER']}
,{vm_memory_high_watermark, 0.2}]
} 
].

Virtual Host: MoBunniesMoProblems
Queue: MoData: Mirror(all) and Durable
Admin is setup on all nodes.
Using the .NET client version: 2.8.7 (nuget download)
The publisher uses SelectConfirm with messages set to persistent.
The subscriber uses BasicConsume with BasicQos set depending on the test. 
(10, 1000, 5)
Message size has been 3,196 bytes
100,000 messages sent in each test.

---------------------------------------------------------------------------------------------
Test 01:
Assumptions: shutdown cluster node.  A new server will be master of the 
queue.  subscriber should continue to receive messages.  reconnection is 
possible and duplicate messages are possible.
Result: A node took ownership of the node's queue.  The subscriber failed 
to continue after the node was shutdown.  connection was never dropped.

Test Notes
Before testing make sure all servers in the cluster are active and see each 
other.  Admin confirmed this with all green for nodes.
Publish 100k messages 3,196 in size with no subscriber active.
When all messages have been published Turn on subscriber.
During subscriber's active download of the messages shut down the node that 
is the master of the queue.
Queue will fail over to the next oldest node.
Subscriber connection will not be inturrupted but will stop receiving data.

riobaron-1: the connection server.  The publisher will send to this server. 
 The subscriber will connect to this server.
riooverlord-1: MoData queue master.  queue was mirrored on riobaron-1 and 
cust1-master.  Stats server.
cust1-master: mirror: second oldest server in the cluster.  

During the publishing the high watermark for memory and drive space was 
never reached.

After riooverlord-1 was shutdown using Windows Service manager to shutdown 
the rabbitMQ service.  cust1-master was promoted as the MoData queue master.
Subscriber data send stopped but the connection was still active.  Timeout 
on the connection was set to 25 seconds.  The connection never timed out 
even after letting it sit for 24 hours.

Results of log files:
riobaron-1's log file
=INFO REPORT==== 16-Oct-2012::08:02:15 ===
rabbit on node 'rabbit at RIOOVERLORD-1' down


=ERROR REPORT==== 16-Oct-2012::08:02:15 ===
Mnesia('rabbit at RIOBARON-1'): ** ERROR ** mnesia_event got 
{inconsistent_database, bad_decision, 'rabbit at RIOOVERLORD-1'}

cust1-master's log file

=INFO REPORT==== 16-Oct-2012::08:02:18 ===
Statistics database started.


=INFO REPORT==== 16-Oct-2012::08:02:18 ===
Mirrored-queue (queue 'Q' in vhost '/'): Slave 
<'rabbit at CUST1-MASTER'.2.255.0> saw deaths of mirrors 
<'rabbit at RIOOVERLORD-1'.1.220.0> 


=INFO REPORT==== 16-Oct-2012::08:02:18 ===
Mirrored-queue (queue 'Q' in vhost '/'): Promoting slave 
<'rabbit at CUST1-MASTER'.2.255.0> to master


=INFO REPORT==== 16-Oct-2012::08:02:18 ===
rabbit on node 'rabbit at RIOOVERLORD-1' down


=INFO REPORT==== 16-Oct-2012::08:02:18 ===
Mirrored-queue (queue 'MoData' in vhost 'MoBunnyMoProblems'): Slave 
<'rabbit at CUST1-MASTER'.2.253.0> saw deaths of mirrors 
<'rabbit at RIOOVERLORD-1'.1.219.0> 


=INFO REPORT==== 16-Oct-2012::08:02:18 ===
Mirrored-queue (queue 'MoData' in vhost 'MoBunnyMoProblems'): Promoting 
slave <'rabbit at CUST1-MASTER'.2.253.0> to master


=ERROR REPORT==== 16-Oct-2012::08:02:18 ===
Mnesia('rabbit at CUST1-MASTER'): ** ERROR ** mnesia_event got 
{inconsistent_database, bad_decision, 'rabbit at RIOOVERLORD-1'}

riooverlord-1's log file

=INFO REPORT==== 16-Oct-2012::08:02:16 ===
Stopping Rabbit


=INFO REPORT==== 16-Oct-2012::08:02:16 ===
    application: rabbitmq_management
    exited: stopped
    type: permanent


=INFO REPORT==== 16-Oct-2012::08:02:16 ===
    application: rabbitmq_management_agent
    exited: stopped
    type: permanent


=INFO REPORT==== 16-Oct-2012::08:02:16 ===
stopped TCP Listener on 0.0.0.0:5672


=INFO REPORT==== 16-Oct-2012::08:02:16 ===
stopped TCP Listener on [::]:5672


=INFO REPORT==== 16-Oct-2012::08:02:16 ===
    application: rabbit
    exited: stopped
    type: permanent


=ERROR REPORT==== 16-Oct-2012::08:02:16 ===
Error in process <0.20312.2> on node 'rabbit at RIOOVERLORD-1' with exit 
value: 
{badarg,[{mnesia_tm,commit_participant,6,[{file,"mnesia_tm.erl"},{line,1750}]}]}



=INFO REPORT==== 16-Oct-2012::08:02:16 ===
    application: mnesia
    exited: stopped
    type: permanent


=INFO REPORT==== 16-Oct-2012::08:02:16 ===
    application: os_mon
    exited: stopped
    type: permanent


=INFO REPORT==== 16-Oct-2012::08:02:16 ===
Halting Erlang VM


---------------------------------------------------------------------------------------------
Test 02:
Assumption: send data into cluster. shutdown cluster master node of the 
queue.  subscriber should continue to receive messages.
Result: a new node promoted to master of the queue upon the shutdown. 
 subscriber did not receive data after the promotion.  connection was still 
running and no errors from the client.

Test Notes:
All nodes in the cluster were online and green.  The queue was mirrored on 
all nodes.
queue starts off with 0 messages in the queue.

riobaron-1: connection server.  publisher connected to this server. 
 subscriber connects to this server.  youngest server in the cluster
cust1-master: stats server and master of MoData queue.   oldest server
riooverlord: second oldest server.

After the publisher completed the subscriber was turned on.  During the 
subscribers connection and active data being received the riobaron-1 server 
was shutdown.  The subscriber reconnected to riooverlord-1.  Data resumed 
to send to the subscriber.  During the subscription the cust1-master node 
was shutdown.  Subscription stopped.  connection remained active and 
running.
rioverlord-1 become the new master of the MoData queue.

Waited on the cluster but the connection remained active but no data 
received from the cluster.  When the client is shutdown and makes a new 
connection sending will resume.

Results from log files:

riobaron-1: 
=INFO REPORT==== 17-Oct-2012::08:15:35 ===
rabbit on node 'rabbit at RIOOVERLORD-1' up


=INFO REPORT==== 17-Oct-2012::08:24:46 ===
Stopping Rabbit


=INFO REPORT==== 17-Oct-2012::08:24:46 ===
    application: rabbitmq_management
    exited: stopped
    type: permanent


=INFO REPORT==== 17-Oct-2012::08:24:46 ===
    application: rabbitmq_management_agent
    exited: stopped
    type: permanent


=INFO REPORT==== 17-Oct-2012::08:24:46 ===
stopped TCP Listener on 0.0.0.0:5672


=INFO REPORT==== 17-Oct-2012::08:24:46 ===
stopped TCP Listener on [::]:5672


=INFO REPORT==== 17-Oct-2012::08:24:46 ===
    application: rabbit
    exited: stopped
    type: permanent


=INFO REPORT==== 17-Oct-2012::08:24:46 ===
    application: mnesia
    exited: stopped
    type: permanent


=INFO REPORT==== 17-Oct-2012::08:24:46 ===
    application: os_mon
    exited: stopped
    type: permanent


=INFO REPORT==== 17-Oct-2012::08:24:46 ===
Halting Erlang VM

cust1-master:
=INFO REPORT==== 17-Oct-2012::08:15:12 ===
Statistics database started.


=INFO REPORT==== 17-Oct-2012::08:15:32 ===
rabbit on node 'rabbit at RIOBARON-1' up


=INFO REPORT==== 17-Oct-2012::08:15:38 ===
rabbit on node 'rabbit at RIOOVERLORD-1' up


=INFO REPORT==== 17-Oct-2012::08:16:03 ===
Adding mirror of queue 'MoData' in vhost 'MoBunnyMoProblems' on node 
'rabbit at RIOOVERLORD-1': <3151.343.0>


=INFO REPORT==== 17-Oct-2012::08:16:03 ===
Adding mirror of queue 'MoData' in vhost 'MoBunnyMoProblems' on node 
'rabbit at RIOBARON-1': <3150.354.0>


=INFO REPORT==== 17-Oct-2012::08:24:49 ===
rabbit on node 'rabbit at RIOBARON-1' down


=INFO REPORT==== 17-Oct-2012::08:27:11 ===
rabbit on node 'rabbit at RIOBARON-1' up


=INFO REPORT==== 17-Oct-2012::08:46:05 ===
rabbit on node 'rabbit at RIOBARON-1' down


=INFO REPORT==== 17-Oct-2012::08:46:44 ===
Stopping Rabbit


=INFO REPORT==== 17-Oct-2012::08:46:44 ===
    application: rabbitmq_management
    exited: stopped
    type: permanent


=INFO REPORT==== 17-Oct-2012::08:46:44 ===
    application: rabbitmq_management_agent
    exited: stopped
    type: permanent


=INFO REPORT==== 17-Oct-2012::08:46:44 ===
stopped TCP Listener on 0.0.0.0:5672


=INFO REPORT==== 17-Oct-2012::08:46:44 ===
stopped TCP Listener on [::]:5672


=INFO REPORT==== 17-Oct-2012::08:46:45 ===
    application: rabbit
    exited: stopped
    type: permanent


=INFO REPORT==== 17-Oct-2012::08:46:45 ===
    application: mnesia
    exited: stopped
    type: permanent


=INFO REPORT==== 17-Oct-2012::08:46:45 ===
    application: os_mon
    exited: stopped
    type: permanent


=INFO REPORT==== 17-Oct-2012::08:46:45 ===
Halting Erlang VM

riooverlord-1:

=INFO REPORT==== 17-Oct-2012::08:24:47 ===
rabbit on node 'rabbit at RIOBARON-1' down


=INFO REPORT==== 17-Oct-2012::08:24:47 ===
Mirrored-queue (queue 'MoData' in vhost 'MoBunnyMoProblems'): Slave 
<'rabbit at RIOOVERLORD-1'.3.343.0> saw deaths of mirrors 
<'rabbit at RIOBARON-1'.2.354.0> 


=INFO REPORT==== 17-Oct-2012::08:24:47 ===
Mirrored-queue (queue 'Q' in vhost '/'): Slave 
<'rabbit at RIOOVERLORD-1'.3.258.0> saw deaths of mirrors 
<'rabbit at RIOBARON-1'.2.252.0> 



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121017/59a63a76/attachment.htm>


More information about the rabbitmq-discuss mailing list