[rabbitmq-discuss] cluster buster

Mark Ward ward.mark at gmail.com
Wed Oct 17 17:27:06 BST 2012


Thanks Simon,

This is putting me on the right track to resolve this in my client.  I am 
using the .NET client and if I am understanding it correctly to handle the 
cancel extension I need to extend the QueueingBasicConsumer and override 
the OnCancel() method to provide my clients access to the cancel extension?

http://www.rabbitmq.com/consumer-cancel.html

http://www.rabbitmq.com/releases/rabbitmq-dotnet-client/v1.4.0/rabbitmq-dotnet-client-1.4.0-net-2.0-htmldoc/type-RabbitMQ.Client.QueueingBasicConsumer.html

-Mark


-------------------------------------------
Oct 17, 2012; 11:10am   Simon MacMullen-2
On 17/10/12 16:53, Mark Ward wrote: 
> The issue is when I shutdown a server using the Windows Services to stop 
> a node the cluster does fail over the queue but does not continue 
> delivering messages to the subscriber.  The subscriber also does not 
> receive any errors.  The client's connection still has a running status 
> but no data flows.  I am able to reproduce this error. 

Have you read this: 

http://www.rabbitmq.com/ha.html#behaviour

"Clients that were consuming from the mirrored-queue and support our 
Consumer Cancellation Notifications extension will receive a 
notification that their subscription to the mirrored-queue has been 
abruptly cancelled. At this point they should re-consume from the queue, 
which will pick up the new master. The reason for sending this 
notification is that informing clients of the loss of the master is 
essential: otherwise the client may continue to issue acknowledgements 
for messages they were sent by the old, failed master, and not expect 
that they might be about to see the same messages again, this time sent 
by the new master. Of course, clients that were connected to the failed 
node will find their connections failed, and will need to reconnect to a 
surviving node of the cluster." 

? 

If the consumer is connected to a node other than the one that failed, 
it will not be disconnected, but it will be sent a cancellation 
notification and will have to reissue basic.consume. 

Cheers, Simon 

-- 
Simon MacMullen 
RabbitMQ, VMware 

On Wednesday, October 17, 2012 10:53:13 AM UTC-5, Mark Ward wrote:
>
> Hello,
> I am testing what happens to the cluster when nodes are shutdown.  The 
> results of my testing have not been positive.  I hope to either be advised 
> if I have a mistake in my cluster setup or if something is really going 
> wrong.
>
> The issue is when I shutdown a server using the Windows Services to stop a 
> node the cluster does fail over the queue but does not continue delivering 
> messages to the subscriber.  The subscriber also does not receive any 
> errors.  The client's connection still has a running status but no data 
> flows.  I am able to reproduce this error.
>
> I have added two test run results to this email. I have more tests and the 
> results saved if more information is required.  I can also rerun future 
> tests.
>
> I have the following puny virtual machines for my cluster.
> 3 Virtual Machines
> • Windows 2008 R2 sp1 64-bit
> • 1 GB of RAM
> • 16 GB of drive space free
> RabbitMQ: 2.8.7
> Erlang: R15B02
> RabbitMQ is setup as the following on each of the nodes:
> [ {kernel, [{inet_dist_listen_min, 55021}, {inet_dist_listen_max, 55021} 
> ] },
> {rabbit, 
> [{cluster_nodes, ['rabbit at RIOBARON-1', 'rabbit at RIOOVERLORD-1', 
> 'rabbit at CUST1-MASTER']}
> ,{vm_memory_high_watermark, 0.2}]
> } 
> ].
>
> Virtual Host: MoBunniesMoProblems
> Queue: MoData: Mirror(all) and Durable
> Admin is setup on all nodes.
> Using the .NET client version: 2.8.7 (nuget download)
> The publisher uses SelectConfirm with messages set to persistent.
> The subscriber uses BasicConsume with BasicQos set depending on the test. 
> (10, 1000, 5)
> Message size has been 3,196 bytes
> 100,000 messages sent in each test.
>
>
> ---------------------------------------------------------------------------------------------
> Test 01:
> Assumptions: shutdown cluster node.  A new server will be master of the 
> queue.  subscriber should continue to receive messages.  reconnection is 
> possible and duplicate messages are possible.
> Result: A node took ownership of the node's queue.  The subscriber failed 
> to continue after the node was shutdown.  connection was never dropped.
>
> Test Notes
> Before testing make sure all servers in the cluster are active and see 
> each other.  Admin confirmed this with all green for nodes.
> Publish 100k messages 3,196 in size with no subscriber active.
> When all messages have been published Turn on subscriber.
> During subscriber's active download of the messages shut down the node 
> that is the master of the queue.
> Queue will fail over to the next oldest node.
> Subscriber connection will not be inturrupted but will stop receiving data.
>
> riobaron-1: the connection server.  The publisher will send to this 
> server.  The subscriber will connect to this server.
> riooverlord-1: MoData queue master.  queue was mirrored on riobaron-1 and 
> cust1-master.  Stats server.
> cust1-master: mirror: second oldest server in the cluster.  
>
> During the publishing the high watermark for memory and drive space was 
> never reached.
>
> After riooverlord-1 was shutdown using Windows Service manager to shutdown 
> the rabbitMQ service.  cust1-master was promoted as the MoData queue master.
> Subscriber data send stopped but the connection was still active.  Timeout 
> on the connection was set to 25 seconds.  The connection never timed out 
> even after letting it sit for 24 hours.
>
> Results of log files:
> riobaron-1's log file
> =INFO REPORT==== 16-Oct-2012::08:02:15 ===
> rabbit on node 'rabbit at RIOOVERLORD-1' down
>
>
> =ERROR REPORT==== 16-Oct-2012::08:02:15 ===
> Mnesia('rabbit at RIOBARON-1'): ** ERROR ** mnesia_event got 
> {inconsistent_database, bad_decision, 'rabbit at RIOOVERLORD-1'}
>
> cust1-master's log file
>
> =INFO REPORT==== 16-Oct-2012::08:02:18 ===
> Statistics database started.
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:18 ===
> Mirrored-queue (queue 'Q' in vhost '/'): Slave 
> <'rabbit at CUST1-MASTER'.2.255.0> saw deaths of mirrors 
> <'rabbit at RIOOVERLORD-1'.1.220.0> 
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:18 ===
> Mirrored-queue (queue 'Q' in vhost '/'): Promoting slave 
> <'rabbit at CUST1-MASTER'.2.255.0> to master
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:18 ===
> rabbit on node 'rabbit at RIOOVERLORD-1' down
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:18 ===
> Mirrored-queue (queue 'MoData' in vhost 'MoBunnyMoProblems'): Slave 
> <'rabbit at CUST1-MASTER'.2.253.0> saw deaths of mirrors 
> <'rabbit at RIOOVERLORD-1'.1.219.0> 
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:18 ===
> Mirrored-queue (queue 'MoData' in vhost 'MoBunnyMoProblems'): Promoting 
> slave <'rabbit at CUST1-MASTER'.2.253.0> to master
>
>
> =ERROR REPORT==== 16-Oct-2012::08:02:18 ===
> Mnesia('rabbit at CUST1-MASTER'): ** ERROR ** mnesia_event got 
> {inconsistent_database, bad_decision, 'rabbit at RIOOVERLORD-1'}
>
> riooverlord-1's log file
>
> =INFO REPORT==== 16-Oct-2012::08:02:16 ===
> Stopping Rabbit
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:16 ===
>     application: rabbitmq_management
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:16 ===
>     application: rabbitmq_management_agent
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:16 ===
> stopped TCP Listener on 0.0.0.0:5672
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:16 ===
> stopped TCP Listener on [::]:5672
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:16 ===
>     application: rabbit
>     exited: stopped
>     type: permanent
>
>
> =ERROR REPORT==== 16-Oct-2012::08:02:16 ===
> Error in process <0.20312.2> on node 'rabbit at RIOOVERLORD-1' with exit 
> value: 
> {badarg,[{mnesia_tm,commit_participant,6,[{file,"mnesia_tm.erl"},{line,1750}]}]}
>
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:16 ===
>     application: mnesia
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:16 ===
>     application: os_mon
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 16-Oct-2012::08:02:16 ===
> Halting Erlang VM
>
>
>
> ---------------------------------------------------------------------------------------------
> Test 02:
> Assumption: send data into cluster. shutdown cluster master node of the 
> queue.  subscriber should continue to receive messages.
> Result: a new node promoted to master of the queue upon the shutdown. 
>  subscriber did not receive data after the promotion.  connection was still 
> running and no errors from the client.
>
> Test Notes:
> All nodes in the cluster were online and green.  The queue was mirrored on 
> all nodes.
> queue starts off with 0 messages in the queue.
>
> riobaron-1: connection server.  publisher connected to this server. 
>  subscriber connects to this server.  youngest server in the cluster
> cust1-master: stats server and master of MoData queue.   oldest server
> riooverlord: second oldest server.
>
> After the publisher completed the subscriber was turned on.  During the 
> subscribers connection and active data being received the riobaron-1 server 
> was shutdown.  The subscriber reconnected to riooverlord-1.  Data resumed 
> to send to the subscriber.  During the subscription the cust1-master node 
> was shutdown.  Subscription stopped.  connection remained active and 
> running.
> rioverlord-1 become the new master of the MoData queue.
>
> Waited on the cluster but the connection remained active but no data 
> received from the cluster.  When the client is shutdown and makes a new 
> connection sending will resume.
>
> Results from log files:
>
> riobaron-1: 
> =INFO REPORT==== 17-Oct-2012::08:15:35 ===
> rabbit on node 'rabbit at RIOOVERLORD-1' up
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:46 ===
> Stopping Rabbit
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:46 ===
>     application: rabbitmq_management
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:46 ===
>     application: rabbitmq_management_agent
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:46 ===
> stopped TCP Listener on 0.0.0.0:5672
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:46 ===
> stopped TCP Listener on [::]:5672
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:46 ===
>     application: rabbit
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:46 ===
>     application: mnesia
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:46 ===
>     application: os_mon
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:46 ===
> Halting Erlang VM
>
> cust1-master:
> =INFO REPORT==== 17-Oct-2012::08:15:12 ===
> Statistics database started.
>
>
> =INFO REPORT==== 17-Oct-2012::08:15:32 ===
> rabbit on node 'rabbit at RIOBARON-1' up
>
>
> =INFO REPORT==== 17-Oct-2012::08:15:38 ===
> rabbit on node 'rabbit at RIOOVERLORD-1' up
>
>
> =INFO REPORT==== 17-Oct-2012::08:16:03 ===
> Adding mirror of queue 'MoData' in vhost 'MoBunnyMoProblems' on node 
> 'rabbit at RIOOVERLORD-1': <3151.343.0>
>
>
> =INFO REPORT==== 17-Oct-2012::08:16:03 ===
> Adding mirror of queue 'MoData' in vhost 'MoBunnyMoProblems' on node 
> 'rabbit at RIOBARON-1': <3150.354.0>
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:49 ===
> rabbit on node 'rabbit at RIOBARON-1' down
>
>
> =INFO REPORT==== 17-Oct-2012::08:27:11 ===
> rabbit on node 'rabbit at RIOBARON-1' up
>
>
> =INFO REPORT==== 17-Oct-2012::08:46:05 ===
> rabbit on node 'rabbit at RIOBARON-1' down
>
>
> =INFO REPORT==== 17-Oct-2012::08:46:44 ===
> Stopping Rabbit
>
>
> =INFO REPORT==== 17-Oct-2012::08:46:44 ===
>     application: rabbitmq_management
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 17-Oct-2012::08:46:44 ===
>     application: rabbitmq_management_agent
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 17-Oct-2012::08:46:44 ===
> stopped TCP Listener on 0.0.0.0:5672
>
>
> =INFO REPORT==== 17-Oct-2012::08:46:44 ===
> stopped TCP Listener on [::]:5672
>
>
> =INFO REPORT==== 17-Oct-2012::08:46:45 ===
>     application: rabbit
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 17-Oct-2012::08:46:45 ===
>     application: mnesia
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 17-Oct-2012::08:46:45 ===
>     application: os_mon
>     exited: stopped
>     type: permanent
>
>
> =INFO REPORT==== 17-Oct-2012::08:46:45 ===
> Halting Erlang VM
>
> riooverlord-1:
>
> =INFO REPORT==== 17-Oct-2012::08:24:47 ===
> rabbit on node 'rabbit at RIOBARON-1' down
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:47 ===
> Mirrored-queue (queue 'MoData' in vhost 'MoBunnyMoProblems'): Slave 
> <'rabbit at RIOOVERLORD-1'.3.343.0> saw deaths of mirrors 
> <'rabbit at RIOBARON-1'.2.354.0> 
>
>
> =INFO REPORT==== 17-Oct-2012::08:24:47 ===
> Mirrored-queue (queue 'Q' in vhost '/'): Slave 
> <'rabbit at RIOOVERLORD-1'.3.258.0> saw deaths of mirrors 
> <'rabbit at RIOBARON-1'.2.252.0> 
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121017/3c074348/attachment.htm>


More information about the rabbitmq-discuss mailing list