[rabbitmq-discuss] RabbitMQ broker crashing under heavy load with mirrored queues

Tue Jan 10 23:13:15 GMT 2012

Hi Steve please find the following:

> If you are lucky (and it appears that you are, or else you are
> using auto acknowledgements) then none of the acknowledgements are lost
> (or none are required!).
In my queue consumer there is no acknowledgeMode set.
SimpleMessageListenerContainer sets the acknowledgeMode as AUTO as
default if it is not set.
In other words auto acknowledgment is used.

>If they do not resend, then this could be the source of the lost messages
> they were not sent in the first place.
The producer may have been connected to crashed Node.
I have seen two AmqpExceptions in the log.
I didn't have logic in my producer code to resend the message on
AmqpException.
In the catch block I have added the code to resend the message as
follows:
	@Override
	public void convertAndSend(final Object message)  {
		MessageProperties props = null;
		try {
			props = new MessageProperties();
			props.setDeliveryMode(MessageDeliveryMode.PERSISTENT);   //setting
delivery mode as PERSISTENT
			send(getMessageConverter().toMessage(message, props));

		} catch (AmqpException amqpe) {
			send(getMessageConverter().toMessage(message, props));
		}
	}

After adding the code to resend message on exception, I have rerun the
test.
This time there were no messages lost. All 20K messages were
processed.
[ecloud at t-3 log]$ grep Thread-0 central-log.log | wc -l
2000
[ecloud at t-3 log]$ grep Thread-1 central-log.log | wc -l
2000
[ecloud at t-3 log]$ grep Thread-2 central-log.log | wc -l
2000
[ecloud at t-3 log]$ grep Thread-3 central-log.log | wc -l
2000
[ecloud at t-3 log]$ grep Thread-4 central-log.log | wc -l
2000
[ecloud at t-3 log]$ grep Thread-5 central-log.log | wc -l
2000
[ecloud at t-3 log]$ grep Thread-6 central-log.log | wc -l
2000
[ecloud at t-3 log]$ grep Thread-7 central-log.log | wc -l
2000
[ecloud at t-3 log]$ grep Thread-8 central-log.log | wc -l
2000
[ecloud at t-3 log]$ grep Thread-9 central-log.log | wc -l
2000

>This is interesting, too. Can you supply us with the complete output from
>rabbitmqctl status for both nodes, and explain exactly what you mean by
>'run rabbitmqctl on NodeA'?

I meant running report.
This time I have captured all the details.
In the beginning of the test I had only node t-2.
Running report command worked fine on t-2 node.

Then I joined t-4 node to the cluster as follows:

/usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl stop_app
/usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl reset
/usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl cluster
rabbit at t-2 rabbit at t-4

Then I ran report command on t-2. Following is the error displayed:
[ecloud at t-2 vv]$  /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/
rabbitmqctl report > /home/ecloud/vv/rep3.txt
Error: unable to connect to node 'rabbit at t-2': nodedown
diagnostics:
- nodes and their ports on t-2: [{rabbit,19667},
                                       {rabbitmqctl28789,44074}]
- current node: 'rabbitmqctl28789 at t-2'
- current node home dir: /home/ecloud
- current node cookie hash: VLhPX0Ti0bNE//tFwfQQGA==

 I ran status command on t-2 node:

[ecloud at t-2 vv]$ /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/
rabbitmqctl -n rabbit at t-2 status

Status of node 'rabbit at t-2' ...
[{pid,28188},
 {running_applications,[{rabbit,"RabbitMQ","2.7.1"},
                        {mnesia,"MNESIA  CXC 138 12","4.4.19"},
                        {os_mon,"CPO  CXC 138 46","2.2.6"},
                        {sasl,"SASL  CXC 138 11","2.1.9.4"},
                        {stdlib,"ERTS  CXC 138 10","1.17.4"},
                        {kernel,"ERTS  CXC 138 10","2.14.4"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R14B03 (erts-5.8.4) [source] [64-bit] [rq:1]
[async-threads:30] [kernel-poll:true]\n"},
 {memory,[{total,60196864},
          {processes,10386136},
          {processes_used,10379416},
          {system,49810728},
          {atom,1122009},
          {atom_used,1117604},
          {binary,69968},
          {code,11235581},
          {ets,793680}]},
 {vm_memory_high_watermark,0.39999999962067656},
 {vm_memory_limit,843607244}]
...done.

Following is after running status command on t-4 node:
[ecloud at t-4 ~]$ /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/
rabbitmqctl -n rabbit at t-4 status
Status of node 'rabbit at t-4' ...
[{pid,22424},
 {running_applications,[{rabbit,"RabbitMQ","2.7.1"},
                        {os_mon,"CPO  CXC 138 46","2.2.6"},
                        {mnesia,"MNESIA  CXC 138 12","4.4.19"},
                        {sasl,"SASL  CXC 138 11","2.1.9.4"},
                        {stdlib,"ERTS  CXC 138 10","1.17.4"},
                        {kernel,"ERTS  CXC 138 10","2.14.4"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R14B03 (erts-5.8.4) [source] [64-bit] [smp:
4:4] [rq:4]
 [async-threads:30] [kernel-poll:true]\n"},
 {memory,[{total,68521368},
          {processes,10537392},
          {processes_used,10500904},
          {system,57983976},
          {atom,1125249},
          {atom_used,1122051},
          {binary,3123584},
          {code,11235605},
          {ets,2940008}]},
 {vm_memory_high_watermark,0.3999999994293313},
 {vm_memory_limit,420559257}]
...done.

Thanks
Venkat

On Jan 10, 11:27 am, Steve Powell <st... at rabbitmq.com> wrote:
> Hi Venkat,
>
> I'm glad things are better under 2.7.1.
>
> > I have one question, referring tohttp://www.rabbitmq.com/ha.html:
> >> As a result of the requeuing, clients that re-consume from the queue
> >> must be aware that they are likely to subsequently receive messages
> >> that they have seen previously
>
> This is an accurate quote, and is still true.  Acknowledgements are only sent
> to the master and then copied to the slaves, so the slaves might not know
> about some of them if the master goes down before some acknowledgements can
> be forwarded.  If you are lucky (and it appears that you are, or else you are
> using auto acknowledgements) then none of the acknowledgements are lost
> (or none are required!).
>
> > You notice that there are two lines displaying 1999, this is because
> > two messages were lost. Otherwise you see 2000 messages processed
> > from each thread.
>
> > From this, does it mean that I don't have to worry about duplicate
> > messages due to requeing?
>
> No, it doesn't mean that. If you have explicit acknowledgements by your
> consumers, then when the master fails the slave may redeliver some messages
> that were acknowledged, as well as the ones that weren't.
>
> What interests me is the messages that are lost. If I understand it
> correctly, messages are published to the master and all the slaves
> simultaneously, so the failure of the master shouldn't lose any messages.
>
> Having said that, you haven't said to which broker your test apps connect.
> If they were connected to the master at the time, then what do they do when
> the master fails?  Do they automatically reconnect (I presume this is in
> the tests' logs)? Do they resend the last message (which will have failed
> because the connection will have been dropped)?
>
> If they do not resend, then this could be the source of the lost messages
> -- they were not sent in the first place.
>
> Please can you explain just a little more about the test thread connection
> history, and to which broker they are connected?  I would expect that, if
> they are connected to the slave, then you won't see any lost messages in
> this test scenario.
>
> > I assuming that HA Proxy
> > was not quick enough to detect about the Crashed Node A and thus those
> > 2/3 messages were routed to crashed NodeA. Please correct me if I am
> > wrong.
>
> I don't think this is the problem, as messages are published to all the
> brokers mirroring the queue.
>
> > The other thing that I just wanted to bring it to your attention (it
> > doesn't bother me). It is as follows:
> > I have NodeA in the beginning of the cluster then I join NodeB to the
> > cluster.
> > If I run rabbitmqctl report on NodeA, it throws an error saying that
> > NodeA is down (when it is really not down). But it works fine on
> > NodeB.
>
> This is interesting, too. Can you supply us with the complete output from
> rabbitmqctl status for both nodes, and explain exactly what you mean by
> 'run rabbitmqctl on NodeA'?
>
> Thank you for reporting these issues.
>
> Steve Powell  (a curious bunny)
> ----------some more definitions from the SPD----------
> avoirdupois (phr.) 'Would you like peas with that?'
> distribute (v.) To denigrate an award ceremony.
> definite (phr.) 'It's hard of hearing, I think.'
> modest (n.) The most mod.
>
> On 9 Jan 2012, at 23:39, Venkat wrote:
>
> > Hi Steve I have run some tests using RabbitMQ 2.7.1 please find the
> > following:
>
> ...(elided)
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss