[rabbitmq-discuss] RabbitMQ broker crashing under heavy load with mirrored queues

Wed Jan 11 01:22:28 GMT 2012

Steve I just wanted to add further notes.
I didn't have to write any reconnectivity mechanism for queue consumer
because spring-amqp provides built-in reconnectivity mechanism in
SimpleMessageListenerContainer.

Thanks
Venkat

On Jan 10, 6:13 pm, Venkat <vvelu... at gmail.com> wrote:
> Hi Steve please find the following:
>
> > If you are lucky (and it appears that you are, or else you are
> > using auto acknowledgements) then none of the acknowledgements are lost
> > (or none are required!).
>
> In my queue consumer there is no acknowledgeMode set.
> SimpleMessageListenerContainer sets the acknowledgeMode as AUTO as
> default if it is not set.
> In other words auto acknowledgment is used.
>
> >If they do not resend, then this could be the source of the lost messages
> > they were not sent in the first place.
>
> The producer may have been connected to crashed Node.
> I have seen two AmqpExceptions in the log.
> I didn't have logic in my producer code to resend the message on
> AmqpException.
> In the catch block I have added the code to resend the message as
> follows:
>         @Override
>         public void convertAndSend(final Object message)  {
>                 MessageProperties props = null;
>                 try {
>                         props = new MessageProperties();
>                         props.setDeliveryMode(MessageDeliveryMode.PERSISTENT);   //setting
> delivery mode as PERSISTENT
>                         send(getMessageConverter().toMessage(message, props));
>
>                 } catch (AmqpException amqpe) {
>                         send(getMessageConverter().toMessage(message, props));
>                 }
>         }
>
> After adding the code to resend message on exception, I have rerun the
> test.
> This time there were no messages lost. All 20K messages were
> processed.
> [ecloud at t-3 log]$ grep Thread-0 central-log.log | wc -l
> 2000
> [ecloud at t-3 log]$ grep Thread-1 central-log.log | wc -l
> 2000
> [ecloud at t-3 log]$ grep Thread-2 central-log.log | wc -l
> 2000
> [ecloud at t-3 log]$ grep Thread-3 central-log.log | wc -l
> 2000
> [ecloud at t-3 log]$ grep Thread-4 central-log.log | wc -l
> 2000
> [ecloud at t-3 log]$ grep Thread-5 central-log.log | wc -l
> 2000
> [ecloud at t-3 log]$ grep Thread-6 central-log.log | wc -l
> 2000
> [ecloud at t-3 log]$ grep Thread-7 central-log.log | wc -l
> 2000
> [ecloud at t-3 log]$ grep Thread-8 central-log.log | wc -l
> 2000
> [ecloud at t-3 log]$ grep Thread-9 central-log.log | wc -l
> 2000
>
> >This is interesting, too. Can you supply us with the complete output from
> >rabbitmqctl status for both nodes, and explain exactly what you mean by
> >'run rabbitmqctl on NodeA'?
>
> I meant running report.
> This time I have captured all the details.
> In the beginning of the test I had only node t-2.
> Running report command worked fine on t-2 node.
>
> Then I joined t-4 node to the cluster as follows:
>
> /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl stop_app
> /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl reset
> /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl cluster
> rabbit at t-2 rabbit at t-4
>
> Then I ran report command on t-2. Following is the error displayed:
> [ecloud at t-2 vv]$  /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/
> rabbitmqctl report > /home/ecloud/vv/rep3.txt
> Error: unable to connect to node 'rabbit at t-2': nodedown
> diagnostics:
> - nodes and their ports on t-2: [{rabbit,19667},
>                                        {rabbitmqctl28789,44074}]
> - current node: 'rabbitmqctl28789 at t-2'
> - current node home dir: /home/ecloud
> - current node cookie hash: VLhPX0Ti0bNE//tFwfQQGA==
>
>  I ran status command on t-2 node:
>
> [ecloud at t-2 vv]$ /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/
> rabbitmqctl -n rabbit at t-2 status
>
> Status of node 'rabbit at t-2' ...
> [{pid,28188},
>  {running_applications,[{rabbit,"RabbitMQ","2.7.1"},
>                         {mnesia,"MNESIA  CXC 138 12","4.4.19"},
>                         {os_mon,"CPO  CXC 138 46","2.2.6"},
>                         {sasl,"SASL  CXC 138 11","2.1.9.4"},
>                         {stdlib,"ERTS  CXC 138 10","1.17.4"},
>                         {kernel,"ERTS  CXC 138 10","2.14.4"}]},
>  {os,{unix,linux}},
>  {erlang_version,"Erlang R14B03 (erts-5.8.4) [source] [64-bit] [rq:1]
> [async-threads:30] [kernel-poll:true]\n"},
>  {memory,[{total,60196864},
>           {processes,10386136},
>           {processes_used,10379416},
>           {system,49810728},
>           {atom,1122009},
>           {atom_used,1117604},
>           {binary,69968},
>           {code,11235581},
>           {ets,793680}]},
>  {vm_memory_high_watermark,0.39999999962067656},
>  {vm_memory_limit,843607244}]
> ...done.
>
> Following is after running status command on t-4 node:
> [ecloud at t-4 ~]$ /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/
> rabbitmqctl -n rabbit at t-4 status
> Status of node 'rabbit at t-4' ...
> [{pid,22424},
>  {running_applications,[{rabbit,"RabbitMQ","2.7.1"},
>                         {os_mon,"CPO  CXC 138 46","2.2.6"},
>                         {mnesia,"MNESIA  CXC 138 12","4.4.19"},
>                         {sasl,"SASL  CXC 138 11","2.1.9.4"},
>                         {stdlib,"ERTS  CXC 138 10","1.17.4"},
>                         {kernel,"ERTS  CXC 138 10","2.14.4"}]},
>  {os,{unix,linux}},
>  {erlang_version,"Erlang R14B03 (erts-5.8.4) [source] [64-bit] [smp:
> 4:4] [rq:4]
>  [async-threads:30] [kernel-poll:true]\n"},
>  {memory,[{total,68521368},
>           {processes,10537392},
>           {processes_used,10500904},
>           {system,57983976},
>           {atom,1125249},
>           {atom_used,1122051},
>           {binary,3123584},
>           {code,11235605},
>           {ets,2940008}]},
>  {vm_memory_high_watermark,0.3999999994293313},
>  {vm_memory_limit,420559257}]
> ...done.
>
> Thanks
> Venkat
>
> On Jan 10, 11:27 am, Steve Powell <st... at rabbitmq.com> wrote:
>
>
>
>
>
>
>
> > Hi Venkat,
>
> > I'm glad things are better under 2.7.1.
>
> > > I have one question, referring tohttp://www.rabbitmq.com/ha.html:
> > >> As a result of the requeuing, clients that re-consume from the queue
> > >> must be aware that they are likely to subsequently receive messages
> > >> that they have seen previously
>
> > This is an accurate quote, and is still true.  Acknowledgements are only sent
> > to the master and then copied to the slaves, so the slaves might not know
> > about some of them if the master goes down before some acknowledgements can
> > be forwarded.  If you are lucky (and it appears that you are, or else you are
> > using auto acknowledgements) then none of the acknowledgements are lost
> > (or none are required!).
>
> > > You notice that there are two lines displaying 1999, this is because
> > > two messages were lost. Otherwise you see 2000 messages processed
> > > from each thread.
>
> > > From this, does it mean that I don't have to worry about duplicate
> > > messages due to requeing?
>
> > No, it doesn't mean that. If you have explicit acknowledgements by your
> > consumers, then when the master fails the slave may redeliver some messages
> > that were acknowledged, as well as the ones that weren't.
>
> > What interests me is the messages that are lost. If I understand it
> > correctly, messages are published to the master and all the slaves
> > simultaneously, so the failure of the master shouldn't lose any messages.
>
> > Having said that, you haven't said to which broker your test apps connect.
> > If they were connected to the master at the time, then what do they do when
> > the master fails?  Do they automatically reconnect (I presume this is in
> > the tests' logs)? Do they resend the last message (which will have failed
> > because the connection will have been dropped)?
>
> > If they do not resend, then this could be the source of the lost messages
> > -- they were not sent in the first place.
>
> > Please can you explain just a little more about the test thread connection
> > history, and to which broker they are connected?  I would expect that, if
> > they are connected to the slave, then you won't see any lost messages in
> > this test scenario.
>
> > > I assuming that HA Proxy
> > > was not quick enough to detect about the Crashed Node A and thus those
> > > 2/3 messages were routed to crashed NodeA. Please correct me if I am
> > > wrong.
>
> > I don't think this is the problem, as messages are published to all the
> > brokers mirroring the queue.
>
> > > The other thing that I just wanted to bring it to your attention (it
> > > doesn't bother me). It is as follows:
> > > I have NodeA in the beginning of the cluster then I join NodeB to the
> > > cluster.
> > > If I run rabbitmqctl report on NodeA, it throws an error saying that
> > > NodeA is down (when it is really not down). But it works fine on
> > > NodeB.
>
> > This is interesting, too. Can you supply us with the complete output from
> > rabbitmqctl status for both nodes, and explain exactly what you mean by
> > 'run rabbitmqctl on NodeA'?
>
> > Thank you for reporting these issues.
>
> > Steve Powell  (a curious bunny)
> > ----------some more definitions from the SPD----------
> > avoirdupois (phr.) 'Would you like peas with that?'
> > distribute (v.) To denigrate an award ceremony.
> > definite (phr.) 'It's hard of hearing, I think.'
> > modest (n.) The most mod.
>
> > On 9 Jan 2012, at 23:39, Venkat wrote:
>
> > > Hi Steve I have run some tests using RabbitMQ 2.7.1 please find the
> > > following:
>
> > ...(elided)
> > _______________________________________________
> > rabbitmq-discuss mailing list
> > rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss