[rabbitmq-discuss] RabbitMQ broker crashing under heavy load with mirrored queues

Venkat vveludan at gmail.com
Mon Jan 23 00:39:27 GMT 2012


Hi Steve, sorry for the delayed response. Please find the following:
is it still true that the rabbitmqctl report
> command is failing on t-2? I wondered if this might be something to do with the
> user you are running the rabbitmqctl command under?  Try it with and without
> sudo, for example.
I was running the command without sudo. I will try with sudo and let
you know.

> Your message rate of 40k with only one exception/loss is good, isn't it? I'm not
> certain that the recreate connection code you have used is all necessary, but if
> it works for you, that's fine. What made you put in a 2-second delay (why a
> delay and why 2 seconds)?

Steve in HA Proxy config the check interval was set to 2 seconds.
Therefore I have set 2 seconds delay.
While posting 40K messages, one message was sent with retry option
when I brought down the broker. In other words all 40K messages were
posted to the queue.
But the consumer was losing 4K to 5K messages. I have run several
tests. Consistently 4K-5K messages were lost.
Finally I used channel transaction while posting messages as follows:
	@Bean
	public RabbitTemplate rabbitTemplate() {
		RabbitTemplate template = new RabbitTemplate(connectionFactory());
		template.setChannelTransacted(true);
		template.setMessageConverter(messageConverter());
		configureMDBTemplate(template);
		return template;
	}

Steve I am not sure if I could use confirms with spring-amqp. That's
why I used channel transaction.
Even after using channel transaction, 4K to 5K messages were lost.
This loss was consistent from 10-15 runs.
I have verified the loss of messages having the queue consumer stopped
so that I could track received message count.

Thanks
Venkat





On Jan 16, 10:21 am, Steve Powell <st... at rabbitmq.com> wrote:
> Hi Venkat,
>
> I'm not at all sure what is happening with the cluster nodes. But it is hard to
> tell with the information provided.  It looks as though your nodes are both
> running as disc nodes happily -- is it still true that the rabbitmqctl report
> command is failing on t-2? I wondered if this might be something to do with the
> user you are running the rabbitmqctl command under?  Try it with and without
> sudo, for example.
>
> Your message rate of 40k with only one exception/loss is good, isn't it? I'm not
> certain that the recreate connection code you have used is all necessary, but if
> it works for you, that's fine. What made you put in a 2-second delay (why a
> delay and why 2 seconds)?
>
> The only other thing I might suggest is that you investigate publisher confirms.
> This is a lightweight way of knowing that a publish actually got through to the
> rabbitmq node and was successfully passed on (or stored). See
> [http://www.rabbitmq.com/blog/2011/02/10/introducing-publisher-confirms/] for an
> introduction using Java, and
> [http://www.rabbitmq.com/extensions.html#publishing] for the AMQP details. It
> may be just what you want to know when wondering if your message is lost.
>
> Steve Powell  (a happy bunny)
> ----------some more definitions from the SPD----------
> avoirdupois (phr.) 'Would you like peas with that?'
> distribute (v.) To denigrate an award ceremony.
> definite (phr.) 'It's hard of hearing, I think.'
> modest (n.) The most mod.
>
> On 13 Jan 2012, at 05:03, Venkat wrote:
>
> (You did start_app after the cluster command, didn't you???  :-))
>
> Hi Steve I did restart the the app.
> Following are the steps I have performed on both nodes:
>
> Starting the second node t-4:
> ./rabbitmq-server -detached
>
> Steps to join t-4 node to t-2:
> /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl stop_app
> /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl reset
> /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl cluster rabbit at t-2 rabbit at t-4
> Clustering node 'rabbit at t-4' with ['rabbit at t-2',
>                                     'rabbit at t-4'] ...
> ...done.
> /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl start_app
> Starting node 'rabbit at t-4' ...
> ...done.
>
> Running cluster_status on t-4 node:
> [ecloud at t-4 sbin]$ /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl cluster_status
> Cluster status of node 'rabbit at t-4' ...
> [{nodes,[{disc,['rabbit at t-4','rabbit at t-2']}]},
> {running_nodes,['rabbit at t-2','rabbit at t-4']}]
> ...done.
>
> Running cluster_status on t-2 node (to which t-4 is joined):
> [ecloud at t-2 vv]$ /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl cluster_status
> Cluster status of node 'rabbit at t-2' ...
> [{nodes,[{disc,['rabbit at t-4','rabbit at t-2']}]},
> {running_nodes,['rabbit at t-4','rabbit at t-2']}]
> ...done.
>
> --------------------------------------------------------------------------- -----
> I have been testing with HA feature with different scenario.
> In my previous test the messages were pumped in with a SOAP service.
> This was pumping messages at slow rate.
> I have used a test that pumps in messages by calling plain Java
> Service. I have also increased messages pumping in from 20K to 40K.
> I am finding that messages are lost while pumping into the queue.
> As you mentioned earlier this could be due to connecting to dead
> broker.
> I modified the producer code by giving 2 seconds lapse of time and
> setting a fresh ConnectionFactory as follows:
>
> @Override
> public void convertAndSend(final Object message) throws AmqpException
> {
>   MessageProperties props = null;
>   try {
>     props = new MessageProperties();
>     props.setDeliveryMode(MessageDeliveryMode.PERSISTENT);   //setting delivery mode as PERSISTENT
>     send(getMessageConverter().toMessage(message, props));
>   } catch (AmqpException amqpe) {
>     System.out.println("Exception occurred while sending:"+amqpe.getMessage());
>     try {
>       Thread.sleep(2000);
>     } catch (InterruptedException e) {
>       e.printStackTrace();
>     }
>     Properties props1 = FrameworkServiceLocator.getInstance().
>       getCommonsConfigurationService(ServiceConstants.DMB_COMMONS_CONFIG_SERVICE) .
>       getProperties(CommonsConfigurationConstants.RABBIT_MQ_CONFIG_NAME);
>     String rabbitMQUser = props1.getProperty(CommonsConfigurationConstants.RABBITMQ_USER);
>     String rabbitMQPassword = props1.getProperty(CommonsConfigurationConstants.RABBITMQ_PASSWORD);
>     String rabbitMQHost = props1.getProperty(CommonsConfigurationConstants.RABBITMQ_HOST);
>     String rabbitMQChannelCacheSize = props1.getProperty(CommonsConfigurationConstants.RABBITMQ_CHANNEL_CACHE_SIZ E);
>     CachingConnectionFactory connectionFactory = new CachingConnectionFactory(rabbitMQHost);
>     connectionFactory.setChannelCacheSize(Integer.parseInt(rabbitMQChannelCache Size));
>     connectionFactory.setUsername(rabbitMQUser);
>     connectionFactory.setPassword(rabbitMQPassword);
>     setConnectionFactory(connectionFactory);
>     try {
>       send(getMessageConverter().toMessage(message, props));
>     } catch(AmqpException e1) {
>       e1.printStackTrace();
>     }
>   }
>
> }
>
> After this change is made, I saw an exception occurred once while
> sending 40K messages which is as follows:
> java.net.SocketException: Broken pipe.
> I have run the test 10-15 times each time 5K-6K messages were lost
> but this exception was occurring only once.
>
> Thanks
> Venkat
>
> On Jan 11, 12:55 pm, Steve Powell <st... at rabbitmq.com> wrote:
> Hi Venkat,
>
> This time there were no messages lost. All 20K messages were
> processed.
>
> That's great.
>
> I'm trying to figure out what might be wrong with
> rabbitmqctl report; I'll get back to you.
>
> Meanwhile, running
>        rabbitmqctl -n rabbit at t-2 status
> ON NODE t-4 might be interesting.
>
> Also, can you tell us the output from
>        rabbitmqctl cluster_status
> on both nodes, please.
>
> It is not clear if you have issued the stop_app and start_app and
> reset/force_reset commands properly (you probably have), so could you follow
> the steps as described in the clustering guide, and issue
> rabbitmqctl cluster_status on both nodes after each cluster change?
> We should be able to see where things went wrong, then.
>
> (You did start_app after the cluster command, didn't you???  :-))
>
> Cheers,
>
> Steve Powell  (a hoppy bunny)
> ----------some more definitions from the SPD----------
> avoirdupois (phr.) 'Would you like peas with that?'
> distribute (v.) To denigrate an award ceremony.
> definite (phr.) 'It's hard of hearing, I think.'
> modest (n.) The most mod.
>
> On 11 Jan 2012, at 01:22, Venkat wrote:> ...
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


More information about the rabbitmq-discuss mailing list