[rabbitmq-discuss] RabbitMQ broker crashing under heavy load with mirrored queues

Mon Jan 16 15:21:19 GMT 2012

Hi Venkat,

I'm not at all sure what is happening with the cluster nodes. But it is hard to
tell with the information provided.  It looks as though your nodes are both
running as disc nodes happily -- is it still true that the rabbitmqctl report
command is failing on t-2? I wondered if this might be something to do with the
user you are running the rabbitmqctl command under?  Try it with and without
sudo, for example.

Your message rate of 40k with only one exception/loss is good, isn't it? I'm not
certain that the recreate connection code you have used is all necessary, but if
it works for you, that's fine. What made you put in a 2-second delay (why a
delay and why 2 seconds)?

The only other thing I might suggest is that you investigate publisher confirms.
This is a lightweight way of knowing that a publish actually got through to the
rabbitmq node and was successfully passed on (or stored). See
[http://www.rabbitmq.com/blog/2011/02/10/introducing-publisher-confirms/] for an
introduction using Java, and
[http://www.rabbitmq.com/extensions.html#publishing] for the AMQP details. It
may be just what you want to know when wondering if your message is lost.

Steve Powell  (a happy bunny)
----------some more definitions from the SPD----------
avoirdupois (phr.) 'Would you like peas with that?'
distribute (v.) To denigrate an award ceremony.
definite (phr.) 'It's hard of hearing, I think.'
modest (n.) The most mod.

On 13 Jan 2012, at 05:03, Venkat wrote:

(You did start_app after the cluster command, didn't you???  :-))

Hi Steve I did restart the the app.
Following are the steps I have performed on both nodes:

Starting the second node t-4:
./rabbitmq-server -detached

Steps to join t-4 node to t-2:
/usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl stop_app
/usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl reset
/usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl cluster rabbit at t-2 rabbit at t-4
Clustering node 'rabbit at t-4' with ['rabbit at t-2',
                                    'rabbit at t-4'] ...
...done.
/usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl start_app
Starting node 'rabbit at t-4' ...
...done.

Running cluster_status on t-4 node:
[ecloud at t-4 sbin]$ /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit at t-4' ...
[{nodes,[{disc,['rabbit at t-4','rabbit at t-2']}]},
{running_nodes,['rabbit at t-2','rabbit at t-4']}]
...done.

Running cluster_status on t-2 node (to which t-4 is joined):
[ecloud at t-2 vv]$ /usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit at t-2' ...
[{nodes,[{disc,['rabbit at t-4','rabbit at t-2']}]},
{running_nodes,['rabbit at t-4','rabbit at t-2']}]
...done.

--------------------------------------------------------------------------------
I have been testing with HA feature with different scenario.
In my previous test the messages were pumped in with a SOAP service.
This was pumping messages at slow rate.
I have used a test that pumps in messages by calling plain Java
Service. I have also increased messages pumping in from 20K to 40K.
I am finding that messages are lost while pumping into the queue.
As you mentioned earlier this could be due to connecting to dead
broker.
I modified the producer code by giving 2 seconds lapse of time and
setting a fresh ConnectionFactory as follows:

@Override
public void convertAndSend(final Object message) throws AmqpException
{
  MessageProperties props = null;
  try {
    props = new MessageProperties();
    props.setDeliveryMode(MessageDeliveryMode.PERSISTENT);   //setting delivery mode as PERSISTENT
    send(getMessageConverter().toMessage(message, props));
  } catch (AmqpException amqpe) {
    System.out.println("Exception occurred while sending:"+amqpe.getMessage());
    try {
      Thread.sleep(2000);
    } catch (InterruptedException e) {
      e.printStackTrace();
    }
    Properties props1 = FrameworkServiceLocator.getInstance().
      getCommonsConfigurationService(ServiceConstants.DMB_COMMONS_CONFIG_SERVICE).
      getProperties(CommonsConfigurationConstants.RABBIT_MQ_CONFIG_NAME);
    String rabbitMQUser = props1.getProperty(CommonsConfigurationConstants.RABBITMQ_USER);
    String rabbitMQPassword = props1.getProperty(CommonsConfigurationConstants.RABBITMQ_PASSWORD);
    String rabbitMQHost = props1.getProperty(CommonsConfigurationConstants.RABBITMQ_HOST);
    String rabbitMQChannelCacheSize = props1.getProperty(CommonsConfigurationConstants.RABBITMQ_CHANNEL_CACHE_SIZE);
    CachingConnectionFactory connectionFactory = new CachingConnectionFactory(rabbitMQHost);
    connectionFactory.setChannelCacheSize(Integer.parseInt(rabbitMQChannelCacheSize));
    connectionFactory.setUsername(rabbitMQUser);
    connectionFactory.setPassword(rabbitMQPassword);
    setConnectionFactory(connectionFactory);
    try {
      send(getMessageConverter().toMessage(message, props));
    } catch(AmqpException e1) {
      e1.printStackTrace();
    }
  }
}

After this change is made, I saw an exception occurred once while
sending 40K messages which is as follows:
java.net.SocketException: Broken pipe.
I have run the test 10-15 times each time 5K-6K messages were lost
but this exception was occurring only once.

Thanks
Venkat

On Jan 11, 12:55 pm, Steve Powell <st... at rabbitmq.com> wrote:
Hi Venkat,

This time there were no messages lost. All 20K messages were
processed.

That's great.

I'm trying to figure out what might be wrong with
rabbitmqctl report; I'll get back to you.

Meanwhile, running
       rabbitmqctl -n rabbit at t-2 status
ON NODE t-4 might be interesting.

Also, can you tell us the output from
       rabbitmqctl cluster_status
on both nodes, please.

It is not clear if you have issued the stop_app and start_app and
reset/force_reset commands properly (you probably have), so could you follow
the steps as described in the clustering guide, and issue
rabbitmqctl cluster_status on both nodes after each cluster change?
We should be able to see where things went wrong, then.

(You did start_app after the cluster command, didn't you???  :-))

Cheers,

Steve Powell  (a hoppy bunny)
----------some more definitions from the SPD----------
avoirdupois (phr.) 'Would you like peas with that?'
distribute (v.) To denigrate an award ceremony.
definite (phr.) 'It's hard of hearing, I think.'
modest (n.) The most mod.

On 11 Jan 2012, at 01:22, Venkat wrote:> ...

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss at lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss