[rabbitmq-discuss] 3.1.1 - Errors during failover

Mon Jun 10 16:26:27 BST 2013

Thanks for taking the time to look at those logs.

The purpose of the test is to confirm that the client application, server application, LVS and RMQ itself all perform properly during fail-over. There is certainly messaging activity during the test, although relatively modest by RMQ standards judging from the rates I see discussed on this forum.

Of the 10 queues, three queues were operating with a combined average rate of around 200 messages per second from client app to server app. The publish rate will burst to instantaneous peaks possibly exceeding 500 messages per second for a short time. A fourth queue is returning responses from server app to client app at somewhere around 300 to 400 messages per second. A fifth queue is used for application heartbeats from server to client at around 1 message every 2 seconds. The remaining 5 queues were idle during the test. The messages are in the range of hundreds of bytes to a couple of kilobytes. Each message has a per-message TTL configured - 10 seconds for the application heartbeats, 90 seconds otherwise.
--
When either client or server application detects a channel / connection shutdown it enters a reconnect cycle making two attempts to connect each second until it is able to re-establish connection and resume operation. Each application uses a single RMQ connection for consume and publish. All publishing happens on a single channel per app using publisher confirms.

I am able to reliably reproduce an error of some kind by running the test for long enough (a couple of dozen fail-overs is typically enough, although sometimes fewer than 10). I haven't paid sufficiently close attention to be able to say whether the error is always the same. However the symptoms are not always exactly the same. Usually after an error has occurred one or other broker will refuse to shutdown gracefully. Sometimes queues vanish and I will have to reconfigure them.

In order to gather a clean sample for the logs I posted I deleted /var/lib/rabbitmq/mnesia and reconfigured the cluster and queues from scratch. I then rebooted the nodes to ensure a clean starting point.

Thanks again,

Nathanael

Simon MacMullen wrote:

Hi thanks. This is definitely an odd looking error, can you tell us mre about what you're doing? Are you just starting / stopping nodes, or is there messaging activity going on at the same time (and if so, what?)

Cheers, Simon

On 10/06/13 11:32, Rensen, Nathanael wrote:
I've attached the sasl log from mq-002. Sorry I didn't include that originally.

Thanks for taking a look.

Nathanael

Simon MacMullen wrote:

Hi. Looking at the logs it seems like the message store on mq-002 crashed / shut down unexpectedly, but there's no information about this in the log. Do you have the corresponding sasl log?

Cheers, Simon

On 09/06/13 06:03, Rensen, Nathanael wrote:
While testing a fail-over scenario with RabbitMQ 3.1.1 I have repeatedly encountered errors, sometimes resulting in durable queues vanishing.

The cluster consists of two brokers using LVS / keepalived in order to connect clients to a functional broker. There are 10 mirrored queues, each of which has ha-sync-mode = automatic. A script is used to shut down one broker or the other in turn using 'service rabbitmq-server {start|stop}', such that there is always one broker running and leaving at least 30 seconds between each start / stop. I am expecting that this test should be able to run indefinitely without destabilising the cluster, however I have not been able to achieve more than a few dozen fail-overs without some error occurring. I'm hoping someone may have some insight or suggestions as to how to stabilise this environment.

I have included basic environment details below and attached logs from both brokers showing one example. In this case zg-dev-mq-003 was stopped at 11:32:21 and went through what appears to be a clean shutdown:

=INFO REPORT==== 9-Jun-2013::11:33:22 === Halting Erlang VM

zg-dev-mq-002 detected the other broker down and promoted itself to master. Then after accepting connections from clients it logged an error as shown below:

=INFO REPORT==== 9-Jun-2013::11:33:22 === rabbit on node 'rabbit at zg-dev-mq-003' down
=INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.427.0> (10.0.72.36:61434 -> 172.17.0.73:5672)
=INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.430.0> (10.0.72.36:61435 -> 172.17.0.73:5672)
=ERROR REPORT==== 9-Jun-2013::11:33:22 ===
** Generic server <0.418.0> terminating
** Last message in was {'$gen_cast',
                          {delete_and_terminate,
                           {badarg,
                            [{ets,insert_new,
                              [360523,
                               {{<<10,71,177,42,66,240,207,204,251,26,181,155,
                                   246,83,172,137>>,
                                 <<120,196,170,245,109,158,126,84,92,250,21,193,
                                   123,113,128,48>>},
                                -1}],
                              []},
                             {rabbit_msg_store,client_update_flying,3,[]},
                             {rabbit_msg_store,'-remove/2-lc$^0/1-0-',2,[]},
                             {rabbit_msg_store,remove,2,[]},
                             {rabbit_variable_queue,
                              '-with_immutable_msg_store_state/3-fun-0-',2,[]},
                             {rabbit_variable_queue,with_msg_store_state,3,[]},
                             {rabbit_variable_queue,
                              with_immutable_msg_store_state,3,[]},
                             {rabbit_variable_queue,'-ack/2-lc$^0/1-0-',2,
                              []}]}}}
etc

Environment details (same for both brokers):

[root at zg-dev-mq-002]# uname -a
Linux zg-dev-mq-002.zettagrid.local 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

[root at zg-dev-mq-002]# cat /etc/centos-release
CentOS release 6.4 (Final)

[root at zg-dev-mq-002]# yum list installed | egrep 'rabbit|erlang'
esl-erlang.x86_64      R16B-2           @/esl-erlang-R16B-2.x86_64
esl-erlang-compat.noarch      R14B-1.el6       @/esl-erlang-compat-R14B-1.el6.noarch
rabbitmq-server.noarch 3.1.1-1          @/rabbitmq-server-3.1.1-1.noarch

Thanks very much,

Nathanael

________________________________

ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.