[rabbitmq-discuss] 3.1.1 - Errors during failover

Thu Jun 13 13:37:39 BST 2013

Hi. We've failed to reproduce this using a similar workload and also 
spent some time staring at the appropriate bits of code and not come up 
with anything.

So to try to narrow things down a bit:

1) Does the problem still occur if you disable automatic eager sync? 
(And don't eager sync manually either.)

2) Can you provide the Mnesia directory and logs from a machine which 
has just failed?

Cheers, Simon

On 10/06/13 16:26, Rensen, Nathanael wrote:
> Thanks for taking the time to look at those logs.
>
> The purpose of the test is to confirm that the client application,
> server application, LVS and RMQ itself all perform properly during
> fail-over. There is certainly messaging activity during the test,
> although relatively modest by RMQ standards judging from the rates I
> see discussed on this forum.
>
> Of the 10 queues, three queues were operating with a combined average
> rate of around 200 messages per second from client app to server app.
> The publish rate will burst to instantaneous peaks possibly exceeding
> 500 messages per second for a short time. A fourth queue is returning
> responses from server app to client app at somewhere around 300 to
> 400 messages per second. A fifth queue is used for application
> heartbeats from server to client at around 1 message every 2 seconds.
> The remaining 5 queues were idle during the test. The messages are in
> the range of hundreds of bytes to a couple of kilobytes. Each message
> has a per-message TTL configured - 10 seconds for the application
> heartbeats, 90 seconds otherwise. -- When either client or server
> application detects a channel / connection shutdown it enters a
> reconnect cycle making two attempts to connect each second until it
> is able to re-establish connection and resume operation. Each
> application uses a single RMQ connection for consume and publish. All
> publishing happens on a single channel per app using publisher
> confirms.
>
> I am able to reliably reproduce an error of some kind by running the
> test for long enough (a couple of dozen fail-overs is typically
> enough, although sometimes fewer than 10). I haven't paid
> sufficiently close attention to be able to say whether the error is
> always the same. However the symptoms are not always exactly the
> same. Usually after an error has occurred one or other broker will
> refuse to shutdown gracefully. Sometimes queues vanish and I will
> have to reconfigure them.
>
> In order to gather a clean sample for the logs I posted I deleted
> /var/lib/rabbitmq/mnesia and reconfigured the cluster and queues from
> scratch. I then rebooted the nodes to ensure a clean starting point.
>
> Thanks again,
>
> Nathanael
>
>
> Simon MacMullen wrote:
>
> Hi thanks. This is definitely an odd looking error, can you tell us
> mre about what you're doing? Are you just starting / stopping nodes,
> or is there messaging activity going on at the same time (and if so,
> what?)
>
> Cheers, Simon
>
>
> On 10/06/13 11:32, Rensen, Nathanael wrote: I've attached the sasl
> log from mq-002. Sorry I didn't include that originally.
>
> Thanks for taking a look.
>
> Nathanael
>
> Simon MacMullen wrote:
>
> Hi. Looking at the logs it seems like the message store on mq-002
> crashed / shut down unexpectedly, but there's no information about
> this in the log. Do you have the corresponding sasl log?
>
> Cheers, Simon
>
>
> On 09/06/13 06:03, Rensen, Nathanael wrote: While testing a fail-over
> scenario with RabbitMQ 3.1.1 I have repeatedly encountered errors,
> sometimes resulting in durable queues vanishing.
>
> The cluster consists of two brokers using LVS / keepalived in order
> to connect clients to a functional broker. There are 10 mirrored
> queues, each of which has ha-sync-mode = automatic. A script is used
> to shut down one broker or the other in turn using 'service
> rabbitmq-server {start|stop}', such that there is always one broker
> running and leaving at least 30 seconds between each start / stop. I
> am expecting that this test should be able to run indefinitely
> without destabilising the cluster, however I have not been able to
> achieve more than a few dozen fail-overs without some error
> occurring. I'm hoping someone may have some insight or suggestions as
> to how to stabilise this environment.
>
> I have included basic environment details below and attached logs
> from both brokers showing one example. In this case zg-dev-mq-003 was
> stopped at 11:32:21 and went through what appears to be a clean
> shutdown:
>
> =INFO REPORT==== 9-Jun-2013::11:33:22 === Halting Erlang VM
>
> zg-dev-mq-002 detected the other broker down and promoted itself to
> master. Then after accepting connections from clients it logged an
> error as shown below:
>
> =INFO REPORT==== 9-Jun-2013::11:33:22 === rabbit on node
> 'rabbit at zg-dev-mq-003' down =INFO REPORT==== 9-Jun-2013::11:33:22 ===
> accepting AMQP connection <0.427.0> (10.0.72.36:61434 ->
> 172.17.0.73:5672) =INFO REPORT==== 9-Jun-2013::11:33:22 === accepting
> AMQP connection <0.430.0> (10.0.72.36:61435 -> 172.17.0.73:5672)
> =ERROR REPORT==== 9-Jun-2013::11:33:22 === ** Generic server
> <0.418.0> terminating ** Last message in was {'$gen_cast',
> {delete_and_terminate, {badarg, [{ets,insert_new, [360523,
> {{<<10,71,177,42,66,240,207,204,251,26,181,155, 246,83,172,137>>,
> <<120,196,170,245,109,158,126,84,92,250,21,193, 123,113,128,48>>},
> -1}], []}, {rabbit_msg_store,client_update_flying,3,[]},
> {rabbit_msg_store,'-remove/2-lc$^0/1-0-',2,[]},
> {rabbit_msg_store,remove,2,[]}, {rabbit_variable_queue,
> '-with_immutable_msg_store_state/3-fun-0-',2,[]},
> {rabbit_variable_queue,with_msg_store_state,3,[]},
> {rabbit_variable_queue, with_immutable_msg_store_state,3,[]},
> {rabbit_variable_queue,'-ack/2-lc$^0/1-0-',2, []}]}}} etc
>
> Environment details (same for both brokers):
>
> [root at zg-dev-mq-002]# uname -a Linux zg-dev-mq-002.zettagrid.local
> 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64
> x86_64 x86_64 GNU/Linux
>
> [root at zg-dev-mq-002]# cat /etc/centos-release CentOS release 6.4
> (Final)
>
> [root at zg-dev-mq-002]# yum list installed | egrep 'rabbit|erlang'
> esl-erlang.x86_64      R16B-2           @/esl-erlang-R16B-2.x86_64
> esl-erlang-compat.noarch      R14B-1.el6
> @/esl-erlang-compat-R14B-1.el6.noarch rabbitmq-server.noarch 3.1.1-1
> @/rabbitmq-server-3.1.1-1.noarch
>
> Thanks very much,
>
> Nathanael
>
> ________________________________
>
> ZettaServe Disclaimer: This email and any files transmitted with it
> are confidential and intended solely for the use of the individual or
> entity to whom they are addressed. If you are not the named addressee
> you should not disseminate, distribute or copy this e-mail. Please
> notify the sender immediately if you have received this email by
> mistake and delete this email from your system. Computer viruses can
> be transmitted via email. The recipient should check this email and
> any attachments for the presence of viruses. ZettaServe Pty Ltd
> accepts no liability for any damage caused by any virus transmitted
> by this email.
>

-- 
Simon MacMullen
RabbitMQ, Pivotal