[rabbitmq-discuss] Outage with 3-node RabbitMQ 3.1.3 Cluster

Wed Nov 6 16:18:29 GMT 2013

On 6 November 2013 16:02, Matt Wise <matt at nextdoor.com> wrote:
> See comments inline.
>
>
> On Wed, Nov 6, 2013 at 2:37 AM, Tim Watson <tim at rabbitmq.com> wrote:
>>
>> Hi Matt,
>>
>> Sorry to hear you've been running into problems.
>>
>> On 5 Nov 2013, at 22:05, Matt Wise wrote:
>>
>> > (sorry if this gets posted twice.. first email never seemed to make it
>> > to the list)
>> >
>> > Hey... I had a pretty rough time today with a 3-node RabbitMQ 3.1.3
>> > cluster thats under pretty heavy use (6-7 million messages per day -- 100MB
>> > peak bandwidth per node). I want to pose a few questions here. First off,
>> > here's the basic configuration though.
>> >
>> > Configuration:
>> >   serverA, serverB and serverC are all configured with RabbitMQ 3.1.3.
>> > They each are configured via Puppet ... and Puppet uses a dynamic node
>> > discovery plugin (zookeeper) to find the nodes. The node lists are
>> > hard-coded into the rabbitmq.config file. A dynamic server list generator
>> > supplies Puppet with this list of servers (and is not really necessary to
>> > describe here in this email).
>> >
>> > Scenario:
>> >   A momentary configuration blip caused serverA and serverB to begin
>> > reconfiguring their rabbitmq.config files... when they did this, they also
>> > both issued a 'service rabbitmq restart' command. This command took
>> > 40+minutes and ultimately failed. During this failure, RabbitMQ was
>> > technically running and accepting connections to the TCP ports ... but it
>> > would not actually answer any queries. Commands like list_queues would hang
>> > indefinitely.
>> >
>>
>> What ha recovery policy (if any) do you have set up? A and B might get a
>> different "view of the world" set up in their respective rabbitmq.config
>> files (either to each other and/or to C) and then get restarted, but this
>> should affect their view of the cluster, because as per
>> http://www.rabbitmq.com/clustering.html:
>>
>> "Note that the cluster configuration is applied only to fresh nodes. A
>> fresh nodes is a node which has just been reset or is being start for the
>> first time. Thus, the automatic clustering won't take place after restarts
>> of nodes. This means that any change to the clustering via rabbitmqctl will
>> take precedence over the automatic clustering configuration."
>>
>
> So far we've taken the approach that clustering configuration should be
> hard-coded into the rabbitmq.config files. This works well in explicitly
> defining all of the hosts in a cluster on every machine, but it also means
> that adding a 4th node to a 3-node cluster will cause the 3 running live
> nodes to do a full service restart, which is bad. Our rabbitmq.config though
> is identical on all of the machines (other than the server-list, which may
> have been in-flux when Puppet was restarting these services)
>
>> [
>>         {rabbit, [
>>                 {log_levels, [{connection, warning}]},
>>                 {cluster_partition_handling,pause_minority},
>>                 {tcp_listeners, [ 5672 ] },
>>                 {ssl_listeners, [ 5673 ] },
>>                 {ssl_options,
>> [{cacertfile,"/etc/rabbitmq/ssl/cacert.pem"},
>>                         {certfile,"/etc/rabbitmq/ssl/cert.pem"},
>>                         {keyfile,"/etc/rabbitmq/ssl/key.pem"},
>>                         {verify,verify_peer},
>>                         {fail_if_no_peer_cert,true}
>>                 ]},
>>                 {cluster_nodes,['rabbit at i-23cf477b', 'rabbit at i-07d8bc5f',
>> 'rabbit at i-a3291cf8']}
>>         ]}
>> ].
>
>
>>
>> > Questions:
>> >   1. We only had ~2500 messages in the queues (they are HA'd and
>> > durable). The policy is { 'ha-mode': 'all' }. When serverA and serverB
>> > restarted, why did they never come up? Unfortunately in the restart process,
>> > they blew away their log files as well which makes this really tough to
>> > troubleshoot.
>>
>> It's nigh on impossible to guess what might've gone wrong without any log
>> files to verify against. We could sit and stare at all the relevant code for
>> weeks and not spot a bug that's been triggered here, since if it were
>> obvious we would've fixed it already.
>>
>> If you can give us a very precise set of steps (and timings) that led to
>> this situation, I can try and replicate what you've seen, but I don't fancy
>> my chances to be honest.
>
>
> Its a tough one for us to reproduce.. but I think the closest steps would
> be:
>
>   1. Create a 3-node cluster... configured with similar config to the one I
> pasted above.
>   2. Create enough publishers and subscribers that you have a few hundred
> messages/sec going through the three machines.
>   3. On MachineA and MachineC, remove MachineB from the config file.
>   4. Restart MachineA's rabbitmq daemon using init script
>   5. Wait 3 minutes... theoretically #4 is still in process.. now issue the
> same restart to MachineC.
>
>   Fail.
>
> Thats our best guess right now.. but agreed, the logs are a problem. Can we
> configure RabbitMQ to log through syslog for the future?

Syslog-ng can tail logs, dumping the logs in some arbitrary
destination (another file, Papertrail, etc.)

frank

>> >   2. I know that restarting serverA and serverB at nearly the same time
>> > is obviously a bad idea -- we'll be implementing some changes so this
>> > doesn't happen again -- but could this have lead to data corruption?
>>
>> It's possible, though obviously that shouldn't really happen. How close
>> were the restarts to one another? How many HA queues were mirrored across
>> these nodes, and were they all very busy (as your previous comment about
>> load seems to suggest)? We could try replicating that scenario in our tests,
>> though it's not easy to get the timing right and obviously the existence of
>> network infrastructure on which the nodes are running won't be the same (and
>> that can make a surprisingly big difference IME).
>
>
> The restarts were within a few minutes of each other. There are 5 queues,
> and all 5 queues are set to mirror to 'all' nodes in the cluster. They were
> busy, but no more than maybe 100 messages/sec coming in/out.
>
>>
>>
>> > Once the entire RabbitMQ farm was shut down, we actually were forced to
>> > move the rabbitmq data directory out of the way and start up the farm
>> > completely with blank databases. It seemed that RabbitMQ 3.1.3 really did
>> > not want to recover from this failure. Any thoughts?
>> >
>> >   3. Lastly .. in the event of future failures, what tools are there for
>> > recovering our Mnesia databases? Is there any way we can dump out the data
>> > into some raw form, and then import it back into a new fresh cluster?
>> >
>>
>> I'm afraid there are not, at least not "off the shelf" ones anyway. If you
>> are desperate to recover important production data however, I'm sure we
>> could explore the possibility of trying to help with that somehow. Let me
>> know and I'll make some enquiries at this end.
>
>
> At this point we can move on from the data loss... but it does make for an
> interesting issue. Having tools to analyze the Mnesia DB and get "most of"
> the messages out in some format where they could be re-injected into a fresh
> cluster would be an incredibly useful tool. I wonder how hard it is to do?
>
>>
>> Cheers,
>> Tim
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>