[rabbitmq-discuss] Crash with RabbitMQ 3.1.5

Wed Oct 16 15:29:58 BST 2013

Hello David!

On 16 Oct 2013, at 15:14, David Harrison wrote:
> Hoping someone can help me out.  We recently experienced 2 crashes with our RabbitMQ cluster.  After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again.  This got us up and running.  Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down.
> 

What version of rabbit are you running, and how was it installed?

> During the crash symptoms were as follows:
> 
> - Escalating (and sudden) CPU utilisation on some (but not all) nodes

We've fixed at least one bug with that symptom in recent releases.

> - Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts)

Are there multiple publishers on the same connection/channel when this happens? It wouldn't be unusual, if the server was struggling, to see flow control kick in and affect publishers in this fashion.

> - Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s)

Wow, 10m is amazingly slow. Can you provide log files for this period of activity and problems?

> - When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn't return

Again, we've fixed bugs in that area in recent releases.

> --- When that doesn't return we eventually have to ctrl-c the command
> --- We have to issue a kill signal to rabbit to stop it
> --- Do the same to the epmd process

Even if you have to `kill -9' a rabbit node, you shouldn't need to kill epmd. In theory at least. If that was necessary to fix the "state of the world", it would be indicative of a problem related to the erlang distribution mechanism, but I very much doubt that's the case here.

> Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored "ha-mode:all"), running on Linux
> 

How many queues are we talking about here?

> [
>         {rabbit, [
>                 {cluster_nodes, {['rabbit at b05.internal', 'rabbit at b06.internal','rabbit at b07.internal','rabbit at b08.internal','rabbit at b09.internal'], disc}},
>                 {cluster_partition_handling, pause_minority}

Are you sure that what you're seeing is not caused by a network partition? If it were, any nodes in a minority island would "pause", which would certainly lead to the kind of symptoms you've mentioned here, viz rabbitmqctl calls not returning and so on.

> The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").
> 

That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.

> System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]
> 

I'd strongly suggest upgrading to R16B02 if you can. R14 is pretty ancient and a *lot* of bug fixes have appeared in erts + OTP since then.

> When I look at the Process Information it seems there's a small number with ALOT of messages queued, and the rest are an order of magnitude lower:
> 

That's not unusual.

> when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?)
> 
> {'$gen_cast',{gm,{sender_death,<2710.20649.64>}}}
> 

Interesting - will take a look at that. If you could provide logs for the participating nodes during this whole time period, that would help a lot.

> mixed in with other more regular events:
> 

Actually, sender_death messages are not "irregular" as such. They're just notifying the GM group members that another member (on another node) has died. This is quite normal with mirrored queues, when nodes get partitioned or stopped due to cluster recovery modes.

Cheers,
Tim