[rabbitmq-discuss] Cluster issues

Fri Nov 23 17:09:21 GMT 2012

Hi Carl,

On 22 Nov 2012, at 05:01, Carl Hörberg wrote:

> I have a 3 machine cluster, 1 disc node and 2 ram nodes.
> 
> I stopped (soft, no crash or kill) the disc node.
> 

It's best *not* to leave a cluster with no disk nodes, as that represents an inherent risk if the ram nodes die whilst the disk node is out of action (for maintenance or whatever). Better to have 2 disk and 1 ram nodes if possible.

> One of the RAM nodes reports:
> 
> =ERROR REPORT==== 19-Nov-2012::08:29:32 ===
> ** Generic server <0.865.329> terminating 
> ** Last message in was {'$gen_cast',
> {event,
> {event,channel_stats,
> [{pid,<24148.17585.32>},
> {transactional,false},
> {confirm,false},
> {consumer_count,1},
> {messages_unacknowledged,0},
> {messages_unconfirmed,0},
> {messages_uncommitted,0},
> {acks_uncommitted,0},
> {prefetch_count,16},
> {client_flow_blocked,false},
> {channel_queue_stats,[{<6988.1853.0>,[{ack,7}]}]},
> {channel_exchange_stats,
> [{{resource,<<"vhost1">>,exchange,
> <<"queue1">>},
> [{publish,34649}]}]},
> {channel_queue_exchange_stats,[]}],
> {1353,313843,894086}}}}
> ** When Server state == {state,[{channel_exchange_stats,2244223060},
> {channel_queue_exchange_stats,2244227157},
> {channel_queue_stats,2244218963},
> {channel_stats,2244210768},
> {connection_stats,2244206619},
> {consumers,2244214866},
> {queue_stats,2244202517}],
> 5000}
> ** Reason for termination == 
> ** {badarith,[{rabbit_mgmt_db,rate,5},
> {rabbit_mgmt_db,'-rates/5-lc$^0/1-0-',5},
> {rabbit_mgmt_db,'-rates/5-lc$^1/1-1-',6},
> {rabbit_mgmt_db,rates,5},
> {rabbit_mgmt_db,handle_fine_stat,7},
> {rabbit_mgmt_db,'-handle_fine_stats/4-lc$^1/1-1-',4},
> {rabbit_mgmt_db,'-handle_event/2-lc$^1/1-0-',4},
> {rabbit_mgmt_db,handle_event,2}]}
> 

This is a known bug I'm afraid. See http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-November/023538.html for another instance. We're working to resolve it at the moment and hope to get a fix out asap.

> The disc node was started (with a new IP), and it logs this:
> 
> =WARNING REPORT==== 19-Nov-2012::22:55:27 ===
> msg_store_persistent: recovery terms differ from present
> rebuilding indices from scratch
> 
> 

I would not expect a node stopped with `rabbitmqctl stop` to exhibit this behaviour, however the fact that you've recycled the IP address could be a factor - though it shouldn't. We'll spend some time looking into that.

> 
> And it takes about 10-15min before it starts.
> 

It is not unusual for a message store containing a large volume of messages to take 'quite some time' to completely recover.

> But the other nodes never recognized the node, may it be due to that the DNS wasn't updated? It took about 1min before the DNS resolved correctly, that is the disc node's hostname resolved to the new ip, but I waited longer than that. 
> 
> Meanwhile the disc node reported a lot of these messages:
> 
> =ERROR REPORT==== 19-Nov-2012::23:06:55 ===
> Discarding message {'$gen_call',{<0.2291.0>,#Ref<0.0.5.101573>},{notify_down,<5145.1671.3>}} from <0.2291.0> to <0.1733.0> in an old incarnation (3) of this node (2)
> 
> 

This could be indicative of a bug, as it indicates a channel close notification being sent to a queue record that should've been deleted (or replaced). We will investigate that.

> 
> So I restarted the RAM nodes too, now all cluster nodes could communicate again, but the mgmt interface reported:
> 
> "Statistics database could not be contacted. Message rates and queue lengths will not be shown."
> 
> So stopped all nodes (first ram nodes and the disc node last), and then brought the back up again (disc node first) and now the cluster functioned as normal. 
> 

Well I'm glad you got it under control eventually.

> Any idea what was going on? 
> 
> RabbitMQ 2.8.7
> Erlang R14B4
> 
> 

Could you clarify one or two other details about your environment please? Firstly, do you have HA (mirrorred) queues set up in this cluster? Secondly, was the disk node designated the 'master' when you shut it down? And just to confirm that you shut the disk node down by issuing `rabbitmqctl stop` or an equivalent (such as `service rabbitmqctl stop` or similar) right? If you could provide any of the node's rabbit and rabbit-sasl logs, that would be very useful.

Thanks for reporting this - I will try to keep the list informed of any developments in tracking it down!

Cheers,
Tim