[rabbitmq-discuss] Cluster issues

Thu Nov 22 05:01:42 GMT 2012

I have a 3 machine cluster, 1 disc node and 2 ram nodes.

I stopped (soft, no crash or kill) the disc node.

One of the RAM nodes reports:

=ERROR REPORT==== 19-Nov-2012::08:29:32 ===
** Generic server <0.865.329> terminating 
** Last message in was {'$gen_cast',
{event,
{event,channel_stats,
[{pid,<24148.17585.32>},
{transactional,false},
{confirm,false},
{consumer_count,1},
{messages_unacknowledged,0},
{messages_unconfirmed,0},
{messages_uncommitted,0},
{acks_uncommitted,0},
{prefetch_count,16},
{client_flow_blocked,false},
{channel_queue_stats,[{<6988.1853.0>,[{ack,7}]}]},
{channel_exchange_stats,
[{{resource,<<"vhost1">>,exchange,
<<"queue1">>},
[{publish,34649}]}]},
{channel_queue_exchange_stats,[]}],
{1353,313843,894086}}}}
** When Server state == {state,[{channel_exchange_stats,2244223060},
{channel_queue_exchange_stats,2244227157},
{channel_queue_stats,2244218963},
{channel_stats,2244210768},
{connection_stats,2244206619},
{consumers,2244214866},
{queue_stats,2244202517}],
5000}
** Reason for termination == 
** {badarith,[{rabbit_mgmt_db,rate,5},
{rabbit_mgmt_db,'-rates/5-lc$^0/1-0-',5},
{rabbit_mgmt_db,'-rates/5-lc$^1/1-1-',6},
{rabbit_mgmt_db,rates,5},
{rabbit_mgmt_db,handle_fine_stat,7},
{rabbit_mgmt_db,'-handle_fine_stats/4-lc$^1/1-1-',4},
{rabbit_mgmt_db,'-handle_event/2-lc$^1/1-0-',4},
{rabbit_mgmt_db,handle_event,2}]}

The disc node was started (with a new IP), and it logs this:

=WARNING REPORT==== 19-Nov-2012::22:55:27 ===
msg_store_persistent: recovery terms differ from present
rebuilding indices from scratch

And it takes about 10-15min before it starts.

But the other nodes never recognized the node, may it be due to that the DNS wasn't updated? It took about 1min before the DNS resolved correctly, that is the disc node's hostname resolved to the new ip, but I waited longer than that. 

Meanwhile the disc node reported a lot of these messages:

=ERROR REPORT==== 19-Nov-2012::23:06:55 ===
Discarding message {'$gen_call',{<0.2291.0>,#Ref<0.0.5.101573>},{notify_down,<5145.1671.3>}} from <0.2291.0> to <0.1733.0> in an old incarnation (3) of this node (2)

So I restarted the RAM nodes too, now all cluster nodes could communicate again, but the mgmt interface reported:

"Statistics database could not be contacted. Message rates and queue lengths will not be shown."

So stopped all nodes (first ram nodes and the disc node last), and then brought the back up again (disc node first) and now the cluster functioned as normal. 

Any idea what was going on? 

RabbitMQ 2.8.7
Erlang R14B4