[rabbitmq-discuss] Http Management Plugin: statistics_db_node not_running

Wed Jan 22 17:02:24 GMT 2014

Hello,

I've experienced an issue that is very similar to the following two threads:

http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2013-July/028437.html 
http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2011-January/010859.html

I'm posting here partially looking for an explanation and partially to document how I "fixed" it. On Saturday, one of my boxes crashed due to a disk failure. The failed node rebooted and rejoined the cluster without any manual intervention. Before the node came back online, a script which gathers statistics from the HTTP API began to error, and it continued to error after the node rejoined. The formatted JSON from the /api/overview resource can be found here: https://gist.github.com/anonymous/fe81ff0e890d26b220e6

I believe the point of interest is "statistics_db_node":"not_running". I resolved the problem by issuing stop_app / start_app for four out of the five nodes, checking the api after each restart.

I'm guessing that through sheer bad luck, the node which crashed was the active statistics_db_node. What I don't understand is why I had to manually intervene to get the statistics db back online. Is this a bug? What is the expected behavior here?

Thanks!
Ben

Additional notes:

- When visiting the management plugin from a browser, the overview tab displays the error message: "TypeError: Cannot read property 'connections' of undefined". Additionally, the connections and channels tabs are both empty. Exchanges and Queues are populated as expected, but the statistics related columns (rates and counts) are all empty. Clicking on an exchange results errors with "ReferenceError: exchange is not defined". Clicking on a queue errors with "TypeError: Cannot read property 'ram_msg_count' of undefined".

- Everything aside from the management API appears to have continued to work correctly.

- I'm running a five node cluster on version 3.2.2 in pause_minority mode.

- I have experienced the same web-client issue on 3.1.1 under different circumstances.

- I found the following message in the crashed host's logs, which repeated once for every host in the cluster:

=ERROR REPORT==== 18-Jan-2014::17:59:26 ===
Mnesia(rabbit at rabbitmq04): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, rabbit at rabbitmq05}