[rabbitmq-discuss] Someone else with a nodedown error

Fri May 17 16:05:27 BST 2013

Thanks Tim, I will send you a link to the log files privately. We do have
mirrored queues, we setup an HA policy to mirror all queues to exactly 2
nodes of the 4, as of yet we have not made use of any synchronization
policy.

We start all rabbit nodes via:

sudo /etc/init.d/rabbit-server start

We do have chef managing this server and has since caused a restart on 2 of
our 4 nodes, it is now temporarily disabled. I will send you the log files
for all 4 nodes dating back several days. One thing I did notice in the log
file for 3 of the 4 nodes:

=ERROR REPORT==== 16-May-2013::23:27:20 ===
connection <0.25853.253>, channel 1 - soft error:
{amqp_error,not_found,
            "home node 'rabbit at rabbit-box' of durable queue 'my.queue.name'
in vhost '/' is down or inaccessible",
            'queue.declare'}

When looking at the log files you will notice many entries like:

=INFO REPORT==== 17-May-2013::09:15:42 ===
accepting AMQP connection <0.5117.0> (IP:55913 -> IP:5672)

=WARNING REPORT==== 17-May-2013::09:15:42 ===
closing AMQP connection <0.5117.0> (IP:55913 -> IP:5672):
connection_closed_abruptly

Those are our load balancers checking the node health, sorry for the log
spam.

On Fri, May 17, 2013 at 9:32 AM, Tim Watson <tim at rabbitmq.com> wrote:

> Hmn
>
> On 17 May 2013, at 13:45, Eric Berg wrote:
>
> Thanks for your response Tim. If you would like SSH access to these boxes
> let me know, we can work something out privately. Thanks!
>
>
> Ok, though first of all I'd like to know if you supply logs for the nodes
> in question? A private drop box would be fine.
>
> Update from yesterday:
> It looks like 2 of the 4 nodes in our cluster have finally shut down, all
> channels are now gone. Another node in the cluster hangs on
> > sudo rabbitmqctl status
>
> and the final node in the cluster appears to be running just fine. It
> however sees the unresponsive node in the cluster status as a running node,
> as does the web UI.
>
>
> Right, so we've still got an unresponsive node. Do you have any mirrored
> queues, and if so, what synchronisation and/or recovery policies are you
> using?
>
>
> *When you upgraded your cluster, what RabbitMQ version did you upgrade
> from and to, and did you upgrade Erlang as well and if so, which versions
> were involved?
> *
> - we upgraded from 3.0.4 to 3.1.0, we did not upgrade Erlang it was/is at
> version R15B03. We did however install it via RPM with the --nodeps flag
> because it did not detect the Erlang dependency correctly. We had
> previously installed Erlang:
>
> esl-erlang.x86_64    R15B03-2           @erlang-solutions
>
>
> Hmn, I suppose it's possible that this re-install went wrong somehow and
> is causing some of the things below.
>
> *
> What happens if you start up Erlang by itself, using `erl -sname test` -
> do you still see all those screwy warnings? *
> All 4 of the nodes can run this without issue as my user, when I sudo su
> to rabbitmq user I get errors on 2 of the 4 nodes as such:
>
>
> Well the nodes should always be running as the rabbitmq user, so how're
> you starting them as your user? That might be at the root of some of these
> problems, viz the rabbitmq-server (service) should always run as the
> rabbitmq user and when issuing rabbitmqctl commands and the like, you would
> normally do `$ sudo rabbitmqctl status` and so on. Log files would
> definitely help though.
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130517/f28c8485/attachment.htm>