[rabbitmq-discuss] Upgrade fail

Thu Apr 3 14:11:23 BST 2014

On 02/04/14 20:19, Peter Kopias wrote:
> Hi.
>
>   I've tried to upgrade my cluster to 3.3.0.

<snip>

> BOOT FAILED
> ===========
>
> Timeout contacting cluster nodes: [rabbit at node3,rabbit at node2].
>
> DIAGNOSTICS
> ===========
>
> attempted to contact: [rabbit at node3,rabbit at node2]
>
> rabbit at node3:
>    * found rabbit (port 25672)
>    * TCP connection succeeded
>    * suggestion: hostname mismatch?
>    * suggestion: is the cookie set correctly?
> rabbit at node2:
>    * found rabbit (port 25672)
>    * TCP connection succeeded
>    * suggestion: hostname mismatch?
>    * suggestion: is the cookie set correctly?

This is quite weird. The diagnostics indicate that there actually is 
some process up, registered with epmd and listening on the Erlang 
distribution port for both node2 and node3.

> WHY is it waiting for node2, node3, if node1 was the last to stop, and
> it should come online in itself without them?

Very good question.

I was able to reproduce this precise error message by doing the 
equivalent of:

(stop node1)
rabbitmqctl -n node2 stop_app
rabbitmqctl -n node3 stop_app
(start node1)

i./e. leaving the Erlang VM running but stopping RabbitMQ (and Mnesia) 
on node2 and node3.

Is there any possibility at all you could have done something like this? 
Is there a beam.smp process running on those nodes?

I am especially puzzled since the nodes have registered 25672 as a 
distribution port - prior to 3.3.0 the distribution port would be chosen 
at random, or you could configure it. So the presence of 25672 strongly 
indicates that those nodes have been upgraded to 3.3.0 and are running 
an Erlang VM - with RabbitMQ stopped?

> The cookie is the same, the hostnames are correct.

If I am correct about the diagnosis above then the diagnostics don't 
currently handle this situation well; they don't check for the case 
where Erlang is running but RabbitMQ isn't. This will get fixed in the 
next release.

>   I'm trying to start node2 and node3 to have at least one running but
> they always fail, even if I copied back the previous disk state, and
> upgraded again...

How do they fail? Because it sounds like something is running.

>   Not a single node running currently, looks like I have to:
>   - reinstall each rabbitmq-server with clean /var/lib/rabbitmq dirs
>   - rebuild cluster
>   - rebuild the virtual hosts
>   - recreate all users and permissions
>   - all queues and exchanges and policies
>
>   Thats not the way upgrade should happen. :(

No, it really isn't. :-(

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal