[rabbitmq-discuss] rabbitmq 3.1.4 upgrade lost cluster config

Wed Aug 14 20:17:17 BST 2013

On Wed, Aug 14, 2013 at 03:14:20PM +0100, Emile Joubert wrote:

> If you ran the same sequence of steps on different clusters that had
> identical configuration then you should get the same result. Either
> the clusters did not have the same configuration or the sequence of
> steps was different.

We use puppet.  The configurations are templated and enforced to be
identical except for cluster node names, which are different between
clusters.  If a config changes that is only read at server start, such
as /etc/rabbitmq/rabbitmq.config, then rabbitmq is automatically
restarted by puppet.

> Compare the logfiles from the nodes in the first cluster with the
> logfiles from the second cluster. The differences should indicate the
> cause. Pay close attention to the order of messages of the form
> 
>   rabbit on node 'name at host' up/down

The logs aren't helping me.  In particular, the order of "up" and
"down" events is equivalent between the clusters up until the time of
the failure.

> Also compare the configurations using "rabbitmqctl environment" on both
> clusters and make sure they are the same.

They are indeed the same, except for file names (which incorporate the
nodenames) and cluster members (which again incorporate the
nodenames).  If I compensate for the above with suitable perl -pi -e
s/hostname91/hostname11/ stuff, the configs are identical.

This looks to me like a code bug.  A race condition or any number of
other classes of bug could explain why two identically-configured
clusters would exhibit different behavior when run through the same
sequence of operations.

- Morty