[rabbitmq-discuss] *Advisory* Clustering not working for some connections

Michael Bridgen mikeb at rabbitmq.com
Mon Oct 25 14:04:30 BST 2010


All,

We are pretty sure we have traced this to the inter-node routing code 
used in clustering.

It's quite hard to reproduce, and in general seems uncommon; however, 
/if/ you are using clustering /and/ you see these symptoms:

  - publishing across nodes stops working
  - rabbitmqctl list_queues or list_connections don't respond
  - RabbitMQ has to have a hard restart

it is likely to be this problem, and we advise trying to adapt your 
set-up to work without clustering for the time being.

We are working on fixing it in the next release.


Michael

--
Michael Bridgen
Staff Engineer, RabbitMQ


>    I just noticed that on my tests the producers are also getting
> blocked (we are using a java client and basicPublish()).
>
>    Trying to list_consumers also becomes unresponsive at this point.
>
>    Any help would be really appreciated.
>
>    Thank you,
>
> --
> Ivan Sanchez
>
> On Oct 21, 3:55 pm, Ivan Sanchez<s4nc... at gmail.com>  wrote:
>>    Hi all,
>>
>>    We are trying to run a cluster of 2 rabbitmq machines on Amazon EC2
>> and although it runs fine for a little while, at some stage it stops
>> working only for messages where producer and consumer are connected to
>> different nodes. At this point, "rabbitmqctl list_connections" becomes
>> completely unresponsive, as well as trying to restart the servers. The
>> only option is kill -9 all erlang process and start them again.
>>
>>    rabbitmqctl status shows:
>>
>> Status of node rabbit at rabbit1 ...
>> [{running_applications,
>>       [{rabbit_management,"RabbitMQ Management Console","2.1.1"},
>>        {webmachine,"webmachine","1.7.0"},
>>        {amqp_client,"RabbitMQ AMQP Client","2.1.1"},
>>        {rabbit,"RabbitMQ","2.1.0"},
>>        {os_mon,"CPO  CXC 138 46","2.2.5"},
>>        {sasl,"SASL  CXC 138 11","2.1.9"},
>>        {rabbit_mochiweb,"RabbitMQ Mochiweb Embedding","2.1.1"},
>>        {mochiweb,"MochiMedia Web Server","1.3"},
>>        {crypto,"CRYPTO version 1","1.6.4"},
>>        {inets,"INETS  CXC 138 49","5.3"},
>>        {mnesia,"MNESIA  CXC 138 12","4.4.13"},
>>        {stdlib,"ERTS  CXC 138 10","1.16.5"},
>>        {kernel,"ERTS  CXC 138 10","2.13.5"}]},
>>   {nodes,[{disc,[rabbit at rabbit1,rabbit at rabbit2]}]},
>>   {running_nodes,[rabbit at rabbit2,rabbit at rabbit1]}]
>> ...done.
>>
>> Status of node rabbit at rabbit2 ...
>> [{running_applications,
>>       [{rabbit_management,"RabbitMQ Management Console","2.1.1"},
>>        {webmachine,"webmachine","1.7.0"},
>>        {amqp_client,"RabbitMQ AMQP Client","2.1.1"},
>>        {rabbit,"RabbitMQ","2.1.0"},
>>        {os_mon,"CPO  CXC 138 46","2.2.5"},
>>        {sasl,"SASL  CXC 138 11","2.1.9"},
>>        {rabbit_mochiweb,"RabbitMQ Mochiweb Embedding","2.1.1"},
>>        {mochiweb,"MochiMedia Web Server","1.3"},
>>        {crypto,"CRYPTO version 1","1.6.4"},
>>        {inets,"INETS  CXC 138 49","5.3"},
>>        {mnesia,"MNESIA  CXC 138 12","4.4.13"},
>>        {stdlib,"ERTS  CXC 138 10","1.16.5"},
>>        {kernel,"ERTS  CXC 138 10","2.13.5"}]},
>>   {nodes,[{disc,[rabbit at rabbit1,rabbit at rabbit2]}]},
>>   {running_nodes,[rabbit at rabbit1,rabbit at rabbit2]}]
>> ...done.
>>
>> On the logs of rabbit2, the only error I see some of these:
>>
>> =ERROR REPORT==== 21-Oct-2010::14:40:47 ===
>> exception on TCP connection<0.19069.0>  from 88.211.55.18:13580
>> {bad_header,<<"<policy-">>}
>>
>>    Other information:
>>    - The hostnames (rabbit1, rabbit2) are defined in /etc/hosts on both
>> machines using their private IP, and consumers access them through a
>> DNS round-robin to their public IP
>>    - Both machines use NODENAME=rabbit@<host>  on /etc/rabbitmq/
>> rabbitmq.conf
>>    - Cluster is defined in /etc/rabbitmq/rabbitmq.config using
>> {cluster_nodes, ['rabbit at rabbit1','rabbit at rabbit2']}
>>    - We are using RabbitMQ 2.1.0 and Erlang R13B04 (erts-5.7.5)
>> [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-
>> poll:false]
>>
>>    Any ideas of what can be wrong?
>>
>> --
>> Ivan Sanchez
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



More information about the rabbitmq-discuss mailing list