[rabbitmq-discuss] Clustering not working for some connections

Thu Oct 21 15:55:05 BST 2010

  Hi all,

  We are trying to run a cluster of 2 rabbitmq machines on Amazon EC2
and although it runs fine for a little while, at some stage it stops
working only for messages where producer and consumer are connected to
different nodes. At this point, "rabbitmqctl list_connections" becomes
completely unresponsive, as well as trying to restart the servers. The
only option is kill -9 all erlang process and start them again.

  rabbitmqctl status shows:

Status of node rabbit at rabbit1 ...
[{running_applications,
     [{rabbit_management,"RabbitMQ Management Console","2.1.1"},
      {webmachine,"webmachine","1.7.0"},
      {amqp_client,"RabbitMQ AMQP Client","2.1.1"},
      {rabbit,"RabbitMQ","2.1.0"},
      {os_mon,"CPO  CXC 138 46","2.2.5"},
      {sasl,"SASL  CXC 138 11","2.1.9"},
      {rabbit_mochiweb,"RabbitMQ Mochiweb Embedding","2.1.1"},
      {mochiweb,"MochiMedia Web Server","1.3"},
      {crypto,"CRYPTO version 1","1.6.4"},
      {inets,"INETS  CXC 138 49","5.3"},
      {mnesia,"MNESIA  CXC 138 12","4.4.13"},
      {stdlib,"ERTS  CXC 138 10","1.16.5"},
      {kernel,"ERTS  CXC 138 10","2.13.5"}]},
 {nodes,[{disc,[rabbit at rabbit1,rabbit at rabbit2]}]},
 {running_nodes,[rabbit at rabbit2,rabbit at rabbit1]}]
...done.

Status of node rabbit at rabbit2 ...
[{running_applications,
     [{rabbit_management,"RabbitMQ Management Console","2.1.1"},
      {webmachine,"webmachine","1.7.0"},
      {amqp_client,"RabbitMQ AMQP Client","2.1.1"},
      {rabbit,"RabbitMQ","2.1.0"},
      {os_mon,"CPO  CXC 138 46","2.2.5"},
      {sasl,"SASL  CXC 138 11","2.1.9"},
      {rabbit_mochiweb,"RabbitMQ Mochiweb Embedding","2.1.1"},
      {mochiweb,"MochiMedia Web Server","1.3"},
      {crypto,"CRYPTO version 1","1.6.4"},
      {inets,"INETS  CXC 138 49","5.3"},
      {mnesia,"MNESIA  CXC 138 12","4.4.13"},
      {stdlib,"ERTS  CXC 138 10","1.16.5"},
      {kernel,"ERTS  CXC 138 10","2.13.5"}]},
 {nodes,[{disc,[rabbit at rabbit1,rabbit at rabbit2]}]},
 {running_nodes,[rabbit at rabbit1,rabbit at rabbit2]}]
...done.

On the logs of rabbit2, the only error I see some of these:

=ERROR REPORT==== 21-Oct-2010::14:40:47 ===
exception on TCP connection <0.19069.0> from 88.211.55.18:13580
{bad_header,<<"<policy-">>}

  Other information:
  - The hostnames (rabbit1, rabbit2) are defined in /etc/hosts on both
machines using their private IP, and consumers access them through a
DNS round-robin to their public IP
  - Both machines use NODENAME=rabbit@<host> on /etc/rabbitmq/
rabbitmq.conf
  - Cluster is defined in /etc/rabbitmq/rabbitmq.config using
{cluster_nodes, ['rabbit at rabbit1','rabbit at rabbit2']}
  - We are using RabbitMQ 2.1.0 and Erlang R13B04 (erts-5.7.5)
[source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-
poll:false]

  Any ideas of what can be wrong?

--
Ivan Sanchez