[rabbitmq-discuss] RabbitMQ load balancing/failover with LVS

Mon Jul 27 16:55:09 BST 2009

Hi,

We have been working with a single RabbitMQ node for about a month
now, and have been very happy with it, so we decided to add a second
node, mainly for failover, since our load is moderate. We hooked up
two nodes following http://www.rabbitmq.com/clustering.html:

Status of node 'rabbit at dc1-live-mq1' ...
[{running_applications,[{rabbit,"RabbitMQ","1.5.5"},
                      {mnesia,"MNESIA  CXC 138 12","4.3.5"},
                      {os_mon,"CPO  CXC 138 46","2.1.2.1"},
                      {sasl,"SASL  CXC 138 11","2.1.5.1"},
                      {stdlib,"ERTS  CXC 138 10","1.14.5"},
                      {kernel,"ERTS  CXC 138 10","2.11.5"}]},
 {nodes,['rabbit at dc1-live-mq2','rabbit at dc1-live-mq1']},
 {running_nodes,['rabbit at dc1-live-mq2','rabbit at dc1-live-mq1']}]
...done.

Status of node 'rabbit at dc1-live-mq2' ...
[{running_applications,[{rabbit,"RabbitMQ","1.5.5"},
                      {mnesia,"MNESIA  CXC 138 12","4.3.5"},
                      {os_mon,"CPO  CXC 138 46","2.1.2.1"},
                      {sasl,"SASL  CXC 138 11","2.1.5.1"},
                      {stdlib,"ERTS  CXC 138 10","1.14.5"},
                      {kernel,"ERTS  CXC 138 10","2.11.5"}]},
 {nodes,['rabbit at dc1-live-mq2','rabbit at dc1-live-mq1']},
 {running_nodes,['rabbit at dc1-live-mq1','rabbit at dc1-live-mq2']}]
...done.

When we point our clients to either of the RabbitMQ nodes directly
everything works fine.

In order to allow our clients to always point to a single host,
regardless of which nodes are up, we set up LVS load balancing on a
third server called 'lb1'. However, once we do this, we experience
issues with low-volume queues. It goes roughly like this:

- Consumer starts and establishes a connection to lb1.
- lb1 forwards packets from the consumer to e.g. mq1.
- At this point, the consumer has an established connection to lb1,
mq1 has an established connection directly to the consumer, and
messages published to the queue reach the consumer.
- After ~5-10 minutes without messages published to the queue, the
connection on the consumer goes away, and it establishes a new
connection to lb1. mq1 at this point still has an established to
connection to the consumer on the original port, in addition to the
new connection. Messages published to the queue in question are now no
longer delivered to the consumer.
- We start another consumer, but it doesn't receive messages either.
- After some more time, the original connection times out
({inet_error,etimedout}), and messages get processed again, but only
by the second consumer.

It may be worth mentioning that the consumer subscribes to the queue
with auto-ack turned off.

The problem seems to be the load balancer dropping connections, but
since we're using it successfully in a few other cases, I thought I
could get some input on whether this is even a sensible strategy for
doing failover for RabbitMQ, and if anyone has experience with setups
similar to ours.

Thanks!
niko

PS: We're also seeing plenty of this in the rabbit.log, repeating
every 30 seconds:
=ERROR REPORT==== 27-Jul-2009::16:27:24 ===
** Generic server <0.9049.9> terminating
** Last message in was {inet_async,#Port<0.222>,41513,{ok,#Port<0.236483>}}
** When Server state == {state,{rabbit_networking,start_client,[]},
                             #Port<0.222>,
                             41513}
** Reason for termination ==
** {{badmatch,{error,enotconn}},
  [{tcp_acceptor,handle_info,2},
   {gen_server,handle_msg,6},
   {proc_lib,init_p,5}]}

>From rabbit-sasl.log:
=CRASH REPORT==== 27-Jul-2009::16:27:24 ===
 crasher:
  pid: <0.9049.9>
  registered_name: []
  error_info: {{badmatch,{error,enotconn}},
                [{tcp_acceptor,handle_info,2},
                 {gen_server,handle_msg,6},
                 {proc_lib,init_p,5}]}
  initial_call: {gen,init_it,
                    [gen_server,
                     <0.179.0>,
                     <0.179.0>,
                     tcp_acceptor,
                     {{rabbit_networking,start_client,[]},#Port<0.222>},
                     []]}
  ancestors: ['tcp_acceptor_sup_0.0.0.0:5672',
                <0.178.0>,
                rabbit_sup,
                <0.105.0>]
  messages: []
  links: [<0.179.0>,#Port<0.236483>]
  dictionary: []
  trap_exit: false
  status: running
  heap_size: 233
  stack_size: 21
  reductions: 166
 neighbours:

=SUPERVISOR REPORT==== 27-Jul-2009::16:27:24 ===
   Supervisor: {local,
                                         'tcp_acceptor_sup_0.0.0.0:5672'}
   Context:    child_terminated
   Reason:     {{badmatch,{error,enotconn}},
                [{tcp_acceptor,handle_info,2},
                 {gen_server,handle_msg,6},
                 {proc_lib,init_p,5}]}
   Offender:   [{pid,<0.9049.9>},
                {name,tcp_acceptor},
                {mfa,{tcp_acceptor,start_link,
                                   [{rabbit_networking,start_client,[]},
                                    #Port<0.222>]}},
                {restart_type,transient},
                {shutdown,brutal_kill},
                {child_type,worker}]