[rabbitmq-discuss] RabbitMQ load balancing/failover with LVS
Niko Felger
niko.felger at googlemail.com
Mon Jul 27 16:55:09 BST 2009
Hi,
We have been working with a single RabbitMQ node for about a month
now, and have been very happy with it, so we decided to add a second
node, mainly for failover, since our load is moderate. We hooked up
two nodes following http://www.rabbitmq.com/clustering.html:
Status of node 'rabbit at dc1-live-mq1' ...
[{running_applications,[{rabbit,"RabbitMQ","1.5.5"},
{mnesia,"MNESIA CXC 138 12","4.3.5"},
{os_mon,"CPO CXC 138 46","2.1.2.1"},
{sasl,"SASL CXC 138 11","2.1.5.1"},
{stdlib,"ERTS CXC 138 10","1.14.5"},
{kernel,"ERTS CXC 138 10","2.11.5"}]},
{nodes,['rabbit at dc1-live-mq2','rabbit at dc1-live-mq1']},
{running_nodes,['rabbit at dc1-live-mq2','rabbit at dc1-live-mq1']}]
...done.
Status of node 'rabbit at dc1-live-mq2' ...
[{running_applications,[{rabbit,"RabbitMQ","1.5.5"},
{mnesia,"MNESIA CXC 138 12","4.3.5"},
{os_mon,"CPO CXC 138 46","2.1.2.1"},
{sasl,"SASL CXC 138 11","2.1.5.1"},
{stdlib,"ERTS CXC 138 10","1.14.5"},
{kernel,"ERTS CXC 138 10","2.11.5"}]},
{nodes,['rabbit at dc1-live-mq2','rabbit at dc1-live-mq1']},
{running_nodes,['rabbit at dc1-live-mq1','rabbit at dc1-live-mq2']}]
...done.
When we point our clients to either of the RabbitMQ nodes directly
everything works fine.
In order to allow our clients to always point to a single host,
regardless of which nodes are up, we set up LVS load balancing on a
third server called 'lb1'. However, once we do this, we experience
issues with low-volume queues. It goes roughly like this:
- Consumer starts and establishes a connection to lb1.
- lb1 forwards packets from the consumer to e.g. mq1.
- At this point, the consumer has an established connection to lb1,
mq1 has an established connection directly to the consumer, and
messages published to the queue reach the consumer.
- After ~5-10 minutes without messages published to the queue, the
connection on the consumer goes away, and it establishes a new
connection to lb1. mq1 at this point still has an established to
connection to the consumer on the original port, in addition to the
new connection. Messages published to the queue in question are now no
longer delivered to the consumer.
- We start another consumer, but it doesn't receive messages either.
- After some more time, the original connection times out
({inet_error,etimedout}), and messages get processed again, but only
by the second consumer.
It may be worth mentioning that the consumer subscribes to the queue
with auto-ack turned off.
The problem seems to be the load balancer dropping connections, but
since we're using it successfully in a few other cases, I thought I
could get some input on whether this is even a sensible strategy for
doing failover for RabbitMQ, and if anyone has experience with setups
similar to ours.
Thanks!
niko
PS: We're also seeing plenty of this in the rabbit.log, repeating
every 30 seconds:
=ERROR REPORT==== 27-Jul-2009::16:27:24 ===
** Generic server <0.9049.9> terminating
** Last message in was {inet_async,#Port<0.222>,41513,{ok,#Port<0.236483>}}
** When Server state == {state,{rabbit_networking,start_client,[]},
#Port<0.222>,
41513}
** Reason for termination ==
** {{badmatch,{error,enotconn}},
[{tcp_acceptor,handle_info,2},
{gen_server,handle_msg,6},
{proc_lib,init_p,5}]}
>From rabbit-sasl.log:
=CRASH REPORT==== 27-Jul-2009::16:27:24 ===
crasher:
pid: <0.9049.9>
registered_name: []
error_info: {{badmatch,{error,enotconn}},
[{tcp_acceptor,handle_info,2},
{gen_server,handle_msg,6},
{proc_lib,init_p,5}]}
initial_call: {gen,init_it,
[gen_server,
<0.179.0>,
<0.179.0>,
tcp_acceptor,
{{rabbit_networking,start_client,[]},#Port<0.222>},
[]]}
ancestors: ['tcp_acceptor_sup_0.0.0.0:5672',
<0.178.0>,
rabbit_sup,
<0.105.0>]
messages: []
links: [<0.179.0>,#Port<0.236483>]
dictionary: []
trap_exit: false
status: running
heap_size: 233
stack_size: 21
reductions: 166
neighbours:
=SUPERVISOR REPORT==== 27-Jul-2009::16:27:24 ===
Supervisor: {local,
'tcp_acceptor_sup_0.0.0.0:5672'}
Context: child_terminated
Reason: {{badmatch,{error,enotconn}},
[{tcp_acceptor,handle_info,2},
{gen_server,handle_msg,6},
{proc_lib,init_p,5}]}
Offender: [{pid,<0.9049.9>},
{name,tcp_acceptor},
{mfa,{tcp_acceptor,start_link,
[{rabbit_networking,start_client,[]},
#Port<0.222>]}},
{restart_type,transient},
{shutdown,brutal_kill},
{child_type,worker}]
More information about the rabbitmq-discuss
mailing list