[rabbitmq-discuss] Rabbit Cluster "hangs" on consumer disconnects
Cordell, Ron
Ron.Cordell at RelayHealth.com
Fri Nov 1 17:45:27 GMT 2013
I’m seeing a strange issue with the Rabbit cluster when we shut down the applications that connect to the cluster.
Here is the setup:
Rabbit 3.1.5 Erlang 16B02 running as a 2-node cluster behind an F5 load balancer
The F5 is currently set to send connections to ONLY the 01 node, but provides a virtual IP to the application.
The application consists of a 5 Windows services running on 4 servers each – all 5 services connect to the Rabbit cluster and those services are duplicated across 4 servers.
What seems to happen (but not always) is the following sequence of events:
1. Issue commands to shut down Windows Services, Windows Services begin shutting down
2. At some point during the shutdown process we see the following in the Rabbit logs (truncated for brevity) - there is nothing in the logs before this event, by the way
=WARNING REPORT==== 31-Oct-2013::13:35:57 ===
closing AMQP connection <0.3379.0> (10.100.193.10:50234 -> 10.100.193.80:5672):
connection_closed_abruptly
=WARNING REPORT==== 31-Oct-2013::13:35:57 ===
closing AMQP connection <0.3383.0> (10.100.193.10:50235 -> 10.100.193.80:5672):
connection_closed_abruptly
=WARNING REPORT==== 31-Oct-2013::13:35:57 ===
closing AMQP connection <0.3926.0> (10.100.193.10:50264 -> 10.100.193.80:5672):
connection_closed_abruptly
1. The Windows Services that are trying to shut down hang, unable to shut down cleanly
2. The F5 load balancer is using a monitor to check to see if the Rabbit nodes are up via the HTTP API, and it shows that both are not responding, so the F5 stops routing traffic to either Rabbit node in the cluster
3. Logging onto the rabbit node,
* we are able to use the Erlang ping/pong to verify connectivity between nodes
* The management plugin UI is unresponsive
* We issue a rabbitmqctl stop_app command to node 01, which responds, and we see the following in the rabbit log file:
=INFO REPORT==== 31-Oct-2013::13:56:58 ===
Stopping RabbitMQ
=ERROR REPORT==== 31-Oct-2013::13:57:00 ===
** Generic server <0.6855.0> terminating
** Last message in was {'$gen_cast',
{method,
{'exchange.unbind',0,
<<"ha-execute_suppressmessagedeliveryactivity">>,
<<"MassTransit.Diagnostics.Tracing:GetMessageTraceList">>,
<<>>,false,[]},
none,noflow}}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,<0.4073.0>,
<0.6853.0>,<0.4073.0>,
<<"10.100.193.10:50286 -> 10.100.193.80:5672">>,
{lstate,<0.6854.0>,false,false},
none,1,
{[],[]},
{user,<<"SJINRMQVORTEXWRITE">>,[],
rabbit_auth_backend_ldap,
{impl,
"CN=SJINRMQ VORTEXWRITE,OU=INRMQ,OU=INTEGRATION,OU=Environments,OU=Servers,DC=RHB,DC=AD",
<<"sQ*4~j5Ca9F_7G">>}},
<<"vortex">>,<<>>,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{set,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{set,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
<0.4070.0>,
{state,fine,5000,undefined},
false,1,
{{0,nil},{0,nil}},
[],
[{<<"publisher_confirms">>,bool,true},
{<<"exchange_exchange_bindings">>,bool,true},
{<<"consumer_cancel_notify">>,bool,true},
{<<"basic.nack">>,bool,true}],
none}
** Reason for termination ==
** {{shutdown,
{gen_server,call,
[rabbit_auth_backend_ldap,
{check_resource,
[{username,<<"SJINRMQVORTEXWRITE">>},
{user_dn,
"CN=SJINRMQ VORTEXWRITE,OU=INRMQ,OU=INTEGRATION,OU=Environments,OU=Servers,DC=RHB,DC=AD"},
{vhost,<<"vortex">>},
{resource,exchange},
{name,<<"ha-execute_suppressmessagedeliveryactivity">>},
{permission,write}],
{user,<<"SJINRMQVORTEXWRITE">>,[],rabbit_auth_backend_ldap,
{impl,
"CN=SJINRMQ VORTEXWRITE,OU=INRMQ,OU=INTEGRATION,OU=Environments,OU=Servers,DC=RHB,DC=AD",
<<"sQ*4~j5Ca9F_7G">>}}},
infinity]}},
[{gen_server,call,3,[{file,"gen_server.erl"},{line,188}]},
{rabbit_access_control,check_access,5,[]},
{rabbit_channel,check_resource_access,3,[]},
{rabbit_channel,binding_action,9,[]},
{rabbit_channel,handle_cast,2,[]},
{gen_server2,handle_msg,2,[]},
{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}
=ERROR REPORT==== 31-Oct-2013::13:57:00 ===
** Generic server <0.16344.0> terminating
** Last message in was {'$gen_cast',
{method,
{'basic.publish',0,
<<"RelayHealth.MessageExchange.Core.Configuration.Events:MessageSubscribersLoaded">>,
<<>>,false,false},
{content,60,none,
<<48,128,0,0,0,50,12,67,111,110,116,101,110,
116,45,84,121,112,101,83,0,0,0,32,97,112,
112,108,105,99,97,116,105,111,110,47,118,
110,100,46,109,97,115,115,116,114,97,110,
115,105,116,43,106,115,111,110,2,36,49,97,
102,100,48,48,48,48,45,53,50,97,102,45,
100,52,97,101,45,50,56,50,57,45,48,56,100,
48,97,48,52,56,50,53,102,102>>,
rabbit_framing_amqp_0_9_1,
[<<"{\r\n \"destinationAddress\": \"rabbitmq://rabbit.integration.rhb.ad/vortex/RelayHealth.MessageExchange.Core.Configuration.Events:MessageSubscribersLoaded\",\r\n \"headers\": {},\r\n \"message\": {\r\n \"eventId\": \"1afd0000-52af-d4ae-dcd2-08d0a04825fe\",\r\n \"timeCompleted\": \"2013-10-31T08:56:47.178316Z\",\r\n \"duration\": \"00:00:00.0624004\",\r\n \"initiatingEventId\": \"1afd0000-52af-d4ae-6927-08d0a04825fc\",\r\n \"machineName\": \"SJINWEB03\",\r\n \"processName\": \"RelayHealth.Services.SubscriberActivityService\",\r\n \"messageSubscriberCount\": 264\r\n },\r\n \"messageType\": [\r\n \"urn:message:RelayHealth.MessageExchange.Core.Configuration.MessageSubscribers:MessageSubscriberCache+MessageSubscribersLoadedImpl\",\r\n \"urn:message:RelayHealth.MessageExchange.Core.Configuration.Events:MessageSubscribersLoaded\"\r\n ],\r\n \"sourceAddress\": \"rabbitmq://rabbit.integration.rhb.ad/vortex/ha-subscriber_activity_service_control_sjinweb03?ha=true&prefetch=1\"\r\n}">>]},
flow}}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,<0.16335.0>,
<0.16342.0>,<0.16335.0>,
<<"10.100.193.10:54839 -> 10.100.193.80:5672">>,
{lstate,<0.16343.0>,false,false},
none,1,
{[],[]},
{user,<<"SJINRMQVORTEXWRITE">>,[],
rabbit_auth_backend_ldap,
{impl,
"CN=SJINRMQ VORTEXWRITE,OU=INRMQ,OU=INTEGRATION,OU=Environments,OU=Servers,DC=RHB,DC=AD",
<<"sQ*4~j5Ca9F_7G">>}},
<<"vortex">>,<<>>,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{set,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
{set,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]}}},
<0.16333.0>,
{state,fine,5000,undefined},
true,233,
{{0,nil},{0,nil}},
[],
[{<<"publisher_confirms">>,bool,true},
{<<"exchange_exchange_bindings">>,bool,true},
{<<"consumer_cancel_notify">>,bool,true},
{<<"basic.nack">>,bool,true}],
none}
** Reason for termination ==
** {{shutdown,
{gen_server,call,
[rabbit_auth_backend_ldap,
{check_resource,
[{username,<<"SJINRMQVORTEXWRITE">>},
{user_dn,
"CN=SJINRMQ VORTEXWRITE,OU=INRMQ,OU=INTEGRATION,OU=Environments,OU=Servers,DC=RHB,DC=AD"},
{vhost,<<"vortex">>},
{resource,exchange},
{name,
<<"RelayHealth.MessageExchange.Core.Configuration.Events:MessageSubscribersLoaded">>},
{permission,write}],
{user,<<"SJINRMQVORTEXWRITE">>,[],rabbit_auth_backend_ldap,
{impl,
"CN=SJINRMQ VORTEXWRITE,OU=INRMQ,OU=INTEGRATION,OU=Environments,OU=Servers,DC=RHB,DC=AD",
<<"sQ*4~j5Ca9F_7G">>}}},
infinity]}},
[{gen_server,call,3,[{file,"gen_server.erl"},{line,188}]},
{rabbit_access_control,check_access,5,[]},
{rabbit_channel,check_resource_access,3,[]},
{rabbit_channel,handle_method,3,[]},
{rabbit_channel,handle_cast,2,[]},
{gen_server2,handle_msg,2,[]},
{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}
=ERROR REPORT==== 31-Oct-2013::13:57:00 ===
** Generic server <0.7275.0> terminating
** Last message in was {'$gen_cast’,
. . . Omitting the rest for brevity as there is one of these for every exchange
1. We also issue the stop-app on Rabbit node 02 of the cluster
2. Then we issue rabbitmqctl start_app on Node 01, and rabbit starts. The F5 sees that the node is available, adds it to the pool of available nodes and begins routing traffic to it, and the Windows services are able to complete shut down.
I’ve verified that the application code is shutting down connections cleanly. We also don’t see this behavior with the same application in environments where no rabbit cluster is setup, just a single instance.
This appears to be something that happens to Rabbit itself – the Erlang portion of things seems to be working OK.
I’m attempting to replicate this behavior in other environments so I can try to isolate it but haven’t had a chance to do so thus far, but it does happen in every environment where we have a rabbit cluster.
Any ideas of what I might be missing?
Cheers,
Ron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131101/1c070404/attachment.htm>
More information about the rabbitmq-discuss
mailing list