[rabbitmq-discuss] Orphaned channels after connection close in Rabbit 3.1.4

Wed Aug 14 12:09:23 BST 2013

As an additional piece of information, the cluster consists of two nodes, 
with the queue in question (and indeed all queues) mirrored, with their 
home on node A. All the stuck connections are homed on node A. Consumers 
and publishers connect to either node using a load balancer.

We've just forced our temporary queue (which also started exhibiting the 
same problem) to have node B as its home, and now any workers that attempt 
to consume from node A have an error:

Bunny::NotFound: NOT_FOUND - home node 'rabbit at nodeA' of durable queue 
> 'temp_queue' in vhost '/' is down or inaccessible

Cluster status seems to be ok. Now that we only have consumers on node B, 
we're going to leave it for a while and see if we have any stuck 
connections. We may attempt to split and rejoin node A from the cluster to 
see if it remedies things.

On Wednesday, 14 August 2013 10:41:13 UTC+1, Paul Bowsher wrote:
>
> Hi,
>
> After the upgrade to RabbitMQ 3.1.4 we're seeing a large number of 
> linearly-increasing channels which seem to hang around after the connection 
> is closed. This doesn't happen on every queue (even those processed by the 
> same software), only on a particular queue. The orphaned channels leave 20 
> messages (the prefetch count) unacked.
>
> Symptoms:
>
> - Initially, larger than expected consumer count on queue from our 
> monitoring
> - Stopping all expected consumers on that channel removes the expected 
> number of consumers, leaving orphans (700+ at present)
> - Each orphaned consumer's channel is reachable using Management tool
> - Each connection for the channel is reachable, is in either a "flow" or 
> "blocked" state with zero data flow. Timeout is set to 600s (count doesn't 
> decrease after 10 minutes)
> - Forcing a stuck connection closed through the management interface 
> results in a 500:
>
>> The server encountered an error while processing this request:
>> {exit,{normal,{gen_server,call,
>>                           [<0.16806.1347>,
>>                            {shutdown,"Closed via management plugin"},
>>                            infinity]}},
>>       [{gen_server,call,3,[{file,"gen_server.erl"},{line,188}]},
>>        {rabbit_mgmt_wm_connection,delete_resource,2,[]},
>>        {webmachine_resource,resource_call,3,[]},
>>        {webmachine_resource,do,3,[]},
>>        {webmachine_decision_core,resource_call,1,[]},
>>        {webmachine_decision_core,decision,1,[]},
>>        {webmachine_decision_core,handle_request,2,[]},
>>        {rabbit_webmachine,'-makeloop/1-fun-0-',2,[]}]}
>
>
> - The connection closure does actually succeed, and the count drops. 
> - Forcing a normal connection closed still works correctly.
> - Doing a netstat on both the client and server shows that the port 
> associated with the connection is indeed completely closed, and not in any 
> sort of TIME_WAIT or FIN_WAIT state
>
> This issue occurred over the weekend, but as it was unchecked it 
> eventually exhausted the server of socket descriptors, and we had to 
> restart RabbitMQ to recover. It's happened again over night. We can't see 
> anything in our client or server logs that give any indication as to the 
> cause of it. We upgraded from 3.1.0 so unfortunately are unable to 
> determine which version is responsible for the bug, if indeed it is a bug.
>
> If anyone has seen this behaviour or has any suggestions, please let us 
> know. Also, if we can provide any debug data let us know.
>
> Thanks,
>
> Paul Bowsher
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130814/638c6f83/attachment.htm>