[rabbitmq-discuss] flow control issues

Wed Sep 15 14:45:50 BST 2010

Hi,
since I can not find out the reasons of our problems yet, I've decided  
to perform some more basics tests.

Here are the latest results I' ve gathered by doing some test with the  
MultiCastMain program from java client 2.0.0

First of all, let me introduce the modifications I have performed on  
the MultiCastMain to allow more flexibility in the configuration :
	- introduction of optional parameter "-D" to toggle Durability of  
exchange and queue ON (missing value means OFF)
	- introduction of optional parameter "-A" to toggle autodelete of the  
queue ON (missing value means OFF)
	- introduction of optional parameter "-X" to toggle exclusiveness on  
the queue ON (missing value means OFF)
	- introduction of optional parameter "-Q" to give a name to a queue :  
useful to fetch messages from the same queue between several executions.
	- introduction of optional parameter "-K" to give the same routing  
key (for producer) and binding key (for consumer) between several  
executions.

Here after is a snippet of /etc/rabbitmq/rabbitmq.config :
	[
   	{rabbit, [
      		{vm_memory_high_watermark, 0.05},
      		{ssl_listeners, [{"0.0.0.0",5671}]},
      		{ssl_options, [{cacertfile,"/var/lib/rabbitmq/ssl/certificate- 
authority/schneider-ca.pem"},
                    	{certfile,"/var/lib/rabbitmq/ssl/broker-certs/ 
broker-cert.pem"},
                     	{keyfile,"/var/lib/rabbitmq/ssl/broker-certs/ 
broker-key.pem"},
                     	{verify,verify_none},
                     	{fail_if_no_peer_cert,false}]}
    		]}	
	].

To step in the memory limitations faster, I 've decreased the memory  
watermark threshold to .05 (5%)
This corresponds to 50 MB on the host we are using for this test as  
shown on the rabbit.log file upon startup of broker :

	=INFO REPORT==== 15-Sep-2010::12:36:15 ===
	Limiting to approx 65435 file handles (58889 sockets)

	=INFO REPORT==== 15-Sep-2010::12:36:15 ===
	Memory limit set to 50MB.

Finally the broker status description (rabbitmqctl status)

	[{running_applications,[{rabbit_status,"RabbitMQ Status Page","0.01"},
         	                {rabbit,"RabbitMQ","2.0.0"},
                 	        {ssl,"Erlang/OTP SSL application","3.10.7"},
                         	{public_key,"Public key  
infrastructure","0.4"},
                        	 	{mnesia,"MNESIA  CXC 138 12","4.4.12"},
                        		{os_mon,"CPO  CXC 138 46","2.2.4"},
                         	{sasl,"SASL  CXC 138 11","2.1.8"},
                         	{rabbit_mochiweb,"RabbitMQ Mochiweb  
Embedding","0.01"},
                         	{mochiweb,"MochiMedia Web Server","1.3"},
                         	{crypto,"CRYPTO version 1","1.6.3"},
                         	{inets,"INETS  CXC 138 49","5.2"},
                         	{stdlib,"ERTS  CXC 138 10","1.16.4"},
                         	{kernel,"ERTS  CXC 138 10","2.13.4"}]},
	 {nodes,[{disc,['rabbit at mystic-buntu']}]},
  	{running_nodes,['rabbit at mystic-buntu']}]

Thanks to the modifications of the MultiCastMain, I've run a 2 phases  
producer / consumer test as described bellow :

	phase 0 : initialization of exchange and queue for the test :
		MultiCastMain -h <host> -p 5672 -s 1000 -n 1 -x 0 -y 1 -e testX -Q  
testQ -K testK -D -z 10
		this launches a single consumer creating the exchange testX and  
queue testQ with durability ON.
		The queue is bound to the exchange using key testK.
		The test is told to end after 10 seconds.

	phase 1 - launch a producer alone and watch memory occupation +  
connections status and queue depth (via rabbit_status plugin for  
instance)
		MultiCastMain -h <host> -p 5672 -s 1000 -n 1 -x 1 -y 0 -e testX -Q  
testQ -K testK -D

		This run a producer that will publish 1kb messages to the testX  
exchange with routing key testK.
		These messages will be queued inside testQ.

The producer is blocked after only few seconds running as the memory  
threshold is reached. The queue depth indicated 12 011 messages.
(You can scroll to the end of this mail to see an extract of broker  
log to illustrate the alarm_hanlder notifications.)

What do I mean by "the producer is blocked" :
	- rabbit_status plugin shows the connection in status "bocked"
	- rabbit_status plugin shows that the depth of queue testQ remains  
constant
	- The trace from MultiCastMain does not displays new stats of sending  
rate after a while
	- The last lines in the broker log are :
		=INFO REPORT==== 15-Sep-2010::14:53:44 ===
		vm_memory_high_watermark set. Memory used:53650536 allowed:52556595

		=INFO REPORT==== 15-Sep-2010::14:53:44 ===
     		alarm_handler: {set,{vm_memory_high_watermark,[]}}

 From that point, the producer is blocked without any possible  
recovery. As the flow control is designed in v 2.0.0, I would have  
expected the producer
to be released thanks to sawp of messages to the disk. Anyway, my  
second assumption was that a consumer is needed to release pressure on  
the queue,
so I decided to keep the producer f running, and start phase 2 as  
follow :

	phase 2 - launch a consumer alone and watch memory occupation +  
connections status and queue depth (via rabbit_status plugin for  
instance)

		MultiCastMain -h <host> -p 5672 -s 1000 -n 1 -x 0 -y 1 -e testX -Q  
testQ -K testK -D
		This runs a consumer that will get message out the queue testQ.

The messages from testQ are removed quite fast too, but the memory  
occupation does not seem to be impacted at all, as the producer is  
still blocked.
 From the rabbit_status plugin, the status of the connections remains :
	- "blocked" for the producer
	- "blocking" for the consumer

And memory usage is still over the threshold :  memory (used/ 
available) = 53MB / 50MB

The only way I found out to allow memory to fall back bellow the  
threshold is by killing manually the connection of the producer.
Killing the producer process on the client side does not do indeed :

	If  I kill the producer process, the connection seems to disappear  
but the memory remains over the threshold.
	The rabbit_status plugin seem to fail after this because I got a  
strange error report in the broker log. (see the end of this long  
mail !)

By closing the connection on the broker side, the memory has (not  
always) fallen below the threshold, back to normal (32 MB / 50 MB).

I think this behaviour is not expected and maybe that could be due to  
a bug somewhere, since it si fully reproductible on my configuration.
The fact that the memory occupation never falls below the threshold  
after message are removed from the queue is particularly strange and
unwilling from my point of view. It think that this simple test can  
points out an issue that would explains the problems that I mentioned  
in the
previous messages, but not sure about it.

I am sorry for this quite long message, but I thought that the more  
details you get, the better.

best regards,

Romary.

The following is an extract from the broker log during phase 1

	=INFO REPORT==== 15-Sep-2010::14:53:24 ===
	accepted TCP connection on 0.0.0.0:5672 from 192.168.0.71:1845

	=INFO REPORT==== 15-Sep-2010::14:53:24 ===
	starting TCP connection <0.16120.0> from 192.168.0.71:1845

	=INFO REPORT==== 15-Sep-2010::14:53:42 ===
	vm_memory_high_watermark set. Memory used:52632808 allowed:52556595

	=INFO REPORT==== 15-Sep-2010::14:53:42 ===
    	alarm_handler: {set,{vm_memory_high_watermark,[]}}

	=INFO REPORT==== 15-Sep-2010::14:53:43 ===
	vm_memory_high_watermark clear. Memory used:52355320 allowed:52556595

	=INFO REPORT==== 15-Sep-2010::14:53:43 ===
  	alarm_handler: {clear,vm_memory_high_watermark}

	=INFO REPORT==== 15-Sep-2010::14:53:44 ===
	vm_memory_high_watermark set. Memory used:53650536 allowed:52556595

	=INFO REPORT==== 15-Sep-2010::14:53:44 ===
     	alarm_handler: {set,{vm_memory_high_watermark,[]}}

The following is an extract from the broker log during phase 2

	=INFO REPORT==== 15-Sep-2010::15:21:38 ===
	accepted TCP connection on 0.0.0.0:5672 from 192.168.0.71:1880

	=INFO REPORT==== 15-Sep-2010::15:21:38 ===
	starting TCP connection <0.2101.1> from 192.168.0.71:1880

The following is the error report after shutting down the producer  
process (client side)

ERROR REPORT==== 15-Sep-2010::15:37:53 ===
** Generic server rabbit_status_web terminating
** Last message in was get_context
** When Server state == {state,1284557846311,"Wed, 15 Sep 2010  
11:37:26 GMT",
                                [<<"0.0.0.0:5672 ">>],
                                [[{pid,<<"<0.16120.0>">>},
                                  {address,<<"192.168.0.58">>},
                                  {port,5672},
                                  {peer_address,<<"192.168.0.71">>},
                                  {peer_port,1845},
                                  {recv_oct,12696041},
                                  {recv_cnt,72079},
                                  {send_oct,357},
                                  {send_cnt,5},
                                  {send_pend,0},
                                  {state,<<"blocked">>},
                                  {channels,1},
                                  {user,<<"guest">>},
                                  {vhost,<<"/">>},
                                  {timeout,0},
                                  {frame_max,131072}]],
                                [[{vhost,<<"/">>},
                                  {name,<<"testQ">>},
                                  {durable,<<"true">>},
                                  {auto_delete,<<"false">>},
                                  {arguments,<<"[]">>},
                                  {pid,<<"<0.12150.0>">>},
                                  {messages_ready,0},
                                  {messages_unacknowledged,0},

{messages_uncommitted,<<"undefined">>},
                                  {messages,0},
                                  {acks_uncommitted,<<"undefined">>},
                                  {consumers,0},
                                  {transactions,<<"undefined">>},
                                  {memory,6171332}]],

33,65535,55843456,52556595.0,161,1048576}
** Reason for termination ==
** {{badmatch,{error,enotconn}},{gen_server,call, 
[<0.16120.0>,info,infinity]}}

=ERROR REPORT==== 15-Sep-2010::15:37:53 ===
** Generic server <0.16124.0> terminating
** Last message in was {'EXIT',<0.16123.0>,{badmatch,{error,enotconn}}}
** When Server state == {ch,running, 
1,<0.16120.0>,<0.16122.0>,undefined,none,
                             {set,0,16,16,8,80,48,
                                  {[],[],[],[],[],[],[],[],[],[],[],[], 
[],[],
                                   [],[]},
                                  {{[],[],[],[],[],[],[],[],[],[],[], 
[],[],[],
                                    [],[]}}},
                             1,
                             {[],[]},
                             {[],[]},
                             <<"guest">>,<<"/">>,<<>>,
                             {dict,0,16,16,8,80,48,
                                   {[],[],[],[],[],[],[],[],[],[],[], 
[],[],[],
                                    [],[]},
                                   {{[],[],[],[],[],[],[],[],[],[],[], 
[],[],[],
                                     [],[]}}},
                             {dict,0,16,16,8,80,48,
                                   {[],[],[],[],[],[],[],[],[],[],[], 
[],[],[],
                                    [],[]},
                                   {{[],[],[],[],[],[],[],[],[],[],[], 
[],[],[],
                                     [],[]}}},
                             <0.16121.0>,
                             {state,none,undefined}}
** Reason for termination ==
** {badmatch,{error,enotconn}}

Le 10 sept. 10 à 15:06, Marek Majkowski a écrit :

> Romary,
>
> First, thank a lot for the feedback. For example the information about
> SSL memory
> usage is indeed very interesting. (if that is a big problem to you,
> you may always
> fall back to the old technique of using stunnel)
>
> The flow control was heavily modified between 1.8.1 and 2.0.0. In  
> summary:
> - 1.8.1 - we have send Channel.flow AMQP message to everyone once
>   rabbit reached memory limit
> - 2.0.0 - once we reach memory limit, the connections from which we  
> hear
>   publishes are stopped temporarily. We stop receiving bytes from  
> tcp sockets.
>   That 'stop' shouldn't take too long, as data should be swapped out  
> to disk
>   and memory pressure will drop pretty quickly.
>
>
> On Wed, Sep 8, 2010 at 11:49, Romary Kremer  
> <romary.kremer at gmail.com> wrote:
>> I've started playing a bit with the latest release 2.0.0 and I m  
>> affraid
>> that it looks like their are some regression or at least some  
>> semantic
>> updates.
>
> It's best if you upgraded both server and client library. Do you  
> have any
> particular problems? A lot was changed in 2.0.0 but we think it's  
> fully
> functional. If you found something that blocks you to migrate, you
> could report a bug.
>
>> It looks like any listener is called back  when the alarm handler  
>> is set or
>> cleared, while the producers are still paused / resumed
>> like their are to be.
>
> Interesting. Maybe we have a race there? Or maybe you're blocking
> the main java client thread? (nothing blocking should be done from
> the main thread)
>
>>> during long running tests, we have encountered strange behaviour  
>>> due to
>>> flow control :
>>>
>>> The queue depth starts to increase linearly for about 2 hours,  
>>> these is
>>> coherent since the message throughput of the single consumer
>>> is not enough to absorb message ingress. Memory occupation grow  
>>> faster as
>>> well, until the memory watermark is reached on the broker side.
>
> Are you sure your consumer is ACK-ing the messages it received?
>
>>> From that point, the producers are indeed paused, as flow control  
>>> request
>>> has been issued by the broker, but the consumer seems to be blocked
>>> as well. The queue level is flatten at its top value until the end  
>>> of the
>>> test, even when memory occupation lowered under the threshold.
>
> That's how 1.8.1 behaves. In 2.0.0 we introduced swapping out big  
> queues
> to disk, so the memory usage shouldn't be dependent on a queue size.
>
>>> By registering the FlowListener callback, we have noticed that not  
>>> all of
>>> the producers are notified all the time the alarm handler is set.
>>> Does this mean that the broker applies some heuristic to try not  
>>> to block
>>> every body every time ?
>>> Or does it mean that some of the channels have been somehow  
>>> blacklisted by
>>> the broker ?
>
> No, in 1.8.1 broker should send 'channel.flow' to all the channels.
>
>>> Could anybody explain how the blocking of consumer is assumed to be
>>> implemented ?
>
> The best description is probably here:
>  http://www.rabbitmq.com/extensions.html#memsup
>
> But it covers 2.0.0. I'd suggest an upgrade to 2.0.0 and monitoring
> not only queue size but also number of unacknowledged messages
> ('Msg unack' in status plugin). This number should be near zero.
>
> Cheers,
>  Marek Majkowski

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100915/f26d6487/attachment-0001.htm>