[rabbitmq-discuss] queues going down

Tue Dec 14 19:24:58 GMT 2010

Hi, Peter and Steph...

On Dec 14, 2010, at 7:04 AM, C-Peter Ritchie wrote:

> Hi Jerry.  In 64-bit windows with a 32-bit process you're more likely to get a full 2GB memory; but, if IMAGE_FILE_LARGE_ADDRESS_AWARE is set (see http://www.gidhome.com/gid3gb/index.html) then a 32-bit process can get 4GB of RAM.  64-bit Windows doesn't support the /3GB option to give processes more than 2GB of virtual memory.

Good to confirm.

>  As far as I can tell (see http://support.microsoft.com/kb/294418) the application has to tell Windows it's ok addressing a full 4GB (see above link).

Right.... IIRC this is stamped in a field in the PE file, and I think it's still possible for an executable so-stamped to contain code that might do the wrong thing with 32-bit addresses whose high bit is set, so just modifying the headers of one's binaries might produce surprising results.

> The messages we're sending are about 1400 bytes (also looking at shrinking that) and we're set to persistent.  So, my concern is the effect these are having on memory and the resulting crash.  I understand Rabbit/Erlang need to do some special tricks to get any sort of performance with thousands/millions of messages; but, if that results in the queue crashing under load, this has me concerned.

Ideally one wants enough working memory for Rabbit that the amount of data queued in memory isn't going to be such that Rabbit is compelled to start swapping messages to disk.  Obviously if there's enough of an imbalance between producers and consumers, sustained for a long enough period, then no amount of RAM will keep one out of the woods forever.  Similarly disks are finite in size, disk writes take time, etc.

> I also understand Rabbit is a bit of a mash-up of Erlang and Rabbit and a bit at the mercy of varying versions of each; so, maybe this is just a matter of getting a couple of versions of things that work best...

The Rabbit to Erlang relationship is analogous to the relationship of "some version of some Java application" to a JVM/JRE release:  not particularly mashed up, but one can't have total freedom to change arbitrarily without the other staying in touch with it.  Our Windows Rabbit "complete bundle" distributions include their own Erlang to save users a step and the potential headache of running an Erlang woefully mismatched for their Rabbit release.

Is it a complete bundle distribution that you're working with, or did you put things together from separate components?

> We have a dump file that I "created" yesterday.  If you have an ftp server or something I can send it over.  It's about 1.47 MB; so, I doubt it's going to get through email very easily.

It might be very handy for us to get ahold of that.  If an Erlang runtime exists abnormally, the erl_crash.dump file should contain a lot of forensics on it.  I suspect I'll be OK receiving it on this account; failing that I can give you a gmail account that should be able to accommodate it.  So just to confirm, does it seem that you are seeing a bona fide crash, i.e. one where the Erlang VM running Rabbit is ceasing to exist, or just a Rabbit that becomes unresponsive to the outside world?

> This scenario is absent of any consumer; so, yes, more aggressive consumption could likely cause some benefit.  My concern is that in a distributed system, where any one of our components could go offline for any number of reasons, we could get in this situation quite easily.  It seems that data loss is quite possible due to this.

Alas, this will be something to guard against in any event, or at least monitor for.  The memory-based flow control system in Rabbit exists to let Rabbit keep running even when its available memory is overwhelmed by producer-side load, by persisting messages to disk, and then blocking connected producers until it can catch up, if it can't push stuff to disk quickly enough to match their pace.  In any distributed architecture one has to be able to monitor the health of one's subsystems and react accordingly.  Even with the most over-engineered subsystems something has to give at some point.

Jerry

> -----Original Message-----
> From: Jerry Kuch [mailto:jerryk at vmware.com]
> Sent: Monday, December 13, 2010 8:54 PM
> To: C-Peter Ritchie
> Cc: Jim Apperly; Gregory Chorebanian; Steph Swierenga; rabbitmq info; John DeTreville
> Subject: Re: queues going down
>
> Hi, Peter...
>
> My apologies...  I hit send prematurely on an earlier message I wasn't quite finished with yet...  Here's the rest of what I was going to say about your situation.
>
> The vm_memory_high_watermark setting of 0.8 in your rabbitmq.config file expresses a fraction of the amount of memory Rabbit detects in your system to use as its alarm level.  In the 'INFO REPORT' log entries at startup we see a limit of 819MB, which gives:
>
> 819MB/0.8 = 1023MB
>
> So Erlang/Rabbit appears to think your machine has 1GB of RAM for its use.  You can tweak the fraction as you want to goose the limit up to a level you'd like, but there are a couple of things you'll want to keep in mind:
>
> 1. If Erlang is running as a 32-bit process under Windows, it's probably only going to be given 2GB of user address space at best, regardless of how much physical RAM is on the box.  Off the top of my head I don't know the rules 64-bit Windows imposes on 32-bit processes, but I know that natively 32-bit versions of Windows by default split the 32-bit/4GB address space half and half between user and system space, with the option to set it instead to 3GB of user space and 1 GB system space at boot time.  That latter configuration used to be used by folks running things like memory hungry database servers.  Perhaps 64-bit versions of Windows provide some latitude on tweaking the split on a process by process basis?
>
> 2. You shouldn't push the memory limit value that Rabbit announces at startup past half or so of the available RAM on your system.  Ideally you would like to leave some breathing room for the system to experience spikes in memory usage between its garbage collections.  You also don't want it to compete too aggressively with any file caching your OS might be doing, since you could lose throughput as reads that would otherwise be cached are forced to go to disk.
>
> Bottom line, my current working hypothesis on your configuration is that your 32-bit Erlang/Rabbit process is getting at most 2GB of user address space given to it by Windows, Erlang is deciding to only grab half of that for itself, and your vm_memory_high_watermark setting is setting the alarm threshold at 80% of that, to give you the 819MB you see.
>
> Recall that this isn't a limit on how much memory Rabbit will try to use, but rather the threshold at which its memory based flow control will kick in.  With flow control in force the server will pause reads from the sockets of clients, inhibiting their ability to send contentful messages, suspending connection heartbeats, etc.  Normal server behavior resumes when the memory pressure abates with the intent being that when flow control is in force, producers are rate limited while consumers are allowed to operate normally, thereby hopefully allowing the pressure on the system to remit naturally.
>
> Is it possible that your testing scenario could benefit form having more, or more aggressive consumers to better balance the pressure that vigorous producers are putting on the system?  And that a production system based on it could make sure to keep the consumption side of the equation sufficiently balanced?  I don't recall that we ever established for sure that your broker process was actually dying---in the absence of a crash_dump.erl file or log entries indicating crashes or imminent crashes, is it possible that it's just becoming unresponsive to producers due to flow control?
>
> Best regards,
> Jerry
>
> On Dec 13, 2010, at 10:14 AM, C-Peter Ritchie wrote:
>
>> Hi Jerry.
>>
>> I've been trying to bump up the memory limit but I can't seem to get it to change.  Based on the docs the config  file should be located at %APPDATA%\RabbitMQ\rabbitmq.config  There was neither a RabbitMQ directory there, nor a RabbitMQ\rabbitmq.config file; so, I created them and restarted the service (with a line: [{rabbit, [{vm_memory_high_watermark, 0.8}]}] ).  The log still contained "Memory limit set to 819MB." about the time the service restarted.   The %APPDATA% value happened to point to AppData\Roaming, and I noticed that was a RabbitMQ directory in AppData\Local.  I tried also added a rabbitmq.config in AppData\Local\RabbitMQ; but that also did not change the memory limit.
>>
>> Is there a way to find out what RabbitMQ is configured for in terms of the location of the config file?
>>
>> Cheers -- Peter
>>
>> -----Original Message-----
>> From: Jerry Kuch [mailto:jerryk at vmware.com]
>> Sent: Friday, December 10, 2010 4:22 PM
>> To: C-Peter Ritchie
>> Cc: Jim Apperly; Gregory Chorebanian; Steph Swierenga; rabbitmq info;
>> John DeTreville
>> Subject: Re: queues going down
>>
>> Hi, Peter...
>>
>> Thanks for investigating.  That memory limit sounds like it might be a bit on the low side.  You might try bumping it up and seeing if your problem doesn't reproduce.  Does the broker process actually die, or does it just seem to stop reading the content thrown at it by your clients?
>>
>> A broker under memory pressure has a memory alarm mechanism that will stop it from reading from the sockets of connected clients until the alarm condition is resolved either by delivery of the stored messages the shunting of them to disk.
>>
>> There's documentation on memory-based flow control here:
>>
>> http://www.rabbitmq.com/extensions.html
>>
>> You may find it profitable to check your current settings and try some alternatives, although by all means, the standalone repro-example would be great for us to look at...
>>
>> BTW the page above also has a note about older versions of Erlang (pre-R13B) having trouble determining how much memory they're working with on Windows...  I'm not sure if those affect your setup or not.
>>
>> Thanks,
>> Jerry
>>
>> On Dec 10, 2010, at 1:07 PM, C-Peter Ritchie wrote:
>>
>>> Hi Jim.  The computer we're  running Rabbit on has 8GB of RAM
>>>
>>> This is the last entry from the log:
>>> =INFO REPORT==== 10-Dec-2010::12:36:30 === Limiting to approx 412
>>> file handles (368 sockets)
>>>
>>> =INFO REPORT==== 10-Dec-2010::12:36:30 === Memory limit set to 819MB.
>>>
>>> There is no CRASH REPORT text in the log file
>>>
>>> The sasl log seems to always be empty.  Attached are the log files in the log directory immediately post-crash.
>>>
>>> I'll work on getting a smaller example that reproduces the problem.  Right now we're reading data from a database and pumping messages into the queue.  The database is on a different computer.  I'll see if I can come up with an example that  is stand-alone.
>>>
>>> Cheers -- Peter
>>>
>>> -----Original Message-----
>>> From: Jerry Kuch [mailto:jerryk at vmware.com]
>>> Sent: Friday, December 10, 2010 3:12 PM
>>> To: Jim Apperly
>>> Cc: Gregory Chorebanian; Steph Swierenga; C-Peter Ritchie; rabbitmq
>>> info; John DeTreville
>>> Subject: Re: queues going down
>>>
>>> Hi, Steph and Peter...
>>>
>>> To get started hunting down the source of your problem, may I send some questions your way?  In particular:
>>>
>>> 1. How much memory is installed on the machine you're using as your Rabbit server?
>>>
>>> 2. What memory limit does Rabbit report in its logs when it starts up?  To find it check your main rabbit log for lines looking something like the following:
>>>
>>> =INFO REPORT==== 10-Dec-2010::12:04:47 === Memory limit set to 8365MB.
>>>
>>> The file handle limit, which is reported in similar format at startup might also be interesting to see.
>>>
>>> 3. If you scan through the main Rabbit log looking for ERROR REPORT or CRASH REPORT, do you find anything suspicious?
>>>
>>> 4. Are you able to send us the entirety of your main rabbit.log and rabbit-sasl.log?  (To figure out where those are landing, look at the broker's startup messages for lines that look something like:
>>>
>>> log            : /var/folders/HI/HITiCI9qFWSs0nVW3pzyCU+++TI/-Tmp-//rabbit at StrongMad.log<mailto:/var/folders/HI/HITiCI9qFWSs0nVW3pzyCU+++TI/-Tmp-//rabbit at StrongMad.log>
>>> sasl log       : /var/folders/HI/HITiCI9qFWSs0nVW3pzyCU+++TI/-Tmp-//rabbit at StrongMad-sasl.log<mailto:/var/folders/HI/HITiCI9qFWSs0nVW3pzyCU+++TI/-Tmp-//rabbit at StrongMad-sasl.log>
>>>
>>> 5. Can you distill the client code that's driving your Rabbit to failure into a small standalone program that you could pass our way so that we could attempt to locally reproduce the failure?  I could build a Windows box in a virtual machine to chase this down further if so.
>>>
>>> 6. When the crash occurs, has the rabbit server OS process stopped and ceased to exist?  Or does it remain, but in an unusable state?
>>>
>>> Thanks for any additional information you can provide...
>>>
>>> Best regards,
>>> Jerry
>>> RabbitMQ Team
>>>
>>>
>>> On Dec 10, 2010, at 11:58 AM, Jim Apperly wrote:
>>>
>>> Steph, Greg,
>>>
>>> On 10 December 2010 19:55, Gregory Chorebanian <gchorebanian at vmware.com<mailto:gchorebanian at vmware.com>> wrote:
>>> It was my error to assume you were already going down the road of Rabbit.  Jim - I think R&D is looking to the issue correct?  Can you make a guess as to when we will have an answer?
>>>
>>> It's Friday evening here at Rabbit HQ (London) and most of us Rabbits have gone home.  However, I'm cc'ing John and Jerry from our Pacific Rim team who have got a lot more of Friday remaining.
>>>
>>> Guys - over to you.
>>>
>>> Jim
>>>
>>>
>>>
>>> From: Steph Swierenga
>>> [mailto:SSwierenga at cdic.ca<mailto:SSwierenga at cdic.ca>]
>>> Sent: Friday, December 10, 2010 2:51 PM
>>> To: Gregory Chorebanian
>>> Cc: Jim Apperly; C-Peter Ritchie
>>>
>>> Subject: RE: queues going down
>>>
>>>
>>> Hey, so we're not officially in development, or production yet. We're in the process of evaluating RabbitMQ vs MSMQ as our messaging solution going forward. The problem we've encountered below would be a show-stopper for us. Is there any way we can work out a per-incident fee for this?
>>>
>>> Thanks,
>>> Steph.
>>>
>>> From: Gregory Chorebanian
>>> [mailto:gchorebanian at vmware.com<mailto:gchorebanian at vmware.com>]
>>> Sent: December-10-10 2:43 PM
>>> To: C-Peter Ritchie
>>> Cc: Steph Swierenga; Jim Apperly
>>> Subject: RE: queues going down
>>>
>>> No worries...
>>>
>>> We have three options really..
>>>
>>> If you are not yet in production - we have Developer support @ 2k per person.
>>>
>>> Production - is what I described below @ $3k per cpu per year.
>>>
>>> Or we can do a consulting engagement - generally these carry a 4 day minimum @ 3k per day.
>>>
>>> We can also combine the support and consulting together.
>>>
>>> I am on my cell @ 978-973-4688 feel free to give me a call.
>>>
>>> Regards,
>>> Greg
>>>
>>> From: C-Peter Ritchie
>>> [mailto:PRitchie at cdic.ca<mailto:PRitchie at cdic.ca>]
>>> Sent: Friday, December 10, 2010 2:40 PM
>>> To: Gregory Chorebanian
>>> Cc: Steph Swierenga; Jim Apperly
>>> Subject: RE: queues going down
>>>
>>> Hi Greg.  I'll have to defer the decision about the contract to Steph.
>>>
>>> From: Gregory Chorebanian
>>> [mailto:gchorebanian at vmware.com<mailto:gchorebanian at vmware.com>]
>>> Sent: Friday, December 10, 2010 1:00 PM
>>> To: C-Peter Ritchie
>>> Cc: Steph Swierenga; Jim Apperly
>>> Subject: RE: queues going down
>>>
>>> Peter,
>>>
>>> I added Jim back to the thread - he can help a bit on the technical side.  For support we contract it out by CPU - for Rabbit we charge 3k per CPU per year.
>>>
>>> Should I send the contract to you so we can get a proper support ticket open?
>>>
>>> Let me know,
>>> Greg
>>>
>>> From: C-Peter Ritchie
>>> [mailto:PRitchie at cdic.ca<mailto:PRitchie at cdic.ca>]
>>> Sent: Friday, December 10, 2010 11:41 AM
>>> To: Gregory Chorebanian
>>> Cc: Steph Swierenga; Jeff Miller
>>> Subject: RE: queues going down
>>>
>>> Hi Greg.  My name is Peter Ritchie and I'm working with Steph on
>>> integrating RabbitMQ into our system
>>>
>>> We've installed  RabbitMQ on a  Windows 2008 server (VM: 64 bit with 4 cores).  It's setup to run as a Windows service.  We are sending it about 2,000,000 messages where each message is about 1650 bytes.  We're finding that when the queue reaches about 262,000-328,000 queued messages, the Rabbit MQ server crashes.
>>>
>>> We currently don't have a deployment of 32-bit Windows to test this scenario.
>>>
>>> We're using Erlang 5.8.1.1 and RabbitMQ server 2.2.0.  This is in .Net with the RabbitMQ .NET Client 2.2.0.
>>>
>>> If there's any other information you need; please let me know.
>>>
>>> Thanks -- Peter
>>>
>>>
>>>
>>> From: Jeff Miller
>>> [mailto:millerj at vmware.com<mailto:millerj at vmware.com>]
>>> Sent: Friday, December 10, 2010 9:48 AM
>>> To: Jim Apperly; Steph Swierenga
>>> Cc: info at rabbitmq.com<mailto:info at rabbitmq.com>; Jamie Engesser;
>>> Gregory Chorebanian
>>> Subject: RE: queues going down
>>>
>>> Steph,
>>>
>>> Greg Chorebanian will help you with this request.
>>>
>>> Jeff Miller
>>> Vice President of Sales, Americas
>>> vFabric, Cloud Application Platform Division VMware
>>> 770-241-7809
>>>
>>>
>>>
>>> From: jim.apperly at gmail.com<mailto:jim.apperly at gmail.com>
>>> [mailto:jim.apperly at gmail.com<mailto:jim.apperly at gmail.com>] On
>>> Behalf Of Jim Apperly
>>> Sent: Friday, December 10, 2010 9:31 AM
>>> To: Steph Swierenga
>>> Cc: info at rabbitmq.com<mailto:info at rabbitmq.com>; Jamie Engesser; Jeff
>>> Miller
>>> Subject: Re: queues going down
>>>
>>> Hi Steph,
>>>
>>> We offer commercial support for RabbitMQ through the SpringSource division of VMware. I am cc'ing some colleagues who will be able to connect you with the right people to follow up with this.
>>>
>>> So that we can best help you please can you tell us more about your problem? Have you tried reproducing the issue on 32bit Windows?
>>>
>>> Best wishes
>>> Jim
>>>
>>> On 10 December 2010 13:52, Steph Swierenga <SSwierenga at cdic.ca<mailto:SSwierenga at cdic.ca>> wrote:
>>> Hey, looking for some support (commercial?) for the rabbit. We're hosting the server in a windows service running on Windows Server 2008 (64-bit) and seem to be running into a buffering problem.
>>>
>>> Can you hook me up with a call?
>>>
>>> Thanks,
>>> Steph Swierenga,
>>> Ottawa, Canada
>>> 613 850-8898
>>>
>>>
>>>
>>>
>>> <rabbit at PAYSVC-TEST.log><rabbit at PAYSVC-TEST.log.1><rabbit at PAYSVC-TEST
>>> - sasl.log><rabbit at PAYSVC-TEST-sasl.log.1>
>>
>