[rabbitmq-discuss] Failed upgrade from 1.8.1 to 2.1.0

Fri Oct 8 19:53:28 BST 2010

> Ok, how much memory is in the box, and what are the average sizes of
> these messages? Are you setting the memory_high_watermark to anything
> non-default (of 0.4)? When rabbit starts, there should be entries in the
> logs saying things like:

The messages average about 8 bytes (they are mostly IDs that look like "t3_abcde"). The machine has 7.5gb of RAM, and the log messages you describe look like:

    =INFO REPORT==== 7-Oct-2010::15:54:19 ===
    Limiting to approx 924 file handles (829 sockets)
    =INFO REPORT==== 7-Oct-2010::15:54:19 ===
    Memory limit set to 3081MB.

> I suspect that you're maybe hitting the memory high watermark - are
> there any entries in the logs like:

There are no messages mentioning vm_memory_high_watermark or alarms:

    ri at q01:~/rabbitmq_server/log$ egrep -i '(alarm|memory)' *
    rabbit at q01.log:Memory limit set to 3081MB.
    rabbit at q01.log.1:Memory limit set to 3081MB.
    rabbit at q01.log.1:Memory limit set to 3081MB.

And as you can see in the memory graph I included at the end there, the bump at the end was this attempted upgrade, and the machine definitely didn't run out of RAM. But that "924 file handles" does look quite low, how can I increase that?

> Also, are you publishing messages transient or persistent (what's the
> delivery_mode when you publish messages)?. And your consumers that
> blocked, are these the ones using basic.get or basic.consume? And are
> they acking messages or using autoack? Are any of the consumers doing
> any publishes too?

Some messages are transient and some are persistent. The blocked queues appear to be a mix of basic.get and basic.consume. The consumers ack their own messages (not autoack). I don't think any of the consumers publish messages

> Interesting. What actions other than basic.consume (I assume) are these
> consumers doing when they connect? Do they try and redeclare the queue
> or create any other resources?

They do all try to redeclare all of the queues and bindings, yes

> Ok, what was the disk activity during this exercise in general?

I didn't get a note of it, but the CPU usage, including WAIT CPU, didn't noticeably increase (I have the CPU graph for that period and it's flat with respect to before, during, and after the upgrade), and I'd expect the WAIT to go up if the disk was swamped. 

>> Also, this is unrelated, but I'd provisioned the new machine a couple
>> of weeks ago, and it's been sitting literally 100% idle until I tried
>> moving today. But look at its memory usage over the last week
>> <http://i.imgur.com/M7x6m.png>. What on earth could it be doing that
>> it's growing in memory when *not used*? This machine is identical to
>> our other machines (it's an EC2 node from the same AMI), but only this
>> node has this problem, and all it's running is rabbit.
> Now that is interesting. What version of Erlang are you running?

ri at q01:~/rabbitmq_server/log$ erl --version
Erlang R14B (erts-5.8.1) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]

Thanks for the help so far :)