[rabbitmq-discuss] Failed upgrade from 1.8.1 to 2.1.0

David King dking at ketralnis.com
Fri Oct 8 01:32:36 BST 2010


I tried to upgrade from 1.8.1 to 2.1.0. The "upgrade" was a process of provisioning a new machine running 2.1.0 and slowly moving all of the producers and consumers. Either 2.1.0 couldn't keep up with the load, or there's some max number of connections, or something. A better story can be told by the graphs <http://i.imgur.com/flLPX.png>

At first (about 15:10 on the graph), almost all of the queues started growing. Consumers would hang, unaided by restarts. But some of the queues (like commenstree_q and register_vote_q) didn't have any trouble at all.

At about 15:50, after about 30 minutes of trying to figure out what was going on I restarted rabbit. This time, some of the queues that were uncomsumable before (like spam_q and corrections_q) were now working, but other queues (like indextank_changes) would still hang on consuming.

At 16:20 I gave up and started reverting back to the other queue machine (running rabbit 1.8.1). As that happened, some of the queues that were unconsumable finally started shrinkining and their consumers unhung. Some of them (newcomments_q) processed all of the items in the queue basically instantly, so this isn't a case of our own app not being able to keep up. By about 16:30 I'd completed moving back to the old machine.

Potentially relevant:
* commentstree_q has 4 consumers
* corrections_q has 1 consumer
* the one with the blacked out name has 1 consumer that runs from cron, not continuously
* indextank_changes has 2 consumers
* log_q has 1 consumer
* newcomments_q has 1 consumer
* register_vote_q has ~15 consumers
* scraper_q has 5 consumers
* solrsearch_changes has 1 consumer
* spam_q has 4 consumers
* usage_q has 2 consumers

I realise this may not be enough information, but the logs were basically empty for this period (they'd have occasional messages like "someone connected" and "someone disconnected" but nothing more useful). So what information can I gather beyond this that might help diagnose why rabbit 2.1.0 can't keep up but 1.8.1 can?

Also, this is unrelated, but I'd provisioned the new machine a couple of weeks ago, and it's been sitting literally 100% idle until I tried moving today. But look at its memory usage over the last week <http://i.imgur.com/M7x6m.png>. What on earth could it be doing that it's growing in memory when *not used*? This machine is identical to our other machines (it's an EC2 node from the same AMI), but only this node has this problem, and all it's running is rabbit.


More information about the rabbitmq-discuss mailing list