Hi all,<div><br></div><div>We were hoping to go into production next week with Rabbit's HA queues, however testing by randomly killing processes has shown some odd behaviour. I'm not sure whether it's something odd with our setup, or a set of interacting bugs.</div>
<div><br></div><div>When started, we create three queues:</div>
<div><ul><li>storm-classifications</li><li>storm-meta</li><li>storm-tracking</li></ul><div>There are 3 nodes in the cluster, with queues replicated to all nodes. In testing, I've been issuing kill commands to take out beam, rabbit and erlang processes, which close the channels and make clients reconnect to a different nodes. I allow the downed node to recover and come back up before killing another one (also allowing queues to synchronize).</div>
</div><div><br></div><div>After doing this a couple of times, we see the following:</div><div><br></div><div><a href="http://www.evernote.com/shard/s53/sh/448a967e-b995-4f54-986d-50194955550f/416d4e71af91f6e2f6d5311f7ea9fb44" target="_blank">http://www.evernote.com/shard/s53/sh/448a967e-b995-4f54-986d-50194955550f/416d4e71af91f6e2f6d5311f7ea9fb44</a></div>
<div><br></div><div>The classifications queue is gone (taking any messages with it), the meta queue is only replicated to one other node. The tracking queue is OK, but only because it disappeared and was recreated empty.</div>
<div><br></div><div>The meta queue - despite not having any messages flowing through it in the test system - sometimes shows that one of the nodes is not synchronised for minutes after rejoining.</div><div><br></div><div>
I've also found that sometimes queues stop delivering messages when certain nodes go down (even after being left for minutes), despite being in HA mode (haven't been able to dig into this more yet). Sometimes connections to nodes which have gone down are still shown and get stuck. Using netstat reveals that those connections do not exist at TCP level, and using the Web UI to 'Force Close' them generates an error (red box saying unable to connect to server - however the rest of the UI works fine).</div>
<div><br>
</div>
<div><div>This seems like rather odd behaviour, and means we can't put it into production. I'm having trouble replicating it, all I know is that after cycling nodes a few times it stops working as we'd expect.</div>
<div><br></div><div>Today, while bringing up the cluster from scratch (shutdown all instances, wipe mnesia, restart) I've got 3 nodes running, but an HA queue with 1 master, 2 synced slaves and 1 unsynced slave. Other queues are showing 1 master and 2 synced slaves as expected. (see <a href="http://www.evernote.com/shard/s53/sh/b6345885-88d1-4d21-9614-24abda75a1cb/c2a0dd265b39d21f3e8c336c67ced979">http://www.evernote.com/shard/s53/sh/b6345885-88d1-4d21-9614-24abda75a1cb/c2a0dd265b39d21f3e8c336c67ced979</a>)</div>
<div><br></div><div>Currently on 2.6.0, Ubuntu 11.04 on EC2.</div><div><br></div><div>Quite infuriating, and no idea how to fix it.</div>
</div><div><br></div><div>A</div><div><div><br></div>-- <br>Dr Ashley Brown<br>Chief Architect<br><br>e: <a href="mailto:ashley@spider.io" target="_blank">ashley@spider.io</a><br>a: <a href="http://spider.io" target="_blank">spider.io</a>, 353 The Strand, WC2R 0HS<br>
w: <a href="http://spider.io/" target="_blank">http://spider.io/</a><br>
</div>