[rabbitmq-discuss] Feature Req / Bug list

Tue Nov 26 22:23:31 GMT 2013

Hmmm... So, about 1/4 of the queues were approaching being balanced, and
then all of the hosts dropped out of the test cluster.

Looking at the system, only the second instance is even listening on any
network ports, but the beam.smp processes for all instances are still
running:

# netstat -tlpn | grep beam
tcp        0      0 0.0.0.0:44442               0.0.0.0:*
LISTEN      20303/beam.smp
tcp        0      0 0.0.0.0:16639               0.0.0.0:*
LISTEN      20303/beam.smp
tcp        0      0 :::5672                     :::*
LISTEN      20303/beam.smp

# ps axf
12449 ?        S      0:08 /usr/lib64/erlang/erts-5.8.5/bin/epmd -daemon
20183 ?        Sl    13:50 /usr/lib64/erlang/erts-5.8.5/bin/beam.smp -W w
-K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home
/var/lib/rabbitmq -- -p
20303 ?        Sl    21:31 /usr/lib64/erlang/erts-5.8.5/bin/beam.smp -W w
-K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home
/var/lib/rabbitmq -- -p
20739 ?        Ss     0:04  \_ inet_gethost 4
20740 ?        S      0:04      \_ inet_gethost 4
20423 ?        Sl    11:24 /usr/lib64/erlang/erts-5.8.5/bin/beam.smp -W w
-K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home
/var/lib/rabbitmq -- -p
20543 ?        Sl    11:41 /usr/lib64/erlang/erts-5.8.5/bin/beam.smp -W w
-K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home
/var/lib/rabbitmq -- -p
20663 ?        Sl    10:18 /usr/lib64/erlang/erts-5.8.5/bin/beam.smp -W w
-K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home
/var/lib/rabbitmq -- -p

I killed them all and started them back up again, the spent a long time
rescanning all their queues, and when the main node (which the queues were
created on) re-entered the cluster, they all crashed again. I blew away the
test cluster, and re-ran the following steps, and had the main node where
all the queues were created and the data loaded stop listening on all its
ports and disconnect from the rest of the cluster. Killing it and
restarting it caused it to reread all its queues, and then drop all its
network listeners and connections to the cluster again.

./create_cluster.sh && ./setup_queues.sh && ./populate_queues.sh
./rebalance_cluster.sh

Looking at the logs for instances which drop their network connections
shows some pretty big erlang stack dumps, so I've compressed and uploaded a
sample log file here for you:
https://mega.co.nz/#!31JCGSxa!LdxuhXeX_HN_px8lnQcp63RkFcRY_dW9Z_x9C7qv2aE

Here's my latest set of scripts for reproduction at your end:
https://mega.co.nz/#!Th5UCBZS!eQe9_SmOS5qdv0tP8nUZmnjj7h1QhtOG1i4GrtoYeMU

Graeme

On Tue, Nov 26, 2013 at 1:28 PM, Graeme N <graeme at sudo.ca> wrote:

> On Tue, Nov 26, 2013 at 1:13 PM, Graeme N <graeme at sudo.ca> wrote:
>
>>
>> On Mon, Nov 25, 2013 at 9:39 AM, Simon MacMullen <simon at rabbitmq.com>wrote:
>>
>>> Graeme: we recently merged a few fixes to bugs that we found as a result
>>> of running your test scripts; these can be picked up in the latest
>>> nightlies. I am now able to run rebalance_cluster.sh and toggle_nodes.sh
>>> indefinitely without anything breaking.
>>>
>>
>>> Thanks for producing these scripts!
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131126/e2ccbe88/attachment.htm>