[rabbitmq-discuss] BUG: Failing nodes with HA-queue stops message flow

Simon Lundström simlu at su.se
Mon Oct 1 13:03:16 BST 2012


Hello!

As per requested by Emile on IRC:

We get stale messages in the queue when the node that is "master" in a
HA-queue gets restarted, details below:

We are running two nodes, ram01 and disk01, on 2.8.7.
ram01 is a ram node and, surprise surprise, disk01 is a disc node.

We are using the amqp gem on Ruby for producer and consumer
(sourcecode available on demand).

The producer connects to ram01 and creates a fanout exchange which is
durable.  It then publishes X # of messages on the exchange.

The consumer connects to disk01 connects to disk01 and sets up an error
handler to reconnect if it gets an 320 error.
The consumer creates a durable queue which has x-ha-policy = all. It
when binds to the exchange.

The queue has disk01 as a node and ram01 as a mirror.

Reproduceable steps:
# Restart disk01 (via `rabbitmqctl stop` and start it up again).
# The consumer reconnects because it noticed the 320.
consumer log:
[connection.close] Reply code = 320, reply text = CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
[connection.close] Setting up a periodic reconnection timer...
# In the queue, ram01 is now node and disk01 is a mirror.
# Restart ram01 (via the same procedure as disk01)
# In the management web GUI under the queue (simlu.queue) the consumer is not listed as connected to the queue.
However, if I go to "Connections" I can see that the consumer is connected to disk01. According to the log the consumer is also connected.
# Publish message via the producer
# We now have 1 message "Ready" in the queue.
The consumer does not get any message.
# If we either:
* Restart disk01 or
* Restart consumer
the message gets delivered.
In this case, we restarted disk01.
# Message is delivered.
On ram01 we get this message in the log:
=ERROR REPORT==== 1-Oct-2012::13:55:35 ===
Discarding message {'$gen_call',{<0.173.0>,#Ref<0.0.0.967>},{notify_down,<6733.320.0>}} from <0.173.0> to <0.200.0> in an old incarnation (1) of this node (2)

Log output from disk01 and ram01 is attached.

Thanks!
- Simon

---

Simon Lundström
IT Services
Stockholm University
-------------- next part --------------

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
Mirrored-queue (queue 'simlu.queue' in vhost '/'): Slave <'rabbit at rabbitmq-test-ram01.it.su.se'.1.200.0> saw deaths of mirrors <'rabbit at rabbitmq-test-disk01.it.su.se'.2.250.0> 

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
Mirrored-queue (queue 'simlu.queue' in vhost '/'): Promoting slave <'rabbit at rabbitmq-test-ram01.it.su.se'.1.200.0> to master

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
rabbit on node 'rabbit at rabbitmq-test-disk01.it.su.se' down

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
only running disc node went down

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
rabbit on node 'rabbit at rabbitmq-test-disk01.it.su.se' up

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
cluster contains disc nodes again

=INFO REPORT==== 1-Oct-2012::13:50:42 ===
Stopping Rabbit

=INFO REPORT==== 1-Oct-2012::13:50:42 ===
    application: rabbitmq_management_agent
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:50:42 ===
stopped SSL Listener on [::]:5671

=INFO REPORT==== 1-Oct-2012::13:50:42 ===
    application: rabbit
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:50:42 ===
    application: os_mon
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:50:42 ===
    application: mnesia
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:50:42 ===
Halting Erlang VM

=INFO REPORT==== 1-Oct-2012::13:50:44 ===
Limiting to approx 924 file handles (829 sockets)

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
Memory limit set to 805MB of 2013MB total.

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
Disk free limit set to 1000MB

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
Management plugin upgraded statistics to fine.

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 1-Oct-2012::13:50:45 ===
msg_store_persistent: recovery terms differ from present
rebuilding indices from scratch

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
Adding mirror of queue 'simlu.queue' in vhost '/' on node 'rabbit at rabbitmq-test-ram01.it.su.se': <0.200.0>

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
started SSL Listener on [::]:5671

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
rabbit on node 'rabbit at rabbitmq-test-disk01.it.su.se' up

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
cluster contains disc nodes again

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
Management agent started.

=INFO REPORT==== 1-Oct-2012::13:53:35 ===
accepting AMQP connection <0.253.0> (130.237.95.133:55630 -> 77.238.35.70:5671)

=INFO REPORT==== 1-Oct-2012::13:53:36 ===
closing AMQP connection <0.253.0> (130.237.95.133:55630 -> 77.238.35.70:5671)

=ERROR REPORT==== 1-Oct-2012::13:55:35 ===
Discarding message {'$gen_call',{<0.173.0>,#Ref<0.0.0.967>},{notify_down,<6733.320.0>}} from <0.173.0> to <0.200.0> in an old incarnation (1) of this node (2)


=INFO REPORT==== 1-Oct-2012::13:55:35 ===
Mirrored-queue (queue 'simlu.queue' in vhost '/'): Slave <'rabbit at rabbitmq-test-ram01.it.su.se'.2.200.0> saw deaths of mirrors <'rabbit at rabbitmq-test-disk01.it.su.se'.3.250.0> 

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
Mirrored-queue (queue 'simlu.queue' in vhost '/'): Promoting slave <'rabbit at rabbitmq-test-ram01.it.su.se'.2.200.0> to master

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
rabbit on node 'rabbit at rabbitmq-test-disk01.it.su.se' down

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
only running disc node went down

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
rabbit on node 'rabbit at rabbitmq-test-disk01.it.su.se' up

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
cluster contains disc nodes again

-------------- next part --------------

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
Stopping Rabbit

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
    application: rabbitmq_management
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
    application: rabbitmq_management_agent
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
stopped SSL Listener on [::]:5671

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
    application: rabbit
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
    application: os_mon
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
    application: mnesia
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:50:10 ===
Halting Erlang VM

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
Limiting to approx 924 file handles (829 sockets)

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
Memory limit set to 805MB of 2013MB total.

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
Disk free limit set to 1000MB

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
Management plugin upgraded statistics to fine.

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 1-Oct-2012::13:50:14 ===
msg_store_persistent: recovery terms differ from present
rebuilding indices from scratch

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
Adding mirror of queue 'simlu.queue' in vhost '/' on node 'rabbit at rabbitmq-test-disk01.it.su.se': <0.250.0>

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
started SSL Listener on [::]:5671

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
rabbit on node 'rabbit at rabbitmq-test-ram01.it.su.se' up

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
Management agent started.

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
Management plugin started. Port: 443, path: /

=INFO REPORT==== 1-Oct-2012::13:50:14 ===
Statistics database started.

=INFO REPORT==== 1-Oct-2012::13:50:15 ===
accepting AMQP connection <0.315.0> (130.237.95.133:55602 -> 77.238.35.76:5671)

=INFO REPORT==== 1-Oct-2012::13:50:42 ===
Mirrored-queue (queue 'simlu.queue' in vhost '/'): Slave <'rabbit at rabbitmq-test-disk01.it.su.se'.3.250.0> saw deaths of mirrors <'rabbit at rabbitmq-test-ram01.it.su.se'.1.200.0> 

=INFO REPORT==== 1-Oct-2012::13:50:42 ===
Mirrored-queue (queue 'simlu.queue' in vhost '/'): Promoting slave <'rabbit at rabbitmq-test-disk01.it.su.se'.3.250.0> to master

=INFO REPORT==== 1-Oct-2012::13:50:42 ===
rabbit on node 'rabbit at rabbitmq-test-ram01.it.su.se' down

=INFO REPORT==== 1-Oct-2012::13:50:45 ===
rabbit on node 'rabbit at rabbitmq-test-ram01.it.su.se' up

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
Stopping Rabbit

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
    application: rabbitmq_management
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
    application: rabbitmq_management_agent
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
stopped SSL Listener on [::]:5671

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
    application: rabbit
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
    application: os_mon
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
    application: mnesia
    exited: stopped
    type: permanent

=INFO REPORT==== 1-Oct-2012::13:55:35 ===
Halting Erlang VM

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
Limiting to approx 924 file handles (829 sockets)

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
Memory limit set to 805MB of 2013MB total.

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
Disk free limit set to 1000MB

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
Management plugin upgraded statistics to fine.

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 1-Oct-2012::13:55:39 ===
msg_store_persistent: recovery terms differ from present
rebuilding indices from scratch

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
Adding mirror of queue 'simlu.queue' in vhost '/' on node 'rabbit at rabbitmq-test-disk01.it.su.se': <0.250.0>

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
started SSL Listener on [::]:5671

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
rabbit on node 'rabbit at rabbitmq-test-ram01.it.su.se' up

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
Management agent started.

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
Management plugin started. Port: 443, path: /

=INFO REPORT==== 1-Oct-2012::13:55:39 ===
Statistics database started.

=INFO REPORT==== 1-Oct-2012::13:55:40 ===
accepting AMQP connection <0.315.0> (130.237.95.133:55642 -> 77.238.35.76:5671)



More information about the rabbitmq-discuss mailing list