[rabbitmq-discuss] Rabbitmq falling over & losing messages
Toby White
toby.o.h.white at googlemail.com
Thu Nov 27 15:41:59 GMT 2008
I'm seeing a failure condition in Rabbit, where I seem to lose a whole
queue's worth of messages.
I'm using RabbitMQ-1.4.0, and talking to it with py-amqplib.
If I push small messages at Rabbit as fast as I can (both client and
server on the same host) in transacted batches of 1000:
>>> t = str(time.time())
>>> ch.tx_select()
>>> for i in xrange(1000):
... msgs = [str(1000*i+j)+" "+t for j in xrange(1000)]
... for j in xrange(1000):
... ch.basic_publish(amqp.Message(msgs[j], delivery_mode=2),
'amq.direct', routing_key='TX')
... print i
... ch.tx_commit()
... ch.tx_select()
then initially rabbit accepts the connections, and I can see the
messages arriving on the queue, in their batches, from the broker's
shell (there is no consumer running, so all messages are building up
in the broker queue.)
However, after anywhere between 50,000 and 100,000 messages have been
published, the client gets an exception, RabbitMQ crashes, and the
queue vanishes from the server, along with all unconsumed messages.
Recreating the queue shows it empty; restarting the server entirely is
a hit-and-miss affair; sometimes it works fine (though the messages
are still missing); sometimes it fails due to a timeout on the
persister process, and the only way to restart seems to be to delete /
var/lib/rabbitmq/mnesia.
This is all entirely reproducible, the only variable being the number
of messages published/consumed before Rabbit falls over.
At the time of crashing, the erlang process is taking about 40% of the
host's CPU, and about 10% of its memory; the Python process doing the
publishing is taking a small amount of CPU and memory, nothing else of
significance is consuming any resources.
The client exception looks like:
Traceback (most recent call last):
File "<stdin>", line 6, in <module>
File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line
3336, in tx_commit
(90, 21), # Channel.tx_commit_ok
File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line
183, in wait
frame_type, payload = self._next_frame()
File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line
123, in _next_frame
return self.connection._wait_channel(self.channel_id)
File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line
430, in _wait_channel
self.wait()
File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line
203, in wait
return self._dispatch(method_sig, args, content)
File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line
115, in _dispatch
return amqp_method(self, args)
File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line
563, in _close
raise AMQPConnectionException(reply_code, reply_text, (class_id,
method_id))
amqplib.client_0_8.AMQPConnectionException: (541, u'INTERNAL_ERROR',
(0, 0), '')
Looking in the Rabbit logs, this happens at the crash:
=ERROR REPORT==== 27-Nov-2008::12:00:45 ===
connection <0.283.0> (running), channel 1 - error:
{commit_failed,
[{exit,
{timeout,
{gen_server,
call,
[<0.273.0>,{commit,{{1,<0.288.0>},1080079}}]}}}]}
=WARNING REPORT==== 27-Nov-2008::12:00:45 ===
Non-AMQP exit reason '{commit_failed,
[{exit,
{timeout,
{gen_server,
call,
[<0.273.0>,
{commit,{{1,<0.288.0>},
1080079}}]}}}]}'
=INFO REPORT==== 27-Nov-2008::12:00:45 ===
closing TCP connection <0.283.0> from 127.0.0.1:57226
=ERROR REPORT==== 27-Nov-2008::12:01:36 ===
** Generic server <0.273.0> terminating
** Last message in was {commit,{{1,<0.288.0>},1080079}}
** When Server state == {q,
{amqqueue,
{resource,<<"/">>,queue,<<"TX_queue">>},
true,
false,
[],
[],
none},
none,
none,
true,
1,
[...]
{{basic_message,
{resource,<<"/">>,exchange,<<"amq.direct">>},
<<"TX">>,
{content,
60,
{'P_basic',
undefined,
undefined,
undefined,
2,
undefined,
undefined,
undefined,
undefined,
undefined,
undefined,
undefined,
undefined,
undefined,
undefined},
<<16,0,2>>,
[<<"78999 1227785600.85">>]},
{{1,<0.288.0>},1080078}},
false}]},
{[],[]}}
** Reason for termination ==
** {timeout,
{gen_server,
call,
[rabbit_persister,
{commit_transaction,
{{{1,<0.288.0>},1080079},
{resource,<<"/">>,queue,<<"TX_queue">>}}}]}}
with the entire contents of the queue (79000 messages in this case)
inside the server state.
What can I do to fix this?
Toby
More information about the rabbitmq-discuss
mailing list