[rabbitmq-discuss] Rabbitmq falling over & losing messages

Toby White toby.o.h.white at googlemail.com
Thu Nov 27 15:41:59 GMT 2008


I'm seeing a failure condition in Rabbit, where I seem to lose a whole  
queue's worth of messages.

I'm using RabbitMQ-1.4.0, and talking to it with py-amqplib.

If I push small messages at Rabbit as fast as I can (both client and  
server on the same host) in transacted batches of 1000:

 >>> t = str(time.time())
 >>> ch.tx_select()
 >>> for i in xrange(1000):
...     msgs = [str(1000*i+j)+" "+t for j in xrange(1000)]
...     for j in xrange(1000):
...         ch.basic_publish(amqp.Message(msgs[j], delivery_mode=2),  
'amq.direct', routing_key='TX')
...     print i
...     ch.tx_commit()
...     ch.tx_select()

then initially rabbit accepts the connections, and I can see the  
messages arriving on the queue, in their batches, from the broker's  
shell (there is no consumer running, so all messages are building up  
in the broker queue.)

However, after anywhere between 50,000 and 100,000 messages have been  
published, the client gets an exception, RabbitMQ crashes, and the  
queue vanishes from the server, along with all unconsumed messages.  
Recreating the queue shows it empty; restarting the server entirely is  
a hit-and-miss affair; sometimes it works fine (though the messages  
are still missing); sometimes it fails due to a timeout on the  
persister process, and the only way to restart seems to be to delete / 
var/lib/rabbitmq/mnesia.

This is all entirely reproducible, the only variable being the number  
of messages published/consumed before Rabbit falls over.

At the time of crashing, the erlang process is taking about 40% of the  
host's CPU, and about 10% of its memory; the Python process doing the  
publishing is taking a small amount of CPU and memory, nothing else of  
significance is consuming any resources.

The client exception looks like:

Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line  
3336, in tx_commit
    (90, 21),    # Channel.tx_commit_ok
  File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line  
183, in wait
    frame_type, payload = self._next_frame()
  File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line  
123, in _next_frame
    return self.connection._wait_channel(self.channel_id)
  File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line  
430, in _wait_channel
    self.wait()
  File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line  
203, in wait
    return self._dispatch(method_sig, args, content)
  File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line  
115, in _dispatch
    return amqp_method(self, args)
  File "/usr/lib/python2.5/site-packages/amqplib/client_0_8.py", line  
563, in _close
    raise AMQPConnectionException(reply_code, reply_text, (class_id,  
method_id))
amqplib.client_0_8.AMQPConnectionException: (541, u'INTERNAL_ERROR',  
(0, 0), '')


Looking in the Rabbit logs, this happens at the crash:

=ERROR REPORT==== 27-Nov-2008::12:00:45 ===
connection <0.283.0> (running), channel 1 - error:
{commit_failed,
    [{exit,
         {timeout,
             {gen_server,
                 call,
                 [<0.273.0>,{commit,{{1,<0.288.0>},1080079}}]}}}]}

=WARNING REPORT==== 27-Nov-2008::12:00:45 ===
Non-AMQP exit reason '{commit_failed,
                          [{exit,
                               {timeout,
                                   {gen_server,
                                       call,
                                       [<0.273.0>,
                                        {commit,{{1,<0.288.0>}, 
1080079}}]}}}]}'

=INFO REPORT==== 27-Nov-2008::12:00:45 ===
closing TCP connection <0.283.0> from 127.0.0.1:57226

=ERROR REPORT==== 27-Nov-2008::12:01:36 ===
** Generic server <0.273.0> terminating
** Last message in was {commit,{{1,<0.288.0>},1080079}}
** When Server state == {q,
                         {amqqueue,
                          {resource,<<"/">>,queue,<<"TX_queue">>},
                          true,
                          false,
                          [],
                          [],
                          none},
                         none,
                         none,
                         true,
                         1,
[...]
                           {{basic_message,
                              
{resource,<<"/">>,exchange,<<"amq.direct">>},
                             <<"TX">>,
                             {content,
                              60,
                              {'P_basic',
                               undefined,
                               undefined,
                               undefined,
                               2,
                               undefined,
                               undefined,
                               undefined,
                               undefined,
                               undefined,
                               undefined,
                               undefined,
                               undefined,
                               undefined,
                               undefined},
                              <<16,0,2>>,
                              [<<"78999 1227785600.85">>]},
                             {{1,<0.288.0>},1080078}},
                            false}]},
                         {[],[]}}
** Reason for termination ==
** {timeout,
       {gen_server,
           call,
           [rabbit_persister,
            {commit_transaction,
                {{{1,<0.288.0>},1080079},
                 {resource,<<"/">>,queue,<<"TX_queue">>}}}]}}


with the entire contents of the queue (79000 messages in this case)  
inside the server state.

What can I do to fix this?

Toby





More information about the rabbitmq-discuss mailing list