[rabbitmq-discuss] RabbitMQ crashed in ets:insert_new - looks like a genuine bug...

Fri Aug 12 15:11:56 BST 2011

On Fri, Aug 12, 2011 at 11:26:51AM +0100, Matthew Sackman wrote:
> There is only one use of lists:any in the msg_store, and thus it's
> fairly easy to work out what's happened: somehow, there's been an
> attempt made to sync on disk a msg that turns out not to exist. This is,
> unsurprisingly, an unexpected situation, hence the dramatic crash.

I've now been able to reproduce it. For me, the following test causes
this:

%%%%%%%%%%%
-module(test).
-include_lib("amqp_client/include/amqp_client.hrl").

crash_tx() ->
    {ok, ConnDel} = amqp_connection:start(#amqp_params_network{}),
    {ok, ChanDel} = amqp_connection:open_channel(ConnDel),
    {ok, Conn} = amqp_connection:start(#amqp_params_network{}),
    TxChanCount = 200,
    TxChans = [begin
                   {ok, ChanPubTx} = amqp_connection:open_channel(Conn),
                   amqp_channel:call(ChanPubTx, #'tx.select'{}),
                   ChanPubTx
               end || _ <- lists:seq(1, TxChanCount)],
    amqp_channel:call(
      ChanDel, #'queue.declare' { queue = <<"q">>, durable = true }),
    Method = #'basic.publish'{ routing_key = <<"q">> },
    Content = #amqp_msg { props = #'P_basic'{ delivery_mode = 2 },
                          payload = <<0:8>> },
    Rounds = lists:seq(1, 5),
    PerRound = lists:seq(1, 1000),
    Parent = self(),
    PubPidTxs = [spawn(
                   fun () ->
                           [begin
                                io:format("~p: ~p~n", [self(), R]),
                                amqp_channel:call(ChanPubTx, #'tx.commit'{}),
                                [ok = amqp_channel:cast(ChanPubTx, Method, Content)
                                 || _ <- PerRound]
                            end || R <- Rounds],
                           Parent ! publish_done,
                           amqp_channel:call(ChanPubTx, #'tx.commit'{})
                   end) || ChanPubTx <- TxChans],
    receive
        publish_done ->
            io:format("Deleting queue...~n"),
            amqp_channel:call(ChanDel, #'queue.delete'{ queue = <<"q">> }),
            amqp_channel:close(ChanDel),
            [exit(PubPidTx, kill) || PubPidTx <- PubPidTxs],
            [amqp_channel:close(ChanPubTx) || ChanPubTx <- TxChans]
    end,
    amqp_connection:close(ConnDel),
    [receive publish_done -> ok end || _ <- tl(TxChans)],
    amqp_connection:close(Conn).
%%%%%%%%%%%%

It's a nasty little bug. Basically, we have all sorts of optimisations
in place to make sure that when a queue is deleted, even if it has lots
of outstanding work to do, the queue can be deleted quickly. This
basically amounts to promoting the delete up the list of things to do,
the idea being that it can overtake writes and removes and such that
obviously are now not worth doing considering that the queue is going to
be deleted.

It turns out that if in the list of things to do, you have a write, and
then a sync of that write, the write itself can be ignored owing to the
promoted deletion. But the sync doesn't get ignored, and so tries to
sync the msg to disk, but the msg isn't known about on disk because it's
been ignored, hence the not_found, hence the crash.

The good news is that all of that code has changed quite a lot since
2.5.1 and so I don't think this bug can happen if you run from default
(and consequently will be fixed in the next release). I think. I'll
continue checking through the code on default to be sure of this.

Well found!

Matthew