[rabbitmq-discuss] RabbitMQ crashed in ets:insert_new - looks like a genuine bug...

Eugene Kirpichov ekirpichov at gmail.com
Fri Aug 12 16:38:18 BST 2011


Hi Matthew,

Thanks a lot for taking the time for investigation.

Do you refer to reproducing the INTERNAL_ERROR bug in tx.commit (which
only happened once and didn't cause a node crash), or to the bug that
was causing the node crash (and happened on a different node of the
cluster)?

I'm using RabbitMQ 2.5.1, Erlang R14B02, on Windows indeed.

By the way, I ran the stress test with even more stress on the cluster
several times afterwards and wasn't able to cause it to crash again,
though before that 2 of 2 tests crashed. So I was "lucky" in a sense.

On Fri, Aug 12, 2011 at 7:11 AM, Matthew Sackman <matthew at rabbitmq.com> wrote:
> On Fri, Aug 12, 2011 at 11:26:51AM +0100, Matthew Sackman wrote:
>> There is only one use of lists:any in the msg_store, and thus it's
>> fairly easy to work out what's happened: somehow, there's been an
>> attempt made to sync on disk a msg that turns out not to exist. This is,
>> unsurprisingly, an unexpected situation, hence the dramatic crash.
>
> I've now been able to reproduce it. For me, the following test causes
> this:
>
> %%%%%%%%%%%
> -module(test).
> -include_lib("amqp_client/include/amqp_client.hrl").
>
> crash_tx() ->
>    {ok, ConnDel} = amqp_connection:start(#amqp_params_network{}),
>    {ok, ChanDel} = amqp_connection:open_channel(ConnDel),
>    {ok, Conn} = amqp_connection:start(#amqp_params_network{}),
>    TxChanCount = 200,
>    TxChans = [begin
>                   {ok, ChanPubTx} = amqp_connection:open_channel(Conn),
>                   amqp_channel:call(ChanPubTx, #'tx.select'{}),
>                   ChanPubTx
>               end || _ <- lists:seq(1, TxChanCount)],
>    amqp_channel:call(
>      ChanDel, #'queue.declare' { queue = <<"q">>, durable = true }),
>    Method = #'basic.publish'{ routing_key = <<"q">> },
>    Content = #amqp_msg { props = #'P_basic'{ delivery_mode = 2 },
>                          payload = <<0:8>> },
>    Rounds = lists:seq(1, 5),
>    PerRound = lists:seq(1, 1000),
>    Parent = self(),
>    PubPidTxs = [spawn(
>                   fun () ->
>                           [begin
>                                io:format("~p: ~p~n", [self(), R]),
>                                amqp_channel:call(ChanPubTx, #'tx.commit'{}),
>                                [ok = amqp_channel:cast(ChanPubTx, Method, Content)
>                                 || _ <- PerRound]
>                            end || R <- Rounds],
>                           Parent ! publish_done,
>                           amqp_channel:call(ChanPubTx, #'tx.commit'{})
>                   end) || ChanPubTx <- TxChans],
>    receive
>        publish_done ->
>            io:format("Deleting queue...~n"),
>            amqp_channel:call(ChanDel, #'queue.delete'{ queue = <<"q">> }),
>            amqp_channel:close(ChanDel),
>            [exit(PubPidTx, kill) || PubPidTx <- PubPidTxs],
>            [amqp_channel:close(ChanPubTx) || ChanPubTx <- TxChans]
>    end,
>    amqp_connection:close(ConnDel),
>    [receive publish_done -> ok end || _ <- tl(TxChans)],
>    amqp_connection:close(Conn).
> %%%%%%%%%%%%
>
> It's a nasty little bug. Basically, we have all sorts of optimisations
> in place to make sure that when a queue is deleted, even if it has lots
> of outstanding work to do, the queue can be deleted quickly. This
> basically amounts to promoting the delete up the list of things to do,
> the idea being that it can overtake writes and removes and such that
> obviously are now not worth doing considering that the queue is going to
> be deleted.
>
> It turns out that if in the list of things to do, you have a write, and
> then a sync of that write, the write itself can be ignored owing to the
> promoted deletion. But the sync doesn't get ignored, and so tries to
> sync the msg to disk, but the msg isn't known about on disk because it's
> been ignored, hence the not_found, hence the crash.
>
> The good news is that all of that code has changed quite a lot since
> 2.5.1 and so I don't think this bug can happen if you run from default
> (and consequently will be fixed in the next release). I think. I'll
> continue checking through the code on default to be sure of this.
>
> Well found!
>
> Matthew
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>



-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/


More information about the rabbitmq-discuss mailing list