[rabbitmq-discuss] Mnesia Corruption Bug

Thu Jun 13 15:08:39 BST 2013

Hmm, do you still have the Mnesia directory that wouldn't boot? Are you 
able to reproduce this?

Cheers, Simon

On 13/06/13 14:00, Lee Hambley wrote:
> Hi Simon,
>
> Nothing strange of that sort, we use runit to manage the process (in out
> env we need unprivileged users to be able to restart selected services,
> using runit that's as simple as chowning a named pipe).
>
> In case it matters, on STOP runit sends TERM, waits 7s for the process
> to go away before resorting to sending KILL. ( the follow up KILL is our
> design, but in keeping with runit principles, the 7s timeout is internal
> to runit)
>
> We've no special file system configuration, these machines are i7 with
> raid spinning disks (not sure what configuration, probably 2 drives.
>
> The hardware is practically new <100h usage, and was burned in and
> stress tested at install time.
>
> Happy to post fstabs, raid logs etc if you tell me what you need (and in
> weird cases, how to get it).
>
> On Thursday, June 13, 2013, Simon MacMullen wrote:
>
>     Hi Lee. I would be interested to know how you got the machine into
>     that state.
>
>     There is a bug with a similar stack trace that will be fixed in the
>     next release - but I don't think it's the same bug. In your case we
>     are seeing a message which has been published and delivered
>     according to the queue index, but only published (and not delivered)
>     according to the queue index's journal. As the journal should always
>     record the same state or newer as the main index, this should be
>     impossible.
>
>     So to eliminate obvious causes of weirdness first: are you usuing an
>     unusual filesystem, or mounting the filesystem with unusual options?
>
>     Cheers, Simon
>
>     On 13/06/13 12:36, Lee Hambley wrote:
>
>         Posting this to the list after some discussion on IRC with
>         bob2351 on
>         irc.freenode.net <http://irc.freenode.net>.
>
>         We have a *slightly* strange situation with using RabbitMQ, we
>         start it
>         under `runit`, and it effectively believes that it's running in the
>         foreground. I have anecdotal evidence that this causes other
>         problems,
>         but at least not anything that hurts too often (i.e you lose
>         "persistent
>         messages" in this setup)
>
>         That all aside, attached (
>         https://gist.github.com/__leehambley/5773039
>         <https://gist.github.com/leehambley/5773039> )
>         is a stacktrace from a problematic box, we couldn't get it to
>         recover
>         (single node, single replica, etc, etc) - we simply deleted the
>         mnesia
>         database, which worked well enough.
>
>         Some information about our environment:
>
>              $ erl --version
>              Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8]
>              [async-threads:0] [kernel-poll:false]
>              $ dpkg --list | grep rabbit
>              ii  rabbitmq-server     3.0.4-1     AMQP server written in
>         Erlang
>              $ sudo RABBITMQ_NODENAME=ourproject rabbitmqctl status
>              Status of node ourproject at carla ...
>              [{pid,8055},
>                {running_applications,
>                    [{rabbitmq_management,"__RabbitMQ Management
>         Console","3.0.4"},
>                     {rabbitmq_management_agent,"__RabbitMQ Management
>         Agent","3.0.4"},
>                     {rabbit,"RabbitMQ","3.0.4"},
>                     {os_mon,"CPO  CXC 138 46","2.2.7"},
>                     {rabbitmq_web_dispatch,"__RabbitMQ Web
>         Dispatcher","3.0.4"},
>                     {webmachine,"webmachine","1.9.__1-rmq3.0.4-git52e62bc"},
>                     {mochiweb,"MochiMedia Web
>         Server","2.3.1-rmq3.0.4-__gitd541e9a"},
>                     {xmerl,"XML parser","1.2.10"},
>                     {inets,"INETS  CXC 138 49","5.7.1"},
>                     {mnesia,"MNESIA  CXC 138 12","4.5"},
>                     {amqp_client,"RabbitMQ AMQP Client","3.0.4"},
>                     {sasl,"SASL  CXC 138 11","2.1.10"},
>                     {stdlib,"ERTS  CXC 138 10","1.17.5"},
>                     {kernel,"ERTS  CXC 138 10","2.14.5"}]},
>                {os,{unix,linux}},
>                {erlang_version,
>                    "Erlang R14B04 (erts-5.8.5) [source] [64-bit]
>         [smp:8:8] [rq:8]
>              [async-threads:30] [kernel-poll:true]\n"},
>                {memory,
>                    [{total,33984216},
>                     {connection_procs,756760},
>                     {queue_procs,325576},
>                     {plugins,218728},
>                     {other_proc,9518440},
>                     {mnesia,93728},
>                     {mgmt_db,148472},
>                     {msg_index,71528},
>                     {other_ets,1145600},
>                     {binary,604208},
>                     {code,17266925},
>                     {atom,1550457},
>                     {other_system,2283794}]},
>                {vm_memory_high_watermark,0.4}__,
>                {vm_memory_limit,6656894566},
>                {disk_free_limit,1000000000},
>                {disk_free,11247643770880},
>                {file_descriptors,
>                    [{total_limit,924},
>                     {total_used,23},
>                     {sockets_limit,829},
>                     {sockets_used,12}]},
>                {processes,[{limit,1048576},{__used,345}]},
>                {run_queue,0},
>                {uptime,2692}]
>              ...done.
>
>
>         I believe this bug is already being tracked internally, and I
>         post the
>         report here in the hope that I'll have a place to attach a
>         snapshot of
>         an mnesia database the next time this happens to us, or that someone
>         else might find this report and be able to contribute. Finally,
>         selfishly, in the hope that I'll get notified when this gets
>         fixed, and
>         I upgrade, and sleep at night again.
>
>         - Lee Hambley
>
>
>         _________________________________________________
>         rabbitmq-discuss mailing list
>         rabbitmq-discuss at lists.rabbitmq.com
>         https://lists.rabbitmq.com/__cgi-bin/mailman/listinfo/__rabbitmq-discuss
>         <https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss>
>
>
>
>     --
>     Simon MacMullen
>     RabbitMQ, Pivotal
>
>
>
> --
> Lee Hambley
> --
> http://lee.hambley.name/
> +49 (0) 170 298 5667
>

-- 
Simon MacMullen
RabbitMQ, Pivotal