[rabbitmq-discuss] Mnesia Corruption Bug
Simon MacMullen
simon at rabbitmq.com
Thu Jun 13 15:08:39 BST 2013
Hmm, do you still have the Mnesia directory that wouldn't boot? Are you
able to reproduce this?
Cheers, Simon
On 13/06/13 14:00, Lee Hambley wrote:
> Hi Simon,
>
> Nothing strange of that sort, we use runit to manage the process (in out
> env we need unprivileged users to be able to restart selected services,
> using runit that's as simple as chowning a named pipe).
>
> In case it matters, on STOP runit sends TERM, waits 7s for the process
> to go away before resorting to sending KILL. ( the follow up KILL is our
> design, but in keeping with runit principles, the 7s timeout is internal
> to runit)
>
> We've no special file system configuration, these machines are i7 with
> raid spinning disks (not sure what configuration, probably 2 drives.
>
> The hardware is practically new <100h usage, and was burned in and
> stress tested at install time.
>
> Happy to post fstabs, raid logs etc if you tell me what you need (and in
> weird cases, how to get it).
>
> On Thursday, June 13, 2013, Simon MacMullen wrote:
>
> Hi Lee. I would be interested to know how you got the machine into
> that state.
>
> There is a bug with a similar stack trace that will be fixed in the
> next release - but I don't think it's the same bug. In your case we
> are seeing a message which has been published and delivered
> according to the queue index, but only published (and not delivered)
> according to the queue index's journal. As the journal should always
> record the same state or newer as the main index, this should be
> impossible.
>
> So to eliminate obvious causes of weirdness first: are you usuing an
> unusual filesystem, or mounting the filesystem with unusual options?
>
> Cheers, Simon
>
> On 13/06/13 12:36, Lee Hambley wrote:
>
> Posting this to the list after some discussion on IRC with
> bob2351 on
> irc.freenode.net <http://irc.freenode.net>.
>
> We have a *slightly* strange situation with using RabbitMQ, we
> start it
> under `runit`, and it effectively believes that it's running in the
> foreground. I have anecdotal evidence that this causes other
> problems,
> but at least not anything that hurts too often (i.e you lose
> "persistent
> messages" in this setup)
>
> That all aside, attached (
> https://gist.github.com/__leehambley/5773039
> <https://gist.github.com/leehambley/5773039> )
> is a stacktrace from a problematic box, we couldn't get it to
> recover
> (single node, single replica, etc, etc) - we simply deleted the
> mnesia
> database, which worked well enough.
>
> Some information about our environment:
>
> $ erl --version
> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8]
> [async-threads:0] [kernel-poll:false]
> $ dpkg --list | grep rabbit
> ii rabbitmq-server 3.0.4-1 AMQP server written in
> Erlang
> $ sudo RABBITMQ_NODENAME=ourproject rabbitmqctl status
> Status of node ourproject at carla ...
> [{pid,8055},
> {running_applications,
> [{rabbitmq_management,"__RabbitMQ Management
> Console","3.0.4"},
> {rabbitmq_management_agent,"__RabbitMQ Management
> Agent","3.0.4"},
> {rabbit,"RabbitMQ","3.0.4"},
> {os_mon,"CPO CXC 138 46","2.2.7"},
> {rabbitmq_web_dispatch,"__RabbitMQ Web
> Dispatcher","3.0.4"},
> {webmachine,"webmachine","1.9.__1-rmq3.0.4-git52e62bc"},
> {mochiweb,"MochiMedia Web
> Server","2.3.1-rmq3.0.4-__gitd541e9a"},
> {xmerl,"XML parser","1.2.10"},
> {inets,"INETS CXC 138 49","5.7.1"},
> {mnesia,"MNESIA CXC 138 12","4.5"},
> {amqp_client,"RabbitMQ AMQP Client","3.0.4"},
> {sasl,"SASL CXC 138 11","2.1.10"},
> {stdlib,"ERTS CXC 138 10","1.17.5"},
> {kernel,"ERTS CXC 138 10","2.14.5"}]},
> {os,{unix,linux}},
> {erlang_version,
> "Erlang R14B04 (erts-5.8.5) [source] [64-bit]
> [smp:8:8] [rq:8]
> [async-threads:30] [kernel-poll:true]\n"},
> {memory,
> [{total,33984216},
> {connection_procs,756760},
> {queue_procs,325576},
> {plugins,218728},
> {other_proc,9518440},
> {mnesia,93728},
> {mgmt_db,148472},
> {msg_index,71528},
> {other_ets,1145600},
> {binary,604208},
> {code,17266925},
> {atom,1550457},
> {other_system,2283794}]},
> {vm_memory_high_watermark,0.4}__,
> {vm_memory_limit,6656894566},
> {disk_free_limit,1000000000},
> {disk_free,11247643770880},
> {file_descriptors,
> [{total_limit,924},
> {total_used,23},
> {sockets_limit,829},
> {sockets_used,12}]},
> {processes,[{limit,1048576},{__used,345}]},
> {run_queue,0},
> {uptime,2692}]
> ...done.
>
>
> I believe this bug is already being tracked internally, and I
> post the
> report here in the hope that I'll have a place to attach a
> snapshot of
> an mnesia database the next time this happens to us, or that someone
> else might find this report and be able to contribute. Finally,
> selfishly, in the hope that I'll get notified when this gets
> fixed, and
> I upgrade, and sleep at night again.
>
> - Lee Hambley
>
>
> _________________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/__cgi-bin/mailman/listinfo/__rabbitmq-discuss
> <https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss>
>
>
>
> --
> Simon MacMullen
> RabbitMQ, Pivotal
>
>
>
> --
> Lee Hambley
> --
> http://lee.hambley.name/
> +49 (0) 170 298 5667
>
--
Simon MacMullen
RabbitMQ, Pivotal
More information about the rabbitmq-discuss
mailing list