[rabbitmq-discuss] Missing Durable Queues on Startup

Tue May 8 15:33:01 BST 2012

Hi Chris,

I'm no expert on setups like these, but, based upon other
questions and answers in this m/l, I would want to know:

a) are all the nodes in your cluster disk nodes?
b) was the node that you stop/started a disk node or a ram node?
c) when the cluster was restarted did you start all the disk nodes
   first?

When you say:

> We’re not using HA in rabbit itself, the queues are just
> persistent and durable on each node in the cluster.

can you be more precise? I assumed that the node you restarted was the
one that held the queues you were missing.

You appear to be running version 2.5.1. Clustering has had some fixes
applied to it since then.

I would recommend upgrading to the latest release.

Steve Powell  (a happy bunny)
----------some more definitions from the SPD----------
chinchilla (n.) Cooling device for the lower jaw.
socialcast (n.) Someone to whom everyone is speaking but nobody likes.
literacy (n.) A textually transmitted disease usually contracted in childhood.

On 8 May 2012, at 01:45, Chris Larsen wrote:

> Hello, we ran into an odd situation today where RabbitMQ seemed to start properly but it didn't load most of the durable queues from the mnesia. Running stop_app, then start_app brought back some of the queues but not all. After we found out that not all queues were restored (after a few hours), running stop_app, then start_app again brought the rest of the queues online. Has anyone run into a similar situation?
>  
> Here are some notes about our setup with a few log entries from the logs below. We have 6 machines in the cluster split into pairs running drbd and pacemaker for failover. A glitchy switch caused one of these pairs to split-brain and both MQ resources wound up on the same physical host. Drbd seemed to be fine and after we resolved the split-brain, that's when we noticed the missing queues. There weren't any errors in the startup_log or startup_err files. We’re not using HA in rabbit itself, the queues are just persistent and durable on each node in the cluster.
>  
> We had a number of messages in the SASL logs with “nodedown” so I wonder if the MQ instances simply didn’t join the cluster properly the first couple of times but finally did on the last try? I didn’t check the status of the nodes in the cluster on each node (as suggested elsewhere) in between restarts but I’ll give that a try if it happens again. Thanks for your help!
>  
> RabbitMQ 2.5.1
> Erlang R13B03
> Ubuntu Server 64bit 2.6.38-10
> drbd 8.3.7
>  
> =ERROR REPORT==== 7-May-2012::10:11:30 ===
> Mnesia('rabbit2 at host2'): ** ERROR ** Mnesia on 'rabbit2 at host2' could not connect to node(s) ['rabbit1 at host1']
>  
> =INFO REPORT==== 7-May-2012::10:11:30 ===
> Limiting to approx 32668 file handles (29399 sockets)
>  
> =INFO REPORT==== 7-May-2012::10:12:46 ===
> msg_store_transient: using rabbit_msg_store_ets_index to provide index
>  
> =INFO REPORT==== 7-May-2012::10:12:46 ===
> msg_store_persistent: using rabbit_msg_store_ets_index to provide index
>  
> =WARNING REPORT==== 7-May-2012::10:12:46 ===
> msg_store_persistent: rebuilding indices from scratch
>  
> =INFO REPORT==== 7-May-2012::10:12:46 ===
> started TCP Listener on 192.168.1.1:5672
>  
> =ERROR REPORT==== 7-May-2012::10:13:42 ===
> Mnesia('rabbit2 at host2'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit1 at host3'}
>  
> =ERROR REPORT==== 7-May-2012::10:13:42 ===
> Mnesia('rabbit2 at host2'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit2 at host4'}
>  
> =SUPERVISOR REPORT==== 7-May-2012::10:13:32 ===
>      Supervisor: {<0.11398.2442>,rabbit_channel_sup}
>      Context:    shutdown
>      Reason:     reached_max_restart_intensity
>      Offender:   [{pid,<0.11400.2442>},
>                   {name,channel},
>                   {mfa,
>                       {rabbit_channel,start_link,
>                           [1,<0.11368.2442>,<0.11399.2442>,<0.11368.2442>,
>                            rabbit_framing_amqp_0_9_1,
>                            {user,<<"my_app">>,true,
>                                rabbit_auth_backend_internal,
>                                {internal_user,<<"my_app">>,
>                                    <<199,64,175,52,127,65,248,9,70,171,15,9,5,
>                                      122,73,4,195,147,238,67>>,
>                                    true}},
>                            <<"/my_app">>,[],<0.11366.2442>,
>                            #Fun<rabbit_channel_sup.0.15412730>]}},
>                   {restart_type,intrinsic},
>                   {shutdown,4294967295},
>                   {child_type,worker}]
>  
>  
> =CRASH REPORT==== 7-May-2012::10:13:33 ===
>   crasher:
>     initial call: gen:init_it/6
>     pid: <0.25562.2442>
>     registered_name: []
>     exception exit: {{badmatch,
>                          {error,
>                              [{<7748.8396.531>,
>                                {exit,
>                                    {nodedown,'rabbit1 at host1'},
>                                    []}}]}},
>                      [{rabbit_channel,terminate,2},
>                       {gen_server2,terminate,3},
>                       {proc_lib,wake_up,3}]}
>       in function  gen_server2:terminate/3
>     ancestors: [<0.25560.2442>,<0.25544.2442>,<0.25542.2442>,
>                   rabbit_tcp_client_sup,rabbit_sup,<0.124.0>]
>     messages: []
>     links: [<0.25560.2442>]
>     dictionary: [{{exchange_stats,
>                        {resource,<<"/my_app">>,exchange,
>                            <<"service.exchange">>}},
>                    [{confirm,6},{publish,6}]},
>                   {{queue_exchange_stats,
>                        {<0.253.0>,
>                         {resource,<<"/my_app">>,exchange,
>                             <<"data.exchange">>}}},
>                    [{confirm,6},{publish,6}]},
>                   {delegate,delegate_4},
>                   {{monitoring,<0.253.0>},true},
>                   {{exchange_stats,
>                        {resource,<<"/my_app">>,exchange,
>                            <<"data.exchange">>}},
>                    [{confirm,6},{publish,6}]},
>                   {guid,{{11,<0.25562.2442>},11}}]
>     trap_exit: true
>     status: running
>     heap_size: 987
>     stack_size: 24
>     reductions: 11357
>   neighbours:
>  
> =SUPERVISOR REPORT==== 7-May-2012::10:13:33 ===
>      Supervisor: {<0.25560.2442>,rabbit_channel_sup}
>      Context:    child_terminated
>      Reason:     {{badmatch,
>                       {error,
>                           [{<7748.8396.531>,
>                             {exit,{nodedown,'rabbit1 at host1'},[]}}]}},
>                   [{rabbit_channel,terminate,2},
>                    {gen_server2,terminate,3},
>                    {proc_lib,wake_up,3}]}
>      Offender:   [{pid,<0.25562.2442>},
>                   {name,channel},
>                   {mfa,
>                       {rabbit_channel,start_link,
>                           [1,<0.25545.2442>,<0.25561.2442>,<0.25545.2442>,
>                            rabbit_framing_amqp_0_9_1,
>                            {user,<<"my_app">>,true,
>                                rabbit_auth_backend_internal,
>                                {internal_user,<<"my_app">>,
>                                    <<199,64,175,52,127,65,248,9,70,171,15,9,5,
>                                      122,73,4,195,147,238,67>>,
>                                    true}},
>                            <<"/my_app">>,[],<0.25543.2442>,
>                            #Fun<rabbit_channel_sup.0.15412730>]}},
>                   {restart_type,intrinsic},
>                   {shutdown,4294967295},
>                   {child_type,worker}]
>  
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss