[rabbitmq-discuss] Upgrade fail

Peter Kopias kopias.peter at gmail.com
Wed Apr 2 20:19:48 BST 2014


Hi.

 I've tried to upgrade my cluster to 3.3.0.

 Disclosure: we had seen some strange artifacts of (maybe mnesia) failures,
as there were a few queues that showed up in the lists, but I could never
remove them or even query their properties. I hoped that the upgrade would
solve this.

 Stop node2, node3, node1. (as node1 should run be the last and first)

 node1: apt-get upgrade
 node1: service rabbitmq-server start


/var/log/rabbitmq/startup_log

              RabbitMQ 3.3.0. Copyright (C) 2007-2013 GoPivotal, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /var/log/rabbitmq/rabbit at node1.log
  ######  ##        /var/log/rabbitmq/rabbit at node1-sasl.log
  ##########
              Starting broker...dets: file
"/var/lib/rabbitmq/mnesia/rabbit at node1/schema.DAT" not properly closed,
repairing ...


BOOT FAILED
===========

Timeout contacting cluster nodes: [rabbit at node3,rabbit at node2].

DIAGNOSTICS
===========

attempted to contact: [rabbit at node3,rabbit at node2]

rabbit at node3:
  * found rabbit (port 25672)
  * TCP connection succeeded
  * suggestion: hostname mismatch?
  * suggestion: is the cookie set correctly?
rabbit at node2:
  * found rabbit (port 25672)
  * TCP connection succeeded
  * suggestion: hostname mismatch?
  * suggestion: is the cookie set correctly?

current node details:
- node name: rabbit at node1
- home dir: /var/lib/rabbitmq
- cookie hash: XSquoZCiCPIfZlTilwy1Dg==

BOOT FAILED
===========

Error description:
   {could_not_start,rabbit,
       {bad_return,
           {{rabbit,start,[normal,[]]},
            {'EXIT',
                {rabbit,failure_during_boot,
                    {error,
                        {timeout_waiting_for_tables,
                            [rabbit_listener,rabbit_semi_durable_route,
                             rabbit_route,rabbit_reverse_route,
                             rabbit_topic_trie_node,rabbit_topic_trie_edge,
                             rabbit_topic_trie_binding,rabbit_exchange,
                             rabbit_exchange_serial,rabbit_queue,gm_group,
                             mirrored_sup_childspec]}}}}}}}

Log files (may contain more information):
   /var/log/rabbitmq/rabbit at node1.log
   /var/log/rabbitmq/rabbit at node1-sasl.log

{"init terminating in
do_boot",{rabbit,failure_during_boot,{could_not_start,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot,{error,{timeout_waiting_for_tables,[ra
bbit_listener,rabbit_semi_durable_route,rabbit_route,rabbit_reverse_route,rabbit_topic_trie_node,rabbit_topic_trie_edge,rabbit_topic_trie_binding,rabbit_exchange,rabbit_exchange_serial,rabbit_queue
,gm_group,mirrored_sup_childspec]}}}}}}}}}



WHY is it waiting for node2, node3, if node1 was the last to stop, and it
should come online in itself without them?

The cookie is the same, the hostnames are correct.

Although:

/var/log/rabbitmq/startup_err
Crash dump was written to: erl_crash.dump
init terminating in do_boot ()


/var/log/rabbitmq/rabbit at node1-sasl.log

=CRASH REPORT==== 2-Apr-2014::19:51:57 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.118.0>
    registered_name: []
    exception exit: {bad_return,
                     {{rabbit,start,[normal,[]]},
                      {'EXIT',
                       {rabbit,failure_during_boot,
                        {error,
                         {timeout_waiting_for_tables,
                          [rabbit_listener,rabbit_semi_durable_route,
                           rabbit_route,rabbit_reverse_route,
                           rabbit_topic_trie_node,rabbit_topic_trie_edge,
                           rabbit_topic_trie_binding,rabbit_exchange,
                           rabbit_exchange_serial,rabbit_queue,gm_group,
                           mirrored_sup_childspec]}}}}}}
      in function  application_master:init/4 (application_master.erl, line
133)
    ancestors: [<0.117.0>]
    messages: [{'EXIT',<0.119.0>,normal}]
    links: [<0.117.0>,<0.7.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 4185
    stack_size: 27
    reductions: 322
  neighbours:

rabbitmq.config:

[{rabbit, [{vm_memory_high_watermark, 0.45},{vm_memory_high_watermark,
0.5},{disk_free_limit, 10000000000},{hipe_compile,false}]}].


erl_crash.dump

=erl_crash_dump:0.2
Wed Apr  2 19:51:58 2014
Slogan: init terminating in do_boot ()
System version: Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:4:4]
[async-threads:30] [hipe] [kernel-poll:true]
Compiled: Wed Jul 10 10:49:45 2013
Taints:
Atoms: 22590
=memory
total: 41536280
processes: 14801392
processes_used: 14801320
system: 26734888
atom: 703377
atom_used: 673775
binary: 1330608
code: 19294518
ets: 859648
=hash_table:atom_tab
size: 19289
used: 13349
objs: 22590
depth: 8
=index_table:atom_tab
size: 23552
limit: 1048576
entries: 22590
=hash_table:module_code
size: 397
used: 288
objs: 539
depth: 5
=index_table:module_code
size: 1024
limit: 65536
entries: 539
=hash_table:export_list
size: 6421
used: 4793
objs: 8988
depth: 6
=index_table:export_list
size: 9216
limit: 524288
entries: 8988
=hash_table:export_list
size: 6421
used: 4788
objs: 8956
depth: 6
=hash_table:process_reg
size: 47
used: 24
objs: 31
depth: 2
=hash_table:fun_table
size: 3203
used: 2189
objs: 4009
depth: 6
=hash_table:node_table
size: 5
used: 3
objs: 4
depth: 2
=hash_table:dist_table
size: 5
used: 3
objs: 4
depth: 2
=allocated_areas
sys_misc: 126685
static: 524928
atom_space: 360536 330934
atom_table: 342841
module_table: 187920
export_table: 375636
export_list: 1581888
register_table: 468
fun_table: 25714
module_refs: 25824
loaded_code: 17097536
dist_table: 99875
node_table: 323
bits_bufs_size: 0
bif_timer: 80200
link_lh: 208
process_table: 12582912
port_table: 786432
ets_misc: 49192
=allocator:sys_alloc
option e: true
option m: libc
option tt: 131072
option tp: 0
=allocator:temp_alloc[0]
versions: 2.1 3.0
option e: true
option t: option acul: %d

option ramv: 5
option sbct: 5865745
option asbcst: 524288
option rsbcst: 4145152
option rsbcmt: 90
option rmbcmt: 80
option mmbcs: 100
option mmsbc: 131072
option mmmbc: 256
option lmbcs: 18446744073709551615
option smbcs: 10485760
option mbcgs: 1048576
option mbsd: 3
option as: gf
mbcs blocks: 0 4 4
mbcs blocks size: 0 9848 9848
mbcs carriers: 1 1 1
mbcs mseg carriers: 1
mbcs sys_alloc carriers: 0
mbcs carriers size: 1048576 1048576 1048576
mbcs mseg carriers size: 1048576
mbcs sys_alloc carriers size: 0
sbcs blocks: 0 0 0
sbcs blocks size: 0 0 0
sbcs carriers: 0 0 0
sbcs mseg carriers: 0
sbcs sys_alloc carriers: 0
sbcs carriers size: 0 0 0
sbcs mseg carriers size: 0
sbcs sys_alloc carriers size: 0
temp_alloc calls: 8506
temp_free calls: 8506
temp_realloc calls: 1
mseg_alloc calls: 1
mseg_dealloc calls: 0
mseg_realloc calls: 0
sys_alloc calls: 0
sys_free calls: 0
sys_realloc calls: 0
=allocator:temp_alloc[1]
versions: 0.9 3.0
option e: true
option t: option acul: %d

option ramv: 5
option sbct: 5865745
option asbcst: 524288
option rsbcst: 4145152
option rsbcmt: 90
option rmbcmt: 80
option mmbcs: 100
option mmsbc: 131072
option mmmbc: 256
......

....

=allocator:alloc_util
option mmc: 1024
option ycs: 1048576
=allocator:instr
option m: false
option s: false
option t: false
=proc:<0.0.0>
State: Running
Name: init
Spawned as: otp_ring0:start/2
Spawned by: []
Started: Wed Apr  2 19:51:25 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.3.0>, <0.7.0>, <0.6.0>]
Reductions: 57344
Stack+heap: 6772
OldHeap: 6772
Heap unused: 3114
OldHeap unused: 6325
Memory: 109264
Program counter: 0x00007f94178c6f80 (init:boot_loop/2 + 64)
CP: 0x0000000000000000 (invalid)
=proc:<0.3.0>
State: Waiting
Name: erl_prim_loader
Spawned as: erlang:apply/2
Spawned by: <0.2.0>
Started: Wed Apr  2 19:51:25 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.0.0>, #Port<0.0>]
Reductions: 3093307
Stack+heap: 1598
OldHeap: 4185
Heap unused: 945
OldHeap unused: 2668
Memory: 47136
Program counter: 0x00007f94179369b8 (erl_prim_loader:loop/3 + 176)
CP: 0x0000000000000000 (invalid)
arity = 0
=proc:<0.6.0>
State: Waiting
Name: error_logger
Spawned as: proc_lib:init_p/5
Spawned by: <0.2.0>
Started: Wed Apr  2 19:51:25 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.30.0>, <0.0.0>, {from,<0.46.0>,#Ref<0.0.0.64>},
{from,<0.47.0>,#Ref<0.0.0.69>}]
Reductions: 38274
Stack+heap: 4185
OldHeap: 6772
Heap unused: 2304
OldHeap unused: 5548
Memory: 88776
Program counter: 0x00007f9417962c50 (gen_event:fetch_msg/5 + 80)
CP: 0x0000000000000000 (invalid)
arity = 0
=proc:<0.7.0>
State: Waiting
Name: application_controller
Spawned as: erlang:apply/2
Spawned by: <0.2.0>
Started: Wed Apr  2 19:51:25 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.0.0>, <0.35.0>, <0.9.0>]
Reductions: 208958
Stack+heap: 6772
OldHeap: 28690
Heap unused: 3968
OldHeap unused: 19495
Memory: 284712
Program counter: 0x00007f941638a1c8 (gen_server:loop/6 + 264)
CP: 0x0000000000000000 (invalid)
arity = 0
=proc:<0.9.0>
State: Waiting
Spawned as: proc_lib:init_p/5
Spawned by: <0.8.0>
Started: Wed Apr  2 19:51:25 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.10.0>, <0.7.0>]
Reductions: 44
Stack+heap: 376
OldHeap: 376
Heap unused: 335
OldHeap unused: 171
Memory: 6992
Program counter: 0x00007f941639b430 (application_master:main_loop/2 + 64)
CP: 0x0000000000000000 (invalid)
arity = 0
=proc:<0.10.0>
State: Waiting
Spawned as: application_master:start_it/4
Spawned by: <0.9.0>
Started: Wed Apr  2 19:51:25 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.11.0>, <0.9.0>]
Reductions: 69
Stack+heap: 233
OldHeap: 0
Heap unused: 86
OldHeap unused: 0
Memory: 2736
Program counter: 0x00007f941639d720 (application_master:loop_it/4 + 72)
CP: 0x0000000000000000 (invalid)
arity = 0
=proc:<0.11.0>
State: Waiting
Name: kernel_sup
Spawned as: proc_lib:init_p/5
Spawned by: <0.10.0>
Started: Wed Apr  2 19:51:25 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.10.0>, <0.13.0>, <0.12.0>, <0.18.0>, <0.25.0>, <0.24.0>,
<0.16.0>, <0.27.0>, <0.32.0>, <0.31.0>, <0.29.0>, <0.26.0>]
Reductions: 52300
Stack+heap: 6772
OldHeap: 28690
Heap unused: 3529
OldHeap unused: 24822
Memory: 285072
Program counter: 0x00007f941638a1c8 (gen_server:loop/6 + 264)
CP: 0x0000000000000000 (invalid)
arity = 0
=proc:<0.12.0>
State: Waiting
Name: rex
Spawned as: proc_lib:init_p/5
Spawned by: <0.11.0>
Started: Wed Apr  2 19:51:25 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.11.0>, {from,<22585.2.0>,#Ref<22585.0.0.50>},
{to,<0.155.0>,#Ref<0.0.0.767>}]
Reductions: 858
Stack+heap: 2586
OldHeap: 376
Heap unused: 1188
OldHeap unused: 350
Memory: 24824
Program counter: 0x00007f941638a1c8 (gen_server:loop/6 + 264)
CP: 0x0000000000000000 (invalid)
arity = 0
=proc:<0.13.0>
State: Waiting
Name: global_name_server
Spawned as: proc_lib:init_p/5
Spawned by: <0.11.0>
Started: Wed Apr  2 19:51:25 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.11.0>, <0.15.0>, <0.14.0>]
Reductions: 757
Stack+heap: 610
OldHeap: 376
Heap unused: 353
OldHeap unused: 280
Memory: 8904
Program counter: 0x00007f941638a1c8 (gen_server:loop/6 + 264)
CP: 0x0000000000000000 (invalid)
arity = 0
=proc:<0.14.0>
State: Waiting
Spawned as: erlang:apply/2
Spawned by: <0.13.0>
Started: Wed Apr  2 19:51:25 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.13.0>]
Reductions: 1023
Stack+heap: 1598
OldHeap: 376
Heap unused: 1088
OldHeap unused: 308





 I can't start node2 and node3, as they were stopped before node1, so they
won't start till it gets up or the locks get removed.

 So currently I've got a three legged cluster down on its six knees.

---

 I'm trying to start node2 and node3 to have at least one running but they
always fail, even if I copied back the previous disk state, and upgraded
again...

 Not a single node running currently, looks like I have to:
 - reinstall each rabbitmq-server with clean /var/lib/rabbitmq dirs
 - rebuild cluster
 - rebuild the virtual hosts
 - recreate all users and permissions
 - all queues and exchanges and policies

 Thats not the way upgrade should happen. :(


 Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140402/901a5d01/attachment.html>


More information about the rabbitmq-discuss mailing list