[rabbitmq-discuss] Strange subtract_acks crash in RabbitMQ 3.1.0-1 (case_clause empty) on EC2
Karl Rieb
karl.rieb at gmail.com
Sat May 18 04:20:01 BST 2013
Hi,
I am running a RabbitMQ cluster of 4 nodes in AWS EC2 on *c1.xlarge *instances
using Ubuntu 12.04 LTS (kernel 3.2.0-38). Periodically two of my nodes
holding the largest queues will crash and require me to restart the
rabbitmq-server service. Looking at the logs, I'm at a loss as to why the
crash is occurring. I've have tried reproducing the crash manually by
flooding my nodes with messages, but that doesn't seem to trigger the
issue. Below is output from various RabbitMQ logs:
# cat /var/log/rabbitmq/rabbit at ip-10-123-123-123-sasl.log
> =CRASH REPORT==== 18-May-2013::01:41:51 ===
> crasher:
> initial call: gen:init_it/6
> pid: <0.301.0>
> registered_name: []
> exception exit: {{case_clause,{empty,{[],[]}}},
> [{rabbit_amqqueue_process,subtract_acks,3,[]},
> {rabbit_amqqueue_process,subtract_acks,4,[]},
> {rabbit_amqqueue_process,handle_cast,2,[]},
> {gen_server2,handle_msg,2,[]},
> {proc_lib,wake_up,3,
> [{file,"proc_lib.erl"},{line,249}]}]}
> in function gen_server2:terminate/3
> ancestors: [rabbit_amqqueue_sup,rabbit_sup,<0.207.0>]
> messages: [{'$gen_cast',{resume,<0.676.0>}},
> {'$gen_cast',{ack,[27583,27606,27629,27652,27675,27698,
> 27721,27744,27767,27790,27813,27836,
> 27859,27882,27905,27928,27951,27974,
> 27997,28020,28043,28066,28089,28112,
> 28135,28158,28179,28200,28221,28242,
> 28263,28284,28305,28324,28342,28360,
> 28378,28396,28414,28432,28450,28468,
> 28486,28504,28522,28540,28558,28576,
> 28594,28612],
> <0.737.0>}},
> {'$gen_cast',{resume,<0.737.0>}},
> {'$gen_cast',{ack,[27608,27631,27654,27677,27700,27723,
> 27746,27769,27792,27815,27838,27861,
> 27884,27907,27930,27953,27976,27999,
> 28022,28045,28068,28091,28114,28137,
> 28160,28181,28202,28223,28244,28265,
> 28286,28307,28326,28344,28362,28380,
> 28398,28416,28434,28452,28470,28488,
> 28506,28524,28542,28560,28578,28596,
> 28614],
> <0.676.0>}},
> {'$gen_cast',{ack,[27578,27601,27624,27647,27670,27693,
> 27716,27739,27762,27785,27808,27831,
> 27854,27877,27900,27923,27946,27969,
> 27992,28015,28038,28061,28084,28107,
> 28130,28153,28174,28195,28216,28237,
> 28258,28279,28300,28321,28340,28358,
> 28376,28394,28412,28430,28448,28466,
> 28484,28502,28520,28538,28556,28574,
> 28592,28610],
> <0.716.0>}},
> {'$gen_cast',{resume,<0.716.0>}},
> {'$gen_cast',{ack,[27582,27605,27628,27651,27674,27697,
> 27720,27743,27766,27789,27812,27835,
> 27858,27881,27904,27927,27950,27973,
> 27996,28019,28042,28065,28088,28111,
> 28134,28157,28178,28199,28220,28241,
> 28262,28283,28304,28323,28341,28359,
> 28377,28395,28413,28431,28449,28467,
> 28485,28503,28521,28539,28557,28575,
> 28593,28611],
> <0.737.0>}}]
> links: [<0.300.0>]
> dictionary: [{{ch,<0.729.0>},
> {cr,<0.729.0>,#Ref<0.0.0.12296>,
> {[28715,28696,28677,28658,28639,28620,28602,28584,
> 28566,28548,28530,28512,28494,28476,28458,28440,
> 28422,28404,28386,28368,28350,28332,28313,28292,
> 28271,28250,28229,28208,28187,28166,28143,28120,
> 28097,28074,28051,28028,28005,27982,27959,27936,
> 27913,27890,27867,27844,27821,27798,27775,27752,
> 27729,27706,27683,27660,27637,27614],
> [27591]},
> 1,
> {[],[]},
> {qstate,<0.728.0>,active,{0,nil}},
> 11}},
> {{credit_to,<7350.877.0>},22},
> {{ch,<0.608.0>},
> {cr,<0.608.0>,#Ref<0.0.0.9041>,
> {[28708,28689,28670,28651,28632,28607,28589,28571,
> 28553,28535,28517,28499,28481,28463,28445,28427,
> 28409,28391,28373,28355,28337,28318,28297,28276,
> 28255,28234,28213,28192,28171,28150,28127,28104,
> 28081,28058,28035,28012,27989,27966,27943,27920,
> 27897,27874,27851,27828,27805,27782,27759,27736,
> 27713,27690,27667,27644,27621,27598],
> [27575]},
> 1,
> {[],[]},
> {qstate,<0.607.0>,active,{0,nil}},
> 38}},
> {{ch,<0.624.0>},
> {cr,<0.624.0>,#Ref<0.0.0.9436>,
> {[28722,28703,28684,28665,28646],[28627]},
> 1,
> {[],[]},
> {qstate,<0.623.0>,active,{0,nil}},
> 36}},
> {{ch,<0.696.0>},
> {cr,<0.696.0>,#Ref<0.0.0.11414>,
> {[],[]},
> 2,
> {[],[]},
> {qstate,<0.695.0>,active,{0,nil}},
> 24}},
> {{ch,<0.564.0>},
> {cr,<0.564.0>,#Ref<0.0.0.7285>,
> {[28718,28699,28680,28661,28642],[28623]},
> 1,
> {[],[]},
> {qstate,<0.563.0>,active,{0,nil}},
> 12}},
> {{ch,<0.644.0>},
> {cr,<0.644.0>,#Ref<0.0.0.10071>,
> {[28713,28694,28675,28656,28637,28618,28600,28582,
> 28564,28546,28528,28510,28492,28474,28456,28438,
> 28420,28402,28384,28366,28348,28330,28311,28290,
> 28269,28248,28227,28206,28185,28164,28141,28118,
> 28095,28072,28049,28026,28003,27980,27957,27934,
> 27911,27888,27865,27842,27819,27796,27773,27750,
> 27727,27704,27681,27658,27635,27612],
> [27589]},
> 1,
> {[],[]},
> {qstate,<0.643.0>,active,{0,nil}},
> 11}},
> {{ch,<0.676.0>},
> {cr,<0.676.0>,#Ref<0.0.0.11309>,
> {[28614,28613,28596,28595,28578,28577,28560,28559,
> 28542,28541,28524,28523,28506,28505,28488,28487,
> 28470,28469,28452,28451,28434,28433,28416,28415,
> 28398,28397,28380,28379,28362,28361,28344,28343,
> 28326,28325,28307,28306,28286,28285,28265,28264,
> 28244,28243,28223,28222,28202,28201,28181,28180,
> 28160,28159,28137,28136,28114,28113,28091,28090,
> 28068,28067,28045,28044,28022,28021,27999,27998,
> 27976,27975,27953,27952,27930,27929,27907,27906,
> 27884,27883,27861,27860,27838,27837,27815,27814,
> 27792,27791,27769,27768,27746,27745,27723,27722,
> 27700,27699,27677,27676,27654,27653,27631,27630,
> 27608,27607,27585],
> [27584]},
> 2,
> {[{<0.676.0>,
> {consumer,<<"amq.ctag-0cvDGmyaAbEW_j15AfhGHQ">>,
> true}}],
> [{<0.676.0>,
> {consumer,<<"amq.ctag-YqU04SIfQv6P-QVueEjbRg">>,
> true}}]},
> {qstate,<0.675.0>,suspended,{0,nil}},
> 25}},
> {{xtype_to_module,direct},rabbit_exchange_type_direct},
> {{credit_to,<7350.833.0>},31},
> {{ch,<0.745.0>},
> {cr,<0.745.0>,#Ref<0.0.0.12674>,
> {[28716,28697,28678,28659,28640,28621,28603,28585,
> 28567,28549,28531,28513,28495,28477,28459,28441,
> 28423,28405,28387,28369,28351,28333,28314,28293,
> 28272,28251,28230,28209,28188,28167,28144,28121,
> 28098,28075,28052,28029,28006,27983,27960,27937,
> 27914,27891,27868,27845,27822,27799,27776,27753,
> 27730,27707,27684,27661,27638,27615],
> [27592]},
> 1,
> {[],[]},
> {qstate,<0.744.0>,active,{0,nil}},
> 37}},
> {{ch,<0.660.0>},
> {cr,<0.660.0>,#Ref<0.0.0.10650>,
> {[28712,28693,28674,28655,28636,28617,28599,28581,
> 28563,28545,28527,28509,28491,28473,28455,28437,
> 28419,28401,28383,28365,28347,28329,28310,28289,
> 28268,28247,28226,28205,28184,28163,28140,28117,
> 28094,28071,28048,28025,28002,27979,27956,27933,
> 27910,27887,27864,27841,27818,27795,27772,27749,
> 27726,27703,27680,27657,27634,27611],
> [27588]},
> 1,
> {[],[]},
> {qstate,<0.659.0>,active,{0,nil}},
> 11}},
> {{credit_to,<7350.837.0>},32},
> {{credit_to,<7350.881.0>},17},
> {{ch,<0.737.0>},
> {cr,<0.737.0>,#Ref<0.0.0.12429>,
> {[28612,28611,28594,28593,28576,28575,28558,28557,
> 28540,28539,28522,28521,28504,28503,28486,28485,
> 28468,28467,28450,28449,28432,28431,28414,28413,
> 28396,28395,28378,28377,28360,28359,28342,28341,
> 28324,28323,28305,28304,28284,28283,28263,28262,
> 28242,28241,28221,28220,28200,28199,28179,28178,
> 28158,28157,28135,28134,28112,28111,28089,28088,
> 28066,28065,28043,28042,28020,28019,27997,27996,
> 27974,27973,27951,27950,27928,27927,27905,27904,
> 27882,27881,27859,27858,27836,27835,27813,27812,
> 27790,27789,27767,27766,27744,27743,27721,27720,
> 27698,27697,27675,27674,27652,27651,27629,27628,
> 27606,27605,27583],
> [27582]},
> 2,
> {[{<0.737.0>,
> {consumer,<<"amq.ctag-ITZqhulzHhUzx1uy1Eot3g">>,
> true}}],
> [{<0.737.0>,
> {consumer,<<"amq.ctag-hpo5ejdoJRdLrT6JS42jGw">>,
> true}}]},
> {qstate,<0.736.0>,suspended,{0,nil}},
> 11}},
> {credit_blocked,[]},
> {{credit_to,<7350.854.0>},3},
> {{ch,<0.716.0>},
> {cr,<0.716.0>,#Ref<0.0.0.11820>,
> {[28610,28609,28592,28591,28574,28573,28556,28555,
> 28538,28537,28520,28519,28502,28501,28484,28483,
> 28466,28465,28448,28447,28430,28429,28412,28411,
> 28394,28393,28376,28375,28358,28357,28340,28339,
> 28321,28320,28300,28299,28279,28278,28258,28257,
> 28237,28236,28216,28215,28195,28194,28174,28173,
> 28153,28152,28130,28129,28107,28106,28084,28083,
> 28061,28060,28038,28037,28015,28014,27992,27991,
> 27969,27968,27946,27945,27923,27922,27900,27899,
> 27877,27876,27854,27853,27831,27830,27808,27807,
> 27785,27784,27762,27761,27739,27738,27716,27715,
> 27693,27692,27670,27669,27647,27646,27624,27623,
> 27601,27600,27578],
> [27577]},
> 2,
> {[{<0.716.0>,
> {consumer,<<"amq.ctag-BTlS-GmbXNpcU-wUlMI60Q">>,
> true}}],
> [{<0.716.0>,
> {consumer,<<"amq.ctag-loc5monF8M1pz_Bb8ypumw">>,
> true}}]},
> {qstate,<0.715.0>,suspended,{0,nil}},
> 44}},
> {{ch,<0.782.0>},
> {cr,<0.782.0>,#Ref<0.0.0.15649>,
> {[28714,28695,28676,28657,28638,28619,28601,28583,
> 28565,28547,28529,28511,28493,28475,28457,28439,
> 28421,28403,28385,28367,28349,28331,28312,28291,
> 28270,28249,28228,28207,28186,28165,28142,28119,
> 28096,28073,28050,28027,28004,27981,27958,27935,
> 27912,27889,27866,27843,27820,27797,27774,27751,
> 27728,27705,27682,27659,27636,27613],
> [27590]},
> 1,
> {[],[]},
> {qstate,<0.781.0>,active,{0,nil}},
> 35}},
> {{credit_to,<7350.846.0>},22},
> {{ch,<0.684.0>},
> {cr,<0.684.0>,#Ref<0.0.0.11198>,
> {[28711,28692,28673,28654,28635,28616,28598,28580,
> 28562,28544,28526,28508,28490,28472,28454,28436,
> 28418,28400,28382,28364,28346,28328,28309,28288,
> 28267,28246,28225,28204,28183,28162,28139,28116,
> 28093,28070,28047,28024,28001,27978,27955,27932,
> 27909,27886,27863,27840,27817,27794,27771,27748,
> 27725,27702,27679,27656,27633,27610],
> [27587]},
> 1,
> {[],[]},
> {qstate,<0.683.0>,active,{0,nil}},
> 11}},
> {{ch,<0.692.0>},
> {cr,<0.692.0>,#Ref<0.0.0.11088>,
> {[28710,28691,28672,28653,28634,28615,28597,28579,
> 28561,28543,28525,28507,28489,28471,28453,28435,
> 28417,28399,28381,28363,28345,28327,28308,28287,
> 28266,28245,28224,28203,28182,28161,28138,28115,
> 28092,28069,28046,28023,28000,27977,27954,27931,
> 27908,27885,27862,27839,27816,27793,27770,27747,
> 27724,27701,27678,27655,27632,27609],
> [27586]},
> 1,
> {[],[]},
> {qstate,<0.691.0>,active,{0,nil}},
> 12}},
> {{ch,<0.648.0>},
> {cr,<0.648.0>,#Ref<0.0.0.10283>,
> {[28717,28698,28679,28660,28641],[28622]},
> 1,
> {[],[]},
> {qstate,<0.647.0>,active,{0,nil}},
> 12}},
> {guid,{{2296174761,591959305,1562876388,1822892473},1}},
> {{ch,<0.720.0>},
> {cr,<0.720.0>,#Ref<0.0.0.12036>,
> {[28723,28704,28685,28666,28647],[28628]},
> 1,
> {[],[]},
> {qstate,<0.719.0>,active,{0,nil}},
> 37}},
> {{credit_to,<7350.869.0>},39},
> {{ch,<0.636.0>},
> {cr,<0.636.0>,#Ref<0.0.0.9960>,
> {[28705,28686,28667,28648,28629,28604,28586,28568,
> 28550,28532,28514,28496,28478,28460,28442,28424,
> 28406,28388,28370,28352,28334,28315,28294,28273,
> 28252,28231,28210,28189,28168,28147,28124,28101,
> 28078,28055,28032,28009,27986,27963,27940,27917,
> 27894,27871,27848,27825,27802,27779,27756,27733,
> 27710,27687,27664,27641,27618,27595],
> [27572]},
> 1,
> {[],[]},
> {qstate,<0.635.0>,active,{0,nil}},
> 15}},
> {{credit_to,<7350.850.0>},50},
> {{ch,<0.616.0>},
> {cr,<0.616.0>,#Ref<0.0.0.9300>,
> {[28719,28700,28681,28662,28643],[28624]},
> 1,
> {[],[]},
> {qstate,<0.615.0>,active,{0,nil}},
> 12}},
> {{ch,<0.509.0>},
> {cr,<0.509.0>,#Ref<0.0.0.2243>,
> {[28302,28281,28260,28239,28218,28197,28176,28155,
> 28132,28109,28086,28063,28040,28017,27994,27971],
> [27580,27603,27626,27649,27672,27695,27718,27741,
> 27764,27787,27810,27833,27856,27879,27902,27925,
> 27948]},
> 3,
> {[],[]},
> {qstate,<0.508.0>,active,{0,nil}},
> 36}},
> {{credit_to,<7350.873.0>},17},
> {{ch,<0.588.0>},
> {cr,<0.588.0>,#Ref<0.0.0.7808>,
> {[28721,28702,28683,28664,28645],[28626]},
> 1,
> {[],[]},
> {qstate,<0.587.0>,active,{0,nil}},
> 12}},
> {{ch,<0.664.0>},
> {cr,<0.664.0>,#Ref<0.0.0.10772>,
> {[28720,28701,28682,28663,28644],[28625]},
> 1,
> {[],[]},
> {qstate,<0.663.0>,active,{0,nil}},
> 13}},
> {fhc_age_tree,{0,nil}},
> {{ch,<0.708.0>},
> {cr,<0.708.0>,#Ref<0.0.0.11695>,
> {[28707,28688,28669,28650,28631,28606,28588,28570,
> 28552,28534,28516,28498,28480,28462,28444,28426,
> 28408,28390,28372,28354,28336,28317,28296,28275,
> 28254,28233,28212,28191,28170,28149,28126,28103,
> 28080,28057,28034,28011,27988,27965,27942,27919,
> 27896,27873,27850,27827,27804,27781,27758,27735,
> 27712,27689,27666,27643,27620,27597],
> [27574]},
> 1,
> {[],[]},
> {qstate,<0.707.0>,active,{0,nil}},
> 35}},
> {{credit_to,<7350.829.0>},13},
> {{ch,<0.757.0>},
> {cr,<0.757.0>,#Ref<0.0.0.13150>,
> {[28706,28687,28668,28649,28630,28605,28587,28569,
> 28551,28533,28515,28497,28479,28461,28443,28425,
> 28407,28389,28371,28353,28335,28316,28295,28274,
> 28253,28232,28211,28190,28169,28148,28125,28102,
> 28079,28056,28033,28010,27987,27964,27941,27918,
> 27895,27872,27849,27826,27803,27780,27757,27734,
> 27711,27688,27665,27642,27619,27596],
> [27573]},
> 1,
> {[],[]},
> {qstate,<0.756.0>,active,{0,nil}},
> 49}},
> {{ch,<0.572.0>},
> {cr,<0.572.0>,#Ref<0.0.0.7536>,
> {[28709,28690,28671,28652,28633,28608,28590,28572,
> 28554,28536,28518,28500,28482,28464,28446,28428,
> 28410,28392,28374,28356,28338,28319,28298,28277,
> 28256,28235,28214,28193,28172,28151,28128,28105,
> 28082,28059,28036,28013,27990,27967,27944,27921,
> 27898,27875,27852,27829,27806,27783,27760,27737,
> 27714,27691,27668,27645,27622,27599],
> [27576]},
> 1,
> {[],[]},
> {qstate,<0.571.0>,active,{0,nil}},
> 35}},
> {{credit_to,<7350.825.0>},2},
> {{credit_to,<7350.858.0>},3}]
> trap_exit: true
> status: running
> heap_size: 121536
> stack_size: 27
> reductions: 9363545
> neighbours:
>
> =SUPERVISOR REPORT==== 18-May-2013::01:41:51 ===
> Supervisor: {local,rabbit_amqqueue_sup}
> Context: child_terminated
> Reason: {{case_clause,{empty,{[],[]}}},
> [{rabbit_amqqueue_process,subtract_acks,3,[]},
> {rabbit_amqqueue_process,subtract_acks,4,[]},
> {rabbit_amqqueue_process,handle_cast,2,[]},
> {gen_server2,handle_msg,2,[]},
>
> {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}
> Offender: [{pid,<0.301.0>},
> {name,rabbit_amqqueue},
> {mfa,
> {rabbit_amqqueue_process,start_link,
> [{amqqueue,
> {resource,<<"/">>,queue,
> <<"document_queue">>},
>
> true,false,none,[],<0.7622.0>,[],[],undefined,
> []}]}},
> {restart_type,temporary},
> {shutdown,4294967295},
> {child_type,worker}]
Notice how the *case_clause *is *empty*. The other rabbit log looks
similar:
# cat /var/log/rabbitmq/rabbit at ip-10-123-123-123.log
=ERROR REPORT==== 18-May-2013::01:41:48 ===
** Generic server <0.301.0> terminating
** Last message in was {'$gen_cast',
{ack,
[27585,27584,27607,27630,27653,27676,27699,
27722,27745,27768,27791,27814,27837,27860,
27883,27906,27929,27952,27975,27998,28021,
28044,28067,28090,28113,28136,28159,28180,
28201,28222,28243,28264,28285,28306,28325,
28343,28361,28379,28397,28415,28433,28451,
28469,28487,28505,28523,28541,28559,28577,
28595,28613],
<0.676.0>}}
** When Server state == {q,
{amqqueue,
{resource,<<"/">>,queue,
<<"document_queue">>},
true,false,none,[],<0.301.0>,[],[],undefined,[]},
none,true,rabbit_variable_queue,
{vqstate,
{0,{[],[]}},
{0,{[],[]}},
{delta,undefined,0,undefined},
{0,{[],[]}},
{0,{[],[]}},
* ... [ thousands of lines of state ] ... *
** Reason for termination ==
** {{case_clause,{empty,{[],[]}}},
[{rabbit_amqqueue_process,subtract_acks,3,[]},
{rabbit_amqqueue_process,subtract_acks,4,[]},
{rabbit_amqqueue_process,handle_cast,2,[]},
{gen_server2,handle_msg,2,[]},
{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}
> =ERROR REPORT==== 18-May-2013::01:41:51 ===
connection <0.430.0>, channel 22 - soft error:
{amqp_error,not_found,
"no queue 'document_queue' in vhost '/'",
'queue.declare'}
> =ERROR REPORT==== 18-May-2013::01:41:51 ===
connection <0.384.0>, channel 12 - soft error:
{amqp_error,not_found,
"no queue 'document_queue' in vhost '/'",
'queue.declare'}
> =ERROR REPORT==== 18-May-2013::01:41:51 ===
connection <0.384.0>, channel 15 - soft error:
{amqp_error,not_found,
"no queue 'document_queue' in vhost '/'",
'queue.declare'}
> =ERROR REPORT==== 18-May-2013::01:41:51 ===
connection <0.373.0>, channel 23 - soft error:
{amqp_error,not_found,
"no queue 'document_queue' in vhost '/'",
'queue.declare'}
* ... [ repeated a bunch of times ] ...*
I also see this in my */var/log/syslog* with regards to *beam.smp*:
# cat /var/log/syslog
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176041] INFO: task
> beam.smp:18971 blocked for more than 120 seconds.
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176053] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176060] beam.smp
> D ffff8801bfd93700 0 18971 18962 0x00000000
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176065]
> ffff8801b11cdcb8 0000000000000282 0000000000000000 ffffffffffffffe0
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176070]
> ffff8801b11cdfd8 ffff8801b11cdfd8 ffff8801b11cdfd8 0000000000013700
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176075]
> ffff8801b3180000 ffff8801b35244a0 00007f500e2079e0 ffff8801b146bb80
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176079] Call Trace:
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176089]
> [<ffffffff8165434f>] schedule+0x3f/0x60
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176094]
> [<ffffffff8106aef5>] exit_mm+0x85/0x130
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176097]
> [<ffffffff8106b10e>] do_exit+0x16e/0x450
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176102]
> [<ffffffff810797aa>] ? __dequeue_signal+0x6a/0xb0
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176105]
> [<ffffffff8106b594>] do_group_exit+0x44/0xa0
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176108]
> [<ffffffff8107c36c>] get_signal_to_deliver+0x21c/0x420
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176113]
> [<ffffffff81014825>] do_signal+0x45/0x130
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176116]
> [<ffffffff8105539d>] ? set_next_entity+0xad/0xd0
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176120]
> [<ffffffff81014ad5>] do_notify_resume+0x65/0x80
> May 17 22:46:23 ip-10-123-123-123 kernel: [244800.176124]
> [<ffffffff8165ea90>] int_signal+0x12/0x17
> *... [ repeated 9 more times with same timestamp ] ...*
There are no strange CPU spikes on my EC2 node based on AWS monitoring. The
only thing that seems to stand out is that my writes to my EBS volume spike
during the crash (~90 write OPs/sec, 4 MiB/sec write bandwidth, ~30ms/op
write latency).
My team and I are completely stumped and would appreciate any insight you
might have to the crash reports. Is this a problem with writes hanging
when going against the EBS volume since EBS volumes are essentially network
mounted? Is there a way to configure rabbit to be more tolerant of these
delays?
For debugging purposes, below is the status output of one of the nodes when
it is healthy (so you can see the versions of everything i'm running):
# rabbitmqctl status
> Status of node 'rabbit at ip-10-123-123-123' ...
> [{pid,28978},
> {running_applications,
> [{rabbitmq_management,"RabbitMQ Management Console","3.1.0"},
> {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.1.0"},
> {webmachine,"webmachine","1.9.1-rmq3.1.0-git52e62bc"},
> {mochiweb,"MochiMedia Web Server","2.3.1-rmq3.1.0-gitd541e9a"},
> {rabbitmq_management_agent,"RabbitMQ Management Agent","3.1.0"},
> {rabbit,"RabbitMQ","3.1.0"},
> {os_mon,"CPO CXC 138 46","2.2.11"},
> {inets,"INETS CXC 138 49","5.9.4"},
> {mnesia,"MNESIA CXC 138 12","4.8"},
> {amqp_client,"RabbitMQ AMQP Client","3.1.0"},
> {xmerl,"XML parser","1.3.3"},
> {sasl,"SASL CXC 138 11","2.3.1"},
> {stdlib,"ERTS CXC 138 10","1.19.1"},
> {kernel,"ERTS CXC 138 10","2.16.1"}]},
> {os,{unix,linux}},
> {erlang_version,
> "Erlang R16B (erts-5.10.1) [source-05f1189] [64-bit] [smp:8:8]
> [async-threads:30] [hipe] [kernel-poll:true]\n"},
> {memory,
> [{total,60179048},
> {connection_procs,5274808},
> {queue_procs,13107264},
> {plugins,354272},
> {other_proc,9741830},
> {mnesia,150456},
> {mgmt_db,12248},
> {msg_index,34768},
> {other_ets,1148248},
> {binary,4905584},
> {code,19565863},
> {atom,703377},
> {other_system,5180330}]},
> {vm_memory_high_watermark,0.8},
> {vm_memory_limit,5828434329},
> {disk_free_limit,1000000000},
> {disk_free,3953344512},
> {file_descriptors,
>
> [{total_limit,924},{total_used,6},{sockets_limit,829},{sockets_used,4}]},
> {processes,[{limit,1048576},{used,348}]},
> {run_queue,0},
> {uptime,3796}]
> ...done.
Note that I am using Erlang v16 from the Erlang Solutions repo (esl-erlang
package). I originally was using Erlang v14 from Ubuntu repos
(erlang-base), but decided to upgrade hoping it would resolve the issue
(which it did not). I have also tried running RabbitMQ 3.0.4 without
success. This issue seems to affect both versions.
Thanks,
Karl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130517/864aab47/attachment.htm>
More information about the rabbitmq-discuss
mailing list