[rabbitmq-discuss] Weird Crash (91MB message over STOMP) [Reproducible]

Sat Aug 8 10:32:31 BST 2009

Darien,

Darien Kindlund wrote:
> Actually, this problem is a bit worse.... apparently, when RabbitMQ
> restarts and recovers the persister -- even persistent messages marked
> 'ready' on OTHER durable queues are NOT retrievable by other STOMP
> clients.... I get the same type of error in the rabbit.log:
> 
> =ERROR REPORT==== 8-Aug-2009::04:21:16 ===
> STOMP Reply command unhandled: {'basic.deliver',
>                                    <<"Q_1.manager.workers">>,
>                                    1,
>                                    false,
>                                    <<"events">>,
>                                    <<"1.job.create.job.urls.job_alerts">>}

Right. I think I know what the problem is, and it is indeed a bug in the 
STOMP adapter which causes it to barf when attempting to deliver any 
message that was recovered from the persister.

> The unit test case for this would be:
> 1) Create a durable exchange
> 2) Create a durable queue
> 3) Bind the queue to the exchange
> 4) Make sure the queue has no consumers subscribed
> 5) Send a small (normal) persistent message to the queue
> 6) Crash RabbitMQ by sending a large message to a different, unrelated queue
> 7) Kill epmd
> 8) Restart RabbitMQ
> 9) Verify the (normal) message still exists via rabbitmqctl
> 10) Start STOMP consumer and attempt to subscribe to the queue
> 11) STOMP consumer waits, RabbitMQ generates the log message, but no
> persistent (normal) message gets delivered

You should be able to skip steps 6 and 7, i.e. just bounce rabbit 
normally, and still see the problem.

> I see the 'unacknowledged messages' after a start up the STOMP
> clients.

*phew*. That is much more plausible, and means there is unlikely to be a 
bug in the rabbit core.

> So, I'm thinking the order of operations is:
> 1) Unacknowledged messages exist on the queue
> 2) RabbitMQ dies
> 3) RabbitMQ starts up
> 4) Recovery mode starts, marks all un-ack'd messages as 'ready'
> 5) STOMP clients connect
> 6) RabbitMQ generates the STOMP error
> 7) I check the rabbitmqctl output, and see that there are un-ack'd messages
> 
> To be honest, I can't seem to replicate the issue where the STOMP
> clients disconnect and the messages remain 'un-ack'd' -- I'm thinking
> this error may be transient or somehow a wierd corner case.  If I ever
> encounter that scenario again, I'll be sure to save the mnesia
> directory at that point.

Makes sense. There may well be a delay between the error being generated 
and the messages being moved back into the 'ready' state, particularly, 
say, when rabbit is busy dumping a large error message to a log file.

> I don't have a pure AMQP test client, but I'm curious if this error
> condition exists if the large message were sent over AMQP instead of
> STOMP...

That we have definitely tested. And no, it doesn't cause an error.

So, in summary, I think you have managed to uncover three bugs in the 
STOMP adapter:

1) attempting to deliver messages recovered from the persister via STOMP 
causes an error

2) STOMP client disconnects can result in huge error messages being logged

3) sending large messages via STOMP causes rabbit to die

Thanks for your help in tracking down these problems.

I have one last request: Would it be possible for you to construct a 
simple test case for 3? Ideally I want something along the lines of "1) 
start a clean (i.e. no existing db) rabbit with stomp enabled, 2) run 
this program, 3) see rabbit die". Based on your investigation so far, 
the program in question could perhaps be as simple as creating a large 
message and then attempting to send it over STOMP.

Regards,

Matthias.