[rabbitmq-discuss] AMQP restart

Tue Oct 16 12:57:31 BST 2012

Hi!

On 15 Oct 2012, at 16:07, tom kelly wrote:

> Hi List,
> New user here, debugging some code I inherited so apologies if my questions below are irrelevant.
> 
> I'm investigating an amqp crash that I've seen in my logs a few times and after a code review of the amqp component I'm a bit concerned that my connections may be dying & silently failing when this crash occurs.
> 

Ok, let's look at that then.

> I'm using an older version that unlinked from the process that called "start_link", anyone know why that was? I'm publishing through this channel by calling amqp_channel:cast, so now I'm worried that if the connection & channel were closed down everything that I thought I was publishing after this error just silently failed. And because of the unlink there's no way the application would have known.
> 

I'm really no too sure about this unlink business, but if you could clarify where that was happening then I can probably look through the hg logs to try and figure out what was going on there.

> I plan upgrading to the latest version but I'm not sure that it has all the features to help solve this problem. I see that the unlink is gone and the supervision policy is still:{one_for_all, 0, 1} So I guess this means I have to trap exits and I have responsibility for reopening the connection & channel if it dies?

My reading of the supervision hierarchy is thus:

The application has a top level simple_one_for_one supervisor for all connections, which handles the amqp_connection_sup. This just ensures that each connection can actually be started and they connection_sup is temporary, so no restarts will ever take place. This is presumably what you'd expect, as we're not trying to second guess how long your connections need to live for.

The actual connection consists of a few processes - the gen_connection, connection_type_sup and channel_sup_sup. This is a one_for_all supervisor and the actual gen_connection process is an 'intrinsic' worker, so a non-normal exit will kill the supervisor (and sibling processes), but a normal exit will take everyone else down cleanly.

Now the channel_sup_sup starts a temporary worker (amqp_channel_sup) and that starts an intrinsic worker.  So all in all, it looks to me as if the connection and channel will be properly re-established if a non-normal exit occurs.

> But before I restart it, what happens to any attempts to publish messages? I see there's new confirmation functionality that sounds like it might do what's required but from my reading it seems that if amqp_channel is shut down after a crash on the connection then all the confirm info is discarded. Is there no way to keep this process alive and try to re-open the connection immediately on failure?
> 

I'm not really sure what you're asking here, but my reading of the client is that if you're expecting a confirm and you've not seen it, then you can/should assume the message wasn't accepted by the broker. If you're asking about tracking the confirms between channel instances, then yes, you'll need to do that yourself, using whatever mechanism suits your design (shared/stable storage, stateful parent process, etc).

> I'm just about to plug in the new version and play with the confirmations but any explanations of the current design might help enormously,
> Thanks,
> //TTom.

Well I hope my comments have made it a bit clearly and not worse! Please *do* feel free to come back with any questions, or to clear up anything I've not explained properly.

Cheers,
Tim