[rabbitmq-discuss] Rabbit Client Supervision Architecture

Wed Sep 22 12:49:49 BST 2010

Matthew Sackman <matthew at ...> writes:

> 
> Hi Erik,
> 
> On Wed, Sep 22, 2010 at 07:42:46AM +0000, Erik Seres wrote:
> > I have a few questions regarding the RabbitMQ Erlang client library...
> 
> Excellent. As you've probably noticed it is still very much experimental
> and in a state of flux. Things should be calming down soon though, and
> it might even become officially supported soon!
> 
> > 1) The client now has layers of supervision built in. Apparently, however, 
the 
> > maximum restart count for the various supervisors is set to 0. This will 
prevent 
> > the supervisor from trying to restart its children in case of a crash. I 
have 
> > verified this to be the case by the following:
> >  - created a network type connection to the server,
> >  - opened 2 channels within that connection,
> >  - sent a non-normal exit signal to the newly opened channel process.
> > 
> > The channel process crashed, its supervisor propagated the exit signal to 
the 
> > other channel and finally, it shut down itself, too. The client has never 
> > started back up again.
> > 
> > Question: What would be the intended use case of a supervisor with 
MaxRetryCount 
> > set to 0?
> 
> This is a very common trick in our codebase and indeed further
> requirements led us to introduce the "intrinsic" restart strategy in our
> supervisor2 module.
> 
> In our view, one of the most important properties of supervisors is to
> structure the relationship between different processes and lower
> coupling. The fact that it can do restarting is almost irrelevant in
> many cases.
> 
> In the case of the client, restarting the connection in the event of a
> crash of some sort is a bizarre thing to do simply because it would be
> _impossible_ to be sure that you've got the connection and all the
> channels back to the same state. Channels themselves are stateful, but
> the state is _not_ set at channel creation - e.g. channel.qos txn.select
> basic.consume etc etc. Thus even if channel processes did restart, they
> would not be able to get back to the same state they were last in.
> 
> It's not even reasonable to restart the connection: queues can be
> declared with exclusive=true which means when the connection closes, for
> whatever reason, the queues must be deleted. Of course, when the
> connection is created, it has no idea what queues are going to be
> declared this way, so if the connection gets automatically restarted, it
> can't possibly get back to the same state.
> 
> All in all, this really means that there is just no way of hiding the
> fact the connection / channels have died from the user of the client.
> Also, AMQP is designed to pass errors back to the client by explicitly
> forcing channels or even the connection to be closed. Obviously, if the
> user has done something wrong, you would not want those events to be
> silently papered over.
> 
> > 2) There is a channels manager process sitting at the same level in the 
> > supervision tree as the supervisor that supervises the supervisors of the 
> > channels (and writer and framing channel). The channels manager process does 
not 
> > have any links to the channels it apparently is to manage.
> 
> I think you're on the wrong branch - that sounds a lot like branch
> bug23024 which is not through QA yet. Please make sure you're using the
> default branch.
> 
> > Question: What is the intended purpose of the channels manager?
> 
> Basically, channel number allocation and mapping from channel number <->
> Pid. Also, don't worry about the lack of links - we use monitoring a lot
> and rely on the supervisor hierarchy to tear down the world if something
> really bad happens.
> 
> > 3) I am trying to figure out how the four layers of supervision is supposed 
to 
> > work and can't really wrap my head around it. The way I conceive it should 
work 
> > is something like this, from top down:
> >  - Layer 1: supervise the entire client library
> >  - Layer 2: one_for_all supervision per connection. That is, when the 
> > connection, main reader or writer dies, shutdown all channels within that 
> > connection, restart the connection and all previously open channels.
> >  - Layer 3: one_for_one supervision per channel. That is, when a channel 
dies, 
> > restart the channel only and not affect anything else.
> > 
> > Question: Is this the intended behavior or am I on the wrong track?
> 
> Well, no restarting will ever happen. This is by design. From the bug
> that introduced this all, the diagram is roughly:
> 
> amqp_sup (amqp_sup) (simple-141-term) (Note, this isn't there yet!)
> |
> +--undefined (amqp_connection_sup) (one-for-all) *
>    |
>    +--connection (amqp_{network,direct}_connection) (i)
>    |
>    +--channels_manager (amqp_channels_manager) (i) (bug23024)
>    |
>    +--connection_type_sup (amqp_connection_specific_sup) (i) (def)
>    |  |
>    |  +--framing (rabbit_framing_channel) (i) (N)
>    |  +--writer (rabbit_writer) (i) (N)
>    |  +--main_reader (amqp_main_reader) (i) (N)
>    |  +--rabbit_hearbeat_sender (rabbit_heartbeat) (i) (N) (def)
>    |  +--rabbit_hearbeat_receiver (rabbit_heartbeat) (i) (N) (def)
>    |  +--collector (rabbit_queue_collector) (i) (D)
>    |
>    +--channel_sup_sup (amqp_channel_sup_sup) (simple-141-kill)
>       |
>       +--undefined (amqp_channel_sup) (one-for-all) *
>          |
>          +--channel (amqp_channel) (i)
>          +--framing (rabbit_framing_channel) (i) (N) (def)
>          +--writer (rabbit_writer) (i) (N) (def)
>          +--rabbit_channel (rabbit_channel) (t) (D)
>          +--rabbit_limiter (rabbit_limiter) (t) (D) (def)
> 
> Legend:
> (N) - only in the network case
> (D) - only in the direct case
> (i) - intrinsic
> (t) - transient
> (def) - started later on
> * - multiple instances
> 
> Thus: a channel is itself multiple processes that sit under a
> supervisor (amqp_channel_sup). You have many of these under a
> channel_sup_sup, and you never care about any of them dying, which is
> why channel_sup_sup is a simple-one-for-one (standard brutal kill).
> Depending on the type of connection (network or direct), you need
> different processes which is why we have the connection_type_sup which
> is parameterised by the connection type.
> 
> That's about it really.
> 
> Matthew
> 

Hi Matthew,

Thank you for the very quick and elaborate response. The concept of how you use 
supervisors to define hierarchy is interesting and now I understand why there is
no restart ever happening. In my case, I am not using transactions so the 
stateful nature of channels did not come to mind and I was not taking that into 
consideration.

I think I have one question regarding connections. So, you explain that an 
exclusive queue must be deleted when a connection closes. On that same note, 
however, how do you locate a durable queue after a connection had died (or had 
been closed) and then you reopened it? If one can locate a durable queue after a 
disconnect/connect, which my understanding is one must be able to, then why can 
one not locate an exclusive queue the same way? (Tell me if I just need to dive 
in the AMQP specs to understand this.)

In my application at startup, I will open a connection and then, each 
application process wanting to communicate over AMQP, will open a channel. The 
application will need to know if the amqp_client has crashed and the connection 
and channels have been lost. So, I thought I would add the process ID returned 
by amqp_connection:start(network, ...) to my supervision tree. This does not 
seem to work, though, as the connection PID never shows up under my application 
supervisor as a child. And, consequently, it never gets restarted in case of a 
crash. I have the "Type" in the ChildSpec set to 'supervisor' for this child. 
With that said, how should I go about detecting when/if the amqp client has 
crashed?

Also, what do you think is the expected time frame before this client library 
becomes officially supported?

Thanks for your help!
Erik