[rabbitmq-discuss] Rabbit Client Supervision Architecture

Wed Sep 22 11:27:23 BST 2010

Hi Erik,

On Wed, Sep 22, 2010 at 07:42:46AM +0000, Erik Seres wrote:
> I have a few questions regarding the RabbitMQ Erlang client library...

Excellent. As you've probably noticed it is still very much experimental
and in a state of flux. Things should be calming down soon though, and
it might even become officially supported soon!

> 1) The client now has layers of supervision built in. Apparently, however, the 
> maximum restart count for the various supervisors is set to 0. This will prevent 
> the supervisor from trying to restart its children in case of a crash. I have 
> verified this to be the case by the following:
>  - created a network type connection to the server,
>  - opened 2 channels within that connection,
>  - sent a non-normal exit signal to the newly opened channel process.
> 
> The channel process crashed, its supervisor propagated the exit signal to the 
> other channel and finally, it shut down itself, too. The client has never 
> started back up again.
> 
> Question: What would be the intended use case of a supervisor with MaxRetryCount 
> set to 0?

This is a very common trick in our codebase and indeed further
requirements led us to introduce the "intrinsic" restart strategy in our
supervisor2 module.

In our view, one of the most important properties of supervisors is to
structure the relationship between different processes and lower
coupling. The fact that it can do restarting is almost irrelevant in
many cases.

In the case of the client, restarting the connection in the event of a
crash of some sort is a bizarre thing to do simply because it would be
_impossible_ to be sure that you've got the connection and all the
channels back to the same state. Channels themselves are stateful, but
the state is _not_ set at channel creation - e.g. channel.qos txn.select
basic.consume etc etc. Thus even if channel processes did restart, they
would not be able to get back to the same state they were last in.

It's not even reasonable to restart the connection: queues can be
declared with exclusive=true which means when the connection closes, for
whatever reason, the queues must be deleted. Of course, when the
connection is created, it has no idea what queues are going to be
declared this way, so if the connection gets automatically restarted, it
can't possibly get back to the same state.

All in all, this really means that there is just no way of hiding the
fact the connection / channels have died from the user of the client.
Also, AMQP is designed to pass errors back to the client by explicitly
forcing channels or even the connection to be closed. Obviously, if the
user has done something wrong, you would not want those events to be
silently papered over.

> 2) There is a channels manager process sitting at the same level in the 
> supervision tree as the supervisor that supervises the supervisors of the 
> channels (and writer and framing channel). The channels manager process does not 
> have any links to the channels it apparently is to manage.

I think you're on the wrong branch - that sounds a lot like branch
bug23024 which is not through QA yet. Please make sure you're using the
default branch.

> Question: What is the intended purpose of the channels manager?

Basically, channel number allocation and mapping from channel number <->
Pid. Also, don't worry about the lack of links - we use monitoring a lot
and rely on the supervisor hierarchy to tear down the world if something
really bad happens.

> 3) I am trying to figure out how the four layers of supervision is supposed to 
> work and can't really wrap my head around it. The way I conceive it should work 
> is something like this, from top down:
>  - Layer 1: supervise the entire client library
>  - Layer 2: one_for_all supervision per connection. That is, when the 
> connection, main reader or writer dies, shutdown all channels within that 
> connection, restart the connection and all previously open channels.
>  - Layer 3: one_for_one supervision per channel. That is, when a channel dies, 
> restart the channel only and not affect anything else.
> 
> Question: Is this the intended behavior or am I on the wrong track?

Well, no restarting will ever happen. This is by design. From the bug
that introduced this all, the diagram is roughly:

amqp_sup (amqp_sup) (simple-141-term) (Note, this isn't there yet!)
|
+--undefined (amqp_connection_sup) (one-for-all) *
   |
   +--connection (amqp_{network,direct}_connection) (i)
   |
   +--channels_manager (amqp_channels_manager) (i) (bug23024)
   |
   +--connection_type_sup (amqp_connection_specific_sup) (i) (def)
   |  |
   |  +--framing (rabbit_framing_channel) (i) (N)
   |  +--writer (rabbit_writer) (i) (N)
   |  +--main_reader (amqp_main_reader) (i) (N)
   |  +--rabbit_hearbeat_sender (rabbit_heartbeat) (i) (N) (def)
   |  +--rabbit_hearbeat_receiver (rabbit_heartbeat) (i) (N) (def)
   |  +--collector (rabbit_queue_collector) (i) (D)
   |
   +--channel_sup_sup (amqp_channel_sup_sup) (simple-141-kill)
      |
      +--undefined (amqp_channel_sup) (one-for-all) *
         |
         +--channel (amqp_channel) (i)
         +--framing (rabbit_framing_channel) (i) (N) (def)
         +--writer (rabbit_writer) (i) (N) (def)
         +--rabbit_channel (rabbit_channel) (t) (D)
         +--rabbit_limiter (rabbit_limiter) (t) (D) (def)

Legend:
(N) - only in the network case
(D) - only in the direct case
(i) - intrinsic
(t) - transient
(def) - started later on
* - multiple instances

Thus: a channel is itself multiple processes that sit under a
supervisor (amqp_channel_sup). You have many of these under a
channel_sup_sup, and you never care about any of them dying, which is
why channel_sup_sup is a simple-one-for-one (standard brutal kill).
Depending on the type of connection (network or direct), you need
different processes which is why we have the connection_type_sup which
is parameterised by the connection type.

That's about it really.

Matthew