[rabbitmq-discuss] Rabbit Client Supervision Architecture
Erik Seres
erikseres at exosite.com
Wed Sep 22 12:49:49 BST 2010
Matthew Sackman <matthew at ...> writes:
>
> Hi Erik,
>
> On Wed, Sep 22, 2010 at 07:42:46AM +0000, Erik Seres wrote:
> > I have a few questions regarding the RabbitMQ Erlang client library...
>
> Excellent. As you've probably noticed it is still very much experimental
> and in a state of flux. Things should be calming down soon though, and
> it might even become officially supported soon!
>
> > 1) The client now has layers of supervision built in. Apparently, however,
the
> > maximum restart count for the various supervisors is set to 0. This will
prevent
> > the supervisor from trying to restart its children in case of a crash. I
have
> > verified this to be the case by the following:
> > - created a network type connection to the server,
> > - opened 2 channels within that connection,
> > - sent a non-normal exit signal to the newly opened channel process.
> >
> > The channel process crashed, its supervisor propagated the exit signal to
the
> > other channel and finally, it shut down itself, too. The client has never
> > started back up again.
> >
> > Question: What would be the intended use case of a supervisor with
MaxRetryCount
> > set to 0?
>
> This is a very common trick in our codebase and indeed further
> requirements led us to introduce the "intrinsic" restart strategy in our
> supervisor2 module.
>
> In our view, one of the most important properties of supervisors is to
> structure the relationship between different processes and lower
> coupling. The fact that it can do restarting is almost irrelevant in
> many cases.
>
> In the case of the client, restarting the connection in the event of a
> crash of some sort is a bizarre thing to do simply because it would be
> _impossible_ to be sure that you've got the connection and all the
> channels back to the same state. Channels themselves are stateful, but
> the state is _not_ set at channel creation - e.g. channel.qos txn.select
> basic.consume etc etc. Thus even if channel processes did restart, they
> would not be able to get back to the same state they were last in.
>
> It's not even reasonable to restart the connection: queues can be
> declared with exclusive=true which means when the connection closes, for
> whatever reason, the queues must be deleted. Of course, when the
> connection is created, it has no idea what queues are going to be
> declared this way, so if the connection gets automatically restarted, it
> can't possibly get back to the same state.
>
> All in all, this really means that there is just no way of hiding the
> fact the connection / channels have died from the user of the client.
> Also, AMQP is designed to pass errors back to the client by explicitly
> forcing channels or even the connection to be closed. Obviously, if the
> user has done something wrong, you would not want those events to be
> silently papered over.
>
> > 2) There is a channels manager process sitting at the same level in the
> > supervision tree as the supervisor that supervises the supervisors of the
> > channels (and writer and framing channel). The channels manager process does
not
> > have any links to the channels it apparently is to manage.
>
> I think you're on the wrong branch - that sounds a lot like branch
> bug23024 which is not through QA yet. Please make sure you're using the
> default branch.
>
> > Question: What is the intended purpose of the channels manager?
>
> Basically, channel number allocation and mapping from channel number <->
> Pid. Also, don't worry about the lack of links - we use monitoring a lot
> and rely on the supervisor hierarchy to tear down the world if something
> really bad happens.
>
> > 3) I am trying to figure out how the four layers of supervision is supposed
to
> > work and can't really wrap my head around it. The way I conceive it should
work
> > is something like this, from top down:
> > - Layer 1: supervise the entire client library
> > - Layer 2: one_for_all supervision per connection. That is, when the
> > connection, main reader or writer dies, shutdown all channels within that
> > connection, restart the connection and all previously open channels.
> > - Layer 3: one_for_one supervision per channel. That is, when a channel
dies,
> > restart the channel only and not affect anything else.
> >
> > Question: Is this the intended behavior or am I on the wrong track?
>
> Well, no restarting will ever happen. This is by design. From the bug
> that introduced this all, the diagram is roughly:
>
> amqp_sup (amqp_sup) (simple-141-term) (Note, this isn't there yet!)
> |
> +--undefined (amqp_connection_sup) (one-for-all) *
> |
> +--connection (amqp_{network,direct}_connection) (i)
> |
> +--channels_manager (amqp_channels_manager) (i) (bug23024)
> |
> +--connection_type_sup (amqp_connection_specific_sup) (i) (def)
> | |
> | +--framing (rabbit_framing_channel) (i) (N)
> | +--writer (rabbit_writer) (i) (N)
> | +--main_reader (amqp_main_reader) (i) (N)
> | +--rabbit_hearbeat_sender (rabbit_heartbeat) (i) (N) (def)
> | +--rabbit_hearbeat_receiver (rabbit_heartbeat) (i) (N) (def)
> | +--collector (rabbit_queue_collector) (i) (D)
> |
> +--channel_sup_sup (amqp_channel_sup_sup) (simple-141-kill)
> |
> +--undefined (amqp_channel_sup) (one-for-all) *
> |
> +--channel (amqp_channel) (i)
> +--framing (rabbit_framing_channel) (i) (N) (def)
> +--writer (rabbit_writer) (i) (N) (def)
> +--rabbit_channel (rabbit_channel) (t) (D)
> +--rabbit_limiter (rabbit_limiter) (t) (D) (def)
>
> Legend:
> (N) - only in the network case
> (D) - only in the direct case
> (i) - intrinsic
> (t) - transient
> (def) - started later on
> * - multiple instances
>
> Thus: a channel is itself multiple processes that sit under a
> supervisor (amqp_channel_sup). You have many of these under a
> channel_sup_sup, and you never care about any of them dying, which is
> why channel_sup_sup is a simple-one-for-one (standard brutal kill).
> Depending on the type of connection (network or direct), you need
> different processes which is why we have the connection_type_sup which
> is parameterised by the connection type.
>
> That's about it really.
>
> Matthew
>
Hi Matthew,
Thank you for the very quick and elaborate response. The concept of how you use
supervisors to define hierarchy is interesting and now I understand why there is
no restart ever happening. In my case, I am not using transactions so the
stateful nature of channels did not come to mind and I was not taking that into
consideration.
I think I have one question regarding connections. So, you explain that an
exclusive queue must be deleted when a connection closes. On that same note,
however, how do you locate a durable queue after a connection had died (or had
been closed) and then you reopened it? If one can locate a durable queue after a
disconnect/connect, which my understanding is one must be able to, then why can
one not locate an exclusive queue the same way? (Tell me if I just need to dive
in the AMQP specs to understand this.)
In my application at startup, I will open a connection and then, each
application process wanting to communicate over AMQP, will open a channel. The
application will need to know if the amqp_client has crashed and the connection
and channels have been lost. So, I thought I would add the process ID returned
by amqp_connection:start(network, ...) to my supervision tree. This does not
seem to work, though, as the connection PID never shows up under my application
supervisor as a child. And, consequently, it never gets restarted in case of a
crash. I have the "Type" in the ChildSpec set to 'supervisor' for this child.
With that said, how should I go about detecting when/if the amqp client has
crashed?
Also, what do you think is the expected time frame before this client library
becomes officially supported?
Thanks for your help!
Erik
More information about the rabbitmq-discuss
mailing list