[rabbitmq-discuss] precondition_failed error with amqp_client for erlang

Mon Jul 11 13:32:45 BST 2011

Hi Max,

Sorry for the wait, we're actually rewriting that part of the Erlang
client right now.  So, some of things I say now will change in the near
future.

> For the sake of clarity here is the glossary of terms I used in my last

Thanks for the clarification.

> Yes a list of messages that the amqp_client process sends to a subscriber,
> particularly pertaining to errors in amqp_client land, would be very
> helpful.  I'd like to be able to handle all {'DOWN',Etc} messages with my
> long running process (server).  I'm hoping to handle all hard errors so that
> a restart from either supervisor (my long running process or the
> amqp_client's) won't break the communication between the two.

You don't need to call amqp_client:start/2.  Amqp_connection:start/1
does that automatically.  If you really do need to do it manually, you
should probably use amqp_client:start/0 which uses the .app.  I don't
think it makes any difference to what you're doing.

The AMQP client supervises connections and connections supervise their
channels.  Neither channels nor connections are restarted if they
encounter an error; in certain cases they actually take down their
supervisors, as well.

You normally receive messages from connections and channels.  I don't
see any way for amqp_sup to send a message or even die.  It's only there
if you need a quick way to close all connections.

Connections and channels are gen_servers and you're used to working with
them through the various amqp_connection and amqp_channel methods.
Ignoring the usual call/casts and replies, connection shouldn't send you
any messages and channel should only send:
  * {#'basic.deliver'{}, #'amqp_msg'{}}, #'basic.cancel'{}, and
    #'basic.cancel_ok'{}, to registered consumers (including the
    default consumer),
  * {#'basic.return'{}, #'amqp_msg'{}} to the return handler if it's
    registered,
  * #'channel.flow'{} to the flow handler if it's registered, and
  * #'basic.ack'{} and #'basic.nack'{} to the confirm handler if it's
    registered.

(We're currently changing the way consumers work: consumer behaviour
will be determined by what consumer module is registered with the
channel.  The default selective_consumer will behave more or less like
the current implementation and will send out approximately the same
messages.  Direct_consumer, by contrast, will forward all the messages
it receives to another process and let it handle the details of
multiplexing many consumer processes.)

In case of an error, connections and channels behave in the same way: if
the error occurs while waiting for a blocking call to finish (for
instance during an amqp_channel:call), the channel/connections will die
and the calling process will be sent one of the exit signals below (as
per gen_server normal behaviour); if the error occurs "in the background",
the channel/connection will exit with one of the reasons below.

The following are Connection's possible exit reasons:
  * normal, when the connection was closed cleanly by the *user*,
  * {timeout_waiting_for_close_ok, Reason}, {socket_closing_timeout,
    Reason}, socket_closed_unexpectedly, {socket_error, _},
    {channel0_died, Reason}, heartbeat_timeout; these are all specific
    to network connections: half deal with various environment problems,
    the other half involve errant server behaviour or a key process
    inside the connection dying;
  * {shutdown, {error, Error}} caused by gen_tcp:connect during the initial
    call to amqp_connection:start/0;
  * {shutdown, normal}, {shutdown, {server_initiated_close, Code, Text}},
    when the server initiated the connection close; again, normal just
    means that the reply_code was 200;
  * {shutdown, {app_initiated_close, Code, Text}}, {shutdown,
    {server_misbehaved, Code, Text}}, {shutdown, {internal_error, Code,
    Text}} when you or the amqp_client started shutting down the
    connection;
  * if you attempt to use a connection or channel method while the
    connection is closing, the method will return closing.

The following are Channel's possible exit reasons:
  * normal, when the channel was closed cleanly by the *user*, or when
    the connection is closed cleanly,
  * timed_out_flushing_channel, timed_out_waiting_close_ok, when the
    channel is waiting for various things; the first one happens when
    the channel is closing cleanly but still has pending methods to
    execute and times out; see below,
  * {shutdown, {server_misbehaved, #amqp_error{}}}, when the server does
    something illegal to the channel,
  * {shutdown, {server_initiated_close, Code, Text}}, when the server
    sends a channel.close to *this* channel (it it's sent to another
    channel, this channel will exit with a connection_closing status),
  * {shutdown, {app_initiated_close, Code, Text}}, when the user closes
    this channel, and
  * {shutdown, {connection_closing, Reason}}, when the connection is
    closing and has instructed this channel to close; the Reason is
    one of the connection closing reasons mentioned above; if the Reason
    is normal, the channel will also exit with normal.

Amqp_channel holds a queue of pending RPC methods to execute.  When you
call amqp_channel:close/1, it starts "flushing", that is, it no longer
accepts new methods, but still tries to execute the already enqueued
ones.

Those are the different ways connections and channels can close as far
as I can tell.

We're currently changing the behaviour so that the various methods
return {error, Error} rather than kill the calling process.

> But I get the impression that I'm missing something about how I'm supposed
> to treat the amqp_client library with regard to amqp_client:start/2.  Should
> I be treating the amqp_client connection like mnesia (an application
> entirely independent of mine), add it to my existing supervision tree and
> share one connection throughout my application, or, what I'm currently
> doing, let each part of my application that needs to talk to amqp spin up
> and close their own connection/channel?

It's more of a standalone-application, I think.  There's not much point
in supervising amqp_sup since it can't really go down.  It's probably
best to just ignore it entirely and focus on channels and connections,
which are *not* restarted on failure.

Multiple connections is fine, especially since otherwise you'd have to
worry about restarting connections *and* letting other processes know
that the connection has changed.

Does this help?

Cheers,
Alex

On Tue, Jul 05, 2011 at 11:28:46AM -0400, Max Warnock wrote:
> Sorry about the ambiguity,
> 
> For the sake of clarity here is the glossary of terms I used in my last
> email (which probably clashes with the erlang/amqp_client context you're
> coming from):
> 
> Server - I'm referring to my long running process in erlang that I have
> given a registered name and passed as the 3rd argument to
> amqp_client:subscribe/3.
> Listener - the process created by the amqp_client library when a connection
> and channel are opened
> Subscribe - calling of amqp_client:subscribe/3
> My - I'm using this pronoun to distinguish code written by me from code
> written by you cats at rabbitmq (client library, rabbitmq server, etc)
> 
> I've attached a diagram (approximation/abstraction) of how I'm interacting
> with the amqp_client library. (sorry to the mailing list if attaching a 40K
> diagram breaks etiquette).
> 
> I'm using the amqp_client library in network mode, i.e.,
> amqp_client:start(network, #amqp_params{host = Host, heartbeat=60000})
> 
> Yes a list of messages that the amqp_client process sends to a subscriber,
> particularly pertaining to errors in amqp_client land, would be very
> helpful.  I'd like to be able to handle all {'DOWN',Etc} messages with my
> long running process (server).  I'm hoping to handle all hard errors so that
> a restart from either supervisor (my long running process or the
> amqp_client's) won't break the communication between the two.
> 
> But I get the impression that I'm missing something about how I'm supposed
> to treat the amqp_client library with regard to amqp_client:start/2.  Should
> I be treating the amqp_client connection like mnesia (an application
> entirely independent of mine), add it to my existing supervision tree and
> share one connection throughout my application, or, what I'm currently
> doing, let each part of my application that needs to talk to amqp spin up
> and close their own connection/channel?
> 
> Thanks,
> -Max
> 
> On Mon, Jul 4, 2011 at 12:16 PM, Alexandru Scvorţov
> <alexandru at rabbitmq.com>wrote:
> 
> > Hi Max,
> >
> > I'm trying to run through the steps you provided, but I'm having a bit
> > of trouble following.
> >
> > Are you using a network or a direct connection? (I assume network, but
> > it probably doesn't matter)
> >
> > By server, do you mean the actual RabbitMQ server, or you application?
> > (I'm guessing your long-running application)
> >
> > By subscribe, do you mean calling amqp_channel:subscribe/3?  If so, do
> > you still need a list of the messages the channel may send its
> > subscriber?
> >
> > Or do you mean that your application is sending messages to its
> > subscribers?
> >
> > > 6.) The server supervisor restarts the server which creates a new
> > listener,
> > > but the old listener is still hanging around trying to send the the
> > > registered name
> >
> > What's a listener?  Is it a process that receives messages from the
> > erlang client because it's the endpoint of a subscription to a queue?
> >
> > Can't you link listeners to the server so that when the server goes
> > down, it takes the listeners with it?
> >
> > > So my question then is how should I kill the amqp_client?
> >
> > What do you mean by amqp_client?  If it's an amqp_connection process,
> > you can just send it an shutdown exit.
> >
> > Thanks for the information.
> >
> > Cheers,
> > Alex
> >
> > On Fri, Jul 01, 2011 at 01:51:10PM -0400, Max Warnock wrote:
> > > Problem found.  Thanks for your help.  The problem is a strange one and
> > has
> > > to do with me not shutting my amqp_client listener down properly if my
> > > server dies.  Here is how it manifests:
> > >
> > > 1.) Server starts up and starts up a amqp client connection and channel
> > > 2.) The server binds to that channel and starts the subscription using a
> > > registered name name as the process to which messages will be sent
> > > 3.) Messages start coming in and are ack-ing fine
> > > 4.) Poor error handling in farming out processes brings the server down
> > > 5.) The server does no close the amqp_client connection
> > > 6.) The server supervisor restarts the server which creates a new
> > listener,
> > > but the old listener is still hanging around trying to send the the
> > > registered name
> > > 7.) The older listener sends a message to the server
> > > 8.) The server tries to ack to the new listener which did not send the
> > > message
> > > 9.) The new server pukes because it never sent a message with that tag
> > >
> > > So my question then is how should I kill the amqp_client? If I send it an
> > > exit its supervisor will restart it.  This is what I was getting at with
> > my
> > > tangential questions in the last email.  How should I shut down the
> > > amqp_client without shutting down all the other servers' amqp client
> > > listeners?
> > >
> > > Thanks for all the help,
> > > -Max
> > >
> > > On Thu, Jun 30, 2011 at 9:23 AM, Max Warnock <maxjwarnock at gmail.com>
> > wrote:
> > >
> > > > Thanks, that's very helpful from both the possible issues to chase and
> > > > sanity check perspectives.
> > > >
> > > > I'm using erlang R13B04 with a rabbitmq server installed via gentoo's
> > > > portage at version 2.4.1. I pulled the client library from github (tag
> > > > 2.3.0, commit: 844738f9b56d34104c1ea2ac5700d0898126c5b4).
> > > >
> > > > I'm going to write some debug code to store all the tags I try to ack
> > on
> > > > and see if I can get this error to where it's easily reproducible.
> > Thanks
> > > > for narrowing my search, it's very helpful.  I'll keep you updated. I
> > must
> > > > be doing something wrong somewhere.  I have a hard time believing such
> > a
> > > > widely used library could fail so hard myself.
> > > >
> > > > One thing that would be extremely helpful is if you could point me to
> > some
> > > > documentation which I haven't been able to find:  I'm looking for a
> > listing
> > > > of all the events/messages that are sent out by the amqp client to a
> > > > subscriber.  What does it send when it goes down, what other soft
> > errors
> > > > will it send out, etc.  Additionally, is there a doc somewhere for best
> > > > practices in connecting a listener to another server/long-running
> > process?
> > > >  Not having either of those there has been some struggle to know how to
> > > > restart the subscription/listening process if my server dies.  The
> > > > amqp_client tutorial has been a great help, but when it comes to error
> > > > handling from the listening module perspective it doesn't tell me what
> > the
> > > > library is expecting me to do.  I don't want to have to do a bunch of
> > > > engineering because I'm square peg, round hole-ing the library.  The
> > primary
> > > > issues I'm concerned with are when my server dies hard and is destined
> > to be
> > > > restarted by its supervisor what should I send to the amqp client
> > process?
> > > > Should I send it close messages and then start a new one? Or should I
> > > > reconnect to the client library.  This wouldn't be as big of an issue
> > but I
> > > > need to use durable/persistent queues and if I still have a listener
> > hanging
> > > > around with the same bindings on the same queue it will eat all my
> > messages
> > > > and send them nowhere.
> > > >
> > > > Thanks,
> > > > -Max
> > > >
> > > > On Thu, Jun 30, 2011 at 7:48 AM, Matthew Sackman <matthew at rabbitmq.com
> > >wrote:
> > > >
> > > >> Hi Max,
> > > >>
> > > >> On Wed, Jun 29, 2011 at 06:28:59PM -0400, Max Warnock wrote:
> > > >> > I've built a behavior in erlang to subscribe to a given topic
> > exchange
> > > >> and
> > > >> > farm out message handling.  I'm using the rabbitmq amqp_client
> > library
> > > >> for
> > > >> > erlang and when I put the system under heavy load I get, on
> > occasion,
> > > >> the
> > > >> > following error:
> > > >>
> > > >> Could you let us know which version of Rabbit, Erlang and the Erlang
> > > >> client you're using?
> > > >>
> > > >> > =ERROR REPORT==== 29-Jun-2011::18:02:18 ===
> > > >> > ** Generic server <0.1117.0> terminating
> > > >> > ** Last message in was {'$gen_cast',
> > > >> >                            {method,
> > > >> >                                {'channel.close',406,
> > > >> >                                    <<"PRECONDITION_FAILED - unknown
> > > >> delivery
> > > >> > tag 856">>,
> > > >> >                                    60,80},
> > > >>
> > > >> That's a double-ack (probably). Sadly, the AMQP 0-9-1 spec says that
> > > >> acking is not idempotent, thus it's a fault to ack the same message
> > > >> multiple times...
> > > >>
> > > >> > The server receive loop where the ack happens looks like this:
> > > >> > receive
> > > >> > ...
> > > >> > {#'basic.deliver'{delivery_tag = Tag, routing_key = RoutingKey},
> > > >> > #amqp_msg{payload = Payload}} ->
> > > >> >     amqp_channel:cast(get(amqp_channel_pid),
> > #'basic.ack'{delivery_tag =
> > > >> > Tag}),
> > > >> >     spawn_and_queue(spawn_handle_message, Module, RoutingKey,
> > Payload),
> > > >> >     loop(Module);
> > > >> > ...
> > > >> > end
> > > >>
> > > >> ...hmmm, which is so simple that I can't see how it could go wrong: if
> > > >> you're not double acking then something else must be going on to make
> > > >> the broker think that it's not expecting an ack for that message,
> > hence
> > > >> the error. If you're doing some sort of reject operation - either
> > > >> basic.nack or basic.reject on messages and you then subsequently ack
> > one
> > > >> of those messages then that would also cause this error. There may be
> > > >> other cases as well.
> > > >>
> > > >> > The amqp_client_sup can't seem to bring back the the client either
> > and
> > > >> dies
> > > >> > from the retry intensity being reached.  I've done a hefty amount of
> > > >> > googling and can't seem to find where things could be going wrong.
> > > >>  Before
> > > >> > jumping into the amqp_client code I thought I'd ask the mailing list
> > if
> > > >> they
> > > >> > have any ideas.  The only thing I can think is that there is a race
> > > >> > condition within the client library.  I will be double checking my
> > code
> > > >> to
> > > >> > be sure it isn't sending the ack twice, but given the simplicity of
> > the
> > > >> ack
> > > >> > the only way it could is if it receives the same message (with
> > identical
> > > >> > delivery tag) from the amqp_client library twice.
> > > >>
> > > >> It could be a bug in the client library, but I'd be a little surprised
> > > >> if we're managing to duplicate messages somehow - that would be a new
> > > >> level of fail for us. ;) However, the fact that the entire connection
> > > >> dies is alarming and almost certainly a bug: PRECONDITION_FAILED is a
> > > >> soft error and should only tear down the channel, not the whole
> > > >> connection. After that, all you should have to do is create a new
> > > >> channel and everything else should be ok. If that's not the case
> > please
> > > >> let us know.
> > > >>
> > > >> Best wishes,
> > > >>
> > > >> Matthew
> > > >> _______________________________________________
> > > >> rabbitmq-discuss mailing list
> > > >> rabbitmq-discuss at lists.rabbitmq.com
> > > >> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> > > >>
> > > >
> > > >
> >
> > > _______________________________________________
> > > rabbitmq-discuss mailing list
> > > rabbitmq-discuss at lists.rabbitmq.com
> > > https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> >
> >