[rabbitmq-discuss] Fwd: Monitoring A Queue

Tue Oct 16 18:00:30 BST 2012

I've been tasked with monitoring a queue used by two applications, which 
are considered black boxes from my point of view (App A puts messages 
into a queue, App B consumes them; I have no control over the code for 
either app). I need to measure the rate of messages (messages/second) 
moving successfully from App A to App B through the queue. We're 
concerned with both pileups in the queue (producer working normally, 
consumer halted) and lack of new messages (producer halted). The end 
result will be a Nagios check plugin, which alerts when messages/second 
throughput falls below a given level, and also produces performance data 
for graphing.

I've been examining the JSON results from the management API 
(/api/queues/vhost_name/queue_name), and have come to the following 
conclusions:

1) There are no monotonically increasing counters of the number of 
messages in to the queue and number of messages out of the queue. 
Therefore, there's no way for me to figure the rate on an arbitrary time 
scale.

2) The "messages" element in the JSON array is the number of messages 
currently sitting in the queue, not a counter of the number that have 
passed through it.

3) The API does provide backing_queue_status.avg_ingress_rate and 
backing_queue_status.avg_egress_rate. Going by Matthew's post to 
"Monitoring Message Throughput" 
(http://rabbitmq.1065348.n5.nabble.com/Monitoring-Message-Throughput-td17436.html#a17438 
<http://rabbitmq.1065348.n5.nabble.com/Monitoring-Message-Throughput-td17436.html#a17438>) 
from February, 2010, I gather that these values are averaged over 
approx. 10 seconds. I'm also assuming that ingress is *every* message 
into the queue, and egress is... *every* message grabbed from the queue, 
regardless of ACK? So even if consumers are not acking any messages, 
this rate will still show them? And I assume that means these values 
wouldn't be useful for a low-utilization, bursty queue (i.e., say, a 
30-message burst every few minutes)?

Are these correct?

It would be really helpful if someone could explain exactly what 
avg_in/egress_rate and avg_ack_in/egress_rate mean in the JSON output - 
I've done a bunch of testing with a Pika client connected to RabbitMQ 
2.8.7 on my workstation, but I can't seem to draw a clear connection 
between what my client is doing and what avg_ingress_rate and 
avg_egress_rate show in the JSON, or what the avg_ack_[in|e]gress_rate  
show, when I watch everything simultaneously.

I also found some reference to a Ticket 23 (though I don't see any links 
to the RMQ ticketing system) that dealt with monitoring, but no 
reference to any updates to it:
http://grokbase.com/t/rabbitmq/rabbitmq-discuss/10ag2hejbb/ticket-23-new-comment-monitoring-message-throughput 
<http://grokbase.com/t/rabbitmq/rabbitmq-discuss/10ag2hejbb/ticket-23-new-comment-monitoring-message-throughput>

Any guidance/advice/pointers would be greatly appreciated. I've been 
working on this monitoring for two days now, and I can't find anything 
that I'm confident is a correct solution...

Thanks,
Jason Antman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121016/4e5f2859/attachment.htm>