[rabbitmq-discuss] last sticky wicket on map/reduce

Jon Brisbin jon.brisbin at npcinternational.com
Fri Oct 1 14:34:41 BST 2010


I'm also wondering if anyone uses counts to determine when a job is finished or not. By that I mean, increment a counter for every outgoing message and decrement the counter when a response is received. In the case of a map/reduce job, I'd need to do something like:

SQL -> Map phase = +1 (per row)
Map phase -> Reduce phase = -1 (that we got the original msg) +1 * (num of emit's)
Reduce phase -> Response|ReReduce = -1 (for emit's) +1 (for response/rereduce)
[ReReduce -> Response] = -1 +1 (for sending response)
Response = -1

Essentially, each step would decrement a counter for the incoming message and increment the counter for the outgoing message. A reduce phase might decrement the counter 1000 times and increment it once. But since the map phase incremented it 1000 times prior, the count after map/reduce would be "1". The response listener would then decrement the counter when it processed the response, see that it's now zero, and know to continue.

If my goal is to beat processing times on the AS/400 when doing large financial calculations (daily acct'g reports take several hours to generate), I can't really depend on timeouts to make sure I've gathered all my results. I want the job to return as soon as results are ready. I'd like to go to management and show them a 2 hr -> 15 min improvement by using parallel processing.

I'm just wondering if using ZooKeeper or similar to do distributed, synchronized counters will have enough atomicity to not miss a count incr/decr. If I miss even one, I'm screwed because it'll never get back to zero (or get there prematurely).

I need a sentence with a question mark or this will definitely go unanswered: are message counters like this a good way to monitor asynchronous, distributed processing state?

Thanks! :)

Jon Brisbin
Portal Webmaster
NPC International, Inc.



On Oct 1, 2010, at 8:11 AM, Jon Brisbin wrote:

> I had not really looked at the spring integration stuff for a solution. It looks interesting, though.
> 
> Thanks for the link...
> 
> Jon Brisbin
> Portal Webmaster
> NPC International, Inc.
> 
> 
> 
> On Sep 30, 2010, at 4:19 PM, Shane Witbeck wrote:
> 
>> Have you thought about using an Aggregator? Spring Integration offers this: 
>> 
>> http://static.springsource.org/spring-integration/reference/htmlsingle/spring-integration-reference.html#aggregator
>> 
>> I think Apache Camel offers this too. Both might be overkill in your case but maybe a look at how they're doing it will help.
>> 
>> HTH,
>> Shane
>> 
>> 
>> On Thu, Sep 30, 2010 at 4:41 PM, Jon Brisbin <jon.brisbin at npcinternational.com> wrote:
>> I've got a pest of a sticky wicket in my map/reduce implementation that's using Groovy for the logic and RabbitMQ for the plumbing. It's frustrating because I'm so close.
>> 
>> The problem I'm having is knowing when I'm finished. Using data like this:
>> 
>> 1
>> 2
>> 3
>> 4
>> 5
>> 6
>> 7
>> 8
>> 9
>> 10
>> END
>> 
>> The "END" goes through a separate consumer thread because it goes out on a fanout exchange (it has to go to all workers), so it comes in out-of-order from the other data:
>> 
>> 1    2    3
>>           END
>> 4    5    6
>> END  END  
>> 7    8    9
>> ...etc...
>> 
>> I can sort of work around this by keeping track of id changes in my consumers using the classic "if this.id != last.id" approach. But the last record is a tricky one because there's no key change event to trigger sending the response back. Unless I simply wait until a timeout has occurred, I'm not sure how I can tell when I've collected all the responses I'm going to get.
>> 
>> The problem is that I only know how many message I've sent and not how many to expect in return. emit() can be called multiple times from a map phase and the reduce phase can take (records per key) * (emitted) and either rereduce the result or reply back to the requestor. The requestor shouldn't know whether the result has been rereduced or not. It should simply process the return values.
>> 
>> What am I missing to handle situations like this? Should I introduce another component to this that keeps track of how many messages are sent and received? Maybe put a ZooKeeper install in somewhere and coordinate all this? I've already got Riak integrated, though I'd think ZooKeeper would be better at managing concurrent updates.
>> 
>> Any help or suggestions here would be greatly appreciated! :)
>> 
>> Jon Brisbin
>> Portal Webmaster
>> NPC International, Inc.
>> 
>> 
>> 
>> 
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>> 
>> 
>> 
>> 
>> -- 
>> -Shane
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20101001/831a0347/attachment.htm>


More information about the rabbitmq-discuss mailing list