[rabbitmq-discuss] last sticky wicket on map/reduce

Jon Brisbin jon at jbrisbin.com
Fri Oct 1 15:04:09 BST 2010


On Oct 1, 2010, at 8:51 AM, Alexis Richardson wrote:

> Why can't you use a checksum instead? Each time you create a set of n subtasks from some task T, attach a fraction m/n to each subtask where m is the fraction attached to T. Start with m equals 1. The sum of the fractions will always be 1. No need for shared counters...
> 

Wouldn't I have to know how many subtasks I'll create total when the first subtask goes out? I don't know that in this case. 

Could you give me an example of what you're thinking?

jb

> 
>> On Oct 1, 2010 3:35 PM, "Jon Brisbin" <jon.brisbin at npcinternational.com> wrote:
>> 
>> I'm also wondering if anyone uses counts to determine when a job is finished or not. By that I mean, increment a counter for every outgoing message and decrement the counter when a response is received. In the case of a map/reduce job, I'd need to do something like:
>> 
>> SQL -> Map phase = +1 (per row)
>> Map phase -> Reduce phase = -1 (that we got the original msg) +1 * (num of emit's)
>> Reduce phase -> Response|ReReduce = -1 (for emit's) +1 (for response/rereduce)
>> [ReReduce -> Response] = -1 +1 (for sending response)
>> Response = -1
>> 
>> Essentially, each step would decrement a counter for the incoming message and increment the counter for the outgoing message. A reduce phase might decrement the counter 1000 times and increment it once. But since the map phase incremented it 1000 times prior, the count after map/reduce would be "1". The response listener would then decrement the counter when it processed the response, see that it's now zero, and know to continue.
>> 
>> If my goal is to beat processing times on the AS/400 when doing large financial calculations (daily acct'g reports take several hours to generate), I can't really depend on timeouts to make sure I've gathered all my results. I want the job to return as soon as results are ready. I'd like to go to management and show them a 2 hr -> 15 min improvement by using parallel processing.
>> 
>> I'm just wondering if using ZooKeeper or similar to do distributed, synchronized counters will have enough atomicity to not miss a count incr/decr. If I miss even one, I'm screwed because it'll never get back to zero (or get there prematurely).
>> 
>> I need a sentence with a question mark or this will definitely go unanswered: are message counters like this a good way to monitor asynchronous, distributed processing state?
>> 
>> Thanks! :)
>> Jon Brisbin Portal Webmaster NPC International, Inc.
>> 
>> On Oct 1, 2010, at 8:11 AM, Jon Brisbin wrote: > I had not really looked at the spring integration ...
>> 
>> 
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>> 
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


Thanks!

J. Brisbin
http://jbrisbin.com/






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20101001/ab90f1df/attachment.htm>


More information about the rabbitmq-discuss mailing list