[rabbitmq-discuss] last sticky wicket on map/reduce
Jon Brisbin
jon at jbrisbin.com
Fri Oct 1 15:04:09 BST 2010
On Oct 1, 2010, at 8:51 AM, Alexis Richardson wrote:
> Why can't you use a checksum instead? Each time you create a set of n subtasks from some task T, attach a fraction m/n to each subtask where m is the fraction attached to T. Start with m equals 1. The sum of the fractions will always be 1. No need for shared counters...
>
Wouldn't I have to know how many subtasks I'll create total when the first subtask goes out? I don't know that in this case.
Could you give me an example of what you're thinking?
jb
>
>> On Oct 1, 2010 3:35 PM, "Jon Brisbin" <jon.brisbin at npcinternational.com> wrote:
>>
>> I'm also wondering if anyone uses counts to determine when a job is finished or not. By that I mean, increment a counter for every outgoing message and decrement the counter when a response is received. In the case of a map/reduce job, I'd need to do something like:
>>
>> SQL -> Map phase = +1 (per row)
>> Map phase -> Reduce phase = -1 (that we got the original msg) +1 * (num of emit's)
>> Reduce phase -> Response|ReReduce = -1 (for emit's) +1 (for response/rereduce)
>> [ReReduce -> Response] = -1 +1 (for sending response)
>> Response = -1
>>
>> Essentially, each step would decrement a counter for the incoming message and increment the counter for the outgoing message. A reduce phase might decrement the counter 1000 times and increment it once. But since the map phase incremented it 1000 times prior, the count after map/reduce would be "1". The response listener would then decrement the counter when it processed the response, see that it's now zero, and know to continue.
>>
>> If my goal is to beat processing times on the AS/400 when doing large financial calculations (daily acct'g reports take several hours to generate), I can't really depend on timeouts to make sure I've gathered all my results. I want the job to return as soon as results are ready. I'd like to go to management and show them a 2 hr -> 15 min improvement by using parallel processing.
>>
>> I'm just wondering if using ZooKeeper or similar to do distributed, synchronized counters will have enough atomicity to not miss a count incr/decr. If I miss even one, I'm screwed because it'll never get back to zero (or get there prematurely).
>>
>> I need a sentence with a question mark or this will definitely go unanswered: are message counters like this a good way to monitor asynchronous, distributed processing state?
>>
>> Thanks! :)
>> Jon Brisbin Portal Webmaster NPC International, Inc.
>>
>> On Oct 1, 2010, at 8:11 AM, Jon Brisbin wrote: > I had not really looked at the spring integration ...
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
Thanks!
J. Brisbin
http://jbrisbin.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20101001/ab90f1df/attachment.htm>
More information about the rabbitmq-discuss
mailing list