Fibercorps Social: Map Reduce Concept Extraction

I've been fighting with the MapReduce paradigm for the last five days trying to do something that I thought would only take me a day or two. I've been implementing a concept extraction algorithm based on language modeling that I've used before and learning the twists and turns of the Hadoop libraries in the process. But I think that I've finally got the hang of how to program in this paradigm and I'll share what I've learned in case anyone reading this decides to delve into the crazy world of cloud computation themselves.

The algorithm I've been implementing is described in "Towards the Web of Concepts: Extracting Concepts from Large Datasets". It works by computing an NGram model for the text corpus and then computing a lot of metadata about each NGram like the relative frequency of that NGram's first n-1 tokens to its last n-1 tokens. It then compares the metadata to parametric minimums that are expected of a "good" concept. It works somewhat as a dynamic algorithm by computing the "goodness" of size 1 ngrams first, then size 2 using the results of size 1 and so on up to the maximum size. The final output of the algorithm is a list of the concepts in the corpus. Now to the MapReduce tricks I've learned.

As before, these techniques are probably old hat to experienced Hadoop veterans but this is all new territory to me. I started out by truly looking at the concept extraction algorithm as a mathematical function. From that standpoint I was able to write out the final boolean formula that determines whether or not a particular ngram is a concept in terms of that ngram's metadata. This meant that each metadat piece needed its own equation. Some metadata pieces' equations could only be written in terms of other metadata pieces but some could be written in terms of queries to the ngram model. This led to the development of a series of equations, each of which only needed terms already calculated by a previous equation. Each equation corresponds to a Map/Reduce cycle. The specific scheme worked out follows.

The language model ngram key/value pairs were the ones which the reducers produce with "unmarked" keys (i.e. the key of one of these was simply the stemmed ngram itself). The value of these pairs is the percentage frequency of that ngram for its size followed by the most common unstemmed form of that ngram. All other ngram key/value pairs produced by reducers are "marked" and are of the form _#intermediate . The value of one of these pairs is the value of the indicated metadata for that ngram. It thus becomes the job of every mapper to supply its reducer with the data pieces needed to compute the appropriate metadata values for ngrams of a particular size. Each mapper does this by transforming an appropriate "marked" key/value pair into an "unmarked" one in which the value is tagged in a way that the reducer may appropriately use it. An example better illustrates the general principle.

One piece of metadata which must be computed is the "preconfidence" of an ngram. This is equal to the frequency of a length n ngram divided by the frequency of its n-1 length prefix. For instance if "the king" has a frequency of .2 and "the" has a frequency of .8 then the preconfidence of "the king" is .25 . This would be computed in the following way:

Input to Mapper:
the .8
the king .2

Output of Mapper/Input to Reducer:
the .8
the king .2
the prefix_.2_the king

Output of Reducer:
the .8
the king .2
the king_preconf#intermediate .25

Fibercorps Social

Tuesday, June 7, 2011

Map Reduce Concept Extraction

No comments:

Post a Comment