Fibercorps Social: N-Gram MapReduce

The first step in building a topic modeler within the new social media framework is to develop an N-Gram language model of a collection of Tweets for use in keyphrase extraction. This has proven to be trickier than I thought it would be. This is largely because the framework is intended to let the user choose whichever NLP library they want (e.g. Apache OpenNLP, Stanford NLP, CMU link parser, etc.). Hadoop seems to expect the Mapper and Reducer classes to be static though which made for alot of tricky business in getting it to dynamically figure out which sentenceDetector, tokenize, and stem functions to call. It turns out that the way to accomplish this is to pass in a Class object through the parameters and then use the getMethod and invoke methods of Class and Method respectively to dynamically resolve to a function. I'm sure that to anyone intimately familiar with Hadoop this type of trick is old hat but it took quite some time for me to figure it out. Now on to building the generic NGramModel class and to do some testing. After that the rest of the keyphrase extraction implementation should be reasonably easy to implement.

Fibercorps Social

Sunday, May 29, 2011

N-Gram MapReduce

No comments:

Post a Comment