Sunday, May 29, 2011

N-Gram MapReduce

The first step in building a topic modeler within the new social media framework is to develop an N-Gram language model of a collection of Tweets for use in keyphrase extraction. This has proven to be trickier than I thought it would be. This is largely because the framework is intended to let the user choose whichever NLP library they want (e.g. Apache OpenNLP, Stanford NLP, CMU link parser, etc.). Hadoop seems to expect the Mapper and Reducer classes to be static though which made for alot of tricky business in getting it to dynamically figure out which sentenceDetector, tokenize, and stem functions to call. It turns out that the way to accomplish this is to pass in a Class object through the parameters and then use the getMethod and invoke methods of Class and Method respectively to dynamically resolve to a function. I'm sure that to anyone intimately familiar with Hadoop this type of trick is old hat but it took quite some time for me to figure it out. Now on to building the generic NGramModel class and to do some testing. After that the rest of the keyphrase extraction implementation should be reasonably easy to implement.

Thursday, May 26, 2011

New Timeline

The fallout of the change to Hadoop has finally permeated the project and yesterday a new timeline was thought up. It's also more to my mentor's liking. (Dr. Raghavan is mentoring me on this project) Most specifically it features tiered use case subgoals that are evenly interspersed throughout the timeline. This enables a more smooth analysis of progress of the project as it progresses. And without further ado the new timeline :

Based on four use cases:

1) Topic Analysis: Given a set of Tweets what are the main topics and who is talking about which one?
2) Group Identification: Which people talk/listen to each other? Who are talking about the same things? What are the people in a certain location talking about?
3) Domain Specific Questions: Answer questions specific to a specific domain, in this case politics and business. ex: Who wants to buy what? Who likes which candidate?
4) News Prediction: Within a given domain what are the late breaking stories?

Updated timeline:

Week 1:
-Finish Twitter streaming monitor development
-Write social media monitor abstract class
-Begin developing keyphrase extraction inference rule
Week 2:
-Finish keyphrase extraction implementation
-Write abstract inference rule class
-Write abstract NLP library class
-Documentation/Refactor
Week 3:
-Write inference rules for keyphrase indexing, ranking, and cooccurence analysis
-Begin designing interface for Topic Analysis application
Week 4:
-Finish Topic Analysis interface
-Documentation/Refactor
-Present App for Use Case 1
-Begin Writing Twitter user monitor
Week 5:
-Finish Twitter user monitor
-Update social media monitor class as needed (perhaps add intermediaries)
-Write retweet inference rule
-Begin writing K-means clustering Mahout wrapper
Week 6:
-Finish K-means
-Write geographic tagging inference rule
-Develop Interface for Geographic Identification task
Week 7:
-Documentation/Refactor
-Present App for Use Case 2
-Develop several inference rules for political and business domains
Week 8:
-Develop more inference rules for political and business domains
-Develop interface for domain specific question answering
-Documentation
Week 9:
-Refactor
-Present App for Use Case 3
-Develop trend analysis inference rule
Week 10:
-Develop trust metrics for users
-Develop trust metric for rules
-Write Google News monitor
Week 11:
-Documentation/Refactor
-Combine trust metrics and trend analysis to predict breaking news
Week 12:
-Write interface for Story Predictor
-Write feedback inference rule
Week 13:
-Documentation/Refactor
-Present App for Use Case 4

Tuesday, May 24, 2011

Getting the Project Started

This blog is about the Fibercorps social media analytics framework, a 2011 Google Summer of Code project. The framework is going to be a tool for developing analytical tools for social media. The project proposal can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/blakelemoine/1 .

Right off the bat though there are some changes from that original document. It was pointed out to me last week that IBM has a very similar proprietary project to the one that I've proposed: IBM BigSheets. That's not at all discouraging and in fact I think it points to this concept being a good idea. Good enough for IBM to turn it into a product in any case. I was asked by Fibercorps to incorporate one of the major differences between my proposal and BigSheets. The Fibercorps social media analytics framework will now incorporate distributed computing using Apache Hadoop.

This change triggers a cascade of other changes that will need to be propogated through the system design. The first of which is that the machine learning library used will no longer be Java-ML but will instead be Mahout. Also OpenCog's AtomSpace is explicitly intended to be an in-memory resource and as such will no longer be an appropriate representation. I am currently looking into using RDF to represent information. Representing RDF data in the Hadoop Distributed File System (HDFS) seems to be an open problem but one that should be relatively easy to solve for the special case of what Fibercorps Social is intended to do. I'll update as I progress on that topic.