Sunday, June 26, 2011

First Use Case

I've been a bit negligent in posting here so I'll just do a quick update. The first use case is done now. Rather than going through all of the particulars I'll just make a quick bullet list of the capabilities that are wrapped into the application:

  • Connect to the Twitter Streaming API

  • Add and remove topics from the stream monitor dynamically

  • Extract the keyphrases from the tweets downloaded

  • Query the tweets using the extracted keyphrases

  • Examine frequency and coocurrence frequency of keyphrases


Getting the keyphrase model working was a real learning experience. It ended up involving seven map/reduce cycles. Two are executed once and five are executed iteratively over the various keyphrase sizes from 1 to N. There are some major improvement opportunities over the current design however. The largest of these potentially is that I realize now the need for a distributed database. Using the map/reduce framework to perform queries is simply too slow. Also I will need to adapt the keyphrase extraction algorithm so that it can be iterative. Other than that though it's on to the next use case which is user/topic clustering.

No comments:

Post a Comment