Fibercorps Social: June 2011

Wednesday, June 29, 2011

Data Representation Change

This project continues to be an active learning experience. As any Map/Reduce veteran would've probably told me, it's a terrible framework to use for performing queries. In fact as soon as I began getting abysmal time performance on the queries I found all of the articles explaining why that's exactly what I should have expected. This brought me to switch over to an added complication I thought I would be able to avoid: distributed database systems.

I looked at the available systems and, although there were multiple that support Hadoop interaction, I decided to go with HBase. It's a distributed NoSQL database system that is based on Google's Bigtable. The main reason that I chose it is that it is so closely coupled with Hadoop that it seemed that it would be the one that I could pick up the easiest and thereby have the smallest impact on my timeline. It very well may be the case that in the future some other distributed database could provide a better solution for currently unforseen reasons. This provided me with a good opportunity to more clearly incorporate the Semantically-Interlinked Online Communities (SIOC) RDF vocabulary into the project. Each of the tables is a type from the ontology and the column values are the associated properties. The column families of the tables are "sioc" and (where needed) "foaf". I believe that will give the data representation the ability to be easily integrated with existing and future technologies. Now that the project is starting to really solidify into particular directions I'll start making blog posts about the involved technologies and how they apply to social analytics.

Sunday, June 26, 2011

First Use Case

I've been a bit negligent in posting here so I'll just do a quick update. The first use case is done now. Rather than going through all of the particulars I'll just make a quick bullet list of the capabilities that are wrapped into the application:

Connect to the Twitter Streaming API

Add and remove topics from the stream monitor dynamically

Extract the keyphrases from the tweets downloaded

Query the tweets using the extracted keyphrases

Examine frequency and coocurrence frequency of keyphrases

Getting the keyphrase model working was a real learning experience. It ended up involving seven map/reduce cycles. Two are executed once and five are executed iteratively over the various keyphrase sizes from 1 to N. There are some major improvement opportunities over the current design however. The largest of these potentially is that I realize now the need for a distributed database. Using the map/reduce framework to perform queries is simply too slow. Also I will need to adapt the keyphrase extraction algorithm so that it can be iterative. Other than that though it's on to the next use case which is user/topic clustering.

Tuesday, June 7, 2011

Map Reduce Concept Extraction

I've been fighting with the MapReduce paradigm for the last five days trying to do something that I thought would only take me a day or two. I've been implementing a concept extraction algorithm based on language modeling that I've used before and learning the twists and turns of the Hadoop libraries in the process. But I think that I've finally got the hang of how to program in this paradigm and I'll share what I've learned in case anyone reading this decides to delve into the crazy world of cloud computation themselves.

The algorithm I've been implementing is described in "Towards the Web of Concepts: Extracting Concepts from Large Datasets". It works by computing an NGram model for the text corpus and then computing a lot of metadata about each NGram like the relative frequency of that NGram's first n-1 tokens to its last n-1 tokens. It then compares the metadata to parametric minimums that are expected of a "good" concept. It works somewhat as a dynamic algorithm by computing the "goodness" of size 1 ngrams first, then size 2 using the results of size 1 and so on up to the maximum size. The final output of the algorithm is a list of the concepts in the corpus. Now to the MapReduce tricks I've learned.

As before, these techniques are probably old hat to experienced Hadoop veterans but this is all new territory to me. I started out by truly looking at the concept extraction algorithm as a mathematical function. From that standpoint I was able to write out the final boolean formula that determines whether or not a particular ngram is a concept in terms of that ngram's metadata. This meant that each metadat piece needed its own equation. Some metadata pieces' equations could only be written in terms of other metadata pieces but some could be written in terms of queries to the ngram model. This led to the development of a series of equations, each of which only needed terms already calculated by a previous equation. Each equation corresponds to a Map/Reduce cycle. The specific scheme worked out follows.

The language model ngram key/value pairs were the ones which the reducers produce with "unmarked" keys (i.e. the key of one of these was simply the stemmed ngram itself). The value of these pairs is the percentage frequency of that ngram for its size followed by the most common unstemmed form of that ngram. All other ngram key/value pairs produced by reducers are "marked" and are of the form _#intermediate . The value of one of these pairs is the value of the indicated metadata for that ngram. It thus becomes the job of every mapper to supply its reducer with the data pieces needed to compute the appropriate metadata values for ngrams of a particular size. Each mapper does this by transforming an appropriate "marked" key/value pair into an "unmarked" one in which the value is tagged in a way that the reducer may appropriately use it. An example better illustrates the general principle.

One piece of metadata which must be computed is the "preconfidence" of an ngram. This is equal to the frequency of a length n ngram divided by the frequency of its n-1 length prefix. For instance if "the king" has a frequency of .2 and "the" has a frequency of .8 then the preconfidence of "the king" is .25 . This would be computed in the following way:

Input to Mapper:
the .8
the king .2

Output of Mapper/Input to Reducer:
the .8
the king .2
the prefix_.2_the king

Output of Reducer:
the .8
the king .2
the king_preconf#intermediate .25