Fibercorps Social: 2011

Thursday, August 11, 2011

Keyphrase Based Trust for Topic Prediction

In the previous blog post I described how to find users that should be trusted. These users are the ones who are capable of predicting future hot topics by using them earlier than anyone else. What I didn't cover was the nature of the trust that is assigned. A person assigning trust to someone else will group people into domains of expertise. You might hear someone say something like "I'd listen to John about movies. He always knows which ones are going to be good." Ideally when a user successfully leads the trend on a topic the domain of that topic would be identified and trust would be assigned to the user/domain pair. Unfortunately the size of a domain of expertise and how domains are related is a wide open field of research. The strategy employed here is simpler.

Trust is assigned to user/keyphrase pairs. If for instance Bob uses the term "cloud computing" and it later becomes a hot topic then "Bob/cloud computing" is assigned trust. Furthermore, to approximate the addition of trust to the domain of "cloud computing" a small amount of trust is added for Bob to the keyphrases currently believed to be in the same domain. The basic principle used in Fibercorps Social to determine domains is textual proximity. A term is more likely to regularly co-occur with a term from its domain than a term not in its domain. The co-occurrence of topics can therefore be used to cluster topics in a meaningful way. Ideally hierarchical clustering or fuzzy clustering would be used but due to the time constraints of the summer program the K-Means algorithm was chosen.

Every time cycle the extracted hot topics are used to index the data set. That index is then used to build term-document vectors. These vectors are fed into the Mahout K-Means clustering algorithm to produce a number of topic clusters defined by the user based upon the number and diversity of the data set's search terms. Each of these clusters is treated as though it is a topic domain. Whenever trust is added to a user/keyphrase pair a smaller amount of trust is added for that user to each keyphrase that is in the same cluster. If a user truly is a trend leader within a domain then it is likely that, over an extended period of time, trust will be assigned for that user to the most frequent phrases of that domain. This is not because a user will necessarily successfully predict any of those phrases but because they successfully predict topics which cluster with those phrases.

To see how this trust can be used to predict new hot topics I return to the above example. Bob successfully led the trend on "cloud computing" so he is likely to lead trends on topics within the same domain as "cloud computing". In the future any phrases which Bob uses that co-occur with "cloud computing" or any other topic assigned trust will be potential future hot topics. Let us say for instance that the following is a tweet by Bob:

Still getting the hang of cloud computing. Looking into hadoop cluster maintenance.

Since the concept extraction algorithm used by Fibercorps Social only considers noun phrases the potential new hot topics from that post are:

hang

the hang

hadoop

cluster

maintenance

hadoop cluster

cluster maintenance

hadoop cluster maintenance

Each of those phrases is assigned an amount of belief proportional to the amount of trust that the system has in the "Bob/cloud computing" pair. It is also noted which user(s) contributed belief to a phrase during which cycle based on which co-ocurring phrases. If the amount of belief accumulated by a phrase in a single cycle crosses some threshold then a prediction is made that the phrase will become a topic within a fixed number of cycles. If by the end of that number of cycles the prediction has not come true then trust is taken away from the user/keyphrase pairs that contributed to making that prediction and a smaller penalty is applied to their clusters.

Throughout this blog post I've used phrases like "an amount of trust will be assigned" without specifying how to determine how much trust. My next blog post will be a discussion about the parameters involved in the various stages of the algorithm and how they are being tuned.

Monday, August 8, 2011

Principled Basis for Keyphrase Prediction Algorithm

The final stages of the FiberCorps Social Google Summer of Code 2011 project are starting to wrap up now. The biggest thing is that the tools to build the topic prediction application are now in place and its implementation is almost complete. At Dr. Raghavan's suggestion I am writing a short blog post dedicated to the rationale behind the algorithm. I am unfortunately nowhere near familiar enough with the sociological literature that is relevant. There are certainly many relevant studies and papers that bear on the topic of this post and any references to them would be welcome in the comments. I designed this algorithm based on high level concepts that I've come to through reading Stephen Levitt, Malcolm Gladwell, and David Brooks and through a conversation I had a year ago with Jure Leskovic. In a few weeks we will have empirical results that will either call into question or validate the intuitions described here.

The central concept of the algorithm is that of a "trend leader". It is assumed that in a given domain there are certain individuals that reliably are talking about the next hot topic before anyone else is. What is a "hot topic"? For the purposes of this work, a "hot topic" within a given corpus is defined as a key phrase returned by the algorithm in [1] but any other topic extraction algorithm would also work with this algorithm. Identifying the trend leaders and what their domains of expertise are is the primary task in this topic prediction algorithm.

To understand how trend leaders are identified the nature of the corpus from which hot topics are extracted must be understood. In this application the corpus is built by listening to the Twitter Streaming API, although any temporally ordered text data would work similarly. The corpus occupies a certain window of interest and is updated on a cyclic schedule. For example it might be the case that a corpus embodies three days worth of Twitter data and is updated hourly. This means that every hour the oldest hour of data is removed from the corpus and an hours worth of fresh data is added. In the discussion that follows the corpus before the update will be referred to as D_T and D_T+1 respectively. It is important to note that for the example times given D_T and D_T+1 have a large amount of overlap. The amount of overlap could be varied for experimental study but there must be a non-zero overlap for the algorithm to work.

With the corpus thus defined two distinct sets of hot topics can be discussed. C_T and C_T+1 are the hot topics extracted from D_T and D_T+1 respectively. We can form from these two sets C_new which is the set C_T+1 - C_T+. C_new is the set of hot topics which have only just become "hot". By examining D_old = D_T ∩ D_T+1 we can determine which users were talking about the topics in C_new before they were hot. These users are the likely suspects for trend leaders. The next blog posts will cover the specifics of how these trend leaders are assigned trust, how trust is used to make predictions, and how predictions are used to provide feedback to the trust metric.

References:

[1] A. G. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards
the web of concepts: Extracting concepts from large datasets.
PVLDB, 3(1):566–577, 2010.

Friday, July 22, 2011

Representing Social Media Data in a Distributed Environment Using HBase

The first problem I ran into in this project is that there is no standard way to represent social data in a distributed environment. To solve this problem I began from an agreed upon solution to a related problem and worked from there, becoming more general along the way. While there isn't a standard way to represent social data in a distributed environment there is a semantic vocabulary for describing social data. The Semantically-Interlinked Online Communities (SIOC) rdf vocabulary provides a general standardized way of representing social data in a way that enables it to be easily utilized by third-party web sites and applications. This rdf vocabulary was my starting point.

Finding a semantic representation vocabulary did not outright solve the problem though. The data storage system being used by FiberCorps Social is an HBase NoSQL database. For anyone unfamiliar with how NoSQL database tables work there are two key aspects. Firstly, relational algebra does not inherently hold for HBase tables as it would in a standard SQL table. The most important difference that falls out of that is that there are no foreign keys. Secondly, the tables are essentially key value stores and have the properties that one would normally expect a dictionary data type to have. On the surface this seems to pose a problem for representing rdf data. Rdf triples have three values and a key value store has pairs of data values. Fortunately, HBase and all other NoSQL databases I've looked at have the ability to smuggle in a third value.

An HBase table has rows that are signified by a primary row index. This row index is then paired with values stored in columns. Columns belong to a column family and have an individual name as well. The column values allow you to signify a third value. The convention in FiberCorps Social is for the column family name to be the rdf vocabulary name (e.g. sioc, foaf, etc.) and for the column name to be the predicate that column represents. The row index can then be interpreted as the subject of the predicate and the value stored in the column can be interpreted as the object of the predicate. This design is from a paper by Franke et. al..

That is the representation scheme for primary data tables but tables drawn from social media data are not the only ones needed. Secondary data sets need to be derived from the primary data sets. One example is an ngram model of the data in a primary data set. The row index in such a table is the ngram and the column data is the count. To connect a derived table to its primary table the name of the derived table is derived from the name of the primary. For example if the name of a primary dataset table is sports_tweets then the name of the table containing an ngram model of that dataset would be sports_tweets_ngram.

Index tables are also needed. Because NoSQL databases are essentially key/value stores they do not inherently support indexes on column data. This is problematic if queries based on column data values are needed. To facilitate fast column based search index tables are needed. A row value of an index table is a column value from the primary table. The column value of an index table is a list of row values from the primary table. The list specifies all rows in the primary table which have that value in their column. The first such table used in FiberCorps Social is the keyphrase index table. The row values are key phrases and the column values specify which tweets contain those keyphrases. This method of indexing is grossly inefficient with storage, effectively doubling the amount of space needed, but facilitates constant time lookup. Specifics on how to build index tables like this can be found here.

And that's how data is represented in FiberCorps Social. No specific piece of this representation scheme is truly novel and I've tried to provide links to the sources that I used to build it although I've certainly left one or two out. The main reason that I wanted to write this blog post is that there wasn't a conveniently labeled "How to" that put all of the pieces together. Now there is and hopefully it'll be of use to someone.

Wednesday, June 29, 2011

Data Representation Change

This project continues to be an active learning experience. As any Map/Reduce veteran would've probably told me, it's a terrible framework to use for performing queries. In fact as soon as I began getting abysmal time performance on the queries I found all of the articles explaining why that's exactly what I should have expected. This brought me to switch over to an added complication I thought I would be able to avoid: distributed database systems.

I looked at the available systems and, although there were multiple that support Hadoop interaction, I decided to go with HBase. It's a distributed NoSQL database system that is based on Google's Bigtable. The main reason that I chose it is that it is so closely coupled with Hadoop that it seemed that it would be the one that I could pick up the easiest and thereby have the smallest impact on my timeline. It very well may be the case that in the future some other distributed database could provide a better solution for currently unforseen reasons. This provided me with a good opportunity to more clearly incorporate the Semantically-Interlinked Online Communities (SIOC) RDF vocabulary into the project. Each of the tables is a type from the ontology and the column values are the associated properties. The column families of the tables are "sioc" and (where needed) "foaf". I believe that will give the data representation the ability to be easily integrated with existing and future technologies. Now that the project is starting to really solidify into particular directions I'll start making blog posts about the involved technologies and how they apply to social analytics.

Sunday, June 26, 2011

First Use Case

I've been a bit negligent in posting here so I'll just do a quick update. The first use case is done now. Rather than going through all of the particulars I'll just make a quick bullet list of the capabilities that are wrapped into the application:

Connect to the Twitter Streaming API

Add and remove topics from the stream monitor dynamically

Extract the keyphrases from the tweets downloaded

Query the tweets using the extracted keyphrases

Examine frequency and coocurrence frequency of keyphrases

Getting the keyphrase model working was a real learning experience. It ended up involving seven map/reduce cycles. Two are executed once and five are executed iteratively over the various keyphrase sizes from 1 to N. There are some major improvement opportunities over the current design however. The largest of these potentially is that I realize now the need for a distributed database. Using the map/reduce framework to perform queries is simply too slow. Also I will need to adapt the keyphrase extraction algorithm so that it can be iterative. Other than that though it's on to the next use case which is user/topic clustering.

Tuesday, June 7, 2011

Map Reduce Concept Extraction

I've been fighting with the MapReduce paradigm for the last five days trying to do something that I thought would only take me a day or two. I've been implementing a concept extraction algorithm based on language modeling that I've used before and learning the twists and turns of the Hadoop libraries in the process. But I think that I've finally got the hang of how to program in this paradigm and I'll share what I've learned in case anyone reading this decides to delve into the crazy world of cloud computation themselves.

The algorithm I've been implementing is described in "Towards the Web of Concepts: Extracting Concepts from Large Datasets". It works by computing an NGram model for the text corpus and then computing a lot of metadata about each NGram like the relative frequency of that NGram's first n-1 tokens to its last n-1 tokens. It then compares the metadata to parametric minimums that are expected of a "good" concept. It works somewhat as a dynamic algorithm by computing the "goodness" of size 1 ngrams first, then size 2 using the results of size 1 and so on up to the maximum size. The final output of the algorithm is a list of the concepts in the corpus. Now to the MapReduce tricks I've learned.

As before, these techniques are probably old hat to experienced Hadoop veterans but this is all new territory to me. I started out by truly looking at the concept extraction algorithm as a mathematical function. From that standpoint I was able to write out the final boolean formula that determines whether or not a particular ngram is a concept in terms of that ngram's metadata. This meant that each metadat piece needed its own equation. Some metadata pieces' equations could only be written in terms of other metadata pieces but some could be written in terms of queries to the ngram model. This led to the development of a series of equations, each of which only needed terms already calculated by a previous equation. Each equation corresponds to a Map/Reduce cycle. The specific scheme worked out follows.

The language model ngram key/value pairs were the ones which the reducers produce with "unmarked" keys (i.e. the key of one of these was simply the stemmed ngram itself). The value of these pairs is the percentage frequency of that ngram for its size followed by the most common unstemmed form of that ngram. All other ngram key/value pairs produced by reducers are "marked" and are of the form _#intermediate . The value of one of these pairs is the value of the indicated metadata for that ngram. It thus becomes the job of every mapper to supply its reducer with the data pieces needed to compute the appropriate metadata values for ngrams of a particular size. Each mapper does this by transforming an appropriate "marked" key/value pair into an "unmarked" one in which the value is tagged in a way that the reducer may appropriately use it. An example better illustrates the general principle.

One piece of metadata which must be computed is the "preconfidence" of an ngram. This is equal to the frequency of a length n ngram divided by the frequency of its n-1 length prefix. For instance if "the king" has a frequency of .2 and "the" has a frequency of .8 then the preconfidence of "the king" is .25 . This would be computed in the following way:

Input to Mapper:
the .8
the king .2

Output of Mapper/Input to Reducer:
the .8
the king .2
the prefix_.2_the king

Output of Reducer:
the .8
the king .2
the king_preconf#intermediate .25

Sunday, May 29, 2011

N-Gram MapReduce

The first step in building a topic modeler within the new social media framework is to develop an N-Gram language model of a collection of Tweets for use in keyphrase extraction. This has proven to be trickier than I thought it would be. This is largely because the framework is intended to let the user choose whichever NLP library they want (e.g. Apache OpenNLP, Stanford NLP, CMU link parser, etc.). Hadoop seems to expect the Mapper and Reducer classes to be static though which made for alot of tricky business in getting it to dynamically figure out which sentenceDetector, tokenize, and stem functions to call. It turns out that the way to accomplish this is to pass in a Class object through the parameters and then use the getMethod and invoke methods of Class and Method respectively to dynamically resolve to a function. I'm sure that to anyone intimately familiar with Hadoop this type of trick is old hat but it took quite some time for me to figure it out. Now on to building the generic NGramModel class and to do some testing. After that the rest of the keyphrase extraction implementation should be reasonably easy to implement.

Thursday, May 26, 2011

New Timeline

The fallout of the change to Hadoop has finally permeated the project and yesterday a new timeline was thought up. It's also more to my mentor's liking. (Dr. Raghavan is mentoring me on this project) Most specifically it features tiered use case subgoals that are evenly interspersed throughout the timeline. This enables a more smooth analysis of progress of the project as it progresses. And without further ado the new timeline :

Based on four use cases:

1) Topic Analysis: Given a set of Tweets what are the main topics and who is talking about which one?
2) Group Identification: Which people talk/listen to each other? Who are talking about the same things? What are the people in a certain location talking about?
3) Domain Specific Questions: Answer questions specific to a specific domain, in this case politics and business. ex: Who wants to buy what? Who likes which candidate?
4) News Prediction: Within a given domain what are the late breaking stories?

Updated timeline:

Week 1:
-Finish Twitter streaming monitor development
-Write social media monitor abstract class
-Begin developing keyphrase extraction inference rule
Week 2:
-Finish keyphrase extraction implementation
-Write abstract inference rule class
-Write abstract NLP library class
-Documentation/Refactor
Week 3:
-Write inference rules for keyphrase indexing, ranking, and cooccurence analysis
-Begin designing interface for Topic Analysis application
Week 4:
-Finish Topic Analysis interface
-Documentation/Refactor
-Present App for Use Case 1
-Begin Writing Twitter user monitor
Week 5:
-Finish Twitter user monitor
-Update social media monitor class as needed (perhaps add intermediaries)
-Write retweet inference rule
-Begin writing K-means clustering Mahout wrapper
Week 6:
-Finish K-means
-Write geographic tagging inference rule
-Develop Interface for Geographic Identification task
Week 7:
-Documentation/Refactor
-Present App for Use Case 2
-Develop several inference rules for political and business domains
Week 8:
-Develop more inference rules for political and business domains
-Develop interface for domain specific question answering
-Documentation
Week 9:
-Refactor
-Present App for Use Case 3
-Develop trend analysis inference rule
Week 10:
-Develop trust metrics for users
-Develop trust metric for rules
-Write Google News monitor
Week 11:
-Documentation/Refactor
-Combine trust metrics and trend analysis to predict breaking news
Week 12:
-Write interface for Story Predictor
-Write feedback inference rule
Week 13:
-Documentation/Refactor
-Present App for Use Case 4

Tuesday, May 24, 2011

Getting the Project Started

This blog is about the Fibercorps social media analytics framework, a 2011 Google Summer of Code project. The framework is going to be a tool for developing analytical tools for social media. The project proposal can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/blakelemoine/1 .

Right off the bat though there are some changes from that original document. It was pointed out to me last week that IBM has a very similar proprietary project to the one that I've proposed: IBM BigSheets. That's not at all discouraging and in fact I think it points to this concept being a good idea. Good enough for IBM to turn it into a product in any case. I was asked by Fibercorps to incorporate one of the major differences between my proposal and BigSheets. The Fibercorps social media analytics framework will now incorporate distributed computing using Apache Hadoop.

This change triggers a cascade of other changes that will need to be propogated through the system design. The first of which is that the machine learning library used will no longer be Java-ML but will instead be Mahout. Also OpenCog's AtomSpace is explicitly intended to be an in-memory resource and as such will no longer be an appropriate representation. I am currently looking into using RDF to represent information. Representing RDF data in the Hadoop Distributed File System (HDFS) seems to be an open problem but one that should be relatively easy to solve for the special case of what Fibercorps Social is intended to do. I'll update as I progress on that topic.