Thursday, August 11, 2011

Keyphrase Based Trust for Topic Prediction

In the previous blog post I described how to find users that should be trusted. These users are the ones who are capable of predicting future hot topics by using them earlier than anyone else. What I didn't cover was the nature of the trust that is assigned. A person assigning trust to someone else will group people into domains of expertise. You might hear someone say something like "I'd listen to John about movies. He always knows which ones are going to be good." Ideally when a user successfully leads the trend on a topic the domain of that topic would be identified and trust would be assigned to the user/domain pair. Unfortunately the size of a domain of expertise and how domains are related is a wide open field of research. The strategy employed here is simpler.

Trust is assigned to user/keyphrase pairs. If for instance Bob uses the term "cloud computing" and it later becomes a hot topic then "Bob/cloud computing" is assigned trust. Furthermore, to approximate the addition of trust to the domain of "cloud computing" a small amount of trust is added for Bob to the keyphrases currently believed to be in the same domain. The basic principle used in Fibercorps Social to determine domains is textual proximity. A term is more likely to regularly co-occur with a term from its domain than a term not in its domain. The co-occurrence of topics can therefore be used to cluster topics in a meaningful way. Ideally hierarchical clustering or fuzzy clustering would be used but due to the time constraints of the summer program the K-Means algorithm was chosen.

Every time cycle the extracted hot topics are used to index the data set. That index is then used to build term-document vectors. These vectors are fed into the Mahout K-Means clustering algorithm to produce a number of topic clusters defined by the user based upon the number and diversity of the data set's search terms. Each of these clusters is treated as though it is a topic domain. Whenever trust is added to a user/keyphrase pair a smaller amount of trust is added for that user to each keyphrase that is in the same cluster. If a user truly is a trend leader within a domain then it is likely that, over an extended period of time, trust will be assigned for that user to the most frequent phrases of that domain. This is not because a user will necessarily successfully predict any of those phrases but because they successfully predict topics which cluster with those phrases.

To see how this trust can be used to predict new hot topics I return to the above example. Bob successfully led the trend on "cloud computing" so he is likely to lead trends on topics within the same domain as "cloud computing". In the future any phrases which Bob uses that co-occur with "cloud computing" or any other topic assigned trust will be potential future hot topics. Let us say for instance that the following is a tweet by Bob:

Still getting the hang of cloud computing. Looking into hadoop cluster maintenance.


Since the concept extraction algorithm used by Fibercorps Social only considers noun phrases the potential new hot topics from that post are:

  • hang

  • the hang

  • hadoop

  • cluster

  • maintenance

  • hadoop cluster

  • cluster maintenance

  • hadoop cluster maintenance


Each of those phrases is assigned an amount of belief proportional to the amount of trust that the system has in the "Bob/cloud computing" pair. It is also noted which user(s) contributed belief to a phrase during which cycle based on which co-ocurring phrases. If the amount of belief accumulated by a phrase in a single cycle crosses some threshold then a prediction is made that the phrase will become a topic within a fixed number of cycles. If by the end of that number of cycles the prediction has not come true then trust is taken away from the user/keyphrase pairs that contributed to making that prediction and a smaller penalty is applied to their clusters.

Throughout this blog post I've used phrases like "an amount of trust will be assigned" without specifying how to determine how much trust. My next blog post will be a discussion about the parameters involved in the various stages of the algorithm and how they are being tuned.

Monday, August 8, 2011

Principled Basis for Keyphrase Prediction Algorithm

The final stages of the FiberCorps Social Google Summer of Code 2011 project are starting to wrap up now. The biggest thing is that the tools to build the topic prediction application are now in place and its implementation is almost complete. At Dr. Raghavan's suggestion I am writing a short blog post dedicated to the rationale behind the algorithm. I am unfortunately nowhere near familiar enough with the sociological literature that is relevant. There are certainly many relevant studies and papers that bear on the topic of this post and any references to them would be welcome in the comments. I designed this algorithm based on high level concepts that I've come to through reading Stephen Levitt, Malcolm Gladwell, and David Brooks and through a conversation I had a year ago with Jure Leskovic. In a few weeks we will have empirical results that will either call into question or validate the intuitions described here.

The central concept of the algorithm is that of a "trend leader". It is assumed that in a given domain there are certain individuals that reliably are talking about the next hot topic before anyone else is. What is a "hot topic"? For the purposes of this work, a "hot topic" within a given corpus is defined as a key phrase returned by the algorithm in [1] but any other topic extraction algorithm would also work with this algorithm. Identifying the trend leaders and what their domains of expertise are is the primary task in this topic prediction algorithm.

To understand how trend leaders are identified the nature of the corpus from which hot topics are extracted must be understood. In this application the corpus is built by listening to the Twitter Streaming API, although any temporally ordered text data would work similarly. The corpus occupies a certain window of interest and is updated on a cyclic schedule. For example it might be the case that a corpus embodies three days worth of Twitter data and is updated hourly. This means that every hour the oldest hour of data is removed from the corpus and an hours worth of fresh data is added. In the discussion that follows the corpus before the update will be referred to as DT and DT+1 respectively. It is important to note that for the example times given DT and DT+1 have a large amount of overlap. The amount of overlap could be varied for experimental study but there must be a non-zero overlap for the algorithm to work.

With the corpus thus defined two distinct sets of hot topics can be discussed. CT and CT+1 are the hot topics extracted from DT and DT+1 respectively. We can form from these two sets Cnew which is the set CT+1 - CT+. Cnew is the set of hot topics which have only just become "hot". By examining Dold = DT ∩ DT+1 we can determine which users were talking about the topics in Cnew before they were hot. These users are the likely suspects for trend leaders. The next blog posts will cover the specifics of how these trend leaders are assigned trust, how trust is used to make predictions, and how predictions are used to provide feedback to the trust metric.


References:

[1] A. G. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards
the web of concepts: Extracting concepts from large datasets.
PVLDB, 3(1):566–577, 2010.