Fibercorps Social: Principled Basis for Keyphrase Prediction Algorithm

The final stages of the FiberCorps Social Google Summer of Code 2011 project are starting to wrap up now. The biggest thing is that the tools to build the topic prediction application are now in place and its implementation is almost complete. At Dr. Raghavan's suggestion I am writing a short blog post dedicated to the rationale behind the algorithm. I am unfortunately nowhere near familiar enough with the sociological literature that is relevant. There are certainly many relevant studies and papers that bear on the topic of this post and any references to them would be welcome in the comments. I designed this algorithm based on high level concepts that I've come to through reading Stephen Levitt, Malcolm Gladwell, and David Brooks and through a conversation I had a year ago with Jure Leskovic. In a few weeks we will have empirical results that will either call into question or validate the intuitions described here.

The central concept of the algorithm is that of a "trend leader". It is assumed that in a given domain there are certain individuals that reliably are talking about the next hot topic before anyone else is. What is a "hot topic"? For the purposes of this work, a "hot topic" within a given corpus is defined as a key phrase returned by the algorithm in [1] but any other topic extraction algorithm would also work with this algorithm. Identifying the trend leaders and what their domains of expertise are is the primary task in this topic prediction algorithm.

To understand how trend leaders are identified the nature of the corpus from which hot topics are extracted must be understood. In this application the corpus is built by listening to the Twitter Streaming API, although any temporally ordered text data would work similarly. The corpus occupies a certain window of interest and is updated on a cyclic schedule. For example it might be the case that a corpus embodies three days worth of Twitter data and is updated hourly. This means that every hour the oldest hour of data is removed from the corpus and an hours worth of fresh data is added. In the discussion that follows the corpus before the update will be referred to as D_T and D_T+1 respectively. It is important to note that for the example times given D_T and D_T+1 have a large amount of overlap. The amount of overlap could be varied for experimental study but there must be a non-zero overlap for the algorithm to work.

With the corpus thus defined two distinct sets of hot topics can be discussed. C_T and C_T+1 are the hot topics extracted from D_T and D_T+1 respectively. We can form from these two sets C_new which is the set C_T+1 - C_T+. C_new is the set of hot topics which have only just become "hot". By examining D_old = D_T ∩ D_T+1 we can determine which users were talking about the topics in C_new before they were hot. These users are the likely suspects for trend leaders. The next blog posts will cover the specifics of how these trend leaders are assigned trust, how trust is used to make predictions, and how predictions are used to provide feedback to the trust metric.

References:

[1] A. G. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards
the web of concepts: Extracting concepts from large datasets.
PVLDB, 3(1):566–577, 2010.

Fibercorps Social

Monday, August 8, 2011

Principled Basis for Keyphrase Prediction Algorithm

No comments:

Post a Comment