The final stages of the FiberCorps Social Google Summer of Code 2011 project are starting to wrap up now. The biggest thing is that the tools to build the topic prediction application are now in place and its implementation is almost complete. At Dr. Raghavan's suggestion I am writing a short blog post dedicated to the rationale behind the algorithm. I am unfortunately nowhere near familiar enough with the sociological literature that is relevant. There are certainly many relevant studies and papers that bear on the topic of this post and any references to them would be welcome in the comments. I designed this algorithm based on high level concepts that I've come to through reading Stephen Levitt, Malcolm Gladwell, and David Brooks and through a conversation I had a year ago with Jure Leskovic. In a few weeks we will have empirical results that will either call into question or validate the intuitions described here.
The central concept of the algorithm is that of a "trend leader". It is assumed that in a given domain there are certain individuals that reliably are talking about the next hot topic before anyone else is. What is a "hot topic"? For the purposes of this work, a "hot topic" within a given corpus is defined as a key phrase returned by the algorithm in [1] but any other topic extraction algorithm would also work with this algorithm. Identifying the trend leaders and what their domains of expertise are is the primary task in this topic prediction algorithm.
To understand how trend leaders are identified the nature of the corpus from which hot topics are extracted must be understood. In this application the corpus is built by listening to the Twitter Streaming API, although any temporally ordered text data would work similarly. The corpus occupies a certain window of interest and is updated on a cyclic schedule. For example it might be the case that a corpus embodies three days worth of Twitter data and is updated hourly. This means that every hour the oldest hour of data is removed from the corpus and an hours worth of fresh data is added. In the discussion that follows the corpus before the update will be referred to as DT and DT+1 respectively. It is important to note that for the example times given DT and DT+1 have a large amount of overlap. The amount of overlap could be varied for experimental study but there must be a non-zero overlap for the algorithm to work.
With the corpus thus defined two distinct sets of hot topics can be discussed. CT and CT+1 are the hot topics extracted from DT and DT+1 respectively. We can form from these two sets Cnew which is the set CT+1 - CT+. Cnew is the set of hot topics which have only just become "hot". By examining Dold = DT ∩ DT+1 we can determine which users were talking about the topics in Cnew before they were hot. These users are the likely suspects for trend leaders. The next blog posts will cover the specifics of how these trend leaders are assigned trust, how trust is used to make predictions, and how predictions are used to provide feedback to the trust metric.
References:
[1] A. G. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards
the web of concepts: Extracting concepts from large datasets.
PVLDB, 3(1):566–577, 2010.
No comments:
Post a Comment