Thursday, August 11, 2011

Keyphrase Based Trust for Topic Prediction

In the previous blog post I described how to find users that should be trusted. These users are the ones who are capable of predicting future hot topics by using them earlier than anyone else. What I didn't cover was the nature of the trust that is assigned. A person assigning trust to someone else will group people into domains of expertise. You might hear someone say something like "I'd listen to John about movies. He always knows which ones are going to be good." Ideally when a user successfully leads the trend on a topic the domain of that topic would be identified and trust would be assigned to the user/domain pair. Unfortunately the size of a domain of expertise and how domains are related is a wide open field of research. The strategy employed here is simpler.

Trust is assigned to user/keyphrase pairs. If for instance Bob uses the term "cloud computing" and it later becomes a hot topic then "Bob/cloud computing" is assigned trust. Furthermore, to approximate the addition of trust to the domain of "cloud computing" a small amount of trust is added for Bob to the keyphrases currently believed to be in the same domain. The basic principle used in Fibercorps Social to determine domains is textual proximity. A term is more likely to regularly co-occur with a term from its domain than a term not in its domain. The co-occurrence of topics can therefore be used to cluster topics in a meaningful way. Ideally hierarchical clustering or fuzzy clustering would be used but due to the time constraints of the summer program the K-Means algorithm was chosen.

Every time cycle the extracted hot topics are used to index the data set. That index is then used to build term-document vectors. These vectors are fed into the Mahout K-Means clustering algorithm to produce a number of topic clusters defined by the user based upon the number and diversity of the data set's search terms. Each of these clusters is treated as though it is a topic domain. Whenever trust is added to a user/keyphrase pair a smaller amount of trust is added for that user to each keyphrase that is in the same cluster. If a user truly is a trend leader within a domain then it is likely that, over an extended period of time, trust will be assigned for that user to the most frequent phrases of that domain. This is not because a user will necessarily successfully predict any of those phrases but because they successfully predict topics which cluster with those phrases.

To see how this trust can be used to predict new hot topics I return to the above example. Bob successfully led the trend on "cloud computing" so he is likely to lead trends on topics within the same domain as "cloud computing". In the future any phrases which Bob uses that co-occur with "cloud computing" or any other topic assigned trust will be potential future hot topics. Let us say for instance that the following is a tweet by Bob:

Still getting the hang of cloud computing. Looking into hadoop cluster maintenance.


Since the concept extraction algorithm used by Fibercorps Social only considers noun phrases the potential new hot topics from that post are:

  • hang

  • the hang

  • hadoop

  • cluster

  • maintenance

  • hadoop cluster

  • cluster maintenance

  • hadoop cluster maintenance


Each of those phrases is assigned an amount of belief proportional to the amount of trust that the system has in the "Bob/cloud computing" pair. It is also noted which user(s) contributed belief to a phrase during which cycle based on which co-ocurring phrases. If the amount of belief accumulated by a phrase in a single cycle crosses some threshold then a prediction is made that the phrase will become a topic within a fixed number of cycles. If by the end of that number of cycles the prediction has not come true then trust is taken away from the user/keyphrase pairs that contributed to making that prediction and a smaller penalty is applied to their clusters.

Throughout this blog post I've used phrases like "an amount of trust will be assigned" without specifying how to determine how much trust. My next blog post will be a discussion about the parameters involved in the various stages of the algorithm and how they are being tuned.

No comments:

Post a Comment