Fibercorps Social: Representing Social Media Data in a Distributed Environment Using HBase

The first problem I ran into in this project is that there is no standard way to represent social data in a distributed environment. To solve this problem I began from an agreed upon solution to a related problem and worked from there, becoming more general along the way. While there isn't a standard way to represent social data in a distributed environment there is a semantic vocabulary for describing social data. The Semantically-Interlinked Online Communities (SIOC) rdf vocabulary provides a general standardized way of representing social data in a way that enables it to be easily utilized by third-party web sites and applications. This rdf vocabulary was my starting point.

Finding a semantic representation vocabulary did not outright solve the problem though. The data storage system being used by FiberCorps Social is an HBase NoSQL database. For anyone unfamiliar with how NoSQL database tables work there are two key aspects. Firstly, relational algebra does not inherently hold for HBase tables as it would in a standard SQL table. The most important difference that falls out of that is that there are no foreign keys. Secondly, the tables are essentially key value stores and have the properties that one would normally expect a dictionary data type to have. On the surface this seems to pose a problem for representing rdf data. Rdf triples have three values and a key value store has pairs of data values. Fortunately, HBase and all other NoSQL databases I've looked at have the ability to smuggle in a third value.

An HBase table has rows that are signified by a primary row index. This row index is then paired with values stored in columns. Columns belong to a column family and have an individual name as well. The column values allow you to signify a third value. The convention in FiberCorps Social is for the column family name to be the rdf vocabulary name (e.g. sioc, foaf, etc.) and for the column name to be the predicate that column represents. The row index can then be interpreted as the subject of the predicate and the value stored in the column can be interpreted as the object of the predicate. This design is from a paper by Franke et. al..

That is the representation scheme for primary data tables but tables drawn from social media data are not the only ones needed. Secondary data sets need to be derived from the primary data sets. One example is an ngram model of the data in a primary data set. The row index in such a table is the ngram and the column data is the count. To connect a derived table to its primary table the name of the derived table is derived from the name of the primary. For example if the name of a primary dataset table is sports_tweets then the name of the table containing an ngram model of that dataset would be sports_tweets_ngram.

Index tables are also needed. Because NoSQL databases are essentially key/value stores they do not inherently support indexes on column data. This is problematic if queries based on column data values are needed. To facilitate fast column based search index tables are needed. A row value of an index table is a column value from the primary table. The column value of an index table is a list of row values from the primary table. The list specifies all rows in the primary table which have that value in their column. The first such table used in FiberCorps Social is the keyphrase index table. The row values are key phrases and the column values specify which tweets contain those keyphrases. This method of indexing is grossly inefficient with storage, effectively doubling the amount of space needed, but facilitates constant time lookup. Specifics on how to build index tables like this can be found here.

And that's how data is represented in FiberCorps Social. No specific piece of this representation scheme is truly novel and I've tried to provide links to the sources that I used to build it although I've certainly left one or two out. The main reason that I wanted to write this blog post is that there wasn't a conveniently labeled "How to" that put all of the pieces together. Now there is and hopefully it'll be of use to someone.

Fibercorps Social

Friday, July 22, 2011

Representing Social Media Data in a Distributed Environment Using HBase

No comments:

Post a Comment