论文摘要, Filtering microblogging messages for Social TV, A Bootstrapping Approach to Identifying Relevant Tweets for Social TV
Social TV was named one of the ten most important emerging technologies in 2010 by the MIT Technology Review.
Social Television is a general term for technology that supports communication and social interaction in either the context of watching television, or related to TV content.
Some of these systems allow users to read microblogging messages related to the TV program they are currently watching.
所以这儿讨论的问题就是, 怎么样过滤出真正和TV相关的信息, 最简单的, 而且也是我们一直使用的方法如下,
Current Social TV applications search for these messages by issuing queries to social networks with the full title of the TV program. This naive approach can lead to low precision and recall.
举个简单了例子就可以明白, 这个方法为啥precision and recall都很低...
The popular TV show House is an example that results in low precision.
对于House, 这是个有歧义的词(ambiguous), 除了表示TV节目外, 在不同的语境下有很多其它的用途, 如White House, House of Representatives, building, home, etc. 所以直接搜索House必然是low precision.
Continuing with our example for the show House, there are many messages which do not mention the title of the show but make references to users, hashtags, or even actors and characters related to the show. The problem of low recall is more severe for shows with long titles.
上面说了recall问题对于title比较长的tv非常明显, 很少有人愿意在tweet写全title, 往往会使用缩写.
总结一下, 我们要解决这个问题的挑战如下,
Our task is to retrieve microblogging messages relevant to a given TV show with high precision. Filtering messages from microblogging websites poses several challenges, including:
1. Microblogging messages are short and often lack context. For instance, Twitter messages (tweets) are limited to 140 characters and often contain abbreviated expressions such as hashtags and short URLs.
2. Many social media messages lack proper grammatical structure. Also, users of social networks pay little attention to capitalization and punctuation. This makes it difficult to apply natural language processing technologies to parse the text.
3. Many social media websites offer access to their content through search APIs, but most have rate limits. In order to filter messages we first need to collect them by issuing queries to these services. For each show we require a set of queries which provides the best tradeoff between the need to cover as many messages about the show as possible, and the need to respect
the API rate limits imposed by the social network. Such queries could include the title of the show and other related strings such as hashtags and usernames related to the show. Determining which keywords best describe a TV show can be a challenge.
4. In the last decade alone, television networks have aired more than a thousand new TV shows. Obtaining training data for every show would be prohibitively expensive. Furthermore, new shows are aired every six months.
这个问题怎么解决, 我之前也想了很久, 我也想过要建立一个分类器来区分一条tweet是否是关于tv的, 但是没有想好具体怎么做, 这篇paper就是提出了一个怎么样建立这个分类器的方法.
分类器是个很成熟的技术, 关键就是特征的选取和训练集的收集.
We propose a bootstrapping method which is built upon 1) a small set of labeled data, 2) a large unlabeled dataset, and 3) some domain knowledge, to form a classifier that can generalize to an arbitrary number of TV shows.
由于lable训练集是个耗时的工作, 所以这儿只需要较小的训练集labeled data, 并通过domain knowledge来选取初始的分类特征, 这样可以完成初始的分类器的训练.然后用a large unlabeled dataset作为测试集来测试初始分类器, 在测试过程中发现新的特征, 并不断的完善, 形成可用的improved分类器.
这就是这个方法的大体思想, 并且通过测试, 可以发现improved后的分类器在recall上有很大的提高.
个人觉得这篇paper的价值就在于特征的选取, 下面就看看会选取哪些特征,
Terms related to TV watching
General terms commonly associated with watching TV. 这类特征通过手工收集, 包含如下3个特征,
tv_terms, general terms such as watching, episode, hdtv, netflix, etc.
network_terms, contains names of television networks such as cnn, bbc, pbs, etc.
season_episode,
Some users post messages which contain the season and episode number of the TV show they are currently watching.
“S06E07”, “06x07” and even “6.7” are common ways of referring to the sixth season and the seventh episode of a particular TV show. 所以我们要通过regular expressions来定位是否包含season_episode
对于以上特征, 在tweet中包含相应term时特征为1, 否则为0.
General Positive Rules
rules_score ,
The motivation behind the rules_score feature is the fact that many messages which discuss TV shows follow certain patterns.
如,
<start> watching <show_name>
episode of <show_name>
<show_name> was awesome
如果我们有这样的一个rule列表, 当tweet中包含相应rule时特征为1, 否则为0.
问题是我们怎样找到这些rule, 当然可以人工一个个去发现, 这样也可以准确率比较高, 不过效率太低.
We developed an automated way to extract such general rules and compute their probability of occurrence.
We start from a manually compiled list of ten unambiguous TV show titles, such as “Mythbusters”, “The Simpsons”, “Grey’s Anatomy”, etc. unambiguous 就是没有歧义, 明确的, 这个词一定代表某一个tv的, 相对于ambiguous, 如House
现在我们想要提取tv相关的tweets中的general rules, 所以必须保证找到的tweets是真正和tv相关的, 比较好的办法就是通过unambiguous TV show来收集, 这个方法我们之前也使用过.
For each message which contained one of these titles, the algorithm replaced the title of TV shows, hashtags, references to episodes, etc. with general placeholders, then computed the occurrence of trigrams around the keywords.
这个是关键的一步, 我们需要提取general rules, 所以要先把和某个具体tv相关的信息都屏蔽掉, 然后统计trigrams 的occurrence
Features related to show titles
Although many social media messages lack proper capitalization,when users do capitalize the titles of the shows this can be used as a feature.
title_case, which is set to 1 if the title of the show is capitalized, otherwise it has the value 0.
titles_match, any of the titles mentioned in the message are unambiguous, we can set the value of this feature to 1.
这儿比较有价值的是, 他提出了一个怎么样判断是否unambiguous的方法, 我们之前通过自己统计stop word的方法, 不过效果不是很好, 尤其是对多个词的时候, 他提出可以利用WordNet……Good.
We define unambiguous title to be a title which has zero or one hits when searching for it in WordNET
Features based on domain knowledge crawled from online sources
One of our assumptions is that messages relevant to a show often contain names of actors, characters, or other keywords strongly related to the show.
cosine_characters, cosine_actors, and cosine_wiki, we compute the cosine similarity between a new message and the information we crawled (from TV.com and Wikipedia) about the show for each of the three features.
这个方法可用大大提高recall, 不过实现起来比较麻烦, 而且由于twitter的访问限制, 也不允许为一个show设置太多的term, 所以一直没有采用.
上面就列出了9个初始特征, 然后通过使用初始分类器对测试集进行测试后, 又发现如下特征,
pos_rules_score and neg_rules_score are natural extensions of the feature rules_score.
For instance, for the show House we can now learn positive rules such as episode of house, as well as negative rules such as in the house or the white house.
users_score and hashtags_score
Using messages labeled by Classifier #1, we can determine commonly occurring hashtags and users which often talk about a particular show. Furthermore, these features can also help us expand the set of queries for each show, thus improving the recall by searching for hashtags and users related to the show, in addition to the title.
这点我们之前也想到过, 只是没有实现, 可以提高recall
rush_period, this feature is based on the observation that users of social media websites often discuss about a show during the time it is on air.When classifying a new message we check how many mentions of the show there were in the previous window of 10 minutes. 超过某一threshold设为1, 否则设为0.