在看关于social network文章时,了解了关于数据集获取的问题,总结如下:
1 编写爬虫程序获取数据,例如 Broadcast yourself understanding youtobe uploaders ,文章中使用DFS crawler爬取数据
2 使用研究机构或公司的公开数据集,例如 Friend or Frenemy? Predicting Signed Ties in Social Networks,文章中使用的epinion dataset 来源于http://www.trustlet.org/wiki/Epinions_datasets; 雅虎公开数据集;stack overflow;trustlet;
3 基于网站API的数据获取,例如新浪微博
4 和相关公司内部成员合作,获取数据,例如 Uncovering Social Network Sybils in the Wild,
5 一些专业的国际组织提供的数据,例如网络拓扑数据, traceroute数据,
6其他特定领域的数据,例如药物副反应数据等,UCI机器学习数据库等, UGC社会观察数据集(在建)
总结:we should first choose to utilize the public datasets when developing a research , because we can pay much attention to "reseach", but not "collection", and it is convinient to compare some conclusions or observations of other research based on the same dataset. However, if we cannot access any dataset which we want to investigate, we have to crawl the sites or connect related persons.
Actually, I am confused about this question: we have dataset and then study problems or we discover problems from life and then need dataset to further study. Maybe it depends, which is the most informationless answers. I will keep undating this list if I get more discovery:)