计算两篇文章相似度

计算两篇文章相似度

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(0, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 2), (10, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (10, 1)]]

例如（9，2）这个元素代表第二篇文档中id为9的单词“silver”出现了2次。

有了这些信息，我们就可以基于这些“训练文档”计算一个TF-IDF“模型”：

>>> tfidf = models.TfidfModel(corpus)

基于这个TF-IDF模型，我们可以将上述用词频表示文档向量表示为一个用tf-idf值表示的文档向量：

Based on this tf-idf model, we can express the document vector represented by word frequency as a document vector represented by tf-idf value

>>> corpus_tfidf = tfidf[corpus]
>>> for doc in corpus_tfidf:
... print doc
...
[(1, 0.6633689723434505), (2, 0.6633689723434505), (3, 0.2448297500958463), (6, 0.2448297500958463)]
[(7, 0.16073253746956623), (8, 0.4355066251613605), (9, 0.871013250322721), (10, 0.16073253746956623)]
[(3, 0.5), (6, 0.5), (7, 0.5), (10, 0.5)]

发现一些token貌似丢失了，我们打印一下tfidf模型中的信息：

>>> print tfidf.dfs
{0: 3, 1: 1, 2: 1, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 1, 9: 1, 10: 2}
>>> print tfidf.idfs
{0: 0.0, 1: 1.5849625007211563, 2: 1.5849625007211563, 3: 0.5849625007211562, 4: 0.0, 5: 0.0, 6: 0.5849625007211562, 7: 0.5849625007211562, 8: 1.5849625007211563, 9: 1.5849625007211563, 10: 0.5849625007211562}

我们发现由于包含id为0， 4， 5这3个单词的文档数（df)为3，而文档总数也为3，所以idf被计算为0了，看来gensim没有对分子加1，做一个平滑。不过我们同时也发现这3个单词分别为a, in, of这样的介词，完全可以在预处理时作为停用词干掉，这也从另一个方面说明TF-IDF的有效性。

有了tf-idf值表示的文档向量，我们就可以训练一个LSI模型，和Latent Semantic Indexing (LSI) A Fast Track Tutorial中的例子相似，我们设置topic数为2：

lsi的物理意义不太好解释，不过最核心的意义是将训练文档向量组成的矩阵SVD分解，并做了一个秩为2的近似SVD分解，可以参考那篇英文tutorail。有了这个lsi模型，我们就可以将文档映射到一个二维的topic空间中：
相关阅读:
Windows 服务程序(一)
API---注册表编程
 API---文件操作
 main(argc, char *argv[])
C 自删除技术---批处理方式
 分治法排序
 TDD尝试：nodejs单元测试
 尝试create tech team
Yum重装走过的坑
 求生欲很强的数据库
原文地址：https://www.cnblogs.com/ldphoebe/p/12228948.html