TF-IDF In Scikit-Learn
其实在算下面TF-IDF的步骤之前,还有一步,就是计算Term Frequency 也就是词频。当然,scikit-learn 中也提供了计算词频的包。
位于 sklearn.feature_extraction.text
下面以一个小Demo 来演示计算
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> countVectorizer = CountVectorizer()
>>> countVectorizer
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\b\w\w+\b',
tokenizer=None, vocabulary=None)
>>> corpus
['This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?']
>>> X = countVectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type '<class 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Row format>
>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
上面呢就计算好了,接下来我们需要知道的是该文本集中所有的 tokenizing ,
>>> countVectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[0, 1, 1, 1, 0, 0, 1, 0, 1]
This is the first document.
中 and
出现的次数为 0 document
出现的次数为 1 ,以此类推。
算出来这个后,后面就可以继续计算 TF-IDF 了。
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer()
>>> tfidf = transformer.fit_transform(X)
>>> tfidf
<4x9 sparse matrix of type '<class 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>
>>> tfidf.toarray()
array([[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674],
[ 0. , 0.27230147, 0. , 0.27230147, 0. ,
0.85322574, 0.22262429, 0. , 0.27230147],
[ 0.55280532, 0. , 0. , 0. , 0.55280532,
0. , 0.28847675, 0.55280532, 0. ],
[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674]])
还有一个包叫做TfidfVectorizer 囊括了包括 CountVectorizer 和 TfidfTransformer 所以呢,我们可以用下面更加简便的方法。
>>> vectorizer
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
stop_words=None, strip_accents=None, sublinear_tf=False,
token_pattern='(?u)\b\w\w+\b', tokenizer=None, use_idf=True,
>>> vectorizer.fit_transform(corpus)
<4x9 sparse matrix of type '<class 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>
>>> vectorizer.fit_transform(corpus).toarray()
array([[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674],
[ 0. , 0.27230147, 0. , 0.27230147, 0. ,
0.85322574, 0.22262429, 0. , 0.27230147],
[ 0.55280532, 0. , 0. , 0. , 0.55280532,
0. , 0.28847675, 0.55280532, 0. ],
[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674]])
Text Feature Extraction
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
- tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
- counting the occurrences of tokens in each document.
- normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
TF-IDF Introduce
Term Frequency-Inverse Document Frequency,从字面上来理解呢,就是 词频-逆文档频率。从官方文档上我们可知这个统计方法是用来干啥的。Information Retrieval and Text mining 信息检索以及文本挖掘中。
The tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
它的计算方法也很简便,TF-IDF(term,doc) = TF(term,doc) * IDF(term)
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
TF-IDF Implements In Scikit-learn
接下来解释一下TF-IDF在scikit-learn 使用计算上的一些不同处。首先是 idf(term) 的计算,计算tf-idf的包在 sklearn.feature_extraction.text 的 TfidfTransformer (关于该方法详情,参考这里)中。根据 smooth_idf =True or False 的不同针对 idf 有两种不同的计算方式。分别是
- smooth_idf = true (Default)
idf(t) = log(n(d)/(1+df(d,t)) +1
- smooth_idf = false
idf(t) = log(n(d)/df(d,t))+1
is the total number of documents, anddf(d,t)
is the number of documents that contain termt
. The resulting tf-idf
接下来主要解释一下官方中计算 tf-idf的一个Demo。
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)
counts = [[3, 0, 1],
[2, 0, 0],
[3, 0, 0],
[4, 0, 0],
[3, 2, 0],
[3, 0, 2]]
tfidf = transformer.fit_transform(counts)
array([[ 0.81940995, 0. , 0.57320793],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 0.47330339, 0.88089948, 0. ],
[ 0.58149261, 0. , 0.81355169]])
The first term is present 100% of the time hence not very interesting. The two other features only in less than 50% of the time hence probably more representative of the content of the documents
我当时怎么看都不明白为什么 3 在文本中没有出现100%,但是 文档却说 100% 。如果你也这样想,恭喜你,你是无法理解下去的。我想在网上找找关于这个解释的资料,无奈找不到,只好自己再去看了一遍原文文档。
接下来就是解释:为何是 100%?
想成了正常数字中的3,0和1 的话,那么你想一周也不知道这到底怎么算的。这里的数字代表的是该token出现的次数。要理解这个,就回到了我们之前最初说的:
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
说的通俗点, 其实上面那个其实已经是把一堆文本处理过一遍之后得到的结果。6x3
是 "my" ,"name"," robert"
。 那么我们可以模拟出上面第一行的文本['my robert my my']
。反观刚才的问题,是不是就解释的通了,为什么是100% ,因为这一列每个都不为0 就是说每一行都含有这个token
接下来演示一下计算过程:(仅计算 第一行的)
n(d,term1) = 6
df(d,t)term1 = 6
idf(d,t)term1 = log(n(d,term1)/df(d,t)) +1 = log(1) +1 =1
tf-idf(term1) = tf x idf = 3 *1 =3
tf-idf(term2) = tf * idf = 0 *(log(6/1)+1) = 0
tf-idf(term3) = tf * idf = 1 *(log(6/2)+1) ~ 2.0986
算出来的 tf-idf 行是
[3,0,2.0986]/sqrt(3^2 + 0^2 + 2.0986^2) = [0.819,0,0.573]
0.819 = 3 / sqrt(3^2 + 0^2 + 2.0986^2
// 欧几里得范数
V(norm) = v / sqrt(v(1) x v(1) + …. + v(n) x v(n))
Sqrt 表示根号的意思。
希望给初次学习 scikit-learn 中 Tf-Idf 的可以通过这个,少走点弯路,特别在理解文中提到的那个矩阵的时候。
