【346】TF-IDF

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus=["I come to China to travel", 
    "This is a car polupar in China",          
    "I love tea and Apple ",   
    "The work is to write some papers in science"]
>>> vectorizer=CountVectorizer()
>>> transformer = TfidfTransformer()
>>> tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
>>> print(tfidf)
  (0, 16)	0.4424621378947393
  (0, 15)	0.697684463383976
  (0, 4)	0.4424621378947393
  (0, 3)	0.348842231691988
  (1, 14)	0.45338639737285463
  (1, 9)	0.45338639737285463
  (1, 6)	0.3574550433419527
  (1, 5)	0.3574550433419527
  (1, 3)	0.3574550433419527
  (1, 2)	0.45338639737285463
  (2, 12)	0.5
  (2, 7)	0.5
  (2, 1)	0.5
  (2, 0)	0.5
  (3, 18)	0.3565798233381452
  (3, 17)	0.3565798233381452
  (3, 15)	0.2811316284405006
  (3, 13)	0.3565798233381452
  (3, 11)	0.3565798233381452
  (3, 10)	0.3565798233381452
  (3, 8)	0.3565798233381452
  (3, 6)	0.2811316284405006
  (3, 5)	0.2811316284405006
>>> print(vectorizer.get_feature_names())
['and', 'apple', 'car', 'china', 'come', 'in', 'is', 'love', 'papers', 'polupar', 'science', 'some', 'tea', 'the', 'this', 'to', 'travel', 'work', 'write']

说明：其中 (0, 16) 表示第一行文本，索引为 16 的词，对应的是“travel”，以此类推。

继续上面的信息，获取对应 term 的 tfidf 值，tfidf 变量对应的是 (4, 19) 矩阵的值，对应不同的句子，不同的 term。

>>> tfidf_array = tfidf.toarray()    #获取array，然后遍历array，并分别转为list
>>> names_list = vectorizer.get_feature_names()    #获取names的list
>>> for i in range(0, len(corpus)):
	print(corpus[i],'
')
	tmp_list = tfidf_array[i].tolist()
	for j in range(0, len(names_list)):
		if tmp_list[j] != 0:
			if len(names_list[j])>=7:
				print(names_list[j],'	',tmp_list[j])
			else:
				print(names_list[j],'		',tmp_list[j])
	print('')

	
I come to China to travel 

china 		 0.348842231691988
come 		 0.4424621378947393
to 		 0.697684463383976
travel 		 0.4424621378947393

This is a car polupar in China 

car 		 0.45338639737285463
china 		 0.3574550433419527
in 		 0.3574550433419527
is 		 0.3574550433419527
polupar 	 0.45338639737285463
this 		 0.45338639737285463

I love tea and Apple  

and 		 0.5
apple 		 0.5
love 		 0.5
tea 		 0.5

The work is to write some papers in science 

in 		 0.2811316284405006
is 		 0.2811316284405006
papers 		 0.3565798233381452
science 	 0.3565798233381452
some 		 0.3565798233381452
the 		 0.3565798233381452
to 		 0.2811316284405006
work 		 0.3565798233381452
write 		 0.3565798233381452

>>>

获取 TF(Term Frequency)

>>> X = vectorizer.fit_transform(corpus)
>>> X.toarray()
array([[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0],
       [0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1]],
      dtype=int64)
>>> vector_array = X.toarray()
>>> for i in range(0, len(corpus)):
	print(corpus[i],'
')
	tmp_list = vector_array[i].tolist()
	for j in range(0, len(names_list)):
		if tmp_list[j] != 0:
			if len(names_list[j])>=7:
				print(names_list[j],'	',tmp_list[j])
			else:
				print(names_list[j],'		',tmp_list[j])
	print('')

I come to China to travel 

china 		 1
come 		 1
to 		 2
travel 		 1

This is a car polupar in China 

car 		 1
china 		 1
in 		 1
is 		 1
polupar 	 1
this 		 1

I love tea and Apple  

and 		 1
apple 		 1
love 		 1
tea 		 1

The work is to write some papers in science 

in 		 1
is 		 1
papers 		 1
science 	 1
some 		 1
the 		 1
to 		 1
work 		 1
write 		 1

>>>

相关阅读:
tcp流协议产生的粘包问题和解决方案
 使用fork并发处理多个client的请求和对等通信p2p
最简单的回射客户/服务器程序、time_wait 状态
 C/S程序的一般流程和基本socket函数
 socket概述和字节序、地址转换函数
 IP数据报格式和IP地址路由
 利用ARP和ICMP协议解释ping命令
 TCP/IP协议栈与数据报封装
 从汇编角度来理解linux下多层函数调用堆栈运行状态
 read/write函数与（非）阻塞I/O的概念
原文地址：https://www.cnblogs.com/alex-bn-lee/p/10212235.html