打算分以下几个部分进行
1. 用python写一个爬虫爬取网易新闻
2. 用分词工具对爬下来的文字进行处理, 形成语料库
3. 根据TF-IDF, 自动找出新闻的关键词
4. 根据TF-IDF, 实现相似新闻推荐
step 1a
今天一天都在弄python爬虫, 花了好大力气才写出一个勉强可用的版本
1 # -*- coding: utf-8 -* 2 3 import re, urllib, sys 4 import pyodbc 5 6 newsLink = set()##获取的所有新闻 7 processLink = set()##正在处理的新闻 8 newLink = set()##新读取的新闻 9 viewedLink = set()##已经读取过的新闻 10 11 ##打开输入的链接, 用正则表达式找出新页面中其他的链接, 并添加到全局set中 12 def getNewsLink(link): 13 ##print link 14 if(link in viewedLink): 15 return 16 viewedLink.add(link) 17 content = "" 18 try:##这一步可能会抛出异常 19 content = urllib.urlopen(link).read().decode('gbk').encode('utf-8') 20 except: 21 info=sys.exc_info() 22 print info[0],":",info[1] 23 print "caused by link : ", link 24 m = re.findall(r"news.163.com/d{2}/d{4}/d{2}/w+.html",content,re.M)##网易新闻链接格式为http://news.163.com/14/0621/12/9V8V9AL60001124J.html 25 for i in m: 26 url = "http://" + i 27 newLink.add(url) 28 newsLink.add(url) 29 print "crawled %d page, get %d link"%(len(viewedLink), len(newsLink)) 30 31 ##将读取到的新闻ID存入数据库中 32 def saveNewsIDtoDB(): 33 newsID = dict() 34 for link in newsLink: 35 ID = link[31:47] 36 newsID[ID] = link##截取其中新闻ID 37 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456') 38 cursor = conn.cursor() 39 for (ID, url) in newsID.items(): 40 sql = "INSERT INTO News(NewsID, Url) VALUES ('%s','%s')"%(ID, url) 41 try: 42 cursor.execute(sql) 43 except: 44 info=sys.exc_info() 45 print info[0],":",info[1] 46 print "caused by sql : ", sql 47 conn.commit() 48 conn.close() 49 print "total get %d news ID"%(len(newsID)) 50 51 ##读取指定数量的新闻 52 def readNews(count): 53 processLink = set() 54 processLink.add("http://news.163.com/") 55 while(len(newsLink) < count): 56 for link in processLink: 57 getNewsLink(link) 58 processLink = newLink.copy() 59 newLink.clear() 60 61 readNews(10000) 62 saveNewsIDtoDB()
实现了自动抓取指定数量的新闻并将其ID存入数据库
网易新闻没有公开其API, 但是新闻链接的格式都是固定的
如同http://news.163.com/14/0621/12/9V8V9AL60001124J.html, 14代表年份, 0621代表日期, 12不知道什么意思, 但是一定是两位数字, 后面的16位字符串就是新闻ID
跑了几十分钟, 抓了10360个新闻链接
step 1b
用BeautifulSoup解析链接, 得到新闻的标题, 正文, 和发布时间
跑了接近一个小时吧, 得到9714条新闻记录, 中间折损了接近一千条, 有的是新闻已经被删除了, 也有的是因为新闻正文格式不对, 抓了一堆JS代码进来, 存到数据库的时候就报错了
不过已经够了
解析代码如下
1 # encoding: utf-8 2 import re, urllib, sys 3 import pyodbc, json 4 import socket 5 from bs4 import BeautifulSoup 6 socket.setdefaulttimeout(10.0) 7 8 def readNews(): 9 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456') 10 cursor = conn.cursor() 11 sql = "SELECT * FROM News" 12 cursor.execute(sql) 13 rows = cursor.fetchall() 14 15 updateCount = 0; 16 17 for row in rows:#从数据库中读取链接 18 print row.NewsID, row.Url 19 content = "" 20 ptime = "" 21 title = "" 22 body = "" 23 newsID = row.NewsID.strip() 24 try:##这一步可能会抛出异常 25 content = urllib.urlopen(row.Url).read()#读取网页内容 26 ptime = "20" + row.Url[20:22] + "-" + row.Url[23:25] + "-" + row.Url[25:27]#新闻发布日期 27 title, body = analyzeNews(content)#解析网页内容, 获取新闻标题与正文 28 except: 29 info=sys.exc_info() 30 print info[0],":",info[1] 31 print "caused by link : ", row.Url 32 continue 33 34 sql = "UPDATE News SET Title = '%s', Body = '%s',ptime = '%s' WHERE NewsID = '%s'"%(title, body, ptime, newsID)#生成sql语句 35 try:##这一步可能会抛出异常 36 cursor.execute(sql) 37 except: 38 info=sys.exc_info() 39 print info[0],":",info[1] 40 print "caused by sql : ", sql 41 continue 42 updateCount += 1 43 if(updateCount % 100 == 0): 44 conn.commit() 45 print "已经更新了%s条数据!"%(updateCount) 46 conn.commit() 47 conn.close() 48 print "数据处理完毕, 一共更新了%s条数据!"%(updateCount) 49 50 def analyzeNews(content): 51 soup = BeautifulSoup(content, from_encoding="gb18030") 52 title = soup.title.get_text()[:-7] 53 bodyHtml = soup.find(id = "endtext") 54 if(bodyHtml == None): 55 bodyHtml = soup.find(id = "text") 56 if(bodyHtml == None): 57 bodyHtml = soup.find(id = "endText") 58 body = bodyHtml.get_text() 59 body = re.sub(" +", " ", body)#去除连续的换行符 60 print title 61 return title, body 62 63 readNews()
step 2
用结巴分词对新闻做分词并存入数据库中, 标题的权重设为正文的五倍
没想到数据库的效率这么高, 每秒钟居然能执行近万条插入语句
代码如下
1 # -*- coding: utf-8 -* 2 3 import re, urllib, sys 4 import pyodbc 5 import jieba 6 7 stop = [line.strip().decode('utf-8') for line in open('chinese_stopword.txt').readlines() ] 8 9 def readNewsContent(): 10 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456') 11 cursor = conn.cursor() 12 sql = "SELECT * FROM News" 13 cursor.execute(sql) 14 rows = cursor.fetchall() 15 16 word_dict = dict()#所有词的频数 17 18 insert_count = 0; 19 for row in rows:#从数据库中读取新闻 20 content = row.Body 21 title = row.Title 22 newsID = row.NewsID.strip() 23 seg_dict = sliceNews(title, content)#切词 24 25 newsWordCount = 0 26 for(word, count) in seg_dict.items(): 27 newsWordCount += count 28 sql = "INSERT INTO ContentWord(Word, Count, NewsID) VALUES ('%s',%d, '%s')"%(word, count, newsID)#将每篇新闻的词频存入数据库中 29 cursor.execute(sql) 30 insert_count += 1 31 if(insert_count % 10000 == 0): 32 print "插入%d条新闻词频记录!"%(insert_count) 33 if(word in word_dict):#维护word_dict 34 word_dict[word] += 1 35 else: 36 word_dict[word] = 1 37 sql = "UPDATE News SET WordCount = '%d' WHERE NewsID = '%s'"%(newsWordCount, newsID) 38 cursor.execute(sql) 39 conn.commit() 40 print "一共插入%d条新闻词频记录!"%(insert_count) 41 42 #将word_dict存入数据库中 43 for(word, count) in word_dict.items(): 44 sql = "INSERT INTO TotalWord(Word, Count) VALUES ('%s',%d)"%(word, count) 45 cursor.execute(sql) 46 print "插入%d条总词频记录!"%(len(word_dict.items())) 47 conn.commit() 48 conn.close() 49 50 #对输入文字切词, 并返回去除停用词后的词频 51 def sliceNews(title, content): 52 title_segs = list(jieba.cut(title)) 53 segs = list(jieba.cut(content)) 54 for i in range(5):#标题权重算正文权重的五倍 55 segs += title_segs 56 57 seg_set = set(segs) 58 seg_dict = dict() 59 for seg in seg_set:#去除停用词, 并得到这篇新闻里的词频 60 if(seg not in stop and re.match(ur"[u4e00-u9fa5]+", seg)):#只匹配中文 61 seg_dict[seg] = segs.count(seg) 62 63 return seg_dict 64 65 readNewsContent()
几分钟就跑完了, 一共插入1475330条新闻词频记录和135961条总词频记录
step 3
然后对分词结果做计算, 求其TF-IDF值, 得到每篇新闻的TF-IDF值最高的头20个词语, 作为关键词, 并保存到数据库中
代码如下
1 # -*- coding: utf-8 -* 2 3 import re, urllib, sys 4 import pyodbc 5 import math 6 7 8 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456') 9 cursor = conn.cursor() 10 newsCount = 0; 11 totalWordDict = dict() 12 13 def init(): 14 #读取所有新闻数 15 sql = "SELECT COUNT(*) FROM News" 16 cursor.execute(sql) 17 row = cursor.fetchone() 18 global newsCount 19 newsCount = int(row[0]) 20 #读取总词频并构造字典 21 sql = "SELECT * FROM TotalWord" 22 cursor.execute(sql) 23 rows = cursor.fetchall() 24 for row in rows: 25 totalWordDict[row.Word.strip()] = int(row.Count) 26 27 def clean(): 28 conn.commit() 29 conn.close() 30 31 #计算所有新闻的关键词的tf-idf值 32 def cacluTFIDF(): 33 sql = "SELECT * FROM NEWS"#遍历新闻 34 cursor.execute(sql) 35 rows = cursor.fetchall() 36 insertCount = 0 37 for row in rows:#对每一条新闻计算其关键词的TFIDF值 38 newsID = row.NewsID.strip() 39 keyWordList = calcuKeyWords(newsID) 40 for keyWord in keyWordList:#将计算出的TFIDF值存入数据库中 41 word = keyWord[0] 42 value = keyWord[1] 43 sql = "INSERT INTO TFIDF(Word, Value, NewsID) VALUES ('%s',%f, '%s')"%(word, value, newsID) 44 cursor.execute(sql) 45 insertCount += 1 46 if(insertCount % 10000 == 0): 47 print "插入%d条TFIDF记录!"%(insertCount) 48 conn.commit() 49 print "一共插入%d条TFIDF记录!"%(insertCount) 50 51 #计算指定新闻的关键词 52 def calcuKeyWords(newsID): 53 newsID = newsID.strip() 54 sql = "SELECT * FROM NEWS WHERE NewsID = '%s'"%(newsID) 55 cursor.execute(sql) 56 newsWordCount = cursor.fetchone().WordCount#新闻的总词数 57 58 sql = "SELECT * FROM ContentWord WHERE NewsID = '%s'"%(newsID) 59 cursor.execute(sql) 60 rows = cursor.fetchall() 61 tfidf_dict = dict() 62 global newsCount 63 #构建这篇新闻的tf-idf字典 64 for row in rows: 65 word = row.Word.strip() 66 count = row.Count 67 tf = float(count) / newsWordCount 68 idf = math.log(float(newsCount) / (totalWordDict[word] + 1)) 69 tfidf = tf * idf 70 tfidf_dict[word] = tfidf 71 #取前20个关键词 72 keyWordList = sorted(tfidf_dict.items(), key=lambda d: d[1])[-20:] 73 return keyWordList 74 75 76 init() 77 cacluTFIDF() 78 clean()
比方说对于 重庆东胜煤矿5名遇难者遗体全部找到 这条新闻
程序计算出来的关键词, 按权重从低到高排列分别为:
窜年产采空区工人冒落东翼煤约矸重庆市南川名顶板工作面采煤找到遇难者重庆遗体煤矿东胜
step 4
然后就可以根据关键词来做自动推荐了
具体操作如下(引用自阮一峰的博客)
(1)使用TF-IDF算法,找出两篇文章的关键词;
(2)每篇文章各取出若干个关键词(比如20个),合并成一个集合,计算每篇文章对于这个集合中的词的词频(为了避免文章长度的差异,可以使用相对词频);
(3)生成两篇文章各自的词频向量;
(4)计算两个向量的余弦相似度,值越大就表示越相似。
代码如下
1 # -*- coding: utf-8 -* 2 3 import re, urllib, sys 4 import pyodbc 5 import math 6 7 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456') 8 cursor = conn.cursor() 9 10 def clean(): 11 conn.commit() 12 conn.close() 13 14 #计算两条新闻的相似度, 返回结果为这两条新闻的关键词之间的余弦距离 15 def similar(newsID1, newsID2): 16 newsID1 = newsID1.strip() 17 newsID2 = newsID2.strip() 18 #取得待对比的两个新闻的关键词集合 19 sql = "SELECT * FROM TFIDF WHERE NewsID = '%s' OR NewsID = '%s'"%(newsID1, newsID2) 20 cursor.execute(sql) 21 rows = cursor.fetchall() 22 wordSet = set() 23 for row in rows: 24 wordSet.add(row.Word) 25 #计算两条新闻中关键词的各自出现次数, 用向量表示 26 vector1 = [] 27 vector2 = [] 28 for word in wordSet: 29 sql = "SELECT * FROM ContentWord WHERE NewsID = '%s' AND Word = '%s'"%(newsID1, word) 30 cursor.execute(sql) 31 rows = cursor.fetchall() 32 if len(rows) == 0: 33 vector1.append(0) 34 else: 35 vector1.append(int(rows[0].Count)) 36 sql = "SELECT * FROM ContentWord WHERE NewsID = '%s' AND Word = '%s'"%(newsID2, word) 37 cursor.execute(sql) 38 rows = cursor.fetchall() 39 if len(rows) == 0: 40 vector2.append(0) 41 else: 42 vector2.append(int(rows[0].Count)) 43 return calcuCosDistance(vector1, vector2) 44 45 #计算两个输入向量之间的余弦距离 46 def calcuCosDistance(a, b): 47 if len(a) != len(b): 48 return None 49 part_up = 0.0 50 a_sq = 0.0 51 b_sq = 0.0 52 for a1, b1 in zip(a,b): 53 part_up += a1*b1 54 a_sq += a1**2 55 b_sq += b1**2 56 part_down = math.sqrt(a_sq*b_sq) 57 if part_down == 0.0: 58 return None 59 else: 60 return part_up / part_down 61 62 #输入一个新闻ID, 输出与其最相似的头几条新闻 63 def recommand(newsID): 64 limit = 5 65 result = dict() 66 sql = "SELECT * FROM NEWS"#遍历新闻 67 cursor.execute(sql) 68 rows = cursor.fetchall() 69 70 newsID = newsID.strip() 71 calcuCount = 0 72 for row in rows: 73 calcuCount += 1 74 if calcuCount % 200 == 0: 75 print "已经计算了%d对新闻的相似度"%(calcuCount) 76 if row.NewsID.strip() != newsID:#去掉本身 77 distance = similar(newsID, row.NewsID)#计算两个新闻的相似度 78 if len(result) < limit: 79 result[distance] = row.NewsID 80 else: 81 minDis = min(result.keys()) 82 if(minDis < distance): 83 del result[minDis] 84 result[distance] = row.NewsID 85 86 print "输入的新闻编号为%s"%(newsID) 87 sql = "SELECT * FROM NEWS WHERE NewsID = '%s'"%(newsID) 88 cursor.execute(sql) 89 row = cursor.fetchone() 90 print "输入的新闻链接为: %s"%(row.Url.encode('utf-8')) 91 print "输入的新闻标题为: %s"%(row.Title.decode('gb2312').encode('utf-8')) 92 print "--------------------------------------" 93 for sim, newsID in result.items(): 94 sql = "SELECT * FROM NEWS WHERE NewsID = '%s'"%(newsID) 95 cursor.execute(sql) 96 row = cursor.fetchone() 97 print "推荐新闻的相似度为: %f"%(sim) 98 print "推荐新闻的编号为: %s"%(row.NewsID.encode('utf-8')) 99 print "推荐新闻的链接为: %s"%(row.Url.encode('utf-8')) 100 print "推荐新闻的标题为: %s"%(row.Title.decode('gb2312').encode('utf-8')) 101 print "" 102 103 #print similar("2IK789GB0001121M", "2IKJ8KRJ0001121M") 104 recommand("A4AVPKLA00014JB5") 105 clean()
1 输入的新闻编号为: A4AVPKLA00014JB5 2 输入的新闻链接为: http://news.163.com/14/0823/10/A4AVPKLA00014JB5.html 3 输入的新闻标题为: 重庆东胜煤矿5名遇难者遗体全部找到 4 -------------------------------------- 5 推荐新闻的相似度为: 0.346214 6 推荐新闻的编号为: A4BHA5OO0001124J 7 推荐新闻的链接为: http://news.163.com/14/0823/15/A4BHA5OO0001124J.html 8 推荐新闻的标题为: 安徽淮南煤矿爆炸事故救援再次发现遇难者遗体 9 10 推荐新闻的相似度为: 0.356118 11 推荐新闻的编号为: 8H0Q439K00011229 12 推荐新闻的链接为: http://news.163.com/12/1123/16/8H0Q439K00011229.html 13 推荐新闻的标题为: 安徽淮北首富被曝用500万元买通矿难遇难者家属 14 15 推荐新闻的相似度为: 0.320387 16 推荐新闻的编号为: A3MBB7CF00014JB6 17 推荐新闻的链接为: http://news.163.com/14/0815/10/A3MBB7CF00014JB6.html 18 推荐新闻的标题为: 黑龙江鸡西煤矿透水事故9人升井 仍有16名矿工被困 19 20 推荐新闻的相似度为: 0.324280 21 推荐新闻的编号为: 5Q92I93D000120GU 22 推荐新闻的链接为: http://news.163.com/09/1211/16/5Q92I93D000120GU.html 23 推荐新闻的标题为: 土耳其煤矿发生瓦斯爆炸 19名矿工全部遇难 24 25 推荐新闻的相似度为: 0.361950 26 推荐新闻的编号为: 6D7J4VLR00014AED 27 推荐新闻的链接为: http://news.163.com/10/0804/05/6D7J4VLR00014AED.html 28 推荐新闻的标题为: 贵州一煤矿发生煤与瓦斯突出事故
推荐内容的关联性很好
不过, 由于推荐操作需要对数据库进行遍历, 时间复杂度非常高, 对单个新闻做关联推荐耗时大约在10分钟左右, 实际使用肯定是无法接受的
但是, 毕竟只是个很粗糙的测试, 我个人还是非常满意的
我的感受是: 算法挺神奇, 在上面的代码中, 完全不需要知道具体的新闻内容, 程序就能自动做出相当准确的判断, 非常方便而且有趣
中间还是有很多可以优化的地方
比如爬取新闻的时候可以删除部分无用信息(来源, 记者姓名之类)
根据词语出现的位置, 对TF-IDF值进行修正, 比方说第一段和每一段的第一句话的TF-IDF值应当更高一点
对新闻进行粗略分类, 在对一篇新闻做关联推荐的时候, 不需要遍历整个新闻库
参考资料: 阮一峰的博客 http://www.ruanyifeng.com/blog/2013/03/tf-idf.html