python3 爬取简书30日热门，同时存储到txt与mongodb中

python3 爬取简书30日热门，同时存储到txt与mongodb中
初学python，记录学习过程。

新上榜，七日热门等同理。

此次主要为了学习python中对mongodb的操作，顺便巩固requests与BeautifulSoup。

点击，得到URL https://www.jianshu.com/trending/monthly?utm_medium=index-banner-s&utm_source=desktop

下拉，发现Ajax自动加载，F12观察请求。

Ajax的请求为：https://www.jianshu.com/trending/monthly?seen_snote_ids%5B%5D=20955828&seen_snote_ids%5B%5D=21427995&seen_snote_ids%5B%5D=20906269&seen_snote_ids%5B%5D=20703931&seen_snote_ids%5B%5D=21506894&seen_snote_ids%5B%5D=21763012&seen_snote_ids%5B%5D=20948499&seen_snote_ids%5B%5D=20513670&seen_snote_ids%5B%5D=21758606&seen_snote_ids%5B%5D=21619908&seen_snote_ids%5B%5D=21793770&seen_snote_ids%5B%5D=21478996&seen_snote_ids%5B%5D=20719357&seen_snote_ids%5B%5D=21136222&seen_snote_ids%5B%5D=20946853&seen_snote_ids%5B%5D=21893085&seen_snote_ids%5B%5D=21368495&seen_snote_ids%5B%5D=20917360&seen_snote_ids%5B%5D=21749782&seen_snote_ids%5B%5D=20641197&page=2

仔细观察发现中间存在众多重复的seen_snote_ids，不知啥用，那么去掉试试，将URL换成 https://www.jianshu.com/trending/monthly?page=2，发现OK，中间的seen_snote_ids参数对于请求结果没有影响，那么得到接口https://www.jianshu.com/trending/monthly?page=（1，2，3……），测试了下发现page=11就没了...并且一页加载20条文章。

OK，预习下mongodb在python中的操作。

1、需要用到 pymongo,怎么下载就不多说了，百度谷歌你看着办

2、开启mongodb，用配置文件启动。

顺便给出配置文件吧....
[plain] view plain copy
1. #设置数据目录的路径
2. dbpath = g:datadb
4. #设置日志信息的文件路径
5. logpath = D:MongoDBlogmongodb.log
6. #打开日志输出操作
7. logappend = true
8. #在以后进行用户管理的时候使用它
9. noauth = true
10. #监听的端口号
11. port = 27017
3、在python中使用，给出我当初参考的博客，我觉得蛮清晰明了了点击打开链接

最后，给出源代码
[python] view plain copy
1. #爬取简书上三十日榜并存入数据库中 mongodb
2. import pymongo
3. import requests
4. from requests import RequestException
5. from bs4 import BeautifulSoup
7. client = pymongo.MongoClient('localhost', 27017)
8. db = client.jianshu # mldn是连接的数据库名若不存在则自动创建
9. TABLENAME = 'top'
11. def get_jianshu_monthTop(url):
12. try:
13. response = requests.get(url)
14. if response.status_code ==200:
15. return response.text
16. print(url + ',visit error')
17. return None
18. except RequestException:
19. return None
21. def parse_html(html):
22. base_url = 'https://www.jianshu.com'
23. soup = BeautifulSoup(html, "html.parser")
24. nickname = [i.string for i in soup.select('.info > .nickname')];
25. span = soup.find_all('span',class_ = 'time')
26. time = []
27. for i in span:
28. time.append(i['data-shared-at'][0:10])##截取,例2017-12-27T10:11:11+08:00截取成2017-12-27
29. title = [i.string for i in soup.select('.content > .title')]
30. url = [base_url+i['href'] for i in soup.select('.content > .title')]
31. intro = [i.get_text().strip() for i in soup.select('.content > .abstract')]
32. readcount = [i.get_text().strip() for i in soup.select('.meta > a:nth-of-type(1)')]
33. commentcount = [i.get_text().strip() for i in soup.select('.meta > a:nth-of-type(2)')]
34. likecount = [i.get_text().strip() for i in soup.select('.meta > span:nth-of-type(1)')]
35. tipcount = [i.get_text().strip() for i in soup.select('.meta > span:nth-of-type(2)')]
36. return zip(nickname,time,title,url,intro,readcount,commentcount,likecount,tipcount)
38. #将数据存到mongodb中
39. def save_to_mongodb(item):
40. if db[TABLENAME].insert(item):
41. print('save success:',item)
42. return True
43. print('save fail:',item)
44. return False
45. #将数据存到results.txt中
46. def save_to_file(item):
47. file = open('result.txt', 'a', encoding='utf-8')
48. file.write(item)
49. file.write(' ')
50. file.close()
53. def main(offset):
54. url = """https://www.jianshu.com/trending/monthly?page=""" + str(offset)
55. html = get_jianshu_monthTop(url)
56. for i in parse_html(html):
57. item = {
58. '作者':i[0],
59. '发布时间':i[1],
60. '标题':i[2],
61. 'URL':i[3],
62. '简介':i[4],
63. '阅读量':i[5],
64. '评论量':i[6],
65. '点赞量':i[7],
66. '打赏量':i[8]
67. }
68. save_to_mongodb(item)
69. save_to_file(str(item))
71. if __name__ == '__main__':
72. for i in range(1,11):
73. main(i)
OK,最后给出效果图

TIPS:右键，新标签页打开图片，查看高清大图：)

抓了共157条数据。。。
相关阅读:
MVC跨域CORS扩展
 基于T4的生成方式
 开源免费的HTML5游戏引擎——青瓷引擎(QICI Engine) 1.0正式版发布了！
青瓷引擎问答集锦（一）
青瓷引擎使用心得——修改引擎的loading界面
 走近青瓷引擎（海外用户评测报告）
青瓷引擎之纯JavaScript打造HTML5游戏第二弹——《跳跃的方块》Part 10（排行榜界面&界面管理）
青瓷引擎之纯JavaScript打造HTML5游戏第二弹——《跳跃的方块》Part 9（登陆等待&结算界面）
青瓷引擎之纯JavaScript打造HTML5游戏第二弹——《跳跃的方块》Part 8（登陆界面）
青瓷引擎之纯JavaScript打造HTML5游戏第二弹——《跳跃的方块》Part 7（服务器连接&数据处理）
原文地址：https://www.cnblogs.com/zxtceq/p/9049620.html

python3 爬取简书30日热门，同时存储到txt与mongodb中

TIPS:右键，新标签页打开图片，查看高清大图：)