Date: 2019-06-09
Author: Sun
我们分析格言网 https://www.geyanw.com/, 通过requests网络库和bs4解析库进行爬取此网站内容。
项目操作步骤
-
创建项目文件夹
--geyanwang ---spiders # 保存我们爬虫代码 ---- geyan.py # 爬虫的代码 ---doc # 操作步骤说明文档
-
创建虚拟环境
cd geyanwang/ virtualenv spider --python=python3 # 创建venv虚拟环境
-
安装依赖库
$ source venv/bin/activate (spider) $ pip install requests (spider) $ pip install lxml (spider) $ pip install bs4
-
编写代码 spiders/geyan.py
# -*- coding: utf-8 -*-
__author__ = 'sun'
__date__ = '2019/6/19 下午2:22'
from bs4 import BeautifulSoup as BSP4
import requests
g_set = set()
def store_file(file_name, r):
html_doc = r.text
with open("geyan_%s.html"%file_name, "w") as f:
f.write(html_doc)
def download(url, filename='index'):
'''
:param url: 待下载页面地址
:return: 页面内容
'''
r = requests.get(url) #发送url请求,得到url网页内容
store_file(filename, r)
return r
def parse_tbox(tbox, base_domain):
'''
解析某个小说类别
:param tbox:
:param base_domain:
:return:
'''
tbox_tag = tbox.select("dt a")[0].text
print(tbox_tag)
index = 0
li_list = tbox.find_all("li")
for li in li_list:
link = base_domain + li.a['href']
print("index:%s, link:%s" % (index, link))
index += 1
if link not in g_set:
g_set.add(link)
filename = "%s_%s" % (tbox_tag, index)
sub_html = download(link, filename)
def parse(response):
'''
对页面进行解析
:param response: 页面的返回内容
:return:
'''
base_domin = response.url[:-1]
g_set.add(base_domin)
#print(base_domin)
html_doc = response.content
soup = BSP4(html_doc, "lxml")
tbox_list = soup.select("#p_left dl.tbox") #小说
[parse_tbox(tbox, base_domin) for tbox in tbox_list]
def main():
base_url = "https://www.geyanw.com/"
response = download(base_url)
parse(response)
if __name__ == "__main__":
main()
- 运行上述代码,会产生一堆的html文件至本地
作业
上述geyan.py文件中只处理了首页
如何按照类别分页爬取相关内容,采用多线程实现
eg:
https://www.geyanw.com/lizhimingyan/
https://www.geyanw.com/renshenggeyan/
将爬取的网页以文件夹命名不同的方式进行保存至本地