xpath re bs4 等爬虫解析器的性能比较
思路
测试网站地址:http://baijiahao.baidu.com/s?id=1644707202199076031
根据同一个网站,获取同样的数据,重复 500 次取和后进行对比。
测试例子
# -*- coding: utf-8 -*-
import re
import time
import scrapy
from bs4 import BeautifulSoup
class NewsSpider(scrapy.Spider):
name = 'news'
allowed_domains = ['baidu.com']
start_urls = ['http://baijiahao.baidu.com/s?id=1644707202199076031']
def parse(self, response):
re_time_list = []
xpath_time_list = []
lxml_time_list = []
bs4_lxml_time_list = []
html5lib_time_list = []
bs4_html5lib_time_list = []
for _ in range(500):
# re
re_start_time = time.time()
news_title = re.findall(pattern="<title>(.*?)</title>", string=response.text)[0]
news_content = "".join(re.findall(pattern='<span class="bjh-p">(.*?)</span>', string=response.text))
re_time_list.append(time.time() - re_start_time)
# xpath
xpath_start_time = time.time()
news_title = response.xpath("//div[@class='article-title']/h2/text()").extract_first()
news_content = response.xpath('string(//*[@id="article"])').extract_first()
xpath_time_list.append(time.time() - xpath_start_time)
# bs4 html5lib without BeautifulSoup
soup = BeautifulSoup(response.text, "html5lib")
html5lib_start_time = time.time()
news_title = soup.select_one("div.article-title > h2").text
news_content = soup.select_one("#article").text
html5lib_time_list.append(time.time() - html5lib_start_time)
# bs4 html5lib with BeautifulSoup
bs4_html5lib_start_time = time.time()
soup = BeautifulSoup(response.text, "html5lib")
news_title = soup.select_one("div.article-title > h2").text
news_content = soup.select_one("#article").text
bs4_html5lib_time_list.append(time.time() - bs4_html5lib_start_time)
# bs4 lxml without BeautifulSoup
soup = BeautifulSoup(response.text, "lxml")
lxml_start_time = time.time()
news_title = soup.select_one("div.article-title > h2").text
news_content = soup.select_one("#article").text
lxml_time_list.append(time.time() - lxml_start_time)
# bs4 lxml without BeautifulSoup
bs4_lxml_start_time = time.time()
soup = BeautifulSoup(response.text, "lxml")
news_title = soup.select_one("div.article-title > h2").text
news_content = soup.select_one("#article").text
bs4_lxml_time_list.append(time.time() - bs4_lxml_start_time)
re_result = sum(re_time_list)
xpath_result = sum(xpath_time_list)
lxml_result = sum(lxml_time_list)
html5lib_result = sum(html5lib_time_list)
bs4_lxml_result = sum(bs4_lxml_time_list)
bs4_html5lib_result = sum(bs4_html5lib_time_list)
print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
")
print(f"re 使用时间:{re_result}")
print(f"xpath 使用时间:{xpath_result}")
print(f"lxml 纯解析使用时间:{lxml_result}")
print(f"html5lib 纯解析使用时间:{html5lib_result}")
print(f"bs4_lxml 转换解析使用时间:{bs4_lxml_result}")
print(f"bs4_html5lib 转换解析使用时间:{bs4_html5lib_result}")
print("
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
")
print(f"xpath/re :{xpath_result / re_result}")
print(f"lxml/re :{lxml_result / re_result}")
print(f"html5lib/re :{html5lib_result / re_result}")
print(f"bs4_lxml/re :{bs4_lxml_result / re_result}")
print(f"bs4_html5lib/re :{bs4_html5lib_result / re_result}")
print("
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
测试结果:
第一次
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
re 使用时间:0.018010616302490234
xpath 使用时间:0.19927382469177246
lxml 纯解析使用时间:0.3410227298736572
html5lib 纯解析使用时间:0.3842911720275879
bs4_lxml 转换解析使用时间:1.6482152938842773
bs4_html5lib 转换解析使用时间:6.744122505187988
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
xpath/re :11.064242408196765
lxml/re :18.934539726245003
html5lib/re :21.336925154218847
bs4_lxml/re :91.51354213550078
bs4_html5lib/re :374.4526223822509
lxml/xpath :1.7113272673976896
html5lib/xpath :1.9284578525152096
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
第二次
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
re 使用时间:0.023047208786010742
xpath 使用时间:0.18992280960083008
lxml 纯解析使用时间:0.3522317409515381
html5lib 纯解析使用时间:0.418229341506958
bs4_lxml 转换解析使用时间:1.710503101348877
bs4_html5lib 转换解析使用时间:7.1153998374938965
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
xpath/re :8.24059917034769
lxml/re :15.28305419636484
html5lib/re :18.14663742538819
bs4_lxml/re :74.21736476770769
bs4_html5lib/re :308.7315216154427
lxml/xpath :1.8546047296364272
html5lib/xpath :2.2021016979791463
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
第三次
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
re 使用时间:0.014002561569213867
xpath 使用时间:0.18992352485656738
lxml 纯解析使用时间:0.3783881664276123
html5lib 纯解析使用时间:0.39995455741882324
bs4_lxml 转换解析使用时间:1.751767873764038
bs4_html5lib 转换解析使用时间:7.1871068477630615
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
xpath/re :13.563484360899695
lxml/re :27.022781835827757
html5lib/re :28.56295653062267
bs4_lxml/re :125.10338662716453
bs4_html5lib/re :513.2708620660298
lxml/xpath :1.9923185751389976
html5lib/xpath :2.1058716013241323
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
结果分析:
三次取平均值结果分析
re | xpath | lxml | html5lib | lxml(bs4) | html5lib(bs4) | |
---|---|---|---|---|---|---|
re | 1 | 10.52 | 19.46 | 21.84 | 92.82 | 382.25 |
xpath | 1 | 1.85 | 2.08 | 8.82 | 36.34 | |
lxml | 1 | 1.12 | 4.77 | 19.64 | ||
html5lib | 1 | 4.25 | 17.50 | |||
lxml(bs4) | 1 | 4.12 | ||||
html5lib(bs4) | 1 |
- xpath/re :10.52
- lxml/re :19.46
- html5lib/re :21.84
- bs4_lxml/re :92.82
- bs4_html5lib/re :382.25
- lxml/xpath :1.85
- html5lib/xpath :2.08
- bs4_lxml/xpath :8.82
- bs4_html5lib/xpath :36.34
- html5lib/lxml :1.12
- bs4_lxml/lxml :4.77
- bs4_html5lib/lxml :19.64
- bs4_lxml/html5lib :4.25
- bs4_html5lib/html5lib :17.50
- bs4_html5lib/bs4_lxml :4.12
三种爬取方式的对比
re | xpath | bs4 | |
---|---|---|---|
安装 | 内置 | 第三方 | 第三方 |
语法 | 正则 | 路径匹配 | 面向对象 |
使用 | 困难 | 较困难 | 简单 |
性能 | 最高 | 适中 | 最低 |
结论
re > xpath > bs4
-
re 是 xpath 的 10 倍左右
虽然 re 在性能上远比 xpath bs4 高很多,但是在使用上,比 xpath 和 bs4 难度上要大很多,且后期维护的困难度上也高很多。
-
xpath 是 bs4 的 1.8 倍左右
仅仅比较提取的效率来说,xpath 是 bs4 的 1.8 倍左右,但是实际情况还包含 bs4 的 转换过程,在层数多且量大的情况下,实际效率 xpath 要比 bs4 高很多。
总的来说,xpath 加上 scrapy-redis 的分布式已经非常满足性能要求了,建议入 xpath 的坑。