Python爬虫之Beautiful Soup解析库的使用
Beautiful Soup-介绍
Python第三方库,用于从HTML或XML中提取数据
官方:http://www.crummv.com/software/BeautifulSoup/
安装:pip install beautifulsoup4
Beautiful Soup-语法
soup = BeautifulSoup(html_doc,'html.parser‘,from_encoding='utf-8' )
第一个参数:html文档字符串
第二个参数:html解析器
第三个参数:html文档的编码
Beautiful Soup-使用
标签选择器操作
注意:只会返回一个指定的标签,这也是标签选择器的特性
选择元素
from bs4 import BeautifulSoup html_doc=''' <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home? data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li> ''' soup = BeautifulSoup(html_doc,'lxml')
#将html代码自动补全,并按html代码格式返回 print(soup.prettify())
#输出第一个a标签 print(soup.a)
#输出第一个span标签 print(soup.span)
运行结果如下:
<html> <body> <div class="container"> <a class="logo" href="/pc/home?sign=360_79aabe15"> </a> <nav data-mod="nnav" id="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"> <a class="nnav-item" data-ch="youlike" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank"> 推荐 <span> </span> </a> </li> <li data-index="1"> <a class="nnav-item" data-ch="good_safe2toera" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank"> 新时代 <span> </span> </a> </li> <li data-index="2"> <a class="nnav-item" data-ch="fun" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank"> 娱乐 <span> </span> </a> </li> <li data-index="3"> <a class="nnav-item" href="/pc/home? data-index="> </a> <a class="nnav-item" data-ch="economy" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank"> 财经 <span> </span> </a> </li> </ul> </div> </nav> </div> </body> </html> <a class="logo" href="/pc/home?sign=360_79aabe15"></a> <span></span>
获取名称
获取属性
获取内容
from bs4 import BeautifulSoup html_doc=''' <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home? data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li> ''' soup = BeautifulSoup(html_doc,'lxml') #输出第一个a标签的name print(soup.a.name) #输出第一个a标签的的class属性值,下面两种方法都可以 print(soup.a.attrs['class']) print(soup.a['class']) #输出第一个a标签的内容 print(soup.a.string)
运行结果如下:
a ['logo'] ['logo'] None
嵌套选择
from bs4 import BeautifulSoup html_doc=''' <a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike"><span>推荐</span></a> ''' soup = BeautifulSoup(html_doc,'lxml') print(soup.a.span.string)
运行结果如下:
推荐
子节点和子孙节点操作
获取所有的子节点
from bs4 import BeautifulSoup html=''' <div class="bc"> <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span> <span class="fl" style="padding-top: 6px;"> <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a> <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> > <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> > <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文 </span> <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div> ''' soup = BeautifulSoup(html,'lxml') #第一种方法 print(soup.div.contents) #第二种方法 print(soup.div.children) for i,child in enumerate(soup.div.children): print(i,child)
运行结果如下:
[' ', <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>, ' ', <span class="fl" style="padding-top: 6px;"> <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a> <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a> > <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a> > <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文 </span>, ' ', <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>, ' '] <list_iterator object at 0x0000000002E498D0> 0 1 <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span> 2 3 <span class="fl" style="padding-top: 6px;"> <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a> <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a> > <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a> > <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文 </span> 4 5 <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a> 6
获取所有的子孙节点
from bs4 import BeautifulSoup html=''' <div class="bc"> <span class="fl" style="padding-top: 1px;"> <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span> <span class="fl" style="padding-top: 6px;"> <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a> <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> > <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> > <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span> <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.div.descendants) for i,child in enumerate(soup.div.descendants): print(i,child)
运行结果如下:
<generator object descendants at 0x00000000028F5AF0> 0 1 <span class="fl" style="padding-top: 1px;"> <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span> 2 3 <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a> 4 <img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/> 5 6 <span class="fl" style="padding-top: 6px;"> <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a> <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a> > <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a> > <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span> 7 8 <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a> 9 四级 10 11 <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a> 12 新东方在线 13 > 14 <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a> 15 四级 16 > 17 <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> 18 英语四级词汇 19 > 正文 20 21 <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a> 22 <img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/> 23
父节点和祖先节点操作
获取父节点和祖先节点
from bs4 import BeautifulSoup html=''' <div class="bc"> <span class="fl" style="padding-top: 1px;"> <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span> <span class="fl" style="padding-top: 6px;"> <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a> <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> > <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> > <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span> <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.a.parent) #获取父节点 print(soup.a.parents) #获取祖先节点
运行结果如下:
<span class="fl" style="padding-top: 1px;"> <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span> <generator object parents at 0x00000000028C5B48>
兄弟节点操作
获取兄弟节点
from bs4 import BeautifulSoup html=''' <div class="more_box" id="moreBox"> <h3>360识图</h3> <a href="javascript:;" id="btnLoadMore" class="btn_loadmore">加载更多</a> <p id="imgTotal" class="img_total">找到相关图片约 2637 张</p> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.a.next_siblings) #获取前面的兄弟节点 print(soup.a.previous_siblings) #获取后面的兄弟节点
运行结果如下:
<generator object next_siblings at 0x0000000002885B48> <generator object previous_siblings at 0x0000000002885B48>
python生成器generator
l = [x * x for x in range(10)] g = (x * x for x in range(10)) print(l) print(g)
运行结果如下:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81] <generator object <genexpr> at 0x000000000251C468>
L 是一个list, 而 G 是一个generator:它们在创建时候最基本的不同就list是 [ ] ,而generator是 ( )
如果要一个个打印出来,可以通过next()函数来获得generator的下一个返回值
g = (x * x for x in range(10)) for i in range(10): print(next(g))
运行结果如下
0 1 4 9 16 25 36 49 64 81
标准选择器操作
#可根据标签名、属性、内容查找文档,返回所有匹配结果
find_all(name,attrs,recusive,text,**kwargs) #查找所有标签为a的节点 soup.find_all('a') #查找所有标签为a,链接符合/view/123/htm形式的节点 soup.find_all('a',href='/view/123.htm') soup.find_all('a',href=re.compile(r'/view/d+.htm')) #查找所有标签为div,class为abc,文字为python的节点 soup.find_all('div',class_='abc',string='python') 属性: #获取查到的节点的标签名称 node.name #获取查找到的a节点的href属性 node['href'] #获取查找到的a节点的链接文字 node.get_text() find(name,attrs,recusive,text,**kwargs) 可根据标签名、属性、内容查找文档,和find_all使用方法差不多,只不过返回第一个符合匹配的结果 find_parents() find_parent() find_parents()返回所有祖先节点 ,find_parent()返回直接父节点 find_next_siblings() find_next_sibling() find_next_siblings()返回前面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点 find_previous_siblings() find_previous_sibling() find_previous_siblings()返回前面所有兄弟节点 , find_previous_sibling()返回前面第一个兄弟节点 find_all_next() find_next() find_all_next()返回节点后所有符合条件的节点 , find_next()返回第一个符合条件的节点 find_all_previous() find_previous() find_all_previous()返回节点后所有符合条件的节点 ,find_previous()返回第一个符合条件的节点
测试实例:
import bs4
html_doc='''
<div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?
data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li><li data-index="5"><a class="nnav-item" href="/pc/home?ch=estate&sign=360_79aabe15" target="_blank" data-ch="estate">房产<span></span></a></li><li data-index="6"><a class="nnav-item" href="/pc/home?ch=car&sign=360_79aabe15" target="_blank" data-ch="car">汽车<span></span></a></li><li data-index="7"><a class="nnav-item" href="/pc/home?ch=sport&sign=360_79aabe15" target="_blank" data-ch="sport">体育<span></span></a></li><li data-index="8"><a class="nnav-item" href="/pc/home?ch=domestic&sign=360_79aabe15" target="_blank" data-ch="domestic">国内
'''
#创建BeautifulSoup对象
soup = bs4.BeautifulSoup(html_doc,'html.parser')
#获取所有的链接
links = soup.find_all('a')
for link in links:
print(link.name,link['href'],link.get_text())
#获取/pc/home?sign=360_79aabe15的链接
link_node = soup.find('a',href='/pc/home?sign=360_79aabe15')
print(link_node.name,link_node['href'],link_node.get_text())
运行结果如下:
a /pc/home?sign=360_79aabe15 a /pc/home?ch=youlike&sign=360_79aabe15 推荐 a /pc/home?ch=good_safe2toera&sign=360_79aabe15 新时代 a /pc/home?ch=fun&sign=360_79aabe15 娱乐 a /pc/home? data-index= 财经 a /pc/home?ch=economy&sign=360_79aabe15 财经 a /pc/home?ch=estate&sign=360_79aabe15 房产 a /pc/home?ch=car&sign=360_79aabe15 汽车 a /pc/home?ch=sport&sign=360_79aabe15 体育 a /pc/home?ch=domestic&sign=360_79aabe15 国内 a /pc/home?sign=360_79aabe15