使用Beautiful Soup
1.简介
简单来说Beautiful Soup是Python的一个HTML或XML解析库,可以用来方便的从网页中提取数据。Beautiful Soup提供了一些简单的Python式的函数来打处理导航,搜索,修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据。
Beautiful Soup自动将文本文档转换为Unicode编码,输出文档转换为UTF-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时你仅仅需要说明一下原始编码方式就可以了。
2.准备工作
安装Beautiful Soup
a.相关链接
官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh
PyPi : https://pypi.python.org/pypi/beautifulsoup4
b.pip3安装
pip3 install beautifulsoup4
c.whell安装
从PiPy下载whell文件
然后使用pip安装whell文件
3.使用Beautiful Soup
1.基本用法
from bs4 import BeautifulSoup html = """ <html lang="en"> <head> <meta charset="UTF-8"> <title>The Beautiful Suop</title> </head> <body> <p class="title" name="dromouse"><b>The story</b></p> <p class="story" >once upon a time there were three title sisters;and their name were <a href="http://example.com/elsie" class="sister" id="link1">Elise</a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> """ soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print(soup.title.string)
运行结果如下:
<html lang="en"> <head> <meta charset="utf-8"/> <title> The Beautiful Suop </title> </head> <body> <p class="title" name="dromouse"> <b> The story </b> </p> <p class="story"> once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> Elise </a> <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html> The Beautiful Suop
这里首先声明一个变量html,它是一个HTML字符串。但是需要注意,它并不是一个完成的HTML字符串,body和html节点没有闭合。接着我们将它作为第一个参数传递给Beautiful Soup对象,第二个参数为解析器的类型(这里使用的是lxml),此时就完成了Beautiful Soup对象的初始化。然后将这个对象复制给soup变量。接下来就可以调用soup的各个方法和属性来解析这串HTML代码了。
首先,调用prettify()方法。这个方法可以把要解析的字符串以标准的缩进格式输出。这里需要注意的是,输出结果包含了body和html节点,也就是说对于不标准的HTML代码Beautiful Soup可以自动更正格式。这一步并不是prettify()做的,而是在初始化时就已经完成了。
然后调用soup.title.string。这实际上是输出HTML中title节点的文本内容。So,soup.title可以选出HTML中的节点,再调用string属性就可以得到里面的文本了。
2.节点选择器
直接调用节点的名称就可以选择节点元素,在调用string就可以得到节点的文本了。选择方式非常快速,如果单个节点层次非常清晰,可以选用这种方法。
♦选择元素
from bs4 import BeautifulSoup html = """ <html lang="en"> <head> <meta charset="UTF-8"> <title>The Beautiful Suop</title> </head> <body> <p class="title" name="dromouse"><b>The story</b></p> <p class="story" >once upon a time there were three title sisters;and their name were <a href="http://example.com/elsie" class="sister" id="link1">Elise</a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> """ soup = BeautifulSoup(html, 'lxml') print(soup.title) print(type(soup.title)) print(soup.title.string) print(soup.head) print(soup.p)
运行结果如下:
<title>The Beautiful Suop</title> <class 'bs4.element.Tag'> The Beautiful Suop <head> <meta charset="utf-8"/> <title>The Beautiful Suop</title> </head> <p class="title" name="dromouse"><b>The story</b></p>
这里依旧选用刚才的示例代码,首先打印title节点的选择结果,输出title节点的文本内容。接下来是它的类型,<class 'bs4.element.Tag'>这是Beautiful Soup中一个重要的数据结构。
接下来,我们又尝试了head节点,p节点,选择p节点时只是输出了第一个p节点的内容。当有多个节点时,这种方式只会匹配到第一个节点,后面的节点都会忽略。
♦提取信息
如何获取节点的属性值?获取节点的名称?
(1)名称获取
利用name属性获取节点的名称
print(soup.title.name) 输出结果: title
(2)获取属性
每个节点可以有多个属性,例如id和class等,选择这个节点后可以调用attrs获取所有属性:
print(soup.p.attrs) 运行结果: {'class': ['title'], 'name': 'dromouse'}
可以看到,attrs返回的结果是字典型式,把所有属性的和属性值组成了一个字典。如果想获取name属性,只需要加上键值,可以使用attrs['name']来获取。有一种更简便的写法,直接在节点元素后面加上属性名称:
print(soup.p['name']) print(soup.p['class']) 输出结果: dromouse ['title']
这里需要注意的是,有的结果返回的是字符串,有的结果返回的是列表。比如name属性的值是唯一的,返回的结果就是单个字符串,class的属性可以有多个,所有返回的是一个列表。需要在实际使用中判断。
(3)获取内容
可以使用string获取内容
print(soup.p.string) 输出结果: The story
这里的p节点是第一个p节点
♦嵌套选择
在上面的例子中,每一步的返回结果都是bs4.element.Tag,我们可以继续调用节点进行下一步:
print(soup.head.title) print(type(soup.head.title)) print(soup.head.title.string)
输出结果:
<title>The Beautiful Suop</title> <class 'bs4.element.Tag'> The Beautiful Suop
♦关联选择
先选取某一个节点元素,在以它为基准去选择其父节点,子节点,兄弟节点等。
(1)子节点及子孙节点
使用contents属性获取子节点
from bs4 import BeautifulSoup html = """ <html lang="en"> <head> <meta charset="UTF-8"> <title>The Beautiful Suop</title> </head> <body> <p class="story" >once upon a time there were three title sisters;and their name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elise</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> """ soup = BeautifulSoup(html, 'lxml') print(soup.p.contents)
输出结果:
['once upon a time there were three title sisters;and their name were ',
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a>, ' ',
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' ',
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';
and they lived at the bottom of a well. ']
p节点里包含文本,节点,所以返回一个列表形式。
使用children可以得到相同的结果,此时返回的是一个生成器类型。
print(soup.p.children) for i, child in enumerate(soup.p.children): print(i, child)
输出结果:
<list_iterator object at 0x0000016B477884A8> 0 once upon a time there were three title sisters;and their name were 1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> 2 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 4 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 6 ; and they lived at the bottom of a well.
使用descendants属性获取子孙节点,返回一个生成器,输出的结果包含了span节点。descendants会查询所有子节点,得到所有的子孙节点
<generator object descendants at 0x0000029DA472D9E8> 0 once upon a time there were three title sisters;and their name were 1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> 2 3 <span>Elise</span> 4 Elise 5 6 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 8 Lacie 9 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 11 Tillie 12 ; and they lived at the bottom of a well.
(2)父节点和爷爷节点
调用parent获取某个节点的父节点;
print(soup.a.parent)
输出结果:
<p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p>
很明显,a的直接父节点是p节点,这里直接输出p节点的内容。
调用parents选取到爷爷节点,返回的结果是生成器类型,用列表输出了它的索引和内容,列表中的元素就是a节点的祖先节点。
print(type(soup.a.parents)) print(list(enumerate(soup.a.parents)))
输出结果:
<class 'generator'> [(0, <p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p>),
(1, <body> <p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body>),
(2, <html lang="en"> <head> <meta charset="utf-8"/> <title>The Beautiful Suop</title> </head> <body> <p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>),
(3, <html lang="en"> <head> <meta charset="utf-8"/> <title>The Beautiful Suop</title> </head> <body> <p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>)]
(3)兄弟节点
同级节点获取,next_sibling和previous_sibling分别获取的是节点的下一个兄弟元素和节点的上一个兄弟元素。next_siblings和previous_siblings分别返回后面和前面的所有兄弟元素。
from bs4 import BeautifulSoup html = """ <html lang="en"> <head> <meta charset="UTF-8"> <title>The Beautiful Suop</title> </head> <body> <p class="story" >once upon a time there were three title sisters;and their name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elise</span> </a> hello <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> """ soup = BeautifulSoup(html, 'lxml')
print("Next Sibling:", soup.a.next_sibling)
print("Prev Sibling:", soup.a.previous_sibling)
print("Next Siblings:", list(soup.a.next_siblings))
print("Prev Siblings:", list(soup.a.previous_siblings))
输出结果;
Next Sibling: hello Prev Sibling: once upon a time there were three title sisters;and their name were Next Siblings: [' hello ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well. '] Prev Siblings: ['once upon a time there were three title sisters;and their name were ']
(4)信息提取
单个节点可以直接调用string,attrs等属性获取其文本内容和属性,多个节点的生成器转化为列表后,取到某个节点后再调用string,attrs等属性获取相对应的节点的文本和属性。
from bs4 import BeautifulSoup html = """ <html lang="en"> <body> <p class="story" >once upon a time there were three title sisters;and their name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elise</span> </a> </p> """ soup = BeautifulSoup(html, 'lxml') print(soup.a.next_sibling.string) print(list(soup.a.parents)[0]) print(list(soup.a.parents)[0].attrs['class'])
输出结果:
<p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> </p> ['story']
3.方法选择器
♦find_all()
查询所有符合条件的元素,给它传入一些属性和文本就可以得到符合条件的元素,功能十分强大
find_all(name,attrs,recursive,text,**kwargs)
(1)name
根据节点名称查询元素:
from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> """ soup = BeautifulSoup(html, 'lxml') print(soup.find_all(name='ul')) print(type(soup.find_all(name='ul')[0]))
输出结果:
[
<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li> <li class="element">Bar</li> </ul>
]
<class 'bs4.element.Tag'>
调用find_all()方法,name参数的值为ul,查询到所有ul节点,返回列表类型,每个元素都是bs4.element.Tag类型。key继续进行嵌套查询,查询其内部的li节点:
for ul in soup.find_all(name='ul'): print(ul.find_all(name='li'))
输出结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
遍历每个li,获取其文本内容:
for ul in soup.find_all(name='ul'): print(ul.find_all(name='li')) for li in ul.find_all(name='li'): print(li.string)
输出结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] Foo Bar Jay [<li class="element">Foo</li>, <li class="element">Bar</li>] Foo Bar
(2)attrs
根据传入的属性查询:
print(soup.find_all(attrs={'id': 'list-1'}))
输出结果:
[<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>]
对于一些常见的属性id和class,可以直接使用,不需要attrs。其中class为Python关键字,需要加上下划线:class_='element'
print(soup.find_all(id='list-1')) print(soup.find_all(class_='element'))
输出结果:
[<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>] [<li class="element">Foo</li>,
<li class="element">Bar</li>,
<li class="element">Jay</li>,
<li class="element">Foo</li>,
<li class="element">Bar</li>]
(3)text
text参数可以匹配节点的文本,传入的形式可以是字符串,可以是正则表达式对象,:
import re print(soup.find_all(text=re.compile('F')))
输出结果:
['Foo', 'Foo']
♦find()方法
find()方法返回的是单个元素,也就是第一个匹配的元素。
print(soup.find(name='ul')) print(soup.find(class_='list'))
输出结果:
<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>
这里还有很多类似的方法:
find_parent():返回父节点
find_parents():返回祖先节点
find_next_sibling():返回后面的第一个兄弟节点
find_next_siblings():返回后面所有的兄弟节点
find_previous_sibling():返回前面的第一个兄弟节点
find_previous_siblings():返回前面所有的兄弟节点
find_next():返回节点后面第一个符合条件的节点
find_all_next():返回节点后面所有符合条件的节点
find_previous():返回节点前面第一个符合条件的节点
find_all_previous():返回节点前面所有符合条件的节点
4.CSS选择器
使用CSS选择器只需要调用select()方法,传入响应的CSS选择器:
print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element'))
输出结果:
[<div class="panel-heading"> <h4>hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
♦嵌套选择
遍历每个ul节点,选择其中的li节点:
for ul in soup.select('ul'): print(ul.select('li'))
输出结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
♦获取属性
for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])
输出结果:
list-1 list-1 list-2 list-2
♦获取文本
想要获取文本,除了string以后还可以使用get_text():
# 获取文本 for li in soup.select('li'): print(li.get_text()) print(li.string)
输出结果:
Foo
Foo
Bar
Bar
Jay
Jay
Foo
Foo
推荐使用lxml解析库
节点筛选虽然功能弱但是快
建议使用find() 和find_all()匹配单个或多个
熟悉CSS的可以使用select()进行匹配