【Python爬虫学习笔记（3）】Beautiful Soup库相关知识点总结

【Python爬虫学习笔记（3）】Beautiful Soup库相关知识点总结
1. Beautiful Soup简介

 Beautiful Soup是将数据从HTML和XML文件中解析出来的一个python库，它能够提供一种符合习惯的方法去遍历搜索和修改解析树，这将大大减少爬虫程序的运行时间。

 Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

 Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

2. Beautiful Soup安装

 利用pip可以迅速安装，目前最新版本为BeautifulSoup4。
1 $ pip install beautifulsoup4
安装后，import一下bs4就可以使用了。
1 from bs4 import BeautifulSoup
3. 创建Beautiful Soup对象

我们利用以下测试文件来进行之后的总结。
1 html = """ 2 <html><head><title>The Dormouse's story</title></head> 3 <body> 4 The Dormouse's story 5 Once upon a time there were three little sisters; and their names were 6 <a href="http://example.com/elsie" class="sister" id="link1"></a>, 7 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 8 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 9 and they lived at the bottom of a well. 10 ... 11 """
import之后，创建一个BeautifulSoup对象如下参数可以是一个抓取到的unicode格式的网页html，也可以是一个已经保存到本地的html文件test.html。
1 soup = BeautifulSoup(html) 2 soup = BeautifulSoup(open('test.html'))
创建后查看是否创建成功。注意：有时需要在后面加上encode('utf-8')来进行编码才能将soup对象正确显示出来。
1 print soup.prettify()
4. 四种Beautiful Soup对象类型

Beautiful Soup一共有四大对象种类，包括Tag，NavigableString，BeautifulSoup和Comment。

4.1 Tag

Tag对象

Tag就是html文件中的标签以及标签之间的内容，例如以下就是一个Tag。
1 <title>The Dormouse's story</title>
可以这样得到title这个Tag，第二行为运行结果。
1 print soup.title 2 #<title>The Dormouse's story</title>
注意：如果得到的是'bs4.element.Tag'类型的对象可以继续进行后续的.操作，即能进行soup对象所能进行的操作，所以需要确保一个对象是'bs4.element.Tag'类型后再进行后续对其的操作，例如后面将介绍的.find方法是Tag对象才拥有的。
1 print type(soup.title) 2 #<class 'bs4.element.Tag'>
Tag方法

.name

Tag对象的.name方法得到的是该Tag的标签本身名称。
1 print soup.title.name 2 #title
.attrs

Tag对象的.attrs将得到标签中所有属性的字典。
1 print soup.p.attrs 2 #{'class': ['title'], 'name': 'dromouse'}
可以对Tag对象进行字典可以进行的操作，例如修改，删除，读取等。
1 print soup.p['class']#读取（方法一） 2 #['title'] 3 print soup.p.get('class')#读取（方法二） 4 #['title'] 5 6 soup.p['class']="newClass"#修改 7 print soup.p 8 #The Dormouse's story 9 10 del soup.p['class']#删除 11 print soup.p 12 #The Dormouse's story
4.2 NavigableString

标签内部的内容由.string方法可以得到，且这些内容为'bs4.element.NavigableString'类型的对象。
1 print soup.p.string 2 #The Dormouse's story 3 4 print type(soup.p.string) 5 #<class 'bs4.element.NavigableString'>
4.3 BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag。
1 print type(soup.name) 2 #<type 'unicode'> 3 print soup.name 4 # [document] 5 print soup.attrs 6 #{} 空字典
4.4 Comment

前三种类型几乎涵盖了在HTML或者XML中所有的内容，但是Comment类型是需要关心的一种，和CData，ProcessingInstruction，Declaration，Doctype一样，它是NavigableString类型的一个子类，通过以下代码可以简单了解它的功能。
1 markup = ""#标签中内容为注释 2 soup = BeautifulSoup(markup) 3 comment = soup.b.string 4 type(comment) 5 # <class 'bs4.element.Comment'> 6 comment 7 # u'Hey, buddy. Want to buy a used parser' 8 print(soup.b.prettify()) 9 # 10 #  11 # 
注意：标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容，我们发现它已经把注释符号去掉了，所以这可能会给我们带来不必要的麻烦，需要在使用或者进行一些操作之前进行类型判断。
1 if type(soup.b.string)==bs4.element.Comment: 2 ...
5. 树的遍历

5.1 子孙节点

.content

Tag对象的.content方法可以得到其子节点的一个列表表示。
1 print soup.head.contents 2 #[<title>The Dormouse's story</title>]
当然，既然是列表可以用索引直接得到某一项。
1 print soup.head.contents[0] 2 #<title>The Dormouse's story</title>
.children

Tag对象的.children方法得到一个其子节点的迭代器，可以遍历之获取其中的元素。
1 for child in soup.body.children: 2 print child
.descendants

与.content和.children只得到直接子节点不同，.descendants能对所有子孙节点迭代循环，将标签层层剥离得到所有子节点，同样通过遍历的方法得到每个子孙节点。
1 for child in soup.descendants: 2 print child
5.2 父亲节点

.parent

Tag对象的.parent方法能得到其直接父节点。

.parents

用.parents属性可以递归得到元素的所有父节点。
1 content = soup.head.title.string 2 for parent in content.parents: 3 print parent.name 4 #title 5 #head 6 #html 7 #[document]
5.3 兄弟节点

.next_sibling和.next_siblings

 .next_sibling得到Tag对象平级的下一个节点，如果不存在则返回None。.next_siblings得到Tag对象平级的下面所有兄弟节点。

.previous_sibling和.previous_siblings

 .previous_sibling得到Tag对象平级的上一个节点，如果不存在则返回None。.next_siblings得到Tag对象平级的上面所有兄弟节点。

 注意:由于在HTML文档中的空白和换行也被视作是一个节点，所以可能得到的兄弟节点（或者子节点父节点）会是空白类型或者字符串类型而不是Tag，所以在进行下一步操作时一定要先用type函数进行类型的判断。

5.4 前后节点

.next_element和.next_elements

 与 .next_sibling和.next_siblings 不同，它并不是针对于兄弟节点，而是在所有节点，不分层次得到下一个节点和所有的后续节点。.next_elements的结果通过遍历访问。

.previous_element和.previous_elements

 这两个方法将不分层次得到上一个节点和所有之前的节点。.previous_elements的结果通过遍历访问。

5.4 节点内容

.string

 如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容。
1 print soup.head.string 2 #The Dormouse's story 3 print soup.title.string 4 #The Dormouse's story
而如果Tag包含了多个子节点，Tag就无法确定.string 方法应该调用哪个子节点的内容，输出结果是 None。

.strings和.stripped_strings

当一个Tag对象有多个子节点时，可以用.strings方法再通过遍历获得所有子节点的内容。
1 for string in soup.strings: 2 print(repr(string)) 3 # u"The Dormouse's story" 4 # u' ' 5 # u"The Dormouse's story" 6 # u' ' 7 # u'Once upon a time there were three little sisters; and their names were ' 8 # u'Elsie' 9 # u', ' 10 # u'Lacie' 11 # u' and ' 12 # u'Tillie' 13 # u'; and they lived at the bottom of a well.' 14 # u' ' 15 # u'...' 16 # u' '
用.stripped_strings方法可以得到过滤掉空格和空行的内容。

.get_text（)

如果你仅仅想要得到文档或者标签的文本部分，可以使用.get_text（)方法，它能以一个单一的一个Unicode串的形式返回文档中或者Tag对象下的所有文本。
1 markup = '<a href="http://example.com/"> I linked to example.com </a>' 2 soup = BeautifulSoup(markup) 3 4 soup.get_text() 5 #u' I linked to example.com ' 6 soup.i.get_text() 7 #u'example.com'
你可以指定一个字符串来连接文本的位。
1 soup.get_text("|") 2 #u' I linked to |example.com| '
进一步，通过strip去除掉文本每个位的头尾空白。
1 soup.get_text("|", strip=True) 2 #u'I linked to|example.com'
用列表推导式以及.stripped_strings方法罗列出文本内容。
1 [text for text in soup.stripped_strings] 2 #[u'I linked to', u'example.com']
6. 树的搜索

6.1 find_all(name, attrs, recursive, string, limit, **kwargs)

该方法将搜索当前Tag对象的所有子节点，并且按照过滤条件得到筛选后对象的列表。

name参数

1）传字符串

最简单的方法是传入标签名的字符串，可以得到所有以该字符串为标签名的一个列表。
1 print soup.find_all('a') 2 #[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
2）传正则表达式

可以通过传正则表达式得到符合表达式规则的Tag对象。
1 import re 2 for tag in soup.find_all(re.compile("^b")): 3 print(tag.name) 4 # body 5 # b
3）传列表

可以传入一个字符串的列表，将匹配列表中标签的Tag全部返回。
1 soup.find_all(["a", "b"]) 2 # [The Dormouse's story, 3 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 4 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 5 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
4）传True

True参数将匹配文档中所有的节点，但是不包括文本字符串。
1 for tag in soup.find_all(True): 2 print(tag.name) 3 # html 4 # head 5 # title 6 # body 7 # p 8 # b 9 # p 10 # a 11 # a 12 # a 13 # p
5）传入函数

可以根据函数返回值的True/False来得到匹配的节点。
1 def has_class_but_no_id(tag): 2 return tag.has_attr('class') and not tag.has_attr('id') 3 4 soup.find_all(has_class_but_no_id) 5 # [The Dormouse's story, 6 # Once upon a time there were..., 7 # ...]
关键字参数

可以传入一个或者多个关键字，BeautifulSoup会搜索当前Tag下的每一个节点的该关键字及其对应的值。
1 soup.find_all(href=re.compile("elsie"), id='link1') 2 # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
特殊：如果希望用class及其值作为过滤条件，由于class是python的关键字，所以需要作如下处理。
1 soup.find_all("a", class_="sister") 2 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 3 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 4 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
另外，有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性，可以这样来进行过滤。
1 data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') 2 data_soup.find_all(attrs={"data-foo": "value"}) 3 # [<div data-foo="value">foo!</div>]
text参数

可以在文档中搜索一些字符串内容，与name参数的可选值一样，可以传字符串，列表，正则表达式和True。
1 soup.find_all(text="Elsie") 2 # [u'Elsie'] 3 4 soup.find_all(text=["Tillie", "Elsie", "Lacie"]) 5 # [u'Elsie', u'Lacie', u'Tillie'] 6 7 soup.find_all(text=re.compile("Dormouse")) 8 [u"The Dormouse's story", u"The Dormouse's story"]
limit参数

可用该参数限制返回的节点数目，例子中本身有3个符合的节点，仅输出两个。
1 soup.find_all("a", limit=2) 2 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 3 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
recursive参数

将该参数设为False可限制只搜索当前Tag的直接子节点，可以节省很多搜索时间。
1 soup.html.find_all("title") 2 # [<title>The Dormouse's story</title>] 3 soup.html.find_all("title", recursive=False) 4 # []
6.2. find( name , attrs , recursive , text , **kwargs )

它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果

6.3. find_parents()和find_parent()

find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等. find_parents() 和 find_parent() 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容

6.4. find_next_siblings()和find_next_sibling()

这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代, find_next_siblings() 方法返回所有符合条件的后面的兄弟节点,find_next_sibling() 只返回符合条件的后面的第一个tag节点

6.5. find_previous_siblings()和find_previous_sibling()

这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代, find_previous_siblings()方法返回所有符合条件的前面的兄弟节点, find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点。

6.6. find_all_next()和find_next()

这2个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代, find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点

6.7. find_all_previous()和find_previous()

这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代, find_all_previous() 方法返回所有符合条件的节点, find_previous()方法返回第一个符合条件的节点

参考资料：

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#

转载请注明：

http://www.cnblogs.com/wuwenyan/p/4773427.html
相关阅读:
@echo off
小知识点
 字符串匹配方法
 一般保护错误
 Linux常用压缩与解压缩命令
 opencv__linux__配置
 opencv__配置
 Web开发从零单排之二：在自制电子请帖中添加留言板功能,SAE+PHP+MySql
Web开发从零单排之一：在新浪云平台SAE上开发一个html5电子喜帖
 WPF中使用ValueConverter来实现“范围条件触发器”
原文地址：https://www.cnblogs.com/wuwenyan/p/4773427.html