BeautifulSoup 简述

BeautifulSoup 简述
概述

处理数据，总要面对 HTML 和 XML 文档。BeautifulSoup 是一个可以从 HTML 或 XML 中提取数据的 Python 库，功能强大、使用便捷，诚为朴实有华、人见人爱的数据处理工具。

安装

自从有了 pip 这个神器，安装就不再是问题了。BeautifulSoup 支持 Python 标准库中的 HTML 解析器，也支持其他解析器。我建议使用更牛叉的第三方解析器 lxml——我曾经用它处理过单个文件几百兆字节的xml数据，反应神速，毫无迟滞感。当然，使用系统已有的解析器，除了速度和效率，基本也不会有啥问题。
```
$ pip install beautifulsoup4
$ pip install lxml
```
开始使用
```
> from bs4 import BeautifulSoup
> soup = BeautifulSoup("<html>data</html>", "html.parser") # 使用python内置标准库，速度适中，容错性好
> soup = BeautifulSoup("<html>data</html>", "html5lib") # 以浏览器的方式解析文档，容错性最好
> soup = BeautifulSoup("<html>data</html>", ["lxml-xml"]) # lxml XML 解析器，速度快
> soup = BeautifulSoup("<html>data</html>", "lxml") # lxml HTML 解析器，速度快，容错性好
```
如果没有指定解析器，BeautifulSoup 会自动查找使用系统可用的解析器。

经验总结

所有的例子，均以下面的html为例。
```
html_doc = """
<html>
<div id="My gift">
One
Two
Three
</div>
<img class="photo" src="demo.jpg">
<div class="photo">
<a href="sdysit.com"><img src="logo.png"></a>
山东远思信息科技有限公司
</div>
</html>
"""
```
- 文本也是节点，我们称之为文本型节点，比如p标签中的One，Two，Three
- 某个节点的子节点往往比我们看到的多，因为在那些可见的子节点之外的换行、空格、制表位等，也都是某节点的文本型子节点
节点对象、名称、属性

使用lxml解析器生成一个 BeautifulSoup 对象 soup，然后可以使用标签名得到节点对象：
```
> soup = BeautifulSoup(html_doc, 'lxml')
> tag = soup.html
> tag.name
'html'
>tag.p.name
'p'
```
事实上，我们可以不用在意标签的父级是谁，直接从soup得到节点对象：
```
> soup.p.name
'p'
> soup.img['src']
'demo.jpg'
> soup.img.attrs
{'class': ['photo'], 'src': 'demo.jpg'}
> soup.p['class']
['intro', 'short-text']
> soup.div['id']
'My gift'
```
很显然，这样的方式得到的节点，一定是html中第一个同类型的标签。上面的例子还演示了如何取得节点对象的所有的属性和指定属性。当class属性有多个值时，返回的是一个列表，而id属性不承认多值。

节点的文本内容
取得一个节点的文本内容，有很多种方法，比如：
```
> soup.p.text
'One'
> soup.p.getText()
'One'
> soup.p.get_text()
'One'
> soup.p.string
'One'
> type(soup.p.string)
<class 'bs4.element.NavigableString'>
```
当一个节点只有文本型子节点的时候，前三种方法的效果是完全一致的，第四种方法看上去差不多，但返回的类型是NavigableString（可遍历的字符串）。

当节点包括元素型子节点的时候，输出的结果可能已经不是我们需要的了。此时，可以使用 .strings 或者 .stripped_strings（去掉空行和多余的空格）得到一个迭代器，遍历即可得到我们想要的内容。
```
>>> soup.div.text
'
One
Two
Three
'
>>> soup.html.text
'

One
Two
Three




山东远思信息科技有限公司

'
>>> for item in soup.div.stripped_strings:
print(item)

One
Two
Three
```
子节点

.contents， .children，.descendants 都可以取得节点的子节点，但用法各不相同：
- .contents， .children 只能取得直接子节点，.descendants 则可以递归取得所有子节点
- .contents 返回的子节点的列表，.children，.descendants 返回的是迭代器
父节点

.parent 属性来获取某个元素的父节点：
```
>>> soup.p.parent.name
'div'
```
.parents 属性可以递归得到元素的所有父辈节点：
```
>>> for parent in soup.p.parents:
print(parent.name)

div
body
html
[document]
```
兄弟节点
- 可以使用 .next_sibling 和 .previous_sibling 属性来查询前一个或后一个兄弟节点，但必须注意，除了可见的兄弟节点，还可能存在换行、空格、制表位等文本型的兄弟节点混杂其中。
- 可以使用 .next_siblings 和 .previous_siblings 属性取得当前节点的前面或后面的兄弟节点的迭代输出。
搜索节点
一般使用 find() 和 find_all() 搜索符合条件的第一个节点和全部节点的列表。
```
>>> soup.find('p')
One
>>> soup.find_all('img')
[<img class="photo" src="demo.jpg"/>, <img src="logo.png"/>]
```
使用正则表达式匹配标签名

搜索以d开头的标签：
```
>>> import re
>>> for tag in soup.find_all(re.compile("^d")):
print(tag.name)

div
div
```
使用属性搜索
```
>>> soup.find_all(id='My gift')[0].name # 查找id=My gift的节点
'div'
>>> soup.find_all(id=True)[0].name # 查找有id属性的节点
'div'
>>> soup.find_all(attrs={"id":"My gift"})[0].name # 使用attrs查找
'div'
>>> soup.find_all(attrs={"class":"intro short-text","align":"right"})[0].text # 使用attrs查找
'Three'
>>> soup.find_all(attrs={"align":"right"})[0].text # 使用attrs查找
'Three'
```
使用CSS搜索
```
>>> soup.find_all("p", class_="intro")
[One, Two, Three]
>>> soup.find_all("p", class_="intro short-text")
[One, Two, Three]
>>> 
```
使用文本内容搜索
```
>>> soup.find_all(string="Two")
['Two']
>>> soup.find_all(string=re.compile("Th"))
['Three']
```
使用函数筛选
```
>>> def justdoit(tag):
return tag.parent.has_attr('id') and tag['align']=='center'

>>> soup.find_all(justdoit)
[Two]
```
相关阅读:
转载【Ubuntu】Ubuntu14.04虚拟机调整窗口大小自适应VMware14窗口
 【ubuntu】安装输入法
 【虚拟机ubuntu】安装之后安装VMware tools
【虚拟机ubuntu设置ssh】ssh连不上问题解决方法
 JavaScript常用函数
 Label自适应高度
 xcode 删除文件后编译会出现*** is missing from working copy
找window的三种方法
 怎么查看Mac电脑的开机记录？
iOS 跳转到系统的设置界面
原文地址：https://www.cnblogs.com/yangmaosen/p/12395995.html

BeautifulSoup 简述

概述

安装

经验总结

节点对象、名称、属性

使用正则表达式匹配标签名

使用属性搜索