1.BeautifulSoup模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。
2.安装BeautifulSoup模块
pip3 install beautifulsoup4
3.使用方式
创建html
html_doc =""" <html> <head> <title>BeautifulSoup示例</title> </head> <body> <div> <a href='http://www.dongdong.com'>东东<p>东东内容</p></a> </div> <a id='xixi'>西西</a> <div> <p>南南内容</p> </div> <p>北北内容</p> </body> </html> """
创建beautifulsoup对象
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") #soup是整个html print(soup.prettify()) #打印soup对象的内容,格式化输出
name标签名称
(1)通过soup对象找到所有a标签
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") #soup是整个html tag = soup.find('a') #找到a标签 print(tag)
输出:
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
(2)通过a标签找到a标签的名称
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") #soup是整个html tag = soup.find('a') #找到a标签 name = tag.name #获取a标签的名称
输出:
a
(3)通过a标签的名称修改a标签的名称
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") #soup是整个html tag = soup.find('a') #找到a标签 name = tag.name #获取a标签的名字 tag.name = 'span' #把a标签的名称改为span print(tag)
输出:
<span href="http://www.dongdong.com">东东<p>东东内容</p></span>
attr标签属性
(1)通过attrs获取a标签属性
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') attrs = tag.attrs #获取属性 print(attrs)
输出:
{'href': 'http://www.dongdong.com'}
(2)通过attrs修改a标签属性
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') attrs = tag.attrs #获取属性 tag.attrs = {'href':'http://www.nannan.com'} #修改属性 print(tag)
输出:
<a href="http://www.nannan.com">东东<p>东东内容</p></a>
(3)通过attrs给标签里添加属性love="石头"
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') tag.attrs['love'] = '石头' print(tag)
输出:
<a href="http://www.dongdong.com" love="石头">东东<p>东东内容</p></a>
(4)通过attrs把a标签里的属性href删除掉
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') attrs = tag.attrs #获取属性 del tag.attrs['href'] print(tag)
输出:
<a>东东<p>东东内容</p></a>
标签和内容
(1)通过children找所有body里所有子标签
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tags = soup.find('body').children print(list(tags))
输出:
['
', <div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>, '
', <a id="xixi">西西</a>, '
', <div>
<p>南南内容</p>
</div>, '
', <p>北北内容</p>, '
']
(2)通过children找所有body里所有子标签,再通过tags把每一个标签拿到再通过type(tag)把标签和内容分别取出来
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tags = soup.find('body').children ###通过tags把每一个标签拿到再通过type(tag)把标签和内容分别取出来 from bs4.element import Tag for tag in tags: if type(tag) == Tag: #判断如果type(tag) == Tag是标签 print('我是标签:',tag, type(tag)) else: #否则是文本 print('文本....')
输出:
文本....
我是标签: <div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div> <class 'bs4.element.Tag'>
文本....
我是标签: <a id="xixi">西西</a> <class 'bs4.element.Tag'>
文本....
我是标签: <div>
<p>南南内容</p>
</div> <class 'bs4.element.Tag'>
文本....
我是标签: <p>北北内容</p> <class 'bs4.element.Tag'>
文本....
(3)通过descendants找所有body里所有子子孙孙标签(递归一个一个找)
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tags = soup.find('body').descendants print(list(tags))
输出:
['
', <div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>, '
', <a href="http://www.dongdong.com">东东<p>东东内容</p></a>, '东东', <p>东东内容</p>, '东东内容', '
', '
', <a id="xixi">西西</a>, '西西', '
', <div>
<p>南南内容</p>
</div>, '
', <p>南南内容</p>, '南南内容', '
', '
', <p>北北内容</p>, '北北内容', '
']
(4)通过把body标签里面的孩子都清空(保留标签名)
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('body') tag.clear() print(soup)
输出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body></body>
</html>
(5)decompose递归的删除所有的标签
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") body = soup.find('body') body.decompose() print(soup)
输出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
</html>
(6)extract递归的删除所有的标签,并获取删除的标签
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") body = soup.find('body') v = body.extract() print(v)
输出:
<body>
<div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
</body>
(7)decode把对象转换为字符串(含当前标签)
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") body = soup.find('body') v = body.decode() print(v)
输出:
<body>
<div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
</body>
(8)decode_contents把对象转换为字符串(不含当前标签)
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") body = soup.find('body') v = body.decode_contents() print(v)
输出:
<div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
(10)find获取匹配的第一个标签
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('body').find('p',recursive=False) #recursive=True是否递归去找 print(tag)
输出:
<p>北北内容</p>
(11)get_text获取标签内部文本内容
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') print(tag) v = tag.get_text() print(v)
输出:
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
东东东东内容
(12)index检查标签在某标签中的索引位置
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('body') v = tag.index(tag.find('div')) print(v)
输出:
1
(13)index检查标签在某标签中的索引位置
tag = soup.find('body') for i,v in enumerate(tag): print(i,v)
输出:
0
1 <div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>
2
3 <a id="xixi">西西</a>
4
5 <div>
<p>南南内容</p>
</div>
6
7 <p>北北内容</p>
8
(14)append在当前标签内部追加一个标签
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") from bs4.element import Tag obj = Tag(name='i',attrs={'id': 'it'}) obj.string = '我是一个新来的' tag = soup.find('body') tag.append(obj) print(soup)
输出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body>
<div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
<i id="it">我是一个新来的</i></body>
</html>
(15)insert在当前标签内部指定位置插入一个标签
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") from bs4.element import Tag obj = Tag(name='i', attrs={'id': 'it'}) obj.string = '我是一个新来的' tag = soup.find('body') tag.insert(2, obj) print(soup)
输出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body>
<div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div><i id="it">我是一个新来的</i>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
</body>
</html>
(16)replace_with 在当前标签替换为指定标签
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") from bs4.element import Tag obj = Tag(name='i', attrs={'id': 'it'}) obj.string = '我是一个新来的' tag = soup.find('div') tag.replace_with(obj) print(soup)
输出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body>
<i id="it">我是一个新来的</i>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
</body>
</html>