一 前言
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库;其强大的提取能力让知识追寻者放弃了使用正则匹配查找HTML节点;Beautifu Soup 其能直接通过HTML标签获取相应的节点,或者通过函数直接获得节点,大大提高了编程人员的开发效率;看完本篇学不会Beautiful Soup ,满天神佛都救不了你;觉得知识追寻者的文章有点意思,关注加点赞谢谢;
二 Beautiful Soup 简单使用
Beautiful Soup 的解释器如下:
解释器 | 使用示例 |
---|---|
Python标准库 | BeautifulSoup(markup, "html.parser") |
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") |
lxml XML 解析器 | BeautifulSoup(markup, "xml") |
html5lib | BeautifulSoup(markup, "html5lib") |
本篇的解释器读者可以使用Python标准库或者lxml HTML 解析器都可以;下午中获取标签其实都是获取标签对象,读者谨记;
简要概括下属性的说明:
属性 | 含义 |
---|---|
soup.tag.name | 获取标签tag名称 |
soup.tag.string | 获取标签tag文本内容 |
soup.tag | 获取标签tag |
soup.tag.attrs | 获取标签tag所有属性 |
soup.tag.attrs['class'] | 获取标签指定class的属性 |
soup.tag1.tag2 | 获取子标签tag2 |
soup.tag.contents | 获取tag所有直接子标签以列表输出 |
soup.tag.children | 获取直接子标签,返回生成器 |
soup.tag.descendants | 获取所有子标签,返回生成器 |
soup.tag.parent | 获取直接父节点 |
soup.tag.parents | 获取祖先节点,返回生成器 |
soup.tag.next_sibling | 获取后一个兄弟节点 |
soup.tag.previous_sibling | 获取前一个兄弟节点 |
soup.tag.next_siblings | 获取后一个兄弟节点,返回生成器 |
soup.tag.previous_siblings | 获取前一个兄弟节点,返回生成器 |
2.1 格式化HTML
- 实例化一个Beautiful Soup 实例,入参是HTML,和html.parser
- 调用
prettify()
方法会格式化HTML文档
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.prettify())
输出结果下,是不是很美观,结构是不是很清楚;而且还补全了缺失的标签</form>
, </div>
;
<div class="filter-box d-flex align-items-center">
<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>
排序:
</dt>
<dd>
<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">
默认
</a>
</dd>
<dd>
<a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss">
</use>
</svg>
RSS订阅
</a>
</dd>
</dl>
</form>
</div>
2.2 获取标签节点
- 调用soup.dt 会直接获得第一个匹配到dt标签对象;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 输出节点 <dt>排序:</dt>
print(soup.dt)
2.3 获取节点文本
soup.dt.string 获得dt标签包含的内容;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 输出文本内容 排序:
print(soup.dt.string)
2.4获取节点名称
soup.dt.name 直接获得标签dt的名称;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 输出dt
print(soup.dt.name)
2.5 获得节点对象种类
直接获得标签后使用type方法可以显示出标签类型是 <class 'bs4.element.Tag'>
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
dt = soup.dt
# <class 'bs4.element.Tag'>
print(type(dt))
2.6 获取所有属性
soup.a.attrs 获取匹配到第一个a标签的所有属性;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.a.attrs)
输出默认匹配第一个a标签的全部属性
{'href': 'javascript:void(0);', 'data-report-query': '', 'class': ['btn-filter-sort', 'active'], 'target': '_self'}
2.7 获取特定属性
soup.a.attrs['href'],获取匹配到第一个a标签的href属性内容
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 输出javascript:void(0);
print(soup.a.attrs['href'])
2.8 获取子节点
soup.form.dd 会获得form标签下第一个dd标签
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.form.dd)
输出
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
2.9 获取所有直接子节点
soup.form.contents 将会以列表的形式输出form所有的子标签;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.form.contents)
输出结果:
['
', <dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>]
2.10 获取直接子节点生成器
soup.svg.children 会获得dd所有子节点的生成器;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for index, child in enumerate(soup.svg.children):
print(index, child)
输出结果:
0
1 <use xlink:href="#csdnc-rss"></use>
2
2.11 获取所有子节点生成器
soup.dl.descendants 会获取dl 标签所有的子节点(more than direct child node),
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for index, child in enumerate(soup.dl.descendants):
print(index, child)
输出结果:
0
1 <dt>排序:</dt>
2 排序:
3
4 <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
5 <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a>
6 默认
7
8 <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
9 <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
10
11 <svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>
12
13 <use xlink:href="#csdnc-rss"></use>
14
15 RSS订阅
16
17
2.12 获取直接父节点
soup.a.parent 或获取第一个匹配到a标签的父标签对象;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.a.parent)
输出结果:
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
2.13 获取祖先节点生成器
soup.a.parents 会获得第一个匹配到a标签的所有父节点,也就是祖先节点,返回生成器;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for node in soup.a.parents:
if node is None:
print(node)
else:
print(node.name)
输出结果:
dd
dl
form
div
[document]
2.14 获取兄弟节点
兄弟节点有个坑,通常是返回空白,就不做过多讲解
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.dt.next_sibling)
输出是空白;其它兄弟节点属性就不写了,感觉没啥意义,不是空白就是None;
三 搜索文档
学完第二节内容,读者们其实就是打了个基础,重点是这章节;
函数 | 含义 |
---|---|
find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) | 查找所有匹配节点 |
find(name=None, attrs={}, recursive=True, text=None, **kwargs) | 查找第一个匹配节点 |
find_parent(name=None, attrs={}, **kwargs) | 返回当前节点的父辈节 |
find_parents(name=None, attrs={}, **kwargs) | 返回当前节点的祖先节点 |
find_next_sibling(name=None, attrs={}, text=None, **kwargs) | 返回符合条件的后面的第一个tag节点 |
find_next_siblings(name=None, attrs={}, text=None, **kwargs) | 返回所有符合条件的后面的兄弟节点 |
find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs) | 返回第一个符合条件的前面的兄弟节点 |
find_previous_siblings(self, name=None, attrs={}, text=None, **kwargs) | 返回所有符合条件的前面的兄弟节点 |
find_next(name=None, attrs={}, text=None, **kwargs) | 返回第一个符合条件的节点 |
find_all_next(name=None, attrs={}, text=None, limit=None, **kwargs) | 返回所有符合条件的节点 |
find_previous(name=None, attrs={}, text=None, **kwargs) | 返回第一个符合条件的节点 |
find_all_previousname=None, attrs={}, text=None, limit=None, **kwargs) | 返回所有符合条件的节点 |
- name 表示输出的tag名称
- attrs 表示指定属性查找
- recursive 表示是否递归所有子节点,默认是;设置为false返回直接子节点
- limit 表示 限制 输出数量
- **kwargs 可以指定经常出现的属性搜索,比如 id = 'zszxz'
- text 是过滤条件
本节着重讲解find_all方法,find方法于find_all一致,学一个就会用另一个;
3.1 name参数示例
soup.find_all(name='dd') 会获得所有dd标签对象,并且返回列表;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(name='dd'))
输出结果
[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>, <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>]
注:soup.find_all(name='dd') 与 soup.find_all('dd') 一致;
3.2 attrs 属性示例
soup.find_all(attrs={'id':'seeOriginal'}) 获取 属性 id = seeOriginal 所有标签对象
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(attrs={'id':'seeOriginal'}))
输出
[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl></form>]
3.3 recursive 示例
soup.find_all('dl',recursive=False)
会查找dl标签子节点,当recursive 设置为False之后就找不到了;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all('dl',recursive=False))
输出空列表[]
3.4limit示例
soup.find_all('dd',limit=1)
会限制输出结果为一条
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all('dd',limit=1))
输出
[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>]
3.5 kwargs 示例之属性匹配
soup.find_all(id='seeOriginal')
直接指定id属性查找
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(id='seeOriginal'))
输出
[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl></form>]
3.6 kwargs 示例之正则匹配
soup.find_all(href=re.compile("java.*?"))
匹配属性 href 正则 java开头的属性标签;
# -*- coding: utf-8 -*-
import re
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(href=re.compile("java.*?")))
输出结果
[<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a>]
3.7 按CSS搜索
soup.find_all("a", class_="btn")
查找a标签,class属性带有btn
# -*- coding: utf-8 -*-
import re
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all("a", class_="btn"))
输出结果
[<a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>]
四CSS选择器
Beautiful Soup 还直接支持CSS选择器搜索,下面列出了经常使用的方法示例;
# -*- coding: utf-8 -*-
import re
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 选取 dl 标签下面的 dt标签
lt = soup.select('dl dt')
print(lt)
dd = soup.select('dl dd')
print(dd[0])
# id 选择器搜索
id = soup.select('#seeOriginal')
print(id)
# class选择器 搜索
cla = soup.select('.btn-filter-sort')
print(cla[0])
分别输出如下
soup.select('dl dt')
[<dt>排序:</dt>]
soup.select('dl dd')[0]
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
soup.select('#seeOriginal')
[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS订阅</a>
</dd>
</dl></form>]
soup.select('.btn-filter-sort')[0]
<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a>