• 跟着知识追寻者学BeautifulSoup,你学不会打不还口,骂不还手


    一 前言

    Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库;其强大的提取能力让知识追寻者放弃了使用正则匹配查找HTML节点;Beautifu Soup 其能直接通过HTML标签获取相应的节点,或者通过函数直接获得节点,大大提高了编程人员的开发效率;看完本篇学不会Beautiful Soup ,满天神佛都救不了你;觉得知识追寻者的文章有点意思,关注加点赞谢谢;

    二 Beautiful Soup 简单使用

    Beautiful Soup 的解释器如下:

    解释器 使用示例
    Python标准库 BeautifulSoup(markup, "html.parser")
    lxml HTML 解析器 BeautifulSoup(markup, "lxml")
    lxml XML 解析器 BeautifulSoup(markup, "xml")
    html5lib BeautifulSoup(markup, "html5lib")

    本篇的解释器读者可以使用Python标准库或者lxml HTML 解析器都可以;下午中获取标签其实都是获取标签对象,读者谨记;

    简要概括下属性的说明:

    属性 含义
    soup.tag.name 获取标签tag名称
    soup.tag.string 获取标签tag文本内容
    soup.tag 获取标签tag
    soup.tag.attrs 获取标签tag所有属性
    soup.tag.attrs['class'] 获取标签指定class的属性
    soup.tag1.tag2 获取子标签tag2
    soup.tag.contents 获取tag所有直接子标签以列表输出
    soup.tag.children 获取直接子标签,返回生成器
    soup.tag.descendants 获取所有子标签,返回生成器
    soup.tag.parent 获取直接父节点
    soup.tag.parents 获取祖先节点,返回生成器
    soup.tag.next_sibling 获取后一个兄弟节点
    soup.tag.previous_sibling 获取前一个兄弟节点
    soup.tag.next_siblings 获取后一个兄弟节点,返回生成器
    soup.tag.previous_siblings 获取前一个兄弟节点,返回生成器

    2.1 格式化HTML

    1. 实例化一个Beautiful Soup 实例,入参是HTML,和html.parser
    2. 调用prettify()方法会格式化HTML文档
    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.prettify())
    

    输出结果下,是不是很美观,结构是不是很清楚;而且还补全了缺失的标签</form> , </div>

    <div class="filter-box d-flex align-items-center">
     <form action="" id="seeOriginal">
      <dl class="filter-sort-box d-flex align-items-center">
       <dt>
        排序:
       </dt>
       <dd>
        <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">
         默认
        </a>
       </dd>
       <dd>
        <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
         <svg aria-hidden="true" class="icon">
          <use xlink:href="#csdnc-rss">
          </use>
         </svg>
         RSS订阅
        </a>
       </dd>
      </dl>
     </form>
    </div>
    

    2.2 获取标签节点

    1. 调用soup.dt 会直接获得第一个匹配到dt标签对象;
    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    # 输出节点 <dt>排序:</dt>
    print(soup.dt)
    

    2.3 获取节点文本

    soup.dt.string 获得dt标签包含的内容;

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    # 输出文本内容 排序:
    print(soup.dt.string)
    

    2.4获取节点名称

    soup.dt.name 直接获得标签dt的名称;

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    # 输出dt
    print(soup.dt.name)
    

    2.5 获得节点对象种类

    直接获得标签后使用type方法可以显示出标签类型是 <class 'bs4.element.Tag'>

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    dt = soup.dt
    # <class 'bs4.element.Tag'>
    print(type(dt))
    

    2.6 获取所有属性

    soup.a.attrs 获取匹配到第一个a标签的所有属性;

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.a.attrs)
    

    输出默认匹配第一个a标签的全部属性

    {'href': 'javascript:void(0);', 'data-report-query': '', 'class': ['btn-filter-sort', 'active'], 'target': '_self'}
    

    2.7 获取特定属性

    soup.a.attrs['href'],获取匹配到第一个a标签的href属性内容

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    # 输出javascript:void(0);
    print(soup.a.attrs['href'])
    

    2.8 获取子节点

    soup.form.dd 会获得form标签下第一个dd标签

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.form.dd)
    

    输出

    <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
    

    2.9 获取所有直接子节点

    soup.form.contents 将会以列表的形式输出form所有的子标签;

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.form.contents)
    

    输出结果:

    ['
    ', <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    <svg aria-hidden="true" class="icon">
    <use xlink:href="#csdnc-rss"></use>
    </svg>RSS订阅</a>
    </dd>
    </dl>]
    

    2.10 获取直接子节点生成器

    soup.svg.children 会获得dd所有子节点的生成器;

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    for index, child in enumerate(soup.svg.children):
        print(index, child)
    

    输出结果:

    0 
    
    1 <use xlink:href="#csdnc-rss"></use>
    2 
    

    2.11 获取所有子节点生成器

    soup.dl.descendants 会获取dl 标签所有的子节点(more than direct child node),

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    for index, child in enumerate(soup.dl.descendants):
        print(index, child)
    

    输出结果:

    0 
    
    1 <dt>排序:</dt>
    2 排序:
    3 
    
    4 <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
    5 <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a>
    6 默认
    7 
    
    8 <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    <svg aria-hidden="true" class="icon">
    <use xlink:href="#csdnc-rss"></use>
    </svg>RSS订阅</a>
    </dd>
    9 <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    <svg aria-hidden="true" class="icon">
    <use xlink:href="#csdnc-rss"></use>
    </svg>RSS订阅</a>
    10 
    
    11 <svg aria-hidden="true" class="icon">
    <use xlink:href="#csdnc-rss"></use>
    </svg>
    12 
    
    13 <use xlink:href="#csdnc-rss"></use>
    14 
    
    15 RSS订阅
    16 
    
    17 
    
    

    2.12 获取直接父节点

    soup.a.parent 或获取第一个匹配到a标签的父标签对象;

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.a.parent)
    

    输出结果:

    <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
    

    2.13 获取祖先节点生成器

    soup.a.parents 会获得第一个匹配到a标签的所有父节点,也就是祖先节点,返回生成器;

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    for node in soup.a.parents:
        if node is None:
            print(node)
        else:
            print(node.name)
    
    

    输出结果:

    dd
    dl
    form
    div
    [document]
    

    2.14 获取兄弟节点

    兄弟节点有个坑,通常是返回空白,就不做过多讲解

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.dt.next_sibling)
    

    输出是空白;其它兄弟节点属性就不写了,感觉没啥意义,不是空白就是None;

    三 搜索文档

    学完第二节内容,读者们其实就是打了个基础,重点是这章节;

    函数 含义
    find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) 查找所有匹配节点
    find(name=None, attrs={}, recursive=True, text=None, **kwargs) 查找第一个匹配节点
    find_parent(name=None, attrs={}, **kwargs) 返回当前节点的父辈节
    find_parents(name=None, attrs={}, **kwargs) 返回当前节点的祖先节点
    find_next_sibling(name=None, attrs={}, text=None, **kwargs) 返回符合条件的后面的第一个tag节点
    find_next_siblings(name=None, attrs={}, text=None, **kwargs) 返回所有符合条件的后面的兄弟节点
    find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs) 返回第一个符合条件的前面的兄弟节点
    find_previous_siblings(self, name=None, attrs={}, text=None, **kwargs) 返回所有符合条件的前面的兄弟节点
    find_next(name=None, attrs={}, text=None, **kwargs) 返回第一个符合条件的节点
    find_all_next(name=None, attrs={}, text=None, limit=None, **kwargs) 返回所有符合条件的节点
    find_previous(name=None, attrs={}, text=None, **kwargs) 返回第一个符合条件的节点
    find_all_previousname=None, attrs={}, text=None, limit=None, **kwargs) 返回所有符合条件的节点
    1. name 表示输出的tag名称
    2. attrs 表示指定属性查找
    3. recursive 表示是否递归所有子节点,默认是;设置为false返回直接子节点
    4. limit 表示 限制 输出数量
    5. **kwargs 可以指定经常出现的属性搜索,比如 id = 'zszxz'
    6. text 是过滤条件

    本节着重讲解find_all方法,find方法于find_all一致,学一个就会用另一个;

    3.1 name参数示例

    soup.find_all(name='dd') 会获得所有dd标签对象,并且返回列表;

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.find_all(name='dd'))
    
    

    输出结果

    [<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>, <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    <svg aria-hidden="true" class="icon">
    <use xlink:href="#csdnc-rss"></use>
    </svg>RSS订阅</a>
    </dd>]
    

    注:soup.find_all(name='dd') 与 soup.find_all('dd') 一致;

    3.2 attrs 属性示例

    soup.find_all(attrs={'id':'seeOriginal'}) 获取 属性 id = seeOriginal 所有标签对象

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.find_all(attrs={'id':'seeOriginal'}))
    
    

    输出

    [<form action="" id="seeOriginal">
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    <svg aria-hidden="true" class="icon">
    <use xlink:href="#csdnc-rss"></use>
    </svg>RSS订阅</a>
    </dd>
    </dl></form>]
    

    3.3 recursive 示例

    soup.find_all('dl',recursive=False) 会查找dl标签子节点,当recursive 设置为False之后就找不到了;

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.find_all('dl',recursive=False))
    

    输出空列表[]

    3.4limit示例

    soup.find_all('dd',limit=1) 会限制输出结果为一条

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.find_all('dd',limit=1))
    
    

    输出

    [<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>]
    

    3.5 kwargs 示例之属性匹配

    soup.find_all(id='seeOriginal') 直接指定id属性查找

    # -*- coding: utf-8 -*-
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.find_all(id='seeOriginal'))
    
    

    输出

    [<form action="" id="seeOriginal">
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    <svg aria-hidden="true" class="icon">
    <use xlink:href="#csdnc-rss"></use>
    </svg>RSS订阅</a>
    </dd>
    </dl></form>]
    

    3.6 kwargs 示例之正则匹配

    soup.find_all(href=re.compile("java.*?")) 匹配属性 href 正则 java开头的属性标签;

    # -*- coding: utf-8 -*-
    import re
    
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.find_all(href=re.compile("java.*?")))
    
    

    输出结果

    [<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a>]
    

    3.7 按CSS搜索

    soup.find_all("a", class_="btn") 查找a标签,class属性带有btn

    # -*- coding: utf-8 -*-
    import re
    
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    print(soup.find_all("a", class_="btn"))
    
    

    输出结果

    [<a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    <svg aria-hidden="true" class="icon">
    <use xlink:href="#csdnc-rss"></use>
    </svg>RSS订阅</a>]
    

    四CSS选择器

    Beautiful Soup 还直接支持CSS选择器搜索,下面列出了经常使用的方法示例;

    # -*- coding: utf-8 -*-
    import re
    
    import requests
    from bs4 import BeautifulSoup
    
    html = """
        <div class="filter-box d-flex align-items-center">
        <form action="" id=seeOriginal>
        <dl class="filter-sort-box d-flex align-items-center">
        <dt>排序:</dt>
        <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd>
        <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    		<svg class="icon" aria-hidden="true">
    			<use xlink:href="#csdnc-rss"></use>
    		</svg>RSS订阅</a>
        </dd>
      </dl>"""
    
    # 初始化 soup
    soup = BeautifulSoup(html,'html.parser')
    # 选取 dl 标签下面的 dt标签
    lt = soup.select('dl dt')
    print(lt)
    dd = soup.select('dl dd')
    print(dd[0])
    # id 选择器搜索
    id = soup.select('#seeOriginal')
    print(id)
    # class选择器 搜索
    cla = soup.select('.btn-filter-sort')
    print(cla[0])
    
    

    分别输出如下

    soup.select('dl dt')

    [<dt>排序:</dt>]
    

    soup.select('dl dd')[0]

    <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
    

    soup.select('#seeOriginal')

    [<form action="" id="seeOriginal">
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
    <svg aria-hidden="true" class="icon">
    <use xlink:href="#csdnc-rss"></use>
    </svg>RSS订阅</a>
    </dd>
    </dl></form>]
    

    soup.select('.btn-filter-sort')[0]

    <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a>
    
  • 相关阅读:
    业务层和数据层
    Android开发学习总结——Android开发的一些相关概念(转)
    Android开发学习总结(五)——Android应用目录结构分析(转)
    Android开发学习总结(六)—— APK反编译(转)
    微信开发学习总结(一)——微信开发环境搭建(转)
    PowerMockito使用详解(转)
    java堆栈 (转)
    windows 7 SDK和DDK下载地址
    Linux pipe函数
    火星人的数学观(4)
  • 原文地址:https://www.cnblogs.com/zszxz/p/12208673.html
Copyright © 2020-2023  润新知