• 十三、CSS选择器:BeautifulSoup4


    (1)和lxml一样,Beautifu Soup也是一个HTML/XML的解析器,主要的功能也是如何解析和提取HTML/XML数据。

    (2)lxml只会局部遍历,而Beautiful Soup是基于HTML DOM的,会载入整个文档,解析整个DOM树,因此时间和内存开销都会大很多,所以性能要低于lxml。

    (3)BeautifulSoup用来解析HTML比较简单,API非常人性化,支持CSS选择器、python标准库中的HTML解析器,也支持lxml的XML解析器。

    安装:`pip install beautifusoup4`

    官方文档:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

    抓取工具速度使用难度安装难度
    正则 最快 困难 无(内置)
    BeautifulSoup 最简单 简单
    lxml 简单 一般

    1、示例

    from bs4 import BeautifulSoup
    
    html = """
        <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建Beautiful Soup对象
    soup = BeautifulSoup(html,'lxml')
    
    # 打开本地HTML文件的方式来创建对象
    # soup = BeautifulSoup(open('index.html'))
    
    # 格式化输出soup对象的内容
    print(soup.prettify())

      运行结果:

    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title" name="dromouse">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        <!-- Elsie -->
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>

      如果没有显示地指定解析器,会默认使用这个系统的最佳可用HTML解析器('lxml')。当在另一个系统中运行这段代码,或者在不同的虚拟环境中,使用不同的解析器会造成不同行为。

      可以通过`soup=BeautifuSoup(html,'lxml')`方式指定lxml解析器。

    2、四大对象种类

      Beautifu Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是python对象,所有对象可以归纳为4种:(1)Tag(2)NavigableString(3)BeautifulSoup(4)Comment

      2.1 Tag   

    <head><title>The Dormouse's story</title></head>
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

        Tag是,HTML中的一个个标签(即上面代码中的`title`、`head`、`a`、`p`等等HTML标签)加上里面包括的内容

    from bs4 import BeautifulSoup
    
    html = """
        <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    # 创建Beautiful Soup对象
    soup = BeautifulSoup(html,'lxml')
    
    # 打开本地HTML文件的方式来创建对象
    # soup = BeautifulSoup(open('index.html'))
    
    # 格式化输出soup对象的内容
    # print(soup.prettify())
    
    print(soup.title)
    # <title>The Dormouse's story</title>
    
    print(soup.head)
    # <head><title>The Dormouse's story</title></head>
    
    print(soup.a)
    # <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
    
    print(type(soup.a))
    # <class 'bs4.element.Tag'>
    

    # 通过soup加标签名获取这些标签的内容,这些对象的类型是bs4.element.Tag
    # 通过这种方法查找的是在所有内容中的第一个符合要求的标签。



    # 对于Tag,它本身有两个重要的属性,即name和attrs
    print(soup.name) # [document] # soup对象本身比较特殊,它的name即为[document] print(soup.head.name) # head # 对于其他内部标签,输出的值便为标签本身的名称 print(soup.p.attrs) # {'class': ['title'], 'name': 'dromouse'} # 在这里,我们把p标签的所有属性打印了出来,得到的类型是一个字典 print(soup.p['class']) # [‘title’] 获取属性的值 # 等同下列get方法 print(soup.p.get('class')) # ['title'] 获取属性的值 soup.p['class'] = 'newClass'
    # 对p标签下的class属性的内容进行修改 print(soup.p) # <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

    del soup.p['class'] # 还可以对这个属性进行删除
    print soup.p
    # <p name="dromouse"><b>The Dormouse's story</b></p>

      2.2 NavigableString

        通过.string的方式即可获取标签内部的文字

    print soup.p.string
    # The Dormouse's story
    
    print type(soup.p.string)
    # In [13]: <class 'bs4.element.NavigableString'>

      2.3 BeautifulSoup

        BeautifulSoup对象表示的是一个文档的内容,可以把它当做是一个特殊的Tag对象,可以分别获取它的类型,名称以及属性。

    print(type(soup.name))
    # <class 'str'>
    
    print(soup.name)
    # [document]
    
    print(soup.attrs)
    # {}    文档本身的属性为空

      2.4 Comment

        Comment对象是一个特殊类型的NavigableString对象,其输出的内容不包括注释符号。

    print(soup.a)
    # <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
    
    print(soup.a.string)
    #  Elsie
    
    print(type(soup.a.string))
    # <class 'bs4.element.Comment'>

        注意Comment和NavigableString对象的区别,当HTML标签的.string中有注释时,忽视注释符号,返回其中的内容,这时它是一个Comment对象;当没有注释时,返回其中的内容,这时它是一个NavigableString对象。

    3、遍历文档树

      3.1 直接子节点:`.contents`,`.children`属性

        (1)`.content`属性

          Tag的`.contents`属性可以将Tag的子节点以列表的方式输出

    print(soup.body.contents)
    # tag的.contents属性可以将tag的子节点以列表的方式输出
    """
    ['
    ', <p class="title" name="dromouse"><b>The Dormouse's story</b></p>, '
    ', <p class="story">Once upon a time there were three little sisters; and their names were
    <a class = "sister" href = "http://example.com/elsie" id = "link1" > <!-- Elsie - -> < /a > ,
    <a class = "sister" href = "http://example.com/lacie" id = "link2" > Lacie < /a > and
    <a class = "sister" href = "http://example.com/tillie" id = "link3" > Tillie < /a >;
    and they lived at the bottom of a well. < /p > , '
    ', < p class = "story" > ... < /p > , '
    ']
    """
    # 输出方式为列表,可以用列表索引来获取它的某一个元素
    print(soup.body.contents[1])
    #<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

        (2)`.children`属性

          Tag的`.children`属性,返回一个list生成器对象。

    print(soup.body.children)
    # <list_iterator object at 0x7f55adea9d68>
    
    for  child in soup.body.children:
        print(child)
    # 输出结果
    """ <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

      3.2 所有子孙节点:`.descendants`属性

        `.contents`和`.children`属性仅包含Tag的直接子节点,`.descendants`属性可以对所有Tag的子孙节点进行递归循环,和`.children`类似,返回一个生成器对象。

    print(soup.descendants)
    # <generator object descendants at 0x7f98e70050f8>
    
    for child in soup.descendants:
        print(child)
    """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body></html>
    <head><title>The Dormouse's story</title></head>
    <title>The Dormouse's story</title>
    The Dormouse's story
    
    
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body>
    
    
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <b>The Dormouse's story</b>
    The Dormouse's story
    
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    Once upon a time there were three little sisters; and their names were
    
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
     Elsie 
    ,
    
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    Lacie
     and
    
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    Tillie
    ;
    and they lived at the bottom of a well.
    
    
    <p class="story">...</p>
    ...
    
    
    """

      3.3 节点内容:`.string`属性

        如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。   

    print soup.head.string
    #The Dormouse's story
    print soup.title.string
    #The Dormouse's story

    4、搜索文档树

      4.1 find_all(name,attrs,recursive,text,**kwargs)

        4.1.1 name参数

          name参数可以查找所有名字为name的tag,字符串对象会被自动忽略掉。

          (1)传字符串

            在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容

    print(type(soup.find_all('p')))
    # <class 'bs4.element.ResultSet' >
    
    print(soup.find_all('p'))
    """
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>, <p class="story">...</p>]
    """

          (2)传入正则表达式

            如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的macth()来匹配内容。

    # 找出所有以b开头的标签
    import re
    for tag in soup.find_all(re.compile('^b')):
        print(tag.name)
    # body
    # b

          (3)传列表

            如果传入列表参数,Beautiful Soup会将把与列表中任一元素匹配的内容返回。

    print(soup.find_all(["a",'p']))
    """
    [<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]
    """

        4.1.2 keyword参数

    soup.find_all(id='link2')
    # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

        4.1.3 text参数

          通过text参数可以搜索文档中的字符串内容,与name参数的可选值一样,text参数接受字符串、正则表达式及列表  

    import re
    print(soup.find_all(text='Tillie'))
    
    print(soup.find_all(text=["Tillie","Elsie","Lacie"]))
    
    print(soup.find_all(text=re.compile("Dormouse")))
    """
    ['Tillie']
    ['Lacie', 'Tillie']
    ["The Dormouse's story", "The Dormouse's story"]
    """

      4.2 CSS选择器

        写CSS时,标签名不加任何修饰,类名前加`.`,id名前加`#`

        用soup.select(),返回类型是list

        4.2.1 通过标签名查找

    print(soup.select("title"))
    # [<title>The Dormouse's story</title>]
    print(soup.select("a"))
    """
    [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    """
    print(soup.select('b'))
    # [<b>The Dormouse's story</b>]

        4.2.2 通过类名查找

    print(soup.select(".sister"))
    """
    [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    """

        4.2.3 通过id名查找

    print(soup.select("#link1"))
    # [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

        4.2.4 组合查找

          组合查找即和写css文件时,标签名与类名、id名进行组合的原理是一样的,其各之间需要用空格分开。

    print(soup.select("p #link1"))
    # [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

          直接子标签查找,则使用`>`分隔    

    print(soup.select("head > title"))
    #[<title>The Dormouse's story</title>]

        4.2.5 属性查找

          查找时还可以加入属性元素,属性需要用中括号括起来,注意属性与标签属于同一节点,所以中间不能加空格,否则会无法匹配到。

    print(soup.select('a[class="sister"]'))
    #[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    
    print(soup.select('a[href="http://example.com/elsie"]'))
    #[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

          同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格   

    print(soup.select('p a[href="http://example.com/elsie"]'))
    #[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

        4.2.6 获取内容

           select 方法返回的结果都是列表形式,可以遍历形式输出,然后用 get_text() 方法来获取它的内容。

    soup = BeautifulSoup(html,'lxml')
    print(type(soup.select("title")))
    # <class 'bs4.element.ResultSet'>
    print(soup.select('title')[0])
    # <title>The Dormouse's story</title>
    print(soup.select("title")[0].get_text())
    # The Dormouse's story
    
    for title in soup.select("title"):
        print(title.get_text())
    # The Dormouse's story

       

  • 相关阅读:
    【开源】我和 JAP(JA Plus) 的故事
    justauth-spring-boot-starter V1.3.5 发布成功
    JustAuth 1.15.9 版发布,支持飞书、喜马拉雅、企业微信网页登录
    详细介绍如何自研一款"博客搬家"功能
    推荐一款自研的Java版开源博客系统OneBlog
    JavaScript常用方法
    Markdown 语法学习
    Sublime Text常用设置之个人配置
    webStorm常用设置之过滤文件夹
    HTTP详解
  • 原文地址:https://www.cnblogs.com/nuochengze/p/12863045.html
Copyright © 2020-2023  润新知