• BeautifulSoup库的安装与使用


    BeautifulSoup库的安装

    Win平台:“以管理员身份运行” cmd

    执行 pip install beautifulsoup4

    演示HTML页面地址:http://python123.io/ws//demo.html

    文件名称:demo.html

     

     网页源代码:HTML 5.0 格式代码

    BeautifulSoup库的安装小测:
     1 >>> import requests
     2 >>> r = requests.get("http://python123.io/ws//demo.html")
     3 >>> r.text
     4 '<html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
    </body></html>'
     5 >>> demo = r.text
     6 >>> from bs4 import BeautifulSoup
     7 >>> soup = BeautifulSoup(demo,'html.parser')
     8 >>> print(soup.prettify())
     9 <html>
    10  <head>
    11   <title>
    12    This is a python demo page
    13   </title>
    14  </head>
    15  <body>
    16   <p class="title">
    17    <b>
    18     The demo python introduces several python courses.
    19    </b>
    20   </p>
    21   <p class="course">
    22    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    23    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    24     Basic Python
    25    </a>
    26    and
    27    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    28     Advanced Python
    29    </a>
    30    .
    31   </p>
    32  </body>
    33 </html>
    34 >>>
    Beautiful Soup库的基本元素:

    ​ Beautiful Soup库的理解:

    Beautiful Soup库是解析、遍历、维护“标签树”的功能库。

    <p>..</p> : 标签Tag

     

    Beautiful Soup库的引用:

    ​   from bs4 import BeautifulSoup

      ​ import bs4

    Beautiful Soup库解析器:

    soup = BeautifulSoup ('<html>data</html>','html.parser')

    BeautifulSoup类的基本元素:

    < p class = "title" > ... </p>

     Tag标签:

    1 >>> from bs4 import BeautifulSoup
    2 >>> soup = BeautifulSoup(demo,'html.parser')
    3 >>> soup.title
    4 <title>This is a python demo page</title>
    5 >>> tag = soup.a
    6 >>> tag
    7 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

    ​ 任何存在于HTML语法中的标签都可以用soup.<tag>访问获得当HTML文档中存在多个相同<tag>对应内容时,soup.<tag>返回第一个

    Tag的name:

    1 >>> from bs4 import BeautifulSoup
    2 >>> soup = BeautifulSoup(demo,'html.parser')
    3 >>> soup.a.name
    4 'a'
    5 >>> soup.a.parent.name
    6 'p'
    7 >>> soup.a.parent.parent.name
    8 'body'
    9 >>>

    每个<tag>都有自己的名字,通过<tag>.name获取,字符串类型

    Tag的attrs(属性):

     1 >>> tag = soup.a
     2 >>> tag.attrs
     3 {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
     4 >>> tag.attrs['class']
     5 ['py1']
     6 >>> tag.attrs['href']
     7 'http://www.icourse163.org/course/BIT-268001'
     8 >>> type(tag.attrs)
     9 <class 'dict'>
    10 >>> type(tag)
    11 <class 'bs4.element.Tag'>

    一个<tag>可以有0或多个属性,字典类型

    Tag的NavigableString:

    >>> soup.a
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
    >>> soup.a.string
    'Basic Python'
    >>> soup.p
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    >>> soup.p.string
    'The demo python introduces several python courses.'
    >>> type(soup.p.string)
    <class 'bs4.element.NavigableString'>
    NavigableString可以跨越多个层次

    Tag的Comment:

    1 >>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
    2 >>> newsoup.b.string
    3 'This is a comment'
    4 >>> type(newsoup.b.string)
    5 <class 'bs4.element.Comment'>
    6 >>> newsoup.p.string
    7 'This is not a comment'
    8 >>> type(newsoup.p.string)
    9 <class 'bs4.element.NavigableString'>

    Comment是一种特殊类型

    标签<tag>

    基于bs4库的HTML内容遍历方法:

    HTML基本格式:

    <>...</>构成了所属关系,形成了标签的树形结构

    标签树的下行遍历:

    BeautifulSoup类型是标签树的根节点

    标签树的下行遍历

     1 >>> soup = BeautifulSoup(demo,'html.parser')
     2 >>> soup.head
     3 <head><title>This is a python demo page</title></head>
     4 >>> soup.head.contents
     5 [<title>This is a python demo page</title>]
     6 >>> soup.body.contents
     7 ['
    ', <p class="title"><b>The demo python introduces several python courses.</b></p>, '
    ', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
     8  9 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '
    ']
    10 >>> len(soup.body.contents)
    11 >>> soup.body.contents[1]
    12 <p class="title"><b>The demo python introduces several python courses.</b></p>

    遍历儿子节点:

    for child in soup.body.children:
        print(child)

    遍历子孙节点:

    for child in soup.body.descendants:
        print(child)
    标签树的上行遍历:

     

    1 soup = BeautifulSoup(demo,'html.parser')
    2 for parent in soup.a.parents:           #标签树的上行遍历
    3     if parent is None:
    4         print(parent)
    5     else:
    6         print(parent.name)

    遍历所有先辈节点,包括soup本身,所以要区别判断

    运行结果:

    标签树的平行遍历:

     平行遍历发生在同一个父节点下的各节点间

    遍历的判断:

    让HTML内容更加“友好”的显示:

    bs4库的prettify()方法:
     1 >>> import requests
     2 >>> r = requests.get("http://python123.io/ws//demo.html")
     3 >>> demo = r.text
     4 >>> demo
     5 '<html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
    </body></html>'
     6 >>> soup = BeautifulSoup(demo,'html.parser')
     7 >>> soup.prettify()
     8 '<html>
     <head>
      <title>
       This is a python demo page
      </title>
     </head>
     <body>
      <p class="title">
       <b>
        The demo python introduces several python courses.
       </b>
      </p>
      <p class="course">
       Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
       <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
        Basic Python
       </a>
       and
       <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
        Advanced Python
       </a>
       .
      </p>
     </body>
    </html>'
     9 >>> print(soup.prettify())
    10 <html>
    11  <head>
    12   <title>
    13    This is a python demo page
    14   </title>
    15  </head>
    16  <body>
    17   <p class="title">
    18    <b>
    19     The demo python introduces several python courses.
    20    </b>
    21   </p>
    22   <p class="course">
    23    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    24    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    25     Basic Python
    26    </a>
    27    and
    28    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    29     Advanced Python
    30    </a>
    31    .
    32   </p>
    33  </body>
    34 </html>

    .prettify()为HTML文本<>及其内容增加' '

    .prettify()可用于标签,方法:<tag>.prettify()

    1 >>> print(soup.a.prettify())
    2 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    3  Basic Python
    4 </a>
    bs4库的编码:

    bs4库将任何HTML输入都变为utf-8编码,Python 3.x默认支持编码是utf-8,解析无障碍。

    1 >>> soup = BeautifulSoup("<p>中文</p>",'html.parser')
    2 >>> soup.p.string
    3 '中文'
    4 >>> print(soup.p.prettify())
    5 <p>
    6  中文
    7 </p>
  • 相关阅读:
    JPA或Hibernate中的
    mysql如何在一个字段后面加个字符?
    mysql 怎么通过sql语句批量去掉某一个表中某一个字段的多余字符
    MySql怎样去掉某个字段最后的逗号或最后的字
    condition_variable_any
    Python获取本机外网IP
    Ftp download
    5. Abstract Factory
    0. Design Principle
    4. Factory Method
  • 原文地址:https://www.cnblogs.com/zyh19980816/p/11851606.html
Copyright © 2020-2023  润新知