• python爬虫之BeautifulSoup的HTML解析


      BeautifulSoup是一个用于从HTML和XML文件中提取数据的python库,它提供一些简单的函数来处理导航、搜索、修改分析树等功能。BeautifulSoup能自动将文档转换成Unicode编码,输出文档转换为UTF-8编码。

      本例直接创建模拟HTML代码,进行美化:

    # 导入BeautifulSoup库
    from bs4 import BeautifulSoup
    
    # 创建模拟HTML代码的字符串
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    # 创建一个BeautifulSoup对象,获取页面正文
    soup = BeautifulSoup(html_doc, features="lxml")
    # 打印解析的HTML代码
    print('经BeautifulSoup美化的代码:',soup)
    print('===================================')
    print('通过prettify()方法进行代码的格式化处理:',soup.prettify())

    结果:

    经BeautifulSoup美化的代码: <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body></html>
    ===================================
    通过prettify()方法进行代码的格式化处理: <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        Elsie
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>

  • 相关阅读:
    MySQL
    Python练习-基于socket的FTPServer
    python练习-Socket实现远程cmd命令
    python概念-Socket到底有多骚
    Python练习-天已经亮了计算器也终于完成了
    python模块-logging的智商上限
    python正则表达式-re模块的爱恨情仇
    Python练习-os模块练习-还算是那么回事儿
    Python练习-time模块
    Python练习-跨目录调用模块
  • 原文地址:https://www.cnblogs.com/xiao02fang/p/12933835.html
Copyright © 2020-2023  润新知