• 路飞学城web作业总结


    BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。

    from bs4 import BeautifulSoup
     
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    asdf
        <div class="title">
            <b>The Dormouse's story总共</b>
            <h1>f</h1>
        </div>
    <div class="story">Once upon a time there were three little sisters; and their names were
        <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</div>
    ad<br/>sf
    <p class="story">...</p>
    </body>
    </html>
    """
     
    soup = BeautifulSoup(html_doc, features="lxml")
    # 找到第一个a标签
    tag1 = soup.find(name='a')
    # 找到所有的a标签
    tag2 = soup.find_all(name='a')
    # 找到id=link2的标签
    tag3 = soup.select('#link2')

    安装

    pip3 install beautifulsoup4

    使用实例:

    from bs4 import BeautifulSoup
     
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
        ...
    </body>
    </html>
    """
     
    soup = BeautifulSoup(html_doc, features="lxml")

    1. name,标签名称

    1 tag = soup.find('a')
    2 name = tag.name # 获取
    3 print(name)
    4 tag.name = 'span' # 设置
    5 print(soup)

    2. attr,标签属性

    1 tag = soup.find('a')
    2 attrs = tag.attrs    # 获取
    3 print(attrs)
    4 tag.attrs = {'ik':123} # 设置
    5 tag.attrs['id'] = 'iiiii' # 设置
    6 print(soup)

    3. children,所有子标签

    body = soup.find('body')
    v = body.children

    4.  children,所有子子孙孙标签

    body = soup.find('body')
    v = body.descendants

    5. clear,将标签的所有子标签全部清空(保留标签名)

    tag = soup.find('body')
    tag.clear()
    print(soup)

    6. decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

    body = soup.find('body')
    v = body.decode()
    v = body.decode_contents()
    print(v)

    7. encode,转换为字节(含当前标签);encode_contents(不含当前标签)

    body = soup.find('body')
    v = body.encode()
    v = body.encode_contents()
    print(v)

    10. find,获取匹配的第一个标签

    # tag = soup.find('a')
    # print(tag)
    # tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
    # tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
    # print(tag)

    11. find_all,获取匹配的所有标签

    # tags = soup.find_all('a')
    # print(tags)
     
    # tags = soup.find_all('a',limit=1)
    # print(tags)
     
    # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
    # # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
    # print(tags)
     
     
    # ####### 列表 #######
    # v = soup.find_all(name=['a','div'])
    # print(v)
     
    # v = soup.find_all(class_=['sister0', 'sister'])
    # print(v)
     
    # v = soup.find_all(text=['Tillie'])
    # print(v, type(v[0]))
     
     
    # v = soup.find_all(id=['link1','link2'])
    # print(v)
     
    # v = soup.find_all(href=['link1','link2'])
    # print(v)
     
    # ####### 正则 #######
    import re
    # rep = re.compile('p')
    # rep = re.compile('^p')
    # v = soup.find_all(name=rep)
    # print(v)
     
    # rep = re.compile('sister.*')
    # v = soup.find_all(class_=rep)
    # print(v)
     
    # rep = re.compile('http://www.oldboy.com/static/.*')
    # v = soup.find_all(href=rep)
    # print(v)
     
    # ####### 方法筛选 #######
    # def func(tag):
    # return tag.has_attr('class') and tag.has_attr('id')
    # v = soup.find_all(name=func)
    # print(v)
     
     
    # ## get,获取标签属性
    # tag = soup.find('a')
    # v = tag.get('id')
    # print(v)

    12. has_attr,检查标签是否具有该属性

    # tag = soup.find('a')
    # v = tag.has_attr('id')
    # print(v)

    13. get_text,获取标签内部文本内容

    # tag = soup.find('a')
    # v = tag.get_text('id')
    # print(v)
  • 相关阅读:
    【中文分词】条件随机场CRF
    【中文分词】最大熵马尔可夫模型MEMM
    【中文分词】二阶隐马尔可夫模型2-HMM
    【中文分词】隐马尔可夫模型HMM
    Elasticsearch的CRUD:REST与Java API
    d3的比例尺和坐标轴
    webpack DllPlugin的用法
    webpack单独启动目录方法
    d3的常用方法和数据类型
    d3中的enter,exit,update概念
  • 原文地址:https://www.cnblogs.com/wlx97e6/p/9287948.html
Copyright © 2020-2023  润新知