• 路飞学城web作业总结


    BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。

    from bs4 import BeautifulSoup
     
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    asdf
        <div class="title">
            <b>The Dormouse's story总共</b>
            <h1>f</h1>
        </div>
    <div class="story">Once upon a time there were three little sisters; and their names were
        <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</div>
    ad<br/>sf
    <p class="story">...</p>
    </body>
    </html>
    """
     
    soup = BeautifulSoup(html_doc, features="lxml")
    # 找到第一个a标签
    tag1 = soup.find(name='a')
    # 找到所有的a标签
    tag2 = soup.find_all(name='a')
    # 找到id=link2的标签
    tag3 = soup.select('#link2')

    安装

    pip3 install beautifulsoup4

    使用实例:

    from bs4 import BeautifulSoup
     
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
        ...
    </body>
    </html>
    """
     
    soup = BeautifulSoup(html_doc, features="lxml")

    1. name,标签名称

    1 tag = soup.find('a')
    2 name = tag.name # 获取
    3 print(name)
    4 tag.name = 'span' # 设置
    5 print(soup)

    2. attr,标签属性

    1 tag = soup.find('a')
    2 attrs = tag.attrs    # 获取
    3 print(attrs)
    4 tag.attrs = {'ik':123} # 设置
    5 tag.attrs['id'] = 'iiiii' # 设置
    6 print(soup)

    3. children,所有子标签

    body = soup.find('body')
    v = body.children

    4.  children,所有子子孙孙标签

    body = soup.find('body')
    v = body.descendants

    5. clear,将标签的所有子标签全部清空(保留标签名)

    tag = soup.find('body')
    tag.clear()
    print(soup)

    6. decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

    body = soup.find('body')
    v = body.decode()
    v = body.decode_contents()
    print(v)

    7. encode,转换为字节(含当前标签);encode_contents(不含当前标签)

    body = soup.find('body')
    v = body.encode()
    v = body.encode_contents()
    print(v)

    10. find,获取匹配的第一个标签

    # tag = soup.find('a')
    # print(tag)
    # tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
    # tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
    # print(tag)

    11. find_all,获取匹配的所有标签

    # tags = soup.find_all('a')
    # print(tags)
     
    # tags = soup.find_all('a',limit=1)
    # print(tags)
     
    # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
    # # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
    # print(tags)
     
     
    # ####### 列表 #######
    # v = soup.find_all(name=['a','div'])
    # print(v)
     
    # v = soup.find_all(class_=['sister0', 'sister'])
    # print(v)
     
    # v = soup.find_all(text=['Tillie'])
    # print(v, type(v[0]))
     
     
    # v = soup.find_all(id=['link1','link2'])
    # print(v)
     
    # v = soup.find_all(href=['link1','link2'])
    # print(v)
     
    # ####### 正则 #######
    import re
    # rep = re.compile('p')
    # rep = re.compile('^p')
    # v = soup.find_all(name=rep)
    # print(v)
     
    # rep = re.compile('sister.*')
    # v = soup.find_all(class_=rep)
    # print(v)
     
    # rep = re.compile('http://www.oldboy.com/static/.*')
    # v = soup.find_all(href=rep)
    # print(v)
     
    # ####### 方法筛选 #######
    # def func(tag):
    # return tag.has_attr('class') and tag.has_attr('id')
    # v = soup.find_all(name=func)
    # print(v)
     
     
    # ## get,获取标签属性
    # tag = soup.find('a')
    # v = tag.get('id')
    # print(v)

    12. has_attr,检查标签是否具有该属性

    # tag = soup.find('a')
    # v = tag.has_attr('id')
    # print(v)

    13. get_text,获取标签内部文本内容

    # tag = soup.find('a')
    # v = tag.get_text('id')
    # print(v)
  • 相关阅读:
    HTML5 JS 实现浏览器全屏(F11的效果)
    SpringMVC学习笔记之二(SpringMVC高级参数绑定)
    二十三种设计模式总结
    系统开发中使用拦截器校验是否登录并使用MD5对用户登录密码进行加密
    Mybatis学习笔记之二(动态mapper开发和spring-mybatis整合)
    Mybatis学习笔记之一(环境搭建和入门案例介绍)
    Java中clone方法的使用
    列举Java中常用的包、类和接口
    Spring中bean的注入方式
    [ SSH框架 ] Spring框架学习之三(AOP开发和注解的使用)
  • 原文地址:https://www.cnblogs.com/wlx97e6/p/9287948.html
Copyright © 2020-2023  润新知