• BeautifulSoup


    BeautifulSoup简单使用:

    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    html = '''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    '''
    
    
    # 然后创建BeautifulSoup对象,创建BeautifulSoup对象有两种方式:
    # 第一种:通过字符串创建
    soup = BeautifulSoup(html, 'lxml')
    # 另一种通过文件来创建。假如html_str字符串保存为index.html文件。
    # soup = BeautifulSoup(open('index.html'))
    
    # 文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码。
    print(soup.prettify())
    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        Elsie
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>

    通过下面的一个例子,对bs4有一个简单的了解,以及看一下它的强大之处:

    from bs4 import BeautifulSoup
    
    html = '''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    '''
    soup = BeautifulSoup(html, 'lxml')
    print(soup.prettify())
    print(soup.title)
    print(soup.title.name)
    print(soup.title.string)
    print(soup.title.parent.name)
    print(soup.p)
    print(soup.p["class"])
    print(soup.a)
    print(soup.find_all('a'))
    print(soup.find(id='link3'))

    结果:

    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        Elsie
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>
    <title>The Dormouse's story</title>
    title
    The Dormouse's story
    head
    <p class="title"><b>The Dormouse's story</b></p>
    ['title']
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

    标签选择器

    在快速使用中我们添加如下代码:
    print(soup.title)
    print(type(soup.title))
    print(soup.head)
    print(soup.p)

    通过这种soup.标签名 我们就可以获得这个标签的内容
    这里有个问题需要注意,通过这种方式获取标签,如果文档中有多个这样的标签,返回的结果是第一个标签的内容,如上面我们通过soup.p获取p标签,而文档中有多个p标签,但是只返回了第一个p标签内容



    获取内容 soup.title.string:

    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    """
    .string, .strings, stripped_strings 三个属性。
    .string这个属性很有特点:如果一个标记里面没有标记里面没有标记了,那么,string就会返回标记里面的内容。如果标记里面里面只有唯一的一个标记了,那么,.steing也会返回最里面的内容。如果tag包含了多个子节点,tag就无法确定,string方法应该调用哪个子节点的内容,.srting的输出结果是None
    """
    html = '''                             
    <html><head><title>The Dormouse's story
    <body>                                 
    <p class="title"><b>The Dormouse's stor
    
    <p class="story">Once upon a time there
    <a href="http://example.com/elsie" clas
    <a href="http://example.com/lacie" clas
    <a href="http://example.com/tillie" cla
    and they lived at the bottom of a well.
    <p class="story">...</p>               
    '''
    soup = BeautifulSoup(html, 'lxml',)
    # 想要获取标记内部的文字,需要用到.string
    print(soup.head.string)
    print(soup.title.string)
    print(soup.html.stting)
    print('-' * 50)
    # strings属性主要应用于tag中包含多个字符串的情况,可以进行循环遍历。
    for string in soup.strings:
        print(string)
    print('+' * 50)
    # .stripped_strings属性可以去掉输出字符串中包含的空格或空行。
    for q in soup.stripped_strings:
        print(q)

    结果:

    The Dormouse's story
    
    The Dormouse's story
    
    None
    --------------------------------------------------
    The Dormouse's story
    
    
    
    The Dormouse's stor
    
    
    Once upon a time there
    
    ...
    
    
    ++++++++++++++++++++++++++++++++++++++++++++++++++
    The Dormouse's story
    The Dormouse's stor
    Once upon a time there
    ...

    嵌套选择

    我们直接可以通过下面嵌套的方式获取

    print(soup.head.title.string)


    获取名称 soup.title.name
     
    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    """
    Tag: Tag对象与XmL或HTML原生文档中Tag相同,通俗点说就是标记。比如<title>The Dormouse's story</title>或者<a href="http://example.com/elsie" class="sister" id="linkl">Elsie</a>
    抽取title: print soup.title
    抽取a: print soup.a
    抽取p: print soup.a
    
    Tag 中有两个最重要的属性:name和attributes。 每个Tag都有自己的名字,通过.name来获取。
    """
    
    from bs4 import BeautifulSoup
    
    html = '''                             
    <html><head><title>The Dormouse's story
    <body>                                 
    <p class="title"><b>The Dormouse's stor
    
    <p class="story">Once upon a time there
    <a href="http://example.com/elsie" clas
    <a href="http://example.com/lacie" clas
    <a href="http://example.com/tillie" cla
    and they lived at the bottom of a well.
    <p class="story">...</p>               
    '''
    
    # 然后创建BeautifulSoup对象,创建BeautifulSoup对象
    # 第一种:通过字符串创建
    soup = BeautifulSoup(html, 'lxml', )
    print(soup.name)   # soup对象本身比较特殊,他的name为[documernt], 对于其他内部标记,输出的值标记本身的名称。
    print(soup.title.name)
    print(soup.p.sting)
    
    """
    Tag:可以获取name。还可以修改name,改变之后将影响所有通过当前BeautifulSoup对象生成的HTMl文档。
    """
    soup.title.name = "cc"
    print(soup.title)
    print(soup.cc)  # 这里已经修改title标记成功修改为cc
    # 再说一下Tag中的属性,<p class="title"><b>The Dormouue's story</b></p> 有一个"class"值性,值为”title“。 Tag的属性的操作方法与字典相同。
    print(soup.p['class'])
    print(soup.p.get('class'))
    
    # 也可以点取,比如:.attrs, 用于获取Tag中所有属性
    
    
    # name一样,我们可以对标记中的这些属性和内容等进行修改。
    soup.p['class'] = 'cc'
    print(soup.p)

    结果:

    [document]
    title
    None
    None
    <cc>The Dormouse's story
    </cc>
    ['title']
    ['title']
    <p class="cc"><b>The Dormouse's stor
    
    </b></p>

     

    获取属性

    print(soup.p.attrs['name'])
    print(soup.p['name'])
    上面两种方式都可以获取p标签的name属性值

    父节点和祖先节点

    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    html = '''                             
    <html><head><title>The Dormouse's story
    <body>                                 
    <p class="title"><b>The Dormouse's stor
    
    <p class="story">Once upon a time there
    <a href="http://example.com/elsie" clas
    <a href="http://example.com/lacie" clas
    <a href="http://example.com/tillie" cla
    and they lived at the bottom of a well.
    <p class="story">...</p>               
    '''
    soup = BeautifulSoup(html, 'lxml模块')
    print(soup.title)
    print(soup.title.parent)  # 父节点
    # 通过元素的.parents属性可以递归得到元素的所有的所有父辈节点,使用了.parents方法遍历了<a>标记到根节点的所有节点。
    print(soup.a)
    for p in soup.parents:
        if p is None:
            print(p)
        else:
            print(p.name)

    结果:

    <title>The Dormouse's story
    </title>
    <head><title>The Dormouse's story
    </title></head>
    <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>

    兄弟节点

    soup.a.next_siblings 获取后面的兄弟节点
    soup.a.previous_siblings 获取前面的兄弟节点
    soup.a.next_sibling 获取下一个兄弟标签
    souo.a.previous_sinbling 获取上一个兄弟标签

    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    html = '''                             
    <html><head><title>The Dormouse's story
    <body>                                 
    <p class="title"><b>The Dormouse's stor
    
    <p class="story">Once upon a time there
    <a href="http://example.com/elsie" clas
    <a href="http://example.com/lacie" clas
    <a href="http://example.com/tillie" cla
    and they lived at the bottom of a well.
    <p class="story">...</p>               
    '''
    soup = BeautifulSoup(html, 'lxml')
    # 兄弟节点(从soup.prettify()的输出结果中,我们可以看到<a>有很多兄弟节点。兄弟节点可以理解为和本节点处在同一级的节点,.next_sibling属性可以获取该节点的下一个兄弟节点,.prebious_sibling则与之相反,如果节点不存在,则返回None。
    #
    
    print(soup.p.next_sibling)
    print('-' * 50)
    print(soup.p.prev_sibling)
    print('#' * 50)
    print(soup.p.next_sibling.next_sibling)
    for i in soup.p.next_siblings:
        print(repr(i))

    结果:

    <p class="story">Once upon a time there
    <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p>
    --------------------------------------------------
    None
    ##################################################
    
    
    <p class="story">Once upon a time there
    <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p>
    '
    '
    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    # 前后节点需要使用.next_element,.previous_element这两个属性,与.next_sibling.previous_slbling不同,它并不是针对于兄弟节点,而是针对所有节点,不分层次,例如<head><title>The Dormiuse's</title></head>中的下一个节点就是title
    html = '''                             
    <html><head><title>The Dormouse's story
    <body>                                 
    <p class="title"><b>The Dormouse's stor
    
    <p class="story">Once upon a time there
    <a href="http://example.com/elsie" clas
    <a href="http://example.com/lacie" clas
    <a href="http://example.com/tillie" cla
    and they lived at the bottom of a well.
    <p class="story">...</p>               
    '''
    soup = BeautifulSoup(html, 'lxml模块')
    print(soup.head)
    print(soup.head.next_element)
    # 如果想遍历所有的前节点或者后节点,通过.next_elements 和.previous_elements的迭代器就可以向前或向后访问文档的解析内容。
    print('-' * 50)
    for element in soup.a.next_element:
        print(repr(element))
    结果
    <head><title>The Dormouse's story
    </title></head>
    <title>The Dormouse's story
    </title>
    --------------------------------------------------
    '.'
    '.'
    '.'

    子节点和子孙节点:

    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    html = '''                             
    <html><head><title>The Dormouse's story
    <body>                                 
    <p class="title"><b>The Dormouse's stor
    
    <p class="story">Once upon a time there
    <a href="http://example.com/elsie" clas
    <a href="http://example.com/lacie" clas
    <a href="http://example.com/tillie" cla
    and they lived at the bottom of a well.
    <p class="story">...</p>               
    '''
    # 子节点:(Tag)中的.contents和.children是非常重要的
    soup = BeautifulSoup(html, 'lxml')
    print(soup.head.contents)
    print(len(soup.head.contents))
    print(soup.head.contents[0].string)
    # 字符串没有.contents属性,就是没有子节点。
    # .children属性返回一个生成器,可以对子节点进行循环。
    for chid in soup.head.contents:
        print(chid)
    print('-' * 50)
    # .contents和.children属性包含Tag的直接子节点。
    # .descendants属性可以对所有Tag的子孙节点进行递归循环
    for c in soup.head.descendants:
        print(c)

    结果:

    [<title>The Dormouse's story
    </title>]
    1
    The Dormouse's story
    
    <title>The Dormouse's story
    </title>
    --------------------------------------------------
    <title>The Dormouse's story
    </title>
    The Dormouse's story

    标准选择器

    find_all(name,attrs,recursive,text,**kwargs)

    find_all(name,attrs,recursive,text,**kwargs)
    可以根据标签名,属性,内容查找文档

    name的用法:

    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all('ul'))
    print('-' * 50)
    print(type(soup.find_all('ul')[0]))

    结果:

    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    --------------------------------------------------
    <class 'bs4.element.Tag'>

    同时我们是可以针对结果再次find_all,从而获取所有的li标签信息:

    for ul in soup.find_all('ul'):
        print(ul.find_all('li'))

    attrs可以传入字典的方式来查找标签,但是这里有个特殊的就是class,因为class在python中是特殊的字段,所以如果想要查找class相关的可以更改attrs={'class_':'element'}或者soup.find_all('',{"class":"element}),特殊的标签属性可以不写attrs,例如id。

    html='''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(text='Foo'))

    结果:

    ['Foo', 'Foo']

    其他用法:

    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    import re
    """
    find_all方法,用于搜索当前Tag的所有Tag子节点,并判断是否符合过滤器的条件,
    find_all(name, attrs, recursive, text, **kwargs)
    
    name参数:可以查找所有名字为name的标记,字符串对象会被自动忽略掉。name参数取值可以是字符串,正则表达式,列表,True 和方法。最简单的过滤是字符串。在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串完整匹配的内容。
    """
    
    html = '''                             
    <html><head><title>The Dormouse's story
    <body>                                 
    <p class="title"><b>The Dormouse's stor
    
    <p class="story">Once upon a time there
    <a href="http://example.com/elsie" clas
    <a href="http://example.com/lacie" clas
    <a href="http://example.com/tillie" cla
    and they lived at the bottom of a well.
    <p class="story">...</p>               
    '''
    soup = BeautifulSoup(html, 'lxml模块')
    print(soup.find_all('b'))
    # 如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的match()来匹配内容。
    for tag in soup.find_all(re.compile('^b')):
        print(tag.name)
    print('*' * 50)
    # 如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回。
    print(soup.find_all(['a', 'b']))
    print('@' * 50)
    # 如果传入的参数是True,True可以匹配任何值。
    for ti in soup.find_all(True):
        print(ti)
    
    print('#' * 50)
    # 如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数Tag节点,如果这个方法返回True表示当前匹配并且被找到,如果不是则返回FALSE。
    
    """
    def hasClass_id(tag):
        return tag.has_attr('class') and tag.has_attr('id')
    print(soup.find_all(hasClass_id))
    """
    
    # kwargs参数: 如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字Tag的属性来搜索。搜索指定名字的属性时可以使用的参数值包括字符串,正则表达式,列表,True。如果包含id参数,BeautifulSoup会搜索每个tag的"id"属性。
    print(soup.find_all(id='link2'))
    
    
    # 如果传入href参数,BeautifulSoup会搜索每个Tag的'href'属性。
    print(soup.find_all(href=re.compile('elsie')))
    print(soup.find_all(id=True))
    
    # 如果想用class过滤。但是class是关键字,需要在class后面加个下划线。
    print(soup.find_all('a', class_='sister'))
    print('c' * 50)
    # 使用多个指定名字的参数可以同时过滤Tag的多个属性:
    print(soup.find_all(href=re.compile('elsie'), id='linkl'))
    """
    # 有些tag属性再搜索不能使用,比如:HTML5中的 data-*属性
    data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
    
    data_soup.find_all(attrs={"data-foo": "value"})
    """

    结果:

    [<b>The Dormouse's stor
    
    </b>]
    body
    b
    **************************************************
    [<b>The Dormouse's stor
    
    </b>, <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>]
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    <html><head><title>The Dormouse's story
    </title></head><body>
    <p class="title"><b>The Dormouse's stor
    
    </b></p><p class="story">Once upon a time there
    <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p>
    </body></html>
    <head><title>The Dormouse's story
    </title></head>
    <title>The Dormouse's story
    </title>
    <body>
    <p class="title"><b>The Dormouse's stor
    
    </b></p><p class="story">Once upon a time there
    <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p>
    </body>
    <p class="title"><b>The Dormouse's stor
    
    </b></p>
    <b>The Dormouse's stor
    
    </b>
    <p class="story">Once upon a time there
    <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p>
    <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>
    ##################################################
    []
    [<a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>]
    []
    []
    cccccccccccccccccccccccccccccccccccccccccccccccccc
    []
    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    html = '''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    '''
    soup = BeautifulSoup(html, 'lxml')
    # find_all()方法返回全部的搜索结果,如果文档树很大那么搜索会很慢。如果我们不需要全部结果,可以使用limit参数限制返回结果的数量。当搜索到的结果数量到达limit的限制时,就停止搜索返回结果。
    print(soup.find_all('a', limit=2))

    结果:

    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    html = '''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    '''
    soup = BeautifulSoup(html, 'lxml')
    # 调用Tag的find_all()方法时,BeautifulSoup会搜索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数recursive=False
    print(soup.find_all('title'))
    print(soup.find_all('title', recursive=False))

    结果

    [<title>The Dormouse's story</title>]
    []

    find

    find(name,attrs,recursive,text,**kwargs)
    find返回的匹配结果的第一个元素

    其他一些类似的用法:
    find_parents()返回所有祖先节点,find_parent()返回直接父节点。
    find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
    find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
    find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
    find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

    推荐使用lxml解析库,必要时使用html.parser
    标签选择筛选功能弱但是速度快
    建议使用find()、find_all() 查询匹配单个结果或者多个结果
    如果对CSS选择器熟悉建议使用select()
    记住常用的获取属性和文本值的方法

    CSS选择器

    通过select()直接传入CSS选择器就可以完成选择
    熟悉前端的人对CSS可能更加了解,其实用法也是一样的
    .表示class #表示id
    标签1,标签2 找到所有的标签1和标签2
    标签1 标签2 找到标签1内部的所有的标签2
    [attr] 可以通过这种方法找到具有某个属性的所有标签
    [atrr=value] 例子[target=_blank]表示查找所有target=_blank的标签

    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.select('.panel .panel-heading'))
    print(soup.select('ul li'))
    print(soup.select('#list-2 .element'))
    print(type(soup.select('ul')[0]))

    结果

    [<div class="panel-heading">
    <h4>Hello</h4>
    </div>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    <class 'bs4.element.Tag'>

    获取内容

    通过get_text()就可以获取文本内容

    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for li in soup.select('li'):
        print(li.get_text())

    结果:

    Foo
    Bar
    Jay
    Foo
    Bar

    获取属性
    或者属性的时候可以通过[属性名]或者attrs[属性名]

    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</li>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.select('ul'):
        print(ul['id'])
        print(ul.attrs['id'])
    结果
    list-1
    list-1
    list-2
    list-2
    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    html = '''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    '''
    soup = BeautifulSoup(html, 'lxml')
    # 通过CSS也可以定位元素的位置。在写CSS时,标记名不加任何修饰,类名前加点'.', id名前加'#',在这里我们也可以利用类似的方法来筛选元素,用到的方法是soup.select(),返回类型是list.
    # 1通过标名称进行查找(通过标记名称可以直接查找,可以找到某个标记下的直接标记和兄弟节点标记)
        # 直接查找
    print(soup.select('title'))
        #多层查找
    print(soup.select('html head title'))
    # 查找直接子节点,查找head下的title标记
    print(soup.select('head > title'))
    # 查找p下的id='linkl'的标记
    print(soup.select('p > # linkl'))
    # 查找兄弟节点
    # 查找id=‘linkl’之后class=sisiter的所有兄弟标记
    print(soup.select('# linkl ~ .sister'))
    
    # 查找紧跟着id="linkl"之后 class=sisiter的子标记
    print(soup.select('# link1 + .sester'))

    结果:

    [<title>The Dormouse's story</title>]
    [<title>The Dormouse's story</title>]
    [<title>The Dormouse's story</title>]
    []
    []
    []
    #!/urs/bin/evn python
    # -*- coding:utf-8 -*-
    from bs4 import BeautifulSoup
    html = '''
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    '''
    soup = BeautifulSoup(html, 'lxml模块')
    print(soup.select('.sister'))
    print(soup.select('[class~=sister]'))
    
    # 通过tag的id查找
    print(soup.select('# link1'))
    print(soup.select('a# link2'))
    
    # 通过是否存在某个属性来查找
    print(soup.select('a[href]'))
    
    # 通过属性值来寻找
    print(soup.select('a[href="http://example.com/elseie"]'))
    print(soup.select('a[href^="http://example.com/"]'))
    print(soup.select('a[href*=".com/el"]'))

    结果:

    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    []
    []
    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    []
    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
     
  • 相关阅读:
    5.抽象工厂模式-abstractfactory
    java-code优化(持续更新)
    4.工厂方法模式-factoryMethod
    3.单例模式-singleton
    2.适配器模式-adapter
    1.外观模式-facade
    Hibernate(七)多对一单向关联映射
    Hibernate(六)一对一双向关联映射
    Hibernate(五)一对一单向关联映射
    Hibernate(四)基本映射
  • 原文地址:https://www.cnblogs.com/zqxqx/p/10031550.html
Copyright © 2020-2023  润新知