• 慕课爬虫


    https://www.crummy.com/software/BeautifulSoup/

     1 #!/usr/bin/python
     2 # coding=utf-8
     3 
     4 from bs4 import BeautifulSoup
     5 import re
     6 
     7 html_doc = """
     8 <html><head><title>The Dormouse's story</title></head>
     9 <body>
    10 <p class="title"><b>The Dormouse's story</b></p>
    11 
    12 <p class="story">Once upon a time there were three little sisters; and their names were
    13 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    14 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    15 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    16 and they lived at the bottom of a well.</p>
    17 
    18 <p class="story">...</p>
    19 """
    20 
    21 soup = BeautifulSoup(html_doc,'html.parser',from_encoding = 'utf-8')
    22 
    23 print '获取所有的链接'
    24 links = soup.find_all('a')
    25 for link in links:
    26     print link.name, link['href'],link.get_text()
    27 
    28 print '获取lacie的链接'
    29 link_node = soup.find('a',href='http://example.com/lacie')
    30 print link_node.name, link_node['href'],link_node.get_text()
    31 
    32 print '正则匹配 ill'
    33 #r"" ,字符串中反斜线 只用写一次
    34 link_node = soup.find('a',href=re.compile(r"ill") )    
    35 print link_node.name, link_node['href'],link_node.get_text()
    36 
    37 print '获取p段落文字'
    38 #r"" ,字符串中反斜线 只用写一次
    39 p_node = soup.find('p',class_="title" )    
    40 print p_node.name, p_node.get_text()

    结果:

    获取所有的链接
    a http://example.com/elsie Elsie
    a http://example.com/lacie Lacie
    a http://example.com/tillie Tillie
    获取lacie的链接
    a http://example.com/lacie Lacie
    正则匹配 ill
    a http://example.com/tillie Tillie
    获取p段落文字
    p The Dormouse's story
  • 相关阅读:
    HTML5
    HTML5
    HTML5
    HTML5
    HTML5
    HTML5
    HTML5
    HTML5
    HTML5
    53.Maximum Subarray
  • 原文地址:https://www.cnblogs.com/njczy2010/p/5551976.html
Copyright © 2020-2023  润新知