• Beautifulsoup


    Beautiful Soup:解析HTML页面信息标记与提取方法

    获取网页源代码

    import requests
    from bs4 import BeautifulSoup
    
    kv = {'user-agent':'Mozilla/5.0'}
    url = "https://python123.io/ws/demo.html"
    r = requests.get(url,headers = kv)
    print(r.status_code)
    demo = r.text
    soup = BeautifulSoup(demo,"html.parser")#解析
    print(soup.prettify())

    200
    <html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
    </body></html>

    <html>
    <head>
    <title>
    This is a python demo page
    </title>
    </head>
    <body>
    <p class="title">
    <b>
    The demo python introduces several python courses.
    </b>
    </p>
    <p class="course">
    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
    </a>
    and
    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
    </a>
    .
    </p>
    </body>
    </html>

    BeautifulSoup的使用

     

     

     BeautifulSoup库解析器

      BeautifulSoup类的基本元素

     

    https://python123.io/ws/demo.html

    import requests
    from bs4 import BeautifulSoup
    kv = {'user-agent':'Mozilla/5.0'}
    url = "https://python123.io/ws/demo.html"
    r = requests.get(url,headers = kv)
    print(r.status_code)
    demo = r.text
    soup = BeautifulSoup(demo,"html.parser")
    print(soup.title)
    tag = soup.a#只能返回一个a标签
    print(tag)

    200
    <title>This is a python demo page</title>
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

    print(soup.a.name)
    print(soup.a.parent.name)
    print(soup.a.parent.parent.name)

    a
    p
    body

    print(tag.attrs['href'])
    print(type(tag.attrs))字典
    print(type(tag))

    {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
    http://www.icourse163.org/course/BIT-268001
    <class 'dict'>
    <class 'bs4.element.Tag'>



    tag = soup.a
    print(tag)
    print(tag.string)
    tag1 = soup.p
    print(tag1)
    print(tag1.string)
    tag2  = soup.b
    print(tag2)
    print(tag2.string)

    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
    Basic Python
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    The demo python introduces several python courses.
    <b>The demo python introduces several python courses.</b>
    The demo python introduces several python courses.

    print(type(tag2.string))

    <class 'bs4.element.NavigableString'>

    soup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
    print(soup.b.string)
    print(type(soup.b.string))
    print(soup.p.string)
    print(type(soup.p.string))

    This is a comment
    <class 'bs4.element.Comment'>
    This is not a comment
    <class 'bs4.element.NavigableString'>

    基于bs4库的HTML内容遍历方法

    <html>
    <head>
    <title>
    This is a python demo page
    </title>
    </head>
    <body>
    <p class="title">
    <b>
    The demo python introduces several python courses.
    </b>
    </p>
    <p class="course">
    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
    </a>
    and
    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
    </a>
    .
    </p>
    </body>
    </html>

     上述的标签树如下 

    三种遍历方式

     下行遍历

    soup = BeautifulSoup(demo,"html.parser")
    print(soup.head)
    print(soup.head.contents)
    print(soup.body.contents)返回列表
    print(len(soup.body.contents))
    print(soup.body.contents[1])

    <head><title>This is a python demo page</title></head>
    [<title>This is a python demo page</title>]
    [' ', <p class="title"><b>The demo python introduces several python courses.</b></p>, ' ', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the foll
    owing courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, ' ']
    5
    <p class="title"><b>The demo python introduces several python courses.</b></p>

    for child in soup.body.children:
    print(child) # 遍历儿子节点


    <p class="title"><b>The demo python introduces several python courses.</b></p>


    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

    for child in soup.body.descendants:
         print(child) # 遍历子孙节点


    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <b>The demo python introduces several python courses.</b>
    The demo python introduces several python courses.


    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
    Basic Python
    and
    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
    Advanced Python
    .


    上行遍历
    
    
    
    for parent in soup.a.parents: # 遍历soup的a标签的先辈标签
       if parent is None:
           print(parent)
       else:
           print(parent.name)

    p
    body
    html
    [document]

    soup = BeautifulSoup(demo,"html.parser")
    tag = soup.a
    print(tag)
    print(tag.parent)

    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

    强调:soup.html的parent是它本身  soup.parent是空的

    for parent in soup.a.parents: # 遍历soup的a标签的先辈标签
       if parent is None:
           print( parent)
       else:
           print(parent.name)

    p
    body
    html
    [document]

    平行遍历

    平行遍历发生在同一父节点的各节点间

    标签间的NavigableString也会构成标签树的节点,那么某个节点的父节点、子节点或者平行标签都有可能是NavigableString类型的

    soup = BeautifulSoup(demo,"html.parser")
    tag = soup.a
    print(tag.next_sibling)
    print(tag.next_sibling.next_sibling)
    print(tag.previous_sibling)
    print(tag.previous_sibling.previous_sibling)

    print(tag.parent)

    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

    and
    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
    .

    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

    基于bs4库的HTML格式输出 
    soup = BeautifulSoup(demo,"html.parser")
    print(soup.prettify())#在每个标签后面加了一个换行符,便于美观的输出

    bs4的编码默认都为utf-8编码
    soup = BeautifulSoup("<p>你好</p>","html.parser")
    print(soup.p.string)
    print(soup.p.prettify())


  • 相关阅读:
    2015/8/3 接着跌
    2015/7/31 由于昨天上升缺乏量的支持,今天横盘;在箱体下边缘稍微买了一点---错误!;复文《揭秘主力坐庄流程 内幕超乎想象》,
    打包jar类库与使用jar类库
    java eclipse 监视选择指定变量
    2015/7/29 (高开,V形反转,各种指标背离——可惜没买进,填补空缺图形的心理分析)
    XP、win7下Excel 2007多窗口打开Excel的解决方法
    2015/7/28(总结昨天抄底操作失败-割肉自保)
    六首失传股诗教你如何抄底和逃顶
    2015/7/27 (主力流出-1200亿,上周五回踩,今天到底是震荡下行,还是红魔呢?——在周五成功逃顶,结果今天回调的时候被套!——教训!)
    java中byte数组与int,long,short间的转换
  • 原文地址:https://www.cnblogs.com/tingtin/p/12907452.html
Copyright © 2020-2023  润新知