• beautifulsoup学习


    一,下载安装

    http://www.crummy.com/software/BeautifulSoup

    (1)python setup.py build

    (2)python setup.py install

    (3)小例子

    import bs4
    from bs4 import BeautifulSoup
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc)
    
    print(soup.prettify())
    

    二,使用例子

    import bs4,os,types,urllib,re
    from bs4 import BeautifulSoup
    ###This function is used for crawler the url
    
    def parserCategary(urlString):
        html_doc = urllib.urlopen(urlString).read()
        soup = BeautifulSoup(html_doc)
        lis = soup.find_all('li', {'class':'category-item'})
        result = {}
        r = re.compile('/category/([a-zA-Z_]+)\?')
        for item in lis:
            result[r.findall(item.contents[0]['href'])[0]] = urlString+item.contents[0]['href']
        return result
    
    if __name__ == '__main__':
        urlString = 'https://play.google.com/store'
        result = parserCategary(urlString)
        for item in result.items():
            print item
    

    一个小问题:

    当我去解析一段话的时候,有个这样的节点树:

    [u'This can place a load on the CPU. You may feel slow indeed in the Android OS.', <br>We also exhibit the versions of other OS.<br>Downloads : <a href="https://www.google.com/url?q=http://wizapply.com/mp2mark/&amp;sa=D&amp;usg=AFQjCNFp2xblQHypa_Z4hvs_VRlXdqgUKw" target="_blank">http://wizapply.com/mp2mark/</a><p>+ Mobile GPU demonstration [GP2Mark]<br>Search "GP2Mark" in the Google Play!<p>We sell the "manual" and "MP2Mark C source code". If you are interested in is not please contact us.<p>If there is a request, we will receive improvement, etc.<p>tag bench,demo,tegra,multi,core,intel,amd,arm,snapdragon,samsung,exynos,cortex,ndk</p></p></p></br></p></br></br>]

    这个列表其实只有两项,一个是开头的unicode字符串,第二个就是一个树。但是因为树中间有逗号,导致本来只有两项的列表,错误的有了很多项。。。

    郁闷。

    利用DFS去爬网站,必须明确终止条件是什么?每一步处理的问题是什么?

    今天花了不少时间去写DFS,遇到不少麻烦:

    (1)没有归纳好终止条件是什么

    (2)没有明确每一步的任务

    (3)虽然是深度,但是在一个函数中,要一个接一个的广度遍历所有的直接孩子。

    num-pagination-control

    num-pagination-content

    input   :    id="reviewUseAjaxUrl"

  • 相关阅读:
    类的空间问题
    面向对象初识
    collections模块,shelve模块
    一段水印文字的练习
    jquery选择器中(:button)的含义
    关于通过jquery来理解position的relative及absolute
    [小明学算法]1.动态规划--最长递增子序列问题
    [小明学算法]2.贪心算法----
    [Unity的序列化]2.技能编辑器的实现
    [Unity的序列化]1.什么是序列化
  • 原文地址:https://www.cnblogs.com/jilichuan/p/3107690.html
Copyright © 2020-2023  润新知