• Scraper_compare('NoneType' object has no attribute 'group')


    三种解析网页的方法各有所用,各有特点。通过,对比三种方式更能明白在什么情况之下采用什么方法。其中,运行代码时,可能会遇到一个bug(

    results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).group()
    AttributeError: 'NoneType' object has no attribute 'group'

    ),这其实是有一个参数没有对,如果删除之后,运行顺利!

    # 比较这三种解析的相对性能
    import urllib2
    import re
    from bs4 import BeautifulSoup
    import lxml.html
    import time
    def download(url, user_agent="wswp", num_retries=2):
        print "Download :", url
        headers = {"User_agent": user_agent}
        request = urllib2.Request(url, headers=headers)
        try:
            html = urllib2.urlopen(request).read()
        except urllib2.URLError as e:
            print "Download Error :", e.reason
            html = None
            if num_retries > 0:
                if hasattr(e, "code") and 500 <= e.code < 600:
                    return download(url, user_agent, num_retries-1)
        return html
    # 如果运行出现“results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).group()
    AttributeError: 'NoneType' object has no attribute 'group'”则只需删除FIELDS中的“iso”元素就行。
    FIELDS = ("area", "population", "iso", "country", "capital", "continent", "tld", "currency_code", "currency_name", "phone", "postal_code_format", "postal_code_regex", "languages", "neighbours")
    
    # 正则
    def re_scraper(html):
        results = {}
        for field in FIELDS:
            results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).groups()[0]
        return results
    
    # BeautifulSoup
    def bs_scraper(html):
        soup = BeautifulSoup(html, "html.parser", from_encoding="utf8")
        results = {}
        for field in FIELDS:
            # tr = soup.find(attrs={'id': 'places_%s__row' % field})
            # td = tr.find(attrs={'class':'w2p_fw'})
            # results[field] = td.text
            results[field] = soup.find('table').find('tr', id='places_%s__row' % field).find('td', class_='w2p_fw').text
        return results
    
    # lxml模块
    def lxml_scraper(html):
        tree = lxml.html.fromstring(html)
        results = {}
        for field in FIELDS:
            results[field] = tree.cssselect('tr#places_%s__row > td.w2p_fw' % field)[0].text_content()
        return results
    
    # 测试相对性能
    NUM_ITERATIONS = 1000
    html = download("http://example.webscraping.com/view/United-Kingdom-239")
    for name, scraper in [("Regular expressions", re_scraper), ("BeautifulSoup", bs_scraper), ("Lxml", lxml_scraper)]:
        start = time.time()
        for i in range(NUM_ITERATIONS):
            if scraper == re_scraper:
                # 清除正则表达式产生的搜索缓存,使结果更加公平
                re.purge()
            result = scraper(html)
            assert (result['area'] == '244,820 square kilometres')
    
        end = time.time()
        print "%s: %.2f seconds" % (name, end - start)

    输出结果是:

    Regular expressions: 3.82 seconds
    BeautifulSoup: 25.92 seconds
    Lxml: 4.33 seconds

    可以看出BeautifulSoup的性能最慢,但是,同时它也是最简单的。不过,在通常情况下,LXML是抓取数据的最好选择,这是因为该方法既快速又健壮,而正则表达式和BeautifulSoup只在某些特定的场景下有用。

  • 相关阅读:
    Xposed学习一:初探
    drozer浅析三:命令实现与交互
    drozer源码学习二:info+scanner
    drozer源码学习:app
    android dalvik浅析一:解释器及其执行
    智能汽车安全风险及防护技术分析
    几维安全“把手伸向金融交易系统”
    几维安全携手苏宁易购,创造企业安全建设新模式
    传输协议不安全,数据泄露谁之过?——流量劫持技术分析
    【一周安全热点】黑客“撞库”破解抖音百万账户密码两月获利上百万元|美国佛罗里达州向勒索软件运营商支付60万美元赎金
  • 原文地址:https://www.cnblogs.com/llhy1178/p/6844214.html
Copyright © 2020-2023  润新知