• 【Python爬虫】性能提升


    并发、异步IO

    在编写爬虫时,性能的消耗主要在IO请求中。当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。

    import requests
    
    def fetch_async(url):
        response = requests.get(url)
        return response
    
    
    url_list = ['http://www.github.com', 'http://www.bing.com']
    
    for url in url_list:
        print(url,fetch_async(url))
    1.同步执行
    from concurrent.futures import ThreadPoolExecutor
    import requests
    
    
    def fetch_async(url):
        response = requests.get(url)
        return response
    
    
    url_list = ['http://www.github.com', 'http://www.bing.com']
    pool = ThreadPoolExecutor(5)
    for url in url_list:
        pool.submit(fetch_async, url)
    
    pool.shutdown(wait=True)
    2-多线程(线程池)执行
    """并发未来-线程池"""
    from concurrent.futures import ThreadPoolExecutor
    import time
    import requests
    
    def task(url):
        response = requests.get(url)
        print(url,response.status_code)
        response.encoding = response.apparent_encoding
        if response.status_code == 200:
            return {"url":url,"text":response.text}
    
    def save_to_html(res,*args,**kwargs):
        res = res.result()    #res 回调函数接收到res返回的是一个对象<Future at 0x1ed4cf245c0 state=finished returned dict>
        filename = res['url'].split(".")[-2] + ".html"
        with open(filename,'w+') as f:
            f.write(res["text"])
        print(filename,"--->写入成功!")
    
    def parse_html(res,*args,**kwargs):
        pass
    
    if __name__ == '__main__':
        start = time.time()
        pool = ThreadPoolExecutor()    #线程池 不过不指定值 默认为CPU*5
        url_list = [
            'http://www.cnblogs.com/',
            'https://huaban.com/favorite/beauty/',
            'http://www.bing.com',
            'http://www.zhihu.com',
            'http://www.sina.com',
            'http://www.baidu.com',
            'http://www.autohome.com.cn',
        ]
        for url in url_list:
            v = pool.submit(task,url)
            v.add_done_callback(save_to_html)
            v.add_done_callback(parse_html)
    
        pool.shutdown(wait=True)
        print("consume time is:",time.time()-start)
    3-多线程+回调函数
    from concurrent.futures import ProcessPoolExecutor
    import requests
    
    def fetch_async(url):
        response = requests.get(url)
        return response
    
    
    url_list = ['http://www.github.com', 'http://www.bing.com']
    pool = ProcessPoolExecutor(5)
    for url in url_list:
        pool.submit(fetch_async, url)
    
    pool.shutdown(wait=True)
    4-多进程
    """并发未来-进程池"""
    from concurrent.futures import ProcessPoolExecutor
    import time
    import requests
    
    def task(url):
        response = requests.get(url)
        print(url,response.status_code)
        response.encoding = response.apparent_encoding
        if response.status_code == 200:
            return {"url":url,"text":response.text}
    
    def save_to_html(res,*args,**kwargs):
        res = res.result()    #res 回调函数接收到res返回的是一个对象<Future at 0x1ed4cf245c0 state=finished returned dict>
        filename = res['url'].split(".")[-2] + ".html"
        with open(filename,'w+') as f:
            f.write(res["text"])
        print(filename,"--->写入成功!")
    
    def parse_html(res,*args,**kwargs):
        pass
    
    if __name__ == '__main__':
        start = time.time()
        pool = ProcessPoolExecutor()    #线程池 不过不指定值 默认为CPU*5
        url_list = [
            'http://www.cnblogs.com/',
            'https://huaban.com/favorite/beauty/',
            'http://www.bing.com',
            'http://www.zhihu.com',
            'http://www.sina.com',
            'http://www.baidu.com',
            'http://www.autohome.com.cn',
        ]
        for url in url_list:
            v = pool.submit(task,url)
            v.add_done_callback(save_to_html)
            v.add_done_callback(parse_html)
    
        pool.shutdown(wait=True)
        print("consume time is:",time.time()-start)
    5-多进程+回调函数

    通过上述代码均可以完成对请求性能的提高,对于多线程和多进行的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO首选:

    补充:协程+异步IO(还举例讲了 并发、并行、同步、异步、阻塞、非阻塞

    参考:https://blog.csdn.net/weixin_41207499/article/details/80657201

    参考:https://www.cnblogs.com/ssyfj/p/9222342.html

    https://www.liaoxuefeng.com/wiki/1016959663602400/1017985577429536

  • 相关阅读:
    选择率,基数计算公式
    10.2.0.1.1 grid control的启动和关闭
    重建控制文件
    批量构建添加数据文件SQL
    Flex内存泄露解决方法和内存释放优化原则
    [INS-32052] Oracle基目录和Oracle主目录位置相同
    [INS-20802] Oracle Database Configuration Assistant 失败
    Caused by:java.sql.SQLException:ORA-01008:并非所有变量都已绑定
    JSP中的include有哪些?有什么区别?
    解析XML的方法
  • 原文地址:https://www.cnblogs.com/XJT2018/p/11002526.html
Copyright © 2020-2023  润新知