• 提升爬虫效率的方法


    单线程+多任务异步协程

    • 协程

      • 在函数(特殊函数)定义的时候,使用async修饰,函数调用后,内部语句不会立即执行,而是会返回一个协程对象
    • 任务对象

      • 任务对象=高级的协程对象(进一步封装)=特殊的函数
      • 任务对象必须要注册到时间循环对象中
      • 给任务对象绑定回调:爬虫的数据解析中
    • 事件循环

      • 当做是一个装载任务对象的容器
      • 当启动事件循环对象的时候,存储在内的任务对象会异步执行
    • 特殊函数内部不能写不支持异步请求的模块,如time,requests...否则虽然不报错但实现不了异步

      • time.sleep -- asyncio.sleep
      • requests -- aiohttp
    import asyncio
    import time
    
    start_time = time.time()
    async def get_request(url):
        await asyncio.sleep(2)
        print(url,'下载完成!')
    
    urls = [
        'www.1.com',
        'www.2.com',
    ]
    
    task_lst = []  # 任务对象列表
    for url in urls:
        c = get_request(url)  # 协程对象
        task = asyncio.ensure_future(c)  # 任务对象
        # task.add_done_callback(...)   # 绑定回调
        task_lst.append(task)
    
    loop = asyncio.get_event_loop()  # 事件循环对象
    loop.run_until_complete(asyncio.wait(task_lst))  # 注册,手动挂起
    

    线程池+requests模块

    # 线程池
    import time
    from multiprocessing.dummy import Pool
    
    start_time = time.time()
    url_list = [
        'www.1.com',
        'www.2.com',
        'www.3.com',
    ]
    def get_request(url):
        print('正在下载...',url)
        time.sleep(2)
        print('下载完成!',url)
    
    pool = Pool(3)
    pool.map(get_request,url_list)
    print('总耗时:',time.time()-start_time)
    

    两个方法提升爬虫效率

    起一个flask服务端

    from flask import Flask
    import time
    
    app = Flask(__name__)
    
    @app.route('/bobo')
    def index_bobo():
        time.sleep(2)
        return 'hello bobo!'
    
    @app.route('/jay')
    def index_jay():
        time.sleep(2)
        return 'hello jay!'
    
    @app.route('/tom')
    def index_tom():
        time.sleep(2)
        return 'hello tom!'
    
    if __name__ == '__main__':
        app.run(threaded=True)
    

    aiohttp模块+单线程多任务异步协程

    import asyncio
    import aiohttp
    import requests
    import time
    
    start = time.time()
    async def get_page(url):
        # page_text = requests.get(url=url).text
        # print(page_text)
        # return page_text
        async with aiohttp.ClientSession() as s:  #生成一个session对象
            async with await s.get(url=url) as response:
                page_text = await response.text()
                print(page_text)
        return page_text
    
    urls = [
        'http://127.0.0.1:5000/bobo',
        'http://127.0.0.1:5000/jay',
        'http://127.0.0.1:5000/tom',
    ]
    tasks = []
    for url in urls:
        c = get_page(url)
        task = asyncio.ensure_future(c)
        tasks.append(task)
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    
    end = time.time()
    print(end-start)
    
    # 异步执行!
    # hello tom!
    # hello bobo!
    # hello jay!
    # 2.0311079025268555
    
    '''
    aiohttp模块实现单线程+多任务异步协程
    并用xpath解析数据
    '''
    import aiohttp
    import asyncio
    from lxml import etree
    import time
    
    start = time.time()
    # 特殊函数:请求的发送和数据的捕获
    # 注意async with await关键字
    async def get_request(url):
        async with aiohttp.ClientSession() as s:
            async with await s.get(url=url) as response:
                page_text = await response.text()
                return page_text        # 返回页面源码
    
    # 回调函数,解析数据
    def parse(task):
        page_text = task.result()
        tree = etree.HTML(page_text)
        msg = tree.xpath('/html/body/ul//text()')
        print(msg)
    
    urls = [
        'http://127.0.0.1:5000/bobo',
        'http://127.0.0.1:5000/jay',
        'http://127.0.0.1:5000/tom',
    ]
    tasks = []
    for url in urls:
        c = get_request(url)
        task = asyncio.ensure_future(c)
        task.add_done_callback(parse)  #绑定回调函数!
        tasks.append(task)
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    
    end = time.time()
    print(end-start)
    

    requests模块+线程池

    import time
    import requests
    from multiprocessing.dummy import Pool
    
    start = time.time()
    urls = [
        'http://127.0.0.1:5000/bobo',
        'http://127.0.0.1:5000/jay',
        'http://127.0.0.1:5000/tom',
    ]
    def get_request(url):
        page_text = requests.get(url=url).text
        print(page_text)
        return page_text
    
    pool = Pool(3)
    pool.map(get_request, urls)
    end = time.time()
    print('总耗时:', end-start)
    
    # 实现异步请求
    # hello jay!
    # hello bobo!
    # hello tom!
    # 总耗时: 2.0467123985290527
    

    小结

    • 爬虫的加速目前掌握了两种方法:
      • aiohttp模块+单线程多任务异步协程
      • requests模块+线程池
    • 爬虫接触的模块有三个:
      • requests
      • urllib
      • aiohttp
    • 接触了一下flask开启服务器
  • 相关阅读:
    [Daily Coding Problem 223] O(1) space in order traversal of a binary tree
    [Daily Coding Problem 224] Find smallest positive integer that is not the sum of a subset of a sorted array
    Sliding Window Algorithm Questions
    Sweep Line Algorithm Summary
    Palindrome Algorithm Questions
    Core Data Structures, Algorithms and Concepts
    [LeetCode] 1000. Minimum Cost to Merge Stones
    UVA 253 Cube painting(枚举 模拟)
    Codeforces 821C Okabe and Boxes
    Codeforce 741B Arpa's weak amphitheater and Mehrdad's valuable Hoses(并查集&分组背包)
  • 原文地址:https://www.cnblogs.com/straightup/p/13676391.html
Copyright © 2020-2023  润新知