• Python进阶:聊协程


    从一个爬虫说起

      Python 2 的时代使用生成器协程,Python 3.7 提供了新的基于 asyncio 和 async / await 的方法。先看一个简单的爬虫代码,爬虫的 scrawl_page 函数为休眠数秒,休眠时间取决于 url 最后的那个数字。
    import time
    
    def crawl_page(url):
        print('crawling {}'.format(url))
        sleep_time = int(url.split('_')[-1])
        time.sleep(sleep_time)
        print('OK {}'.format(url))
    
    def main(urls):
        for url in urls:
            crawl_page(url)
    
    %time main(['url_1', 'url_2', 'url_3', 'url_4'])
    
    ######### 输出 ##########
    
    crawling url_1
    OK url_1
    crawling url_2
    OK url_2
    crawling url_3
    OK url_3
    crawling url_4
    OK url_4
    Wall time: 10 s
      由于是同步执行,因此用了10秒。
      试着用协程实现:
    import asyncio
    import nest_asyncio
    nest_asyncio.apply()
    
    async def crawl_page(url):
        print('crawling {}'.format(url))
        sleep_time = int(url.split('_')[-1])
        await asyncio.sleep(sleep_time)
        print('OK {}'.format(url))
    
    async def main(urls):
        for url in urls:
            await crawl_page(url)
    
    %time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))
    
    ########## 输出 ##########
    
    crawling url_1
    OK url_1
    crawling url_2
    OK url_2
    crawling url_3
    OK url_3
    crawling url_4
    OK url_4
    Wall time: 10 s

      首先来看 import asyncio,这个库包含了大部分我们实现协程所需的魔法工具。async 修饰词声明异步函数,于是,这里的 crawl_page 和 main 都变成了异步函数。可以通过 await 来调用异步函数。await 执行的效果和 Python 正常执行是一样的,也就是说程序会阻塞在这里,进入被调用的协程函数,执行完毕返回后再继续,而这也是 await 的字面意思。代码中 await asyncio.sleep(sleep_time) 会在这里休息若干秒,await crawl_page(url) 则会执行 crawl_page() 函数。

      上面的代码仍然是同步执行的,所以同样是用了10秒。

      如果要异步执行,可以用Task任务的概念。

    import asyncio
    import nest_asyncio
    nest_asyncio.apply()
    
    async def crawl_page(url):
    print('crawling {}'.format(url))
    sleep_time = int(url.split('_')[-1])
    await asyncio.sleep(sleep_time)
    print('OK {}'.format(url))
    
    async def main(urls):
    tasks = [asyncio.create_task(crawl_page(url)) for url in urls]
    for task in tasks:
    await task
    
    %time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))
    
    ######### 输出 ##########
    
    crawling url_1
    crawling url_2
    crawling url_3
    crawling url_4
    OK url_1
    OK url_2
    OK url_3
    OK url_4
    Wall time: 4.01 s

      在create_task创建任务后,任务就会马上由事件循环调度执行。如果不调用await task,代码就不会阻塞。我们想要等所有任务结束再往下走,因此用for task in tasks: await task。现在时间就4秒左右,可以用await asyncio.gather(*tasks)来代替for task in tasks: await task。

    import asyncio
    import nest_asyncio
    nest_asyncio.apply()
    import time
    
    async def crawl_page(url):
        print('crawling {}'.format(url))
        sleep_time = int(url.split('_')[-1])
        #await asyncio.sleep(sleep_time)
        await asyncio.sleep(sleep_time)
        print('OK {}'.format(url))
    
    async def main(urls):
        tasks = [asyncio.create_task(crawl_page(url)) for url in urls ]
        await asyncio.gather(*tasks)
    
    %time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))

    协程运行时出现错误,要怎么处理?

      看代码:

    import asyncio
    import nest_asyncio
    nest_asyncio.apply()
    
    async def worker_1():
        await asyncio.sleep(1)
        return 1
    
    async def worker_2():
        await asyncio.sleep(2)
        return 2 / 0
    
    async def worker_3():
        await asyncio.sleep(3)
        print('over worker_3')
        return 3
    
    async def main():
        task_1 = asyncio.create_task(worker_1())
        task_2 = asyncio.create_task(worker_2())
        task_3 = asyncio.create_task(worker_3())
    
        await asyncio.sleep(2)
        task_3.cancel()
    
        res = await asyncio.gather(task_1, task_2, task_3, return_exceptions=True)
        print(res)
    
    %time asyncio.run(main())
    
    ########## 输出 ##########
    
    # [1, ZeroDivisionError('division by zero'), CancelledError()]
    # Wall time: 2 s

      注意return_exceptions参数,上面代码在执worker_2会抛出除0的异常,而worker_2中没有做try..catch捕捉错误,本应该会程序会停止,由于设置了return_exceptions=True,所以没有影响到其他任务的执行。而 CancelledError()表示task_3被cancel()取消掉.

    协程实现消费者生产者模型

      show the code:

    import asyncio
    import random
    
    async def consumer(queue, id):
        while True:
            val = await queue.get()
            print('{} get a val: {}'.format(id, val))
            await asyncio.sleep(1)
    
    async def producer(queue, id):
        for i in range(5):
            val = random.randint(1, 10)
            await queue.put(val)
            print('{} put a val: {}'.format(id, val))
            await asyncio.sleep(2)
    
    async def main():
        queue = asyncio.Queue()
    
        consumer_1 = asyncio.create_task(consumer(queue, 'consumer_1'))
        consumer_2 = asyncio.create_task(consumer(queue, 'consumer_2'))
    
        producer_1 = asyncio.create_task(producer(queue, 'producer_1'))
        producer_2 = asyncio.create_task(producer(queue, 'producer_2'))
        
        await asyncio.sleep(10)
        consumer_1.cancel()
        consumer_2.cancel()
        
        await asyncio.gather(consumer_1, consumer_2, producer_1, producer_2, return_exceptions=True)
    
    %time asyncio.run(main())
    
    ########## 输出 ##########
    
    # producer_1 put a val: 5
    # producer_2 put a val: 3
    # consumer_1 get a val: 5
    # consumer_2 get a val: 3
    # producer_1 put a val: 1
    # producer_2 put a val: 3
    # consumer_2 get a val: 1
    # consumer_1 get a val: 3
    # producer_1 put a val: 6
    # producer_2 put a val: 10
    # consumer_1 get a val: 6
    # consumer_2 get a val: 10
    # producer_1 put a val: 4
    # producer_2 put a val: 5
    # consumer_2 get a val: 4
    # consumer_1 get a val: 5
    # producer_1 put a val: 2
    # producer_2 put a val: 8
    # consumer_1 get a val: 2
    # consumer_2 get a val: 8
    # Wall time: 10 s

    实战:豆瓣近日推荐电影爬虫

    同步版本
    import requests
    from bs4 import BeautifulSoup
    
    def main():
        url = "https://movie.douban.com/cinema/later/beijing/"
        init_page = requests.get(url).content
        init_soup = BeautifulSoup(init_page, 'lxml')
    
        all_movies = init_soup.find('div', id="showing-soon")
        for each_movie in all_movies.find_all('div', class_="item"):
            all_a_tag = each_movie.find_all('a')
            all_li_tag = each_movie.find_all('li')
    
            movie_name = all_a_tag[1].text
            url_to_fetch = all_a_tag[1]['href']
            movie_date = all_li_tag[0].text
    
            response_item = requests.get(url_to_fetch).content
            soup_item = BeautifulSoup(response_item, 'lxml')
            img_tag = soup_item.find('img')
    
            print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))
    
    %time main()
    
    ########## 输出 ##########
    九龙不败 07月02日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2560169035.jpg
    善良的天使 07月02日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2558266159.jpg
    别岁 07月02日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2558138041.jpg
    上海的女儿 07月02日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555602094.jpg
    爱宠大机密2 07月05日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555923582.jpg
    扫毒2天地对决 07月05日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2560684734.jpg
    猪猪侠·不可思议的世界 07月05日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2560664101.jpg
    他她他她 07月05日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2559292102.jpg
    狮子王 07月12日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2559658750.jpg
    命运之夜——天之杯II :迷失之蝶 07月12日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2560749451.jpg
    宝莱坞机器人2.0:重生归来 07月12日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2558657891.jpg
    素人特工 07月12日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2560447448.jpg
    机动战士高达NT 07月12日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2558661806.jpg
    舞动吧!少年 07月12日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555119986.jpg
    嘿,蠢贼 07月16日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2560832388.jpg
    银河补习班 07月18日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2560954373.jpg
    小小的愿望 07月18日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2560659129.jpg
    匠心 07月18日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2553935771.jpg
    猪八戒·传说 07月19日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2559590242.jpg
    刀背藏身 07月19日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2557644589.jpg
    为家而战 07月19日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2559337905.jpg
    Wall time: 22.1 s
    异步版本
    import asyncio
    import aiohttp
    
    from bs4 import BeautifulSoup
    
    async def fetch_content(url):
        async with aiohttp.ClientSession(
            headers=header, connector=aiohttp.TCPConnector(ssl=False)
        ) as session:
            async with session.get(url) as response:
                return await response.text()
    
    async def main():
        url = "https://movie.douban.com/cinema/later/beijing/"
        init_page = await fetch_content(url)
        init_soup = BeautifulSoup(init_page, 'lxml')
    
        movie_names, urls_to_fetch, movie_dates = [], [], []
    
        all_movies = init_soup.find('div', id="showing-soon")
        for each_movie in all_movies.find_all('div', class_="item"):
            all_a_tag = each_movie.find_all('a')
            all_li_tag = each_movie.find_all('li')
    
            movie_names.append(all_a_tag[1].text)
            urls_to_fetch.append(all_a_tag[1]['href'])
            movie_dates.append(all_li_tag[0].text)
    
        tasks = [fetch_content(url) for url in urls_to_fetch]
        pages = await asyncio.gather(*tasks)
    
        for movie_name, movie_date, page in zip(movie_names, movie_dates, pages):
            soup_item = BeautifulSoup(page, 'lxml')
            img_tag = soup_item.find('img')
    
            print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))
    
    %time asyncio.run(main())
    
    ########## 输出 ##########
    
    九龙不败 07月02日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2560169035.jpg
    善良的天使 07月02日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2558266159.jpg
    别岁 07月02日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2558138041.jpg
    上海的女儿 07月02日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555602094.jpg
    爱宠大机密2 07月05日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555923582.jpg
    扫毒2天地对决 07月05日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2560684734.jpg
    猪猪侠·不可思议的世界 07月05日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2560664101.jpg
    他她他她 07月05日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2559292102.jpg
    狮子王 07月12日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2559658750.jpg
    命运之夜——天之杯II :迷失之蝶 07月12日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2560749451.jpg
    宝莱坞机器人2.0:重生归来 07月12日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2558657891.jpg
    素人特工 07月12日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2560447448.jpg
    机动战士高达NT 07月12日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2558661806.jpg
    舞动吧!少年 07月12日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555119986.jpg
    嘿,蠢贼 07月16日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2560832388.jpg
    银河补习班 07月18日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2560954373.jpg
    小小的愿望 07月18日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2560659129.jpg
    匠心 07月18日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2553935771.jpg
    猪八戒·传说 07月19日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2559590242.jpg
    刀背藏身 07月19日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2557644589.jpg
    为家而战 07月19日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2559337905.jpg
    Wall time: 5.82 s

    参考

      极客时间《Python核心技术与实战》专栏

  • 相关阅读:
    世界上最帅的人是谁?
    Java 常量池存放的是什么
    刚 安装 Oracle时,登录会出现的问题, ora-28000: the account is locked
    使用MyBatis Generator自动创建代码
    1.2---翻转字符串(CC150)
    1.1---判断字符串是否所有字符都不相同(CC150)
    1.8---字符串是否是旋转而成(CC150)
    1.7---将矩阵元素为0的行列清零0(CC150)
    String和StringBuffer的转换
    Linux下端口被占用解决
  • 原文地址:https://www.cnblogs.com/xiaoguanqiu/p/11116827.html
Copyright © 2020-2023  润新知