python 生产者 --- 消费者

值得拿出来看看的

多进程爬取 (生产) ，解析 (消费) 网页同时进行，可以作为以后项目扩展使用

from bs4 import BeautifulSoup
import requests
import time
import multiprocessing as mp
import re
from multiprocessing import Queue
# from multiprocessing import JoinableQueue as Queue

base_url = 'https://morvanzhou.github.io/'

def crawl(url):
    html = requests.get(url).text
    # 模拟请求时间消耗 0.1 s
    time.sleep(0.1)
    return html

def parse(html):
    soup = BeautifulSoup(html,'lxml')
    all_anchors = soup.find_all('a',{'href':re.compile(r'^/.+?/$')})
#     title = soup.find('meta',{'property':'og:title'})
    page_urls = {anchor.get_text().strip():base_url+anchor['href'] for anchor in all_anchors}
    main_url = soup.find('meta',{'property':'og:url'})['content']
    return main_url,page_urls
    
# print(html)

def main():
    # unseen 本可以定义多个
    unseen = (base_url,)
    seen = ()
    
    # 为了让 html 爬取 与 html 解析 同步进行，所以这里使用 生产者--消费者 模式
    html_queue = Queue()
    # 开启进程池
    # 生产者 即 html 爬取
    crawl_pool = mp.Pool(2)
    # 消费者 即 html 解析
    parse_pool = mp.Pool(2)
    
    
    for url in unseen:
        # 若一直 有 要被爬取的 html 则 一直进行
        html_queue.put(crawl_pool.apply_async(crawl,args=(url,)).get()) 
    else:
        # 已经爬取完成所有 页面
        html_queue.put(None) # 此处向队列发送 生产完成信号,不然方法一直被阻塞
    
    results = []
    
    # 开启循环 消费生产出的 html，对其进行解析
    while True:
        html=html_queue.get()
        if html:
            results.append(parse_pool.apply_async(parse,args=(html,)).get())
        else:
#             html_queue.task_done()
            break
        
    print(results)
    
if __name__ == '__main__':
    main()

如果有来生，一个人去远行，看不同的风景，感受生命的活力。。。

相关阅读:
设计模式：备忘录模式（Memento）
设计模式:中介者模式(Mediator)
设计模式:迭代器模式(Iterator)
设计模式:解释器模式(Interpreter)
设计模式：命令模式（Command）
设计模式:职责链模式(Chain of Responsibility)
设计模式:单例模式(单例模式)
win7硬盘安装方法
sqlite 附加和分离数据库
Sqlite 复制表结构和数据

原文地址：https://www.cnblogs.com/Frank99/p/10396697.html

python 生产者 --- 消费者

值得拿出来 看看的

多进程 爬取 (生产) ， 解析 (消费) 网页 同时进行，可以作为以后项目扩展使用

值得拿出来看看的

多进程爬取 (生产) ，解析 (消费) 网页同时进行，可以作为以后项目扩展使用