• scrapy 中间件


    下载中间件(Downloader Middleware)

    下载器中间件是介于 Scrapy 的 request/response 处理的钩子框架。 是用于全局修改 Scrapy request 和r esponse 的一个轻量、底层的系统

    这个介绍看起来非常绕口,但其实用容易理解的话表述就是:更换代理 IP,更换 Cookies,更换 User-Agent,自动重试。

    1 写在 middlewares.py 中(名字随便命名)

    2 激活下载中间件

    配置生效,在 setting 中开启()
    SPIDER_MIDDLEWARES = {
    'cnblogs_crawl.middlewares.CnblogsCrawlSpiderMiddleware': 543,
    }
    DOWNLOADER_MIDDLEWARES = {
    'cnblogs_crawl.middlewares.CnblogsCrawlDownloaderMiddleware': 543,
    }


    下载中间件

    下载中间件的用途
    1、在 process——request 内,自定义下载,不用 scrapy 的下载
    2、对请求进行二次加工,比如
    设置请求头
    设置 cookie
    添加代理
    scrapy自带的代理组件:
    from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
    from urllib.request import getproxies

    # 请求需要被下载时,经过所有下载器中间件的 process_request 调用
    -process_request:(请求去,走)
    # return None: 继续处理当次请求,进入下一个中间件
    # return Response: 当次请求结束,把Response丢给引擎处理(可以自己爬,包装成Response)
    # return Request : 相当于把Request重新给了引擎,引擎再去做调度
    # 抛异常:执行process_exception
    
    
    # spider处理完成,返回时调用
    -process_response:(请求回来,走)
    # return a Response object :继续处理当次Response,继续走后续的中间件
    # return a Request object:重新给引擎做调度
    # or raise IgnoreRequest :process_exception
    
    
    # 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
    -process_exception:(出异常,走)
    # return None: continue processing this exception  继续
    # return a Response object: stops process_exception() chain  :停止异常处理链,给引擎(给爬虫)
    # return a Request object: stops process_exception() chain :停止异常处理链,给引擎(重新调度)
    

    中间件添加代理,修改 UA

    class TttDownloaderMiddleware(object):
    
        def get_proxy(self):
            import requests
            res = requests.get('http://0.0.0.0:5010/get').json()['proxy']
            print(res)
            return res
    
        def process_request(self, request, spider):
            # 1 加 cookie (request.cookie 就是访问该网站的 cookie)
            # request.cookie = {'name': snata}
    
            # 2 加代理
            request.meta['proxy'] = self.get_proxy()
    
            # 3 修改 UA
            from fake_useragent import UserAgent
            ua = UserAgent(verify_ssl=False)
            request.headers['User-Agent'] = ua.random
    
            return None
    
        def process_response(self, request, response, spider):
    
            return response
    
        def process_exception(self, request, exception, spider):
    
            pass
    

    爬虫中间件(Spider Middleware)

    爬虫中间件方法介绍

    from scrapy import signals
    
    class SpiderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the spider middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) #当前爬虫执行时触发spider_opened
            return s
    
    ​```
    def spider_opened(self, spider):
        # spider.logger.info('我是egon派来的爬虫1: %s' % spider.name)
        print('我是egon派来的爬虫1: %s' % spider.name)
    
    def process_start_requests(self, start_requests, spider):
        # Must return only requests (not items).
        print('start_requests1')
        for r in start_requests:
            yield r
    
    def process_spider_input(self, response, spider):
        # 每个response经过爬虫中间件进入spider时调用
    
        # 返回值:Should return None or raise an exception.
        #1、None: 继续执行其他中间件的process_spider_input
        #2、抛出异常:
        # 一旦抛出异常则不再执行其他中间件的process_spider_input
        # 并且触发request绑定的errback
        # errback的返回值倒着传给中间件的process_spider_output
        # 如果未找到errback,则倒着执行中间件的process_spider_exception
    
        print("input1")
        return None
    
    def process_spider_output(self, response, result, spider):
        # Must return an iterable of Request, dict or Item objects.
        print('output1')
    
        # 用yield返回多次,与return返回一次是一个道理
        # 如果生成器掌握不好(函数内有yield执行函数得到的是生成器而并不会立刻执行),生成器的形式会容易误导你对中间件执行顺序的理解
        # for i in result:
        #     yield i
        return result
    
    def process_spider_exception(self, response, exception, spider):
        # Should return either None or an iterable of Response, dict
        # or Item objects.
        print('exception1')
    ​```
    

    当前爬虫启动时 以及 初始化请求 产生时

    from scrapy import signals
    
    class SpiderMiddleware1(object):
        @classmethod
        def from_crawler(cls, crawler):
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) #当前爬虫执行时触发spider_opened
            return s
    
    
    def spider_opened(self, spider):
        print('我是爬虫1: %s' % spider.name)
    
    def process_start_requests(self, start_requests, spider):
        # Must return only requests (not items).
        print('start_requests1')
        for r in start_requests:
            yield r
    
    
          
    class SpiderMiddleware2(object):
        @classmethod
        def from_crawler(cls, crawler):
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)  # 当前爬虫执行时触发spider_opened
            return s
    
    ​```
    def spider_opened(self, spider):
        print('我是爬虫2: %s' % spider.name)
    
    def process_start_requests(self, start_requests, spider):
        print('start_requests2')
        for r in start_requests:
            yield r
    
    
    #步骤三:分析运行结果
    #1、启动爬虫时则立刻执行:
    
    我是爬虫1: baidu
    我是爬虫2: baidu
    
    #2、然后产生一个初始的request请求,依次经过爬虫中间件1,2:
    start_requests1
    start_requests2
    

    process_spider_input 返回 None 时

    from scrapy import signals
    
    class SpiderMiddleware1(object):
    
    ​```
    def process_spider_input(self, response, spider):
        print("input1")
    
    def process_spider_output(self, response, result, spider):
        print('output1')
        return result
    
    def process_spider_exception(self, response, exception, spider):
        print('exception1')
    
    ​```
    
    
    class SpiderMiddleware2(object):
    
    ​```
    def process_spider_input(self, response, spider):
        print("input2")
        return None
    
    def process_spider_output(self, response, result, spider):
        print('output2')
        return result
    
    def process_spider_exception(self, response, exception, spider):
        print('exception2')
    
    ​```
    
    #步骤三:运行结果分析
    
    #1、返回 response 时,依次经过爬虫中间件1,2
    input1
    input2
    
    #2、spider处理完毕后,依次经过爬虫中间件2,2
    output2
    output1
    

    process_spider_input 抛出异常时

    from scrapy import signals
    
    class SpiderMiddleware1(object):
    
    ​```
    def process_spider_input(self, response, spider):
        print("input1")
    
    def process_spider_output(self, response, result, spider):
        print('output1')
        return result
    
    def process_spider_exception(self, response, exception, spider):
        print('exception1')
    
    ​```
    
    class SpiderMiddleware2(object):
    
    ​```
    def process_spider_input(self, response, spider):
        print("input2")
        raise Type
    
    def process_spider_output(self, response, result, spider):
        print('output2')
        return result
    
    def process_spider_exception(self, response, exception, spider):
        print('exception2')
    
    ​```
    
    class SpiderMiddleware3(object):
    
    ​```
    def process_spider_input(self, response, spider):
        print("input3")
        return None
    
    def process_spider_output(self, response, result, spider):
        print('output3')
        return result
    
    def process_spider_exception(self, response, exception, spider):
        print('exception3')
    
    ​```
    
    ​        
    
    #运行结果        
    input1
    input2
    exception3
    exception2
    exception1
    
    #分析:
    #1、当 response 经过中间件1的 process_spider_input返回None,继续交给中间件2的process_spider_input
    #2、中间件2的process_spider_input抛出异常,则直接跳过后续的process_spider_input,将异常信息传递给Spiders里该请求的errback
    #3、没有找到errback,则该response既没有被Spiders正常的callback执行,也没有被errback执行,即Spiders啥事也没有干,那么开始倒着执行process_spider_exception
    #4、如果process_spider_exception返回None,代表该方法推卸掉责任,并没处理异常,而是直接交给下一个process_spider_exception,全都返回None,则异常最终交给Engine抛出
    

    指定 errback

    #步骤一:spider.py
    import scrapy
    
    class BaiduSpider(scrapy.Spider):
        name = 'baidu'
        allowed_domains = ['www.baidu.com']
        start_urls = ['http://www.baidu.com/']
    
    ​```
    def start_requests(self):
        yield scrapy.Request(url='http://www.baidu.com/',
                             callback=self.parse,
                             errback=self.parse_err,
                             )
    
    def parse(self, response):
        pass
    
    def parse_err(self,res):
        #res 为异常信息,异常已经被该函数处理了,因此不会再抛给异常,于是开始走process_spider_output
        return [1,2,3,4,5] #提取异常信息中有用的数据以可迭代对象的形式存放于管道中,等待被process_spider_output取走
    
    ​```
    
    
    
    #步骤二:
    '''
    打开注释:
    SPIDER_MIDDLEWARES = {
       'Baidu.middlewares.SpiderMiddleware1': 200,
       'Baidu.middlewares.SpiderMiddleware2': 300,
       'Baidu.middlewares.SpiderMiddleware3': 400,
    }
    
    '''
    
    #步骤三:middlewares.py
    
    from scrapy import signals
    
    class SpiderMiddleware1(object):
    
    ​```
    def process_spider_input(self, response, spider):
        print("input1")
    
    def process_spider_output(self, response, result, spider):
        print('output1',list(result))
        return result
    
    def process_spider_exception(self, response, exception, spider):
        print('exception1')
    
    ​```
    
    class SpiderMiddleware2(object):
    
    ​```
    def process_spider_input(self, response, spider):
        print("input2")
        raise TypeError('input2 抛出异常')
    
    def process_spider_output(self, response, result, spider):
        print('output2',list(result))
        return result
    
    def process_spider_exception(self, response, exception, spider):
        print('exception2')
    
    ​```
    
    class SpiderMiddleware3(object):
    
    ​```
    def process_spider_input(self, response, spider):
        print("input3")
        return None
    
    def process_spider_output(self, response, result, spider):
        print('output3',list(result))
        return result
    
    def process_spider_exception(self, response, exception, spider):
        print('exception3')
    
    ​```
    
    
    
    #步骤四:运行结果分析
    input1
    input2
    output3 [1, 2, 3, 4, 5] #parse_err的返回值放入管道中,只能被取走一次,在output3的方法内可以根据异常信息封装一个新的request请求
    output2 []
    output1 []
    
  • 相关阅读:
    windows10上安装 .NET Framework 3.5
    Mac上安装Tomcat服务器
    实验室中搭建Spark集群和PyCUDA开发环境
    训练实录
    Hello World
    存储管理
    java脚本实现selenium架构下的复选框、上传文件的操作
    java脚本,selenium工具,自动发QQ邮件
    用java脚本,selenium2.0工具,切换窗口经验总结
    六、排队论模型
  • 原文地址:https://www.cnblogs.com/kai-/p/12682401.html
Copyright © 2020-2023  润新知