下载中间件(Downloader Middleware)
下载器中间件是介于 Scrapy 的 request/response 处理的钩子框架。 是用于全局修改 Scrapy request 和r esponse 的一个轻量、底层的系统
这个介绍看起来非常绕口,但其实用容易理解的话表述就是:更换代理 IP,更换 Cookies,更换 User-Agent,自动重试。
1 写在 middlewares.py 中(名字随便命名)
2 激活下载中间件
配置生效,在 setting 中开启()
SPIDER_MIDDLEWARES = {
'cnblogs_crawl.middlewares.CnblogsCrawlSpiderMiddleware': 543,
}
DOWNLOADER_MIDDLEWARES = {
'cnblogs_crawl.middlewares.CnblogsCrawlDownloaderMiddleware': 543,
}
下载中间件
下载中间件的用途
1、在 process——request 内,自定义下载,不用 scrapy 的下载
2、对请求进行二次加工,比如
设置请求头
设置 cookie
添加代理
scrapy自带的代理组件:
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
from urllib.request import getproxies# 请求需要被下载时,经过所有下载器中间件的 process_request 调用 -process_request:(请求去,走) # return None: 继续处理当次请求,进入下一个中间件 # return Response: 当次请求结束,把Response丢给引擎处理(可以自己爬,包装成Response) # return Request : 相当于把Request重新给了引擎,引擎再去做调度 # 抛异常:执行process_exception # spider处理完成,返回时调用 -process_response:(请求回来,走) # return a Response object :继续处理当次Response,继续走后续的中间件 # return a Request object:重新给引擎做调度 # or raise IgnoreRequest :process_exception # 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常 -process_exception:(出异常,走) # return None: continue processing this exception 继续 # return a Response object: stops process_exception() chain :停止异常处理链,给引擎(给爬虫) # return a Request object: stops process_exception() chain :停止异常处理链,给引擎(重新调度)
中间件添加代理,修改 UA
class TttDownloaderMiddleware(object): def get_proxy(self): import requests res = requests.get('http://0.0.0.0:5010/get').json()['proxy'] print(res) return res def process_request(self, request, spider): # 1 加 cookie (request.cookie 就是访问该网站的 cookie) # request.cookie = {'name': snata} # 2 加代理 request.meta['proxy'] = self.get_proxy() # 3 修改 UA from fake_useragent import UserAgent ua = UserAgent(verify_ssl=False) request.headers['User-Agent'] = ua.random return None def process_response(self, request, response, spider): return response def process_exception(self, request, exception, spider): pass
爬虫中间件(Spider Middleware)
爬虫中间件方法介绍
from scrapy import signals class SpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) #当前爬虫执行时触发spider_opened return s ``` def spider_opened(self, spider): # spider.logger.info('我是egon派来的爬虫1: %s' % spider.name) print('我是egon派来的爬虫1: %s' % spider.name) def process_start_requests(self, start_requests, spider): # Must return only requests (not items). print('start_requests1') for r in start_requests: yield r def process_spider_input(self, response, spider): # 每个response经过爬虫中间件进入spider时调用 # 返回值:Should return None or raise an exception. #1、None: 继续执行其他中间件的process_spider_input #2、抛出异常: # 一旦抛出异常则不再执行其他中间件的process_spider_input # 并且触发request绑定的errback # errback的返回值倒着传给中间件的process_spider_output # 如果未找到errback,则倒着执行中间件的process_spider_exception print("input1") return None def process_spider_output(self, response, result, spider): # Must return an iterable of Request, dict or Item objects. print('output1') # 用yield返回多次,与return返回一次是一个道理 # 如果生成器掌握不好(函数内有yield执行函数得到的是生成器而并不会立刻执行),生成器的形式会容易误导你对中间件执行顺序的理解 # for i in result: # yield i return result def process_spider_exception(self, response, exception, spider): # Should return either None or an iterable of Response, dict # or Item objects. print('exception1') ```
当前爬虫启动时 以及 初始化请求 产生时
from scrapy import signals class SpiderMiddleware1(object): @classmethod def from_crawler(cls, crawler): s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) #当前爬虫执行时触发spider_opened return s def spider_opened(self, spider): print('我是爬虫1: %s' % spider.name) def process_start_requests(self, start_requests, spider): # Must return only requests (not items). print('start_requests1') for r in start_requests: yield r class SpiderMiddleware2(object): @classmethod def from_crawler(cls, crawler): s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) # 当前爬虫执行时触发spider_opened return s ``` def spider_opened(self, spider): print('我是爬虫2: %s' % spider.name) def process_start_requests(self, start_requests, spider): print('start_requests2') for r in start_requests: yield r #步骤三:分析运行结果 #1、启动爬虫时则立刻执行: 我是爬虫1: baidu 我是爬虫2: baidu #2、然后产生一个初始的request请求,依次经过爬虫中间件1,2: start_requests1 start_requests2
process_spider_input 返回 None 时
from scrapy import signals class SpiderMiddleware1(object): ``` def process_spider_input(self, response, spider): print("input1") def process_spider_output(self, response, result, spider): print('output1') return result def process_spider_exception(self, response, exception, spider): print('exception1') ``` class SpiderMiddleware2(object): ``` def process_spider_input(self, response, spider): print("input2") return None def process_spider_output(self, response, result, spider): print('output2') return result def process_spider_exception(self, response, exception, spider): print('exception2') ``` #步骤三:运行结果分析 #1、返回 response 时,依次经过爬虫中间件1,2 input1 input2 #2、spider处理完毕后,依次经过爬虫中间件2,2 output2 output1
process_spider_input 抛出异常时
from scrapy import signals class SpiderMiddleware1(object): ``` def process_spider_input(self, response, spider): print("input1") def process_spider_output(self, response, result, spider): print('output1') return result def process_spider_exception(self, response, exception, spider): print('exception1') ``` class SpiderMiddleware2(object): ``` def process_spider_input(self, response, spider): print("input2") raise Type def process_spider_output(self, response, result, spider): print('output2') return result def process_spider_exception(self, response, exception, spider): print('exception2') ``` class SpiderMiddleware3(object): ``` def process_spider_input(self, response, spider): print("input3") return None def process_spider_output(self, response, result, spider): print('output3') return result def process_spider_exception(self, response, exception, spider): print('exception3') ``` #运行结果 input1 input2 exception3 exception2 exception1 #分析: #1、当 response 经过中间件1的 process_spider_input返回None,继续交给中间件2的process_spider_input #2、中间件2的process_spider_input抛出异常,则直接跳过后续的process_spider_input,将异常信息传递给Spiders里该请求的errback #3、没有找到errback,则该response既没有被Spiders正常的callback执行,也没有被errback执行,即Spiders啥事也没有干,那么开始倒着执行process_spider_exception #4、如果process_spider_exception返回None,代表该方法推卸掉责任,并没处理异常,而是直接交给下一个process_spider_exception,全都返回None,则异常最终交给Engine抛出
指定 errback
#步骤一:spider.py import scrapy class BaiduSpider(scrapy.Spider): name = 'baidu' allowed_domains = ['www.baidu.com'] start_urls = ['http://www.baidu.com/'] ``` def start_requests(self): yield scrapy.Request(url='http://www.baidu.com/', callback=self.parse, errback=self.parse_err, ) def parse(self, response): pass def parse_err(self,res): #res 为异常信息,异常已经被该函数处理了,因此不会再抛给异常,于是开始走process_spider_output return [1,2,3,4,5] #提取异常信息中有用的数据以可迭代对象的形式存放于管道中,等待被process_spider_output取走 ``` #步骤二: ''' 打开注释: SPIDER_MIDDLEWARES = { 'Baidu.middlewares.SpiderMiddleware1': 200, 'Baidu.middlewares.SpiderMiddleware2': 300, 'Baidu.middlewares.SpiderMiddleware3': 400, } ''' #步骤三:middlewares.py from scrapy import signals class SpiderMiddleware1(object): ``` def process_spider_input(self, response, spider): print("input1") def process_spider_output(self, response, result, spider): print('output1',list(result)) return result def process_spider_exception(self, response, exception, spider): print('exception1') ``` class SpiderMiddleware2(object): ``` def process_spider_input(self, response, spider): print("input2") raise TypeError('input2 抛出异常') def process_spider_output(self, response, result, spider): print('output2',list(result)) return result def process_spider_exception(self, response, exception, spider): print('exception2') ``` class SpiderMiddleware3(object): ``` def process_spider_input(self, response, spider): print("input3") return None def process_spider_output(self, response, result, spider): print('output3',list(result)) return result def process_spider_exception(self, response, exception, spider): print('exception3') ``` #步骤四:运行结果分析 input1 input2 output3 [1, 2, 3, 4, 5] #parse_err的返回值放入管道中,只能被取走一次,在output3的方法内可以根据异常信息封装一个新的request请求 output2 [] output1 []