Scrapy框架基础:Twsited
Scrapy内部基于事件循环的机制实现爬虫的并发。
原来:
url_list = ['http://www.baidu.com','http://www.baidu.com','http://www.baidu.com',] for item in url_list: response = requests.get(item) print(response.text)
现在:
from twisted.web.client import getPage, defer from twisted.internet import reactor # 第一部分:代理开始接收任务 def callback(contents): print(contents) deferred_list = [] # [(龙泰,贝贝),(刘淞,宝件套),(呼呼,东北)] url_list = ['http://www.bing.com', 'https://segmentfault.com/','https://stackoverflow.com/' ] for url in url_list: deferred = getPage(bytes(url, encoding='utf8')) # (我,要谁) deferred.addCallback(callback) deferred_list.append(deferred) # # 第二部分:代理执行完任务后,停止 dlist = defer.DeferredList(deferred_list) def all_done(arg): reactor.stop() dlist.addBoth(all_done) # 第三部分:代理开始去处理吧 reactor.run()
什么是twisted?
- 官方:基于事件循环的异步非阻塞模块。
- 白话:一个线程同时可以向多个目标发起Http请求。
非阻塞:不等待,所有请求同时发出。 我向请求A、请求B、请求C发起连接请求的时候,不等连接返回结果之后再去连下一个,而是发送一个之后,马上发送下一个。
import socket sk = socket.socket() sk.setblocking(False) sk.connect((1.1.1.1,80)) import socket sk = socket.socket() sk.setblocking(False) sk.connect((1.1.1.2,80)) import socket sk = socket.socket() sk.setblocking(False) sk.connect((1.1.1.3,80))
异步:回调。我一旦帮助callback_A、callback_B、callback_F找到想要的A,B,C,我会主动通知他们。
def callback(contents): print(contents)
事件循环: 我,我一直在循环三个socket任务(即:请求A、请求B、请求C),检查他三个状态:是否连接成功;是否返回结果。
它和requests的区别?
requests是一个Python实现的可以伪造浏览器发送http请求的模块 -封装socket发送请求 twisted是基于事件循环的异步非阻塞网络框架 -封装socket发送请求 -单线程完成完成并发请求 PS:三个关键词 -非阻塞:不等待 -异步:回调 -事件循环:一直循环去检查状态
Scrapy
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下:
Scrapy主要包括了以下组件:
- 引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心)
- 调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
- 下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
- 爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
- 项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
- 下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
- 爬虫中间件(Spider Middlewares)
介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
- 调度中间件(Scheduler Middewares)
介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。
Scrapy运行流程大概如下:
- 引擎找到要执行的爬虫,并执行爬虫的 start_requests 方法,并得到一个迭代器。
- 迭代器循环时会获取Request对象,而Request对象中封装了要访问的URL和回调函数,将所有的request对象(任务)放到调度器中,放入一个请求队列,同时去重。
- 下载器去引擎中获取要下载任务(就是Request对象),引擎向调度器获取任务,调度器从队列中取出一个Request返回引擎,引擎交给下载器
- 下载器下载完成后返回Response对象交给引擎执行回调函数。
- 回到spider的回调函数中,爬虫解析Response
- yield Item(): 解析出实体(Item),则交给实体管道进行进一步的处理
- yield Request()解析出的是链接(URL),则把URL交给调度器等待抓取
一. 基本命令及项目结构
基本命令
1 建立项目:scrapy startproject 项目名称 2 在当前目录中创建中创建一个项目文件(类似于Django) 3 创建爬虫应用 4 cd 项目名称 5 scrapy genspider [-t template] <name> <domain> 6 scrapy gensipider -t basic oldboy oldboy.com 7 scrapy genspider -t crawl weisuen sohu.com
8 PS:
9 查看所有命令:scrapy gensipider -l
10 查看模板命令:scrapy gensipider -d 模板名称
11 scrapy list
12 展示爬虫应用列表
13 运行爬虫应用
14 scrapy crawl 爬虫应用名称
15 Scrapy crawl quotes
16 Scrapy runspider quote
17 scrapy crawl lagou -s JOBDIR=job_info/001 暂停与重启
18 保存文件:Scrapy crawl quotes –o quotes.json
19 shell脚本测试 scrapy shell 'http://scrapy.org' --nolog
详细命令请点击:详解Scrapy的命令行工具
项目结构
1 project_name/ 2 scrapy.cfg 3 project_name/ 4 __init__.py 5 items.py 6 pipelines.py 7 settings.py 8 spiders/ 9 __init__.py 10 爬虫1.py 11 爬虫2.py 12 爬虫3.py
文件说明:
- scrapy.cfg 项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
- items.py 设置数据存储模板,用于结构化数据,如:Django的Model
- pipelines 数据处理行为,如:一般结构化的数据持久化
- settings.py 配置文件,如:递归的层数、并发数,延迟下载等
- spiders 爬虫目录,如:创建文件,编写爬虫规则
注意:一般创建爬虫文件时,以网站域名命名
二. spider编写
scrapy为我们提供了5中spider用于构造请求,解析数据、返回item。常用的就scrapy.spider、scrapy.crawlspider两种。下面是spider常用到的属性和方法。
scrapy.spider
属性、方法 | 功能 | 简述 |
name | 爬虫的名称 | 启动爬虫的时候会用到 |
start_urls | 起始url | 是一个列表,默认被start_requests调用 |
allowd_doamins | 对url进行的简单过滤 | 当请求url没有被allowd_doamins匹配到时,会报一个非常恶心的错,详见我的分布式爬虫的那一篇博客 |
start_requests() | 第一次请求 | 自己的spider可以重写,突破一些简易的反爬机制 |
custom_settings | 定制settings | 可以对每个爬虫定制settings配置 |
from_crawler | 实例化入口 | 在scrapy的各个组件的源码中,首先执行的就是它 |
关于spider我们可以定制start_requests、可以单独的设置custom_settings、也可以设置请求头、代理、cookies。这些基础的用发,之前的一些博客也都介绍到了,额外的想说一下的就是关于页面的解析。
页面的解析,就是两种,一种是请求得到的文章列表中标题中信息量不大(一些新闻、资讯类的网站),只需访问具体的文章内容。在做循环时需解析到a标签下的href属性里的url,还有一种就是一些电商网站类的,商品的列表也本身包含的信息就比较多,既要提取列表中商品的信息,又要提取到详情页中用户的评论,这时循环解析到具体的li标签,然后在解析li里面的a标签进行下一步的跟踪。
scrapy.crawlspider
文档进行了说明,其实就是利用自己定制的rules,正则匹配每次请求的url,对allowd_doamins进行的补充,常用于网站的全站爬取。关于全站的爬取还可以自己找网站url的特征来进行,博客中的新浪网的爬虫以及当当图书的爬虫都是我自己定义的url匹配规则进行的爬取。
1.start_urls
内部原理
scrapy引擎来爬虫中取起始URL:
1. 调用start_requests并获取返回值
2. v = iter(返回值)
3.
req1 = 执行 v.__next__()
req2 = 执行 v.__next__()
req3 = 执行 v.__next__()
...
4. req全部放到调度器中
class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): # 方式一: for url in self.start_urls: yield Request(url=url) # 方式二: # req_list = [] # for url in self.start_urls: # req_list.append(Request(url=url)) # return req_list - 定制:可以去redis中获取
2. 响应:
# response封装了响应相关的所有数据: - response.text - response.encoding - response.body
- response.meta['depth':'深度'] - response.request # 当前响应是由那个请求发起;请求中 封装(要访问的url,下载完成之后执行那个函数)
3. 选择器
1 from scrapy.selector import Selector 2 from scrapy.http import HtmlResponse 3 4 html = """<!DOCTYPE html> 5 <html> 6 <head lang="en"> 7 <meta charset="UTF-8"> 8 <title></title> 9 </head> 10 <body> 11 <ul> 12 <li class="item-"><a id='i1' href="link.html">first item</a></li> 13 <li class="item-0"><a id='i2' href="llink.html">first item</a></li> 14 <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li> 15 </ul> 16 <div><a href="llink2.html">second item</a></div> 17 </body> 18 </html> 19 """ 20 response = HtmlResponse(url='http://example.com', body=html, encoding='utf-8') 21 # hxs = Selector(response=response).xpath('//a') 22 # print(hxs) 23 # hxs = Selector(text=html).xpath('//a') 24 # print(hxs) 25 # hxs = Selector(response=response).xpath('//a[2]') 26 # print(hxs) 27 # hxs = Selector(response=response).xpath('//a[@id]') 28 # print(hxs) 29 # hxs = Selector(response=response).xpath('//a[@id="i1"]') 30 # print(hxs) 31 # hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]') 32 # print(hxs) 33 # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]') 34 # print(hxs) 35 # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]') 36 # print(hxs) 37 # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]') 38 # print(hxs) 39 # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/text()').extract() 40 # print(hxs) 41 # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/@href').extract() 42 # print(hxs) 43 # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract() 44 # print(hxs) 45 # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first() 46 # print(hxs) 47 48 # ul_list = Selector(response=response).xpath('//body/ul/li') 49 # for item in ul_list: 50 # v = item.xpath('./a/span') 51 # # 或 52 # # v = item.xpath('a/span') 53 # # 或 54 # # v = item.xpath('*/a/span') 55 # print(v)
response.css('...') 返回一个response xpath对象
response.css('....').extract() 返回一个列表
response.css('....').extract_first() 提取列表中的元素
def parse_detail(self, response): # items = JobboleArticleItem() # title = response.xpath('//div[@class="entry-header"]/h1/text()')[0].extract() # create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace('·','').strip() # praise_nums = int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract_first()) # fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract_first() # try: # if re.match('.*?(d+).*', fav_nums).group(1): # fav_nums = int(re.match('.*?(d+).*', fav_nums).group(1)) # else: # fav_nums = 0 # except: # fav_nums = 0 # comment_nums = response.xpath('//a[contains(@href,"#article-comment")]/span/text()').extract()[0] # try: # if re.match('.*?(d+).*',comment_nums).group(1): # comment_nums = int(re.match('.*?(d+).*',comment_nums).group(1)) # else: # comment_nums = 0 # except: # comment_nums = 0 # contente = response.xpath('//div[@class="entry"]').extract()[0] # tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract() # tag_list = [tag for tag in tag_list if not tag.strip().endswith('评论')] # tags = ",".join(tag_list) # items['title'] = title # try: # create_date = datetime.datetime.strptime(create_date,'%Y/%m/%d').date() # except: # create_date = datetime.datetime.now() # items['date'] = create_date # items['url'] = response.url # items['url_object_id'] = get_md5(response.url) # items['img_url'] = [img_url] # items['praise_nums'] = praise_nums # items['fav_nums'] = fav_nums # items['comment_nums'] = comment_nums # items['content'] = contente # items['tags'] = tags
# title = response.css('.entry-header h1::text')[0].extract() # create_date = response.css('p.entry-meta-hide-on-mobile::text').extract()[0].strip().replace('·','').strip() # praise_nums = int(response.css(".vote-post-up h10::text").extract_first() # fav_nums = response.css(".bookmark-btn::text").extract_first() # if re.match('.*?(d+).*', fav_nums).group(1): # fav_nums = int(re.match('.*?(d+).*', fav_nums).group(1)) # else: # fav_nums = 0 # comment_nums = response.css('a[href="#article-comment"] span::text').extract()[0] # if re.match('.*?(d+).*', comment_nums).group(1): # comment_nums = int(re.match('.*?(d+).*', comment_nums).group(1)) # else: # comment_nums = 0 # content = response.css('.entry').extract()[0] # tag_list = response.css('p.entry-meta-hide-on-mobile a::text') # tag_list = [tag for tag in tag_list if not tag.strip().endswith('评论')] # tags = ",".join(tag_list) # xpath选择器 /@href /text()
def parse_detail(self, response): img_url = response.meta.get('img_url','') item_loader = ArticleItemLoader(item=JobboleArticleItem(), response=response) item_loader.add_css("title", ".entry-header h1::text") item_loader.add_value('url',response.url) item_loader.add_value('url_object_id', get_md5(response.url)) item_loader.add_css('date', 'p.entry-meta-hide-on-mobile::text') item_loader.add_value("img_url", [img_url]) item_loader.add_css("praise_nums", ".vote-post-up h10::text") item_loader.add_css("fav_nums", ".bookmark-btn::text") item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text") item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text") item_loader.add_css("content", "div.entry") items = item_loader.load_item() yield items
4. 再次发起请求
yield Request(url='xxxx',callback=self.parse)
yield Request(url=parse.urljoin(response.url,post_url), meta={'img_url':img_url}, callback=self.parse_detail)
5. 携带cookie
1.settings
settings文件中给Cookies_enabled=False解注释
settings的headers配置的cookie就可以用了
COOKIES_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cookie': '_ga=GA1.2.1597380101.1571015417; Hm_lvt_2207ecfb7b2633a3bc5c4968feb58569=1571015417; _emuch_index=1; _lastlogin=1; view_tid=11622561; _discuz_uid=13260353; _discuz_pw=bce925aa772ee8f7; last_ip=111.53.196.3_13260353; discuz_tpl=qing; _last_fid=189; _discuz_cc=29751646316289719; Hm_lpvt_2207ecfb7b2633a3bc5c4968feb58569=1571367951; _gat=1' }
2.DownloadMiddleware
settings中的Cookies_enabled=True
settings中给downloadmiddleware解注释
去中间件文件中找downloadmiddleware这个类,修改process_request,添加request.cookies={}即可。
3.爬虫主文件中重写start_request
方式一:携带
cookie_dict cookie_jar = CookieJar() cookie_jar.extract_cookies(response, response.request) # 去对象中将cookie解析到字典 for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): cookie_dict[m] = n.value
yield Request( url='https://dig.chouti.com/login', method='POST', body='phone=8615735177116&password=zyf123&oneMonth=1', headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'}, # cookies=cookie_obj._cookies, cookies=self.cookies_dict, callback=self.check_login, )
方式二: meta
yield Request(url=url, callback=self.login, meta={'cookiejar': True})
6. 回调函数传递值:meta
def parse(self, response):
yield scrapy.Request(url=parse.urljoin(response.url,post_url), meta={'img_url':img_url}, callback=self.parse_detail) def parse_detail(self, response): img_url = response.meta.get('img_url','')
from urllib.parse import urljoin import scrapy from scrapy import Request from scrapy.http.cookies import CookieJar class SpiderchoutiSpider(scrapy.Spider): name = 'choutilike' allowed_domains = ['dig.chouti.com'] start_urls = ['https://dig.chouti.com/'] cookies_dict = {} def parse(self, response): # 去响应头中获取cookie,cookie保存在cookie_jar对象 cookie_obj = CookieJar() cookie_obj.extract_cookies(response, response.request) # 去对象中将cookie解析到字典 for k, v in cookie_obj._cookies.items(): for i, j in v.items(): for m, n in j.items(): self.cookies_dict[m] = n.value # self.cookies_dict = cookie_obj._cookies yield Request( url='https://dig.chouti.com/login', method='POST', body='phone=8615735177116&password=zyf123&oneMonth=1', headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'}, # cookies=cookie_obj._cookies, cookies=self.cookies_dict, callback=self.check_login, ) def check_login(self,response): # print(response.text) yield Request(url='https://dig.chouti.com/all/hot/recent/1', cookies=self.cookies_dict, callback=self.good, ) def good(self,response): id_list = response.css('div.part2::attr(share-linkid)').extract() for id in id_list: url = 'https://dig.chouti.com/link/vote?linksId={}'.format(id) yield Request( url=url, method='POST', cookies=self.cookies_dict, callback=self.show, ) pages = response.css('#dig_lcpage a::attr(href)').extract() for page in pages: url = urljoin('https://dig.chouti.com/',page) yield Request(url=url,callback=self.good) def show(self,response): print(response.text)
三、持久化
1. 书写顺序
- a. 先写pipeline类
- b. 写Item类
import scrapy class ChoutiItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() href = scrapy.Field()
- c. 配置settings
ITEM_PIPELINES = { # 'chouti.pipelines.XiaohuaImagesPipeline': 300, # 'scrapy.pipelines.images.ImagesPipeline': 1, 'chouti.pipelines.ChoutiPipeline': 300, # 'chouti.pipelines.Chouti2Pipeline': 301, }
- d. 爬虫,yield每执行一次,process_item就调用一次。
yield Item对象
2. pipeline的编写
源码执行流程
1. 判断当前XdbPipeline类中是否有from_crawler 有: obj = XdbPipeline.from_crawler(....) 否: obj = XdbPipeline() 2. obj.open_spider() 3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item() 4. obj.close_spider()
import scrapy from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem class ChoutiPipeline(object): def __init__(self,conn_str): self.conn_str = conn_str @classmethod def from_crawler(cls,crawler): """ 初始化的时候,用于创建pipeline对象 :param crawler: :return: """ conn_str = crawler.settings.get('DB') return cls(conn_str) def open_spider(self,spider): """ 爬虫开始时调用 :param spider: :return: """ self.conn = open(self.conn_str,'a',encoding='utf-8') def process_item(self, item, spider): if spider.name == 'spiderchouti': self.conn.write('{} {} '.format(item['title'],item['href'])) #交给下一个pipline使用 return item #丢弃item,不交给下一个pipline # raise DropItem() def close_spider(self,spider): """ 爬虫关闭时调用 :param spider: :return: """ self.conn.close()
注意:pipeline是所有爬虫公用,如果想要给某个爬虫定制需要使用spider参数自己进行处理。
json文件
class JsonExporterPipeline(JsonItemExporter): #调用scrapy提供的json export 导出json文件 def __init__(self): self.file = open('articleexpoter.json','wb') self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False) self.exporter.start_exporting()#开始导出 def close_spider(self): self.exporter.finish_exporting() #停止导出 self.file.close() def process_item(self, item, spider): self.exporter.export_item(item) return item class JsonWithEncodingPipeline(object): #自定义json文件的导出 def __init__(self): self.file = codecs.open('article.json','w',encoding='utf-8') def process_item(self,item,spider): lines = json.dumps(dict(item), ensure_ascii=False) + ' ' self.file.write(lines) return item def spider_closed(self): self.file.close()
存储图片
# -*- coding: utf-8 -*- from urllib.parse import urljoin import scrapy from ..items import XiaohuaItem class XiaohuaSpider(scrapy.Spider): name = 'xiaohua' allowed_domains = ['www.xiaohuar.com'] start_urls = ['http://www.xiaohuar.com/list-1-{}.html'.format(i) for i in range(11)] def parse(self, response): items = response.css('.item_list .item') for item in items: url = item.css('.img img::attr(src)').extract()[0] url = urljoin('http://www.xiaohuar.com',url) title = item.css('.title span a::text').extract()[0] obj = XiaohuaItem(img_url=[url],title=title) yield obj
class XiaohuaItem(scrapy.Item): img_url = scrapy.Field() title = scrapy.Field() img_path = scrapy.Field()
class XiaohuaImagesPipeline(ImagesPipeline): #调用scrapy提供的imagepipeline下载图片 def item_completed(self, results, item, info): if "img_url" in item: for ok,value in results: print(ok,value) img_path = value['path'] item['img_path'] = img_path return item def get_media_requests(self, item, info): # 下载图片 if "img_url" in item: for img_url in item['img_url']: yield scrapy.Request(img_url, meta={'item': item, 'index': item['img_url'].index(img_url)}) # 添加meta是为了下面重命名文件名使用 def file_path(self, request, response=None, info=None): item = request.meta['item'] if "img_url" in item:# 通过上面的meta传递过来item index = request.meta['index'] # 通过上面的index传递过来列表中当前下载图片的下标 # 图片文件名,item['carname'][index]得到汽车名称,request.url.split('/')[-1].split('.')[-1]得到图片后缀jpg,png image_guid = item['title'] + '.' + request.url.split('/')[-1].split('.')[-1] # 图片下载目录 此处item['country']即需要前面item['country']=''.join()......,否则目录名会变成u97e9u56fdu6c7du8f66u6807u5fd7xxx.jpg filename = u'full/{0}'.format(image_guid) return filename
ITEM_PIPELINES = { # 'chouti.pipelines.XiaohuaImagesPipeline': 300, 'scrapy.pipelines.images.ImagesPipeline': 1, }
mysql数据库
class MysqlPipeline(object): def __init__(self): self.conn = pymysql.connect('localhost', 'root','0000', 'crawed', charset='utf8', use_unicode=True) self.cursor = self.conn.cursor() def process_item(self, item, spider): insert_sql = """insert into article(title,url,create_date,fav_nums) values (%s,%s,%s,%s)""" self.cursor.execute(insert_sql,(item['title'],item['url'],item['date'],item['fav_nums'])) self.conn.commit() class MysqlTwistePipeline(object): def __init__(self, dbpool): self.dbpool = dbpool @classmethod def from_settings(cls,settings): dbparms = dict( host=settings['MYSQL_HOST'], db=settings['MYSQL_DB'], user=settings['MYSQL_USER'], password=settings['MYSQL_PASSWORD'], charset='utf8', cursorclass=pymysql.cursors.DictCursor, use_unicode=True, ) dbpool = adbapi.ConnectionPool('pymysql',**dbparms) return cls(dbpool) def process_item(self, item, spider): #使用twisted将mysql插入变异步执行 query = self.dbpool.runInteraction(self.do_insert,item) # query.addErrorback(self.handle_error) #处理异常 query.addErrback(self.handle_error) #处理异常 def handle_error(self,failure): #处理异步插入的异常 print(failure) def do_insert(self,cursor,item): insert_sql, params = item.get_insert_sql() try: cursor.execute(insert_sql,params) print('插入成功') except Exception as e: print('插入失败')
MYSQL_HOST = 'localhost' MYSQL_USER = 'root' MYSQL_PASSWORD = '0000' MYSQL_DB = 'crawed' SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S" SQL_DATE_FORMAT = "%Y-%m-%d" RANDOM_UA_TYPE = "random" ES_HOST = "127.0.0.1"
四、去重规则
Scrapy默认去重规则:
from scrapy.dupefilter import RFPDupeFilter
from __future__ import print_function import os import logging from scrapy.utils.job import job_dir from scrapy.utils.request import request_fingerprint class BaseDupeFilter(object): @classmethod def from_settings(cls, settings): return cls() def request_seen(self, request): return False def open(self): # can return deferred pass def close(self, reason): # can return a deferred pass def log(self, request, spider): # log that a request has been filtered pass class RFPDupeFilter(BaseDupeFilter): """Request Fingerprint duplicates filter""" def __init__(self, path=None, debug=False): self.file = None self.fingerprints = set() self.logdupes = True self.debug = debug self.logger = logging.getLogger(__name__) if path: self.file = open(os.path.join(path, 'requests.seen'), 'a+') self.file.seek(0) self.fingerprints.update(x.rstrip() for x in self.file) @classmethod def from_settings(cls, settings): debug = settings.getbool('DUPEFILTER_DEBUG') return cls(job_dir(settings), debug) def request_seen(self, request): fp = self.request_fingerprint(request) if fp in self.fingerprints: return True self.fingerprints.add(fp) if self.file: self.file.write(fp + os.linesep) def request_fingerprint(self, request): return request_fingerprint(request) def close(self, reason): if self.file: self.file.close() def log(self, request, spider): if self.debug: msg = "Filtered duplicate request: %(request)s" self.logger.debug(msg, {'request': request}, extra={'spider': spider}) elif self.logdupes: msg = ("Filtered duplicate request: %(request)s" " - no more duplicates will be shown" " (see DUPEFILTER_DEBUG to show all duplicates)") self.logger.debug(msg, {'request': request}, extra={'spider': spider}) self.logdupes = False spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)
自定义去重规则
1.编写类
# -*- coding: utf-8 -*- """ @Datetime: 2018/8/31 @Author: Zhang Yafei """ from scrapy.dupefilter import BaseDupeFilter from scrapy.utils.request import request_fingerprint class RepeatFilter(BaseDupeFilter): def __init__(self): self.visited_fd = set() @classmethod def from_settings(cls, settings): return cls() def request_seen(self, request): fd = request_fingerprint(request=request) if fd in self.visited_fd: return True self.visited_fd.add(fd) def open(self): # can return deferred print('open') pass def close(self, reason): # can return a deferred print('close') pass def log(self, request, spider): # log that a request has been filtered pass
2. 配置
# 默认去重规则 # DUPEFILTER_CLASS = "chouti.duplication.RepeatFilter" DUPEFILTER_CLASS = "chouti.dupeFilter.RepeatFilter"
3. 爬虫使用
from urllib.parse import urljoin from ..items import ChoutiItem import scrapy from scrapy.http import Request class SpiderchoutiSpider(scrapy.Spider): name = 'spiderchouti' allowed_domains = ['dig.chouti.com'] start_urls = ['https://dig.chouti.com/'] def parse(self, response): #获取当前页的标题 print(response.request.url) # news = response.css('.content-list .item') # for new in news: # title = new.css('.show-content::text').extract()[0].strip() # href = new.css('.show-content::attr(href)').extract()[0] # item = ChoutiItem(title=title,href=href) # yield item #获取所有页码 pages = response.css('#dig_lcpage a::attr(href)').extract() for page in pages: url = urljoin(self.start_urls[0],page) #将新要访问的url添加到调度器 yield Request(url=url,callback=self.parse)
注意:
- request_seen中编写正确逻辑
- dont_filter=False
五、中间件
下载中间件
from scrapy.http import HtmlResponse
from scrapy.http import Request
class Md1(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
print('md1.process_request',request)
# 1. 返回Response
# import requests
# result = requests.get(request.url)
# return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)
# 2. 返回Request
# return Request('https://dig.chouti.com/r/tec/hot/1')
# 3. 抛出异常
# from scrapy.exceptions import IgnoreRequest
# raise IgnoreRequest
# 4. 对请求进行加工(*)
# request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
pass
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
print('m1.process_response',request,response)
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
DOWNLOADER_MIDDLEWARES = { #'xdb.middlewares.XdbDownloaderMiddleware': 543, # 'xdb.proxy.XdbProxyMiddleware':751, 'xdb.md.Md1':666, 'xdb.md.Md2':667, }
应用:- user-agent - 代理
爬虫中间件
class Sd1(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass # 只在爬虫启动时,执行一次。 def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r
SPIDER_MIDDLEWARES = { # 'xdb.middlewares.XdbSpiderMiddleware': 543, 'xdb.sd.Sd1': 666, 'xdb.sd.Sd2': 667, }
应用: - 深度 - 优先级
class SpiderMiddleware(object): def process_spider_input(self,response, spider): """ 下载完成,执行,然后交给parse处理 :param response: :param spider: :return: """ pass def process_spider_output(self,response, result, spider): """ spider处理完成,返回时调用 :param response: :param result: :param spider: :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable) """ return result def process_spider_exception(self,response, exception, spider): """ 异常调用 :param response: :param exception: :param spider: :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline """ return None def process_start_requests(self,start_requests, spider): """ 爬虫启动时调用 :param start_requests: :param spider: :return: 包含 Request 对象的可迭代对象 """ return start_requests
class DownMiddleware1(object): def process_request(self, request, spider): """ 请求需要被下载时,经过所有下载器中间件的process_request调用 :param request: :param spider: :return: None,继续后续中间件去下载; Response对象,停止process_request的执行,开始执行process_response Request对象,停止中间件的执行,将Request重新调度器 raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception """ pass def process_response(self, request, response, spider): """ spider处理完成,返回时调用 :param response: :param result: :param spider: :return: Response 对象:转交给其他中间件process_response Request 对象:停止中间件,request会被重新调度下载 raise IgnoreRequest 异常:调用Request.errback """ print('response1') return response def process_exception(self, request, exception, spider): """ 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常 :param response: :param exception: :param spider: :return: None:继续交给后续中间件处理异常; Response对象:停止后续process_exception方法 Request对象:停止中间件,request将会被重新调用下载 """ return None
设置代理
在爬虫启动时,提前在os.envrion中设置代理即可。 class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): import os os.environ['HTTPS_PROXY'] = "http://root:woshiniba@192.168.11.11:9999/" os.environ['HTTP_PROXY'] = '19.11.2.32', for url in self.start_urls: yield Request(url=url,callback=self.parse) meta: class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): for url in self.start_urls: yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:woshiniba@192.168.11.11:9999/"'})
import base64 import random from six.moves.urllib.parse import unquote try: from urllib2 import _parse_proxy except ImportError: from urllib.request import _parse_proxy from six.moves.urllib.parse import urlunparse from scrapy.utils.python import to_bytes class XdbProxyMiddleware(object): def _basic_auth_header(self, username, password): user_pass = to_bytes( '%s:%s' % (unquote(username), unquote(password)), encoding='latin-1') return base64.b64encode(user_pass).strip() def process_request(self, request, spider): PROXIES = [ "http://root:woshiniba@192.168.11.11:9999/", "http://root:woshiniba@192.168.11.12:9999/", "http://root:woshiniba@192.168.11.13:9999/", "http://root:woshiniba@192.168.11.14:9999/", "http://root:woshiniba@192.168.11.15:9999/", "http://root:woshiniba@192.168.11.16:9999/", ] url = random.choice(PROXIES) orig_type = "" proxy_type, user, password, hostport = _parse_proxy(url) proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', '')) if user: creds = self._basic_auth_header(user, password) else: creds = None request.meta['proxy'] = proxy_url if creds: request.headers['Proxy-Authorization'] = b'Basic ' + creds class DdbProxyMiddleware(object): def process_request(self, request, spider): PROXIES = [ {'ip_port': '111.11.228.75:80', 'user_pass': ''}, {'ip_port': '120.198.243.22:80', 'user_pass': ''}, {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, {'ip_port': '101.71.27.120:80', 'user_pass': ''}, {'ip_port': '122.96.59.104:80', 'user_pass': ''}, {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, ] proxy = random.choice(PROXIES) if proxy['user_pass'] is not None: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) encoded_user_pass = base64.b64encode(to_bytes(proxy['user_pass'])) request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) else: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
六、定制命令
单爬虫运行 main.py
1 from scrapy.cmdline import execute 2 import sys 3 import os 4 5 sys.path.append(os.path.dirname(__file__)) 6 7 # execute(['scrapy','crawl','spiderchouti','--nolog']) 8 # os.system('scrapy crawl spiderchouti') 9 # os.system('scrapy crawl xiaohua') 10 os.system('scrapy crawl choutilike --nolog')
所有爬虫:
- 在spiders同级创建任意目录,如:commands
- 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
- 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
- 在项目目录执行命令:scrapy crawlall
# -*- coding: utf-8 -*- """ @Datetime: 2018/9/1 @Author: Zhang Yafei """ from scrapy.commands import ScrapyCommand from scrapy.utils.project import get_project_settings class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): print(type(self.crawler_process)) from scrapy.crawler import CrawlerProcess # 1. 执行CrawlerProcess构造方法 # 2. CrawlerProcess对象(含有配置文件)的spiders # 2.1,为每个爬虫创建一个Crawler # 2.2,执行 d = Crawler.crawl(...) # ************************ # # d.addBoth(_done) # 2.3, CrawlerProcess对象._active = {d,} # 3. dd = defer.DeferredList(self._active) # dd.addBoth(self._stop_reactor) # self._stop_reactor ==> reactor.stop() # reactor.run #找到所有的爬虫名称 spider_list = self.crawler_process.spiders.list() # spider_list = ['choutilike','xiaohua']爬取任意项目 for name in spider_list: self.crawler_process.crawl(name, **opts.__dict__) self.crawler_process.start()
七、信号
信号就是使用框架预留的位置,帮助你自定义一些功能。
内置信号
# 引擎开始和结束 engine_started = object() engine_stopped = object() # spider开始和结束 spider_opened = object() # 请求闲置 spider_idle = object() # spider关闭 spider_closed = object() # spider发生异常 spider_error = object() # 请求放入调度器 request_scheduled = object() # 请求被丢弃 request_dropped = object() # 接收到响应 response_received = object() # 响应下载完 response_downloaded = object() # item item_scraped = object() # item被丢弃 item_dropped = object()
自定义扩展
class MyExtend(): def __init__(self,crawler): self.crawler = crawler #钩子上挂障碍物 #在指定信息上注册操作 crawler.signals.connect(self.start,signals.engine_started) crawler.signals.connect(self.close,signals.engine_stopped) @classmethod def from_crawler(cls,crawler): return cls(crawler) def start(self): print('signals.engine_started start') def close(self): print('signals.engine_stopped close')
from scrapy import signals class MyExtend(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): self = cls() crawler.signals.connect(self.x1, signal=signals.spider_opened) crawler.signals.connect(self.x2, signal=signals.spider_closed) return self def x1(self, spider): print('open') def x2(self, spider): print('close')
配置
EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, 'chouti.extensions.MyExtend':200, }
八、配置文件
Scrapy默认配置文件
""" This module contains the default values for all settings used by Scrapy. For more information about these settings you can read the settings documentation in docs/topics/settings.rst Scrapy developers, if you add a setting here remember to: * add it in alphabetical order * group similar settings without leaving blank lines * add its documentation to the available settings documentation (docs/topics/settings.rst) """ import sys from importlib import import_module from os.path import join, abspath, dirname import six AJAXCRAWL_ENABLED = False AUTOTHROTTLE_ENABLED = False AUTOTHROTTLE_DEBUG = False AUTOTHROTTLE_MAX_DELAY = 60.0 AUTOTHROTTLE_START_DELAY = 5.0 AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 BOT_NAME = 'scrapybot' CLOSESPIDER_TIMEOUT = 0 CLOSESPIDER_PAGECOUNT = 0 CLOSESPIDER_ITEMCOUNT = 0 CLOSESPIDER_ERRORCOUNT = 0 COMMANDS_MODULE = '' COMPRESSION_ENABLED = True CONCURRENT_ITEMS = 100 CONCURRENT_REQUESTS = 16 CONCURRENT_REQUESTS_PER_DOMAIN = 8 CONCURRENT_REQUESTS_PER_IP = 0 COOKIES_ENABLED = True COOKIES_DEBUG = False DEFAULT_ITEM_CLASS = 'scrapy.item.Item' DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } DEPTH_LIMIT = 0 DEPTH_STATS = True DEPTH_PRIORITY = 0 DNSCACHE_ENABLED = True DNSCACHE_SIZE = 10000 DNS_TIMEOUT = 60 DOWNLOAD_DELAY = 0 DOWNLOAD_HANDLERS = {} DOWNLOAD_HANDLERS_BASE = { 'data': 'scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler', 'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler', 'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler', 'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler', 's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler', 'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler', } DOWNLOAD_TIMEOUT = 180 # 3mins DOWNLOAD_MAXSIZE = 1024*1024*1024 # 1024m DOWNLOAD_WARNSIZE = 32*1024*1024 # 32m DOWNLOAD_FAIL_ON_DATALOSS = True DOWNLOADER = 'scrapy.core.downloader.Downloader' DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory' DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory' DOWNLOADER_CLIENT_TLS_METHOD = 'TLS' # Use highest TLS/SSL protocol version supported by the platform, # also allowing negotiation DOWNLOADER_MIDDLEWARES = {} DOWNLOADER_MIDDLEWARES_BASE = { # Engine side 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300, 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550, 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560, 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750, 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850, 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900, # Downloader side } DOWNLOADER_STATS = True DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter' EDITOR = 'vi' if sys.platform == 'win32': EDITOR = '%s -m idlelib.idle' EXTENSIONS = {} EXTENSIONS_BASE = { 'scrapy.extensions.corestats.CoreStats': 0, 'scrapy.extensions.telnet.TelnetConsole': 0, 'scrapy.extensions.memusage.MemoryUsage': 0, 'scrapy.extensions.memdebug.MemoryDebugger': 0, 'scrapy.extensions.closespider.CloseSpider': 0, 'scrapy.extensions.feedexport.FeedExporter': 0, 'scrapy.extensions.logstats.LogStats': 0, 'scrapy.extensions.spiderstate.SpiderState': 0, 'scrapy.extensions.throttle.AutoThrottle': 0, } FEED_TEMPDIR = None FEED_URI = None FEED_URI_PARAMS = None # a function to extend uri arguments FEED_FORMAT = 'jsonlines' FEED_STORE_EMPTY = False FEED_EXPORT_ENCODING = None FEED_EXPORT_FIELDS = None FEED_STORAGES = {} FEED_STORAGES_BASE = { '': 'scrapy.extensions.feedexport.FileFeedStorage', 'file': 'scrapy.extensions.feedexport.FileFeedStorage', 'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage', 's3': 'scrapy.extensions.feedexport.S3FeedStorage', 'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage', } FEED_EXPORTERS = {} FEED_EXPORTERS_BASE = { 'json': 'scrapy.exporters.JsonItemExporter', 'jsonlines': 'scrapy.exporters.JsonLinesItemExporter', 'jl': 'scrapy.exporters.JsonLinesItemExporter', 'csv': 'scrapy.exporters.CsvItemExporter', 'xml': 'scrapy.exporters.XmlItemExporter', 'marshal': 'scrapy.exporters.MarshalItemExporter', 'pickle': 'scrapy.exporters.PickleItemExporter', } FEED_EXPORT_INDENT = 0 FILES_STORE_S3_ACL = 'private' FTP_USER = 'anonymous' FTP_PASSWORD = 'guest' FTP_PASSIVE_MODE = True HTTPCACHE_ENABLED = False HTTPCACHE_DIR = 'httpcache' HTTPCACHE_IGNORE_MISSING = False HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' HTTPCACHE_EXPIRATION_SECS = 0 HTTPCACHE_ALWAYS_STORE = False HTTPCACHE_IGNORE_HTTP_CODES = [] HTTPCACHE_IGNORE_SCHEMES = ['file'] HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS = [] HTTPCACHE_DBM_MODULE = 'anydbm' if six.PY2 else 'dbm' HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy' HTTPCACHE_GZIP = False HTTPPROXY_ENABLED = True HTTPPROXY_AUTH_ENCODING = 'latin-1' IMAGES_STORE_S3_ACL = 'private' ITEM_PROCESSOR = 'scrapy.pipelines.ItemPipelineManager' ITEM_PIPELINES = {} ITEM_PIPELINES_BASE = {} LOG_ENABLED = True LOG_ENCODING = 'utf-8' LOG_FORMATTER = 'scrapy.logformatter.LogFormatter' LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s' LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S' LOG_STDOUT = False LOG_LEVEL = 'DEBUG' LOG_FILE = None LOG_SHORT_NAMES = False SCHEDULER_DEBUG = False LOGSTATS_INTERVAL = 60.0 MAIL_HOST = 'localhost' MAIL_PORT = 25 MAIL_FROM = 'scrapy@localhost' MAIL_PASS = None MAIL_USER = None MEMDEBUG_ENABLED = False # enable memory debugging MEMDEBUG_NOTIFY = [] # send memory debugging report by mail at engine shutdown MEMUSAGE_CHECK_INTERVAL_SECONDS = 60.0 MEMUSAGE_ENABLED = True MEMUSAGE_LIMIT_MB = 0 MEMUSAGE_NOTIFY_MAIL = [] MEMUSAGE_WARNING_MB = 0 METAREFRESH_ENABLED = True METAREFRESH_MAXDELAY = 100 NEWSPIDER_MODULE = '' RANDOMIZE_DOWNLOAD_DELAY = True REACTOR_THREADPOOL_MAXSIZE = 10 REDIRECT_ENABLED = True REDIRECT_MAX_TIMES = 20 # uses Firefox default setting REDIRECT_PRIORITY_ADJUST = +2 REFERER_ENABLED = True REFERRER_POLICY = 'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy' RETRY_ENABLED = True RETRY_TIMES = 2 # initial response + 2 retries = 3 requests RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408] RETRY_PRIORITY_ADJUST = -1 ROBOTSTXT_OBEY = False SCHEDULER = 'scrapy.core.scheduler.Scheduler' SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue' SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue' SCHEDULER_PRIORITY_QUEUE = 'queuelib.PriorityQueue' SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader' SPIDER_LOADER_WARN_ONLY = False SPIDER_MIDDLEWARES = {} SPIDER_MIDDLEWARES_BASE = { # Engine side 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500, 'scrapy.spidermiddlewares.referer.RefererMiddleware': 700, 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800, 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900, # Spider side } SPIDER_MODULES = [] STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector' STATS_DUMP = True STATSMAILER_RCPTS = [] TEMPLATES_DIR = abspath(join(dirname(__file__), '..', 'templates')) URLLENGTH_LIMIT = 2083 USER_AGENT = 'Scrapy/%s (+https://scrapy.org)' % import_module('scrapy').__version__ TELNETCONSOLE_ENABLED = 1 TELNETCONSOLE_PORT = [6023, 6073] TELNETCONSOLE_HOST = '127.0.0.1' SPIDER_CONTRACTS = {} SPIDER_CONTRACTS_BASE = { 'scrapy.contracts.default.UrlContract': 1, 'scrapy.contracts.default.ReturnsContract': 2, 'scrapy.contracts.default.ScrapesContract': 3, }
1.深度和优先级
- 深度 - 最开始是0 - 每次yield时,会根据原来请求中的depth + 1 配置:DEPTH_LIMIT 深度控制 - 优先级 - 请求被下载的优先级 -= 深度 * 配置 DEPTH_PRIORITY 配置:DEPTH_PRIORITY
def parse(self, response): #获取当前页的标题 print(response.request.url, response.meta.get('depth', 0))
配置文件解读
# -*- coding: utf-8 -*- # Scrapy settings for step8_king project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html # 1. 爬虫名称 BOT_NAME = 'step8_king' # 2. 爬虫应用路径 SPIDER_MODULES = ['step8_king.spiders'] NEWSPIDER_MODULE = 'step8_king.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # 3. 客户端 user-agent请求头 # USER_AGENT = 'step8_king (+http://www.yourdomain.com)' # Obey robots.txt rules # 4. 禁止爬虫配置 # ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # 5. 并发请求数 # CONCURRENT_REQUESTS = 4 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # 6. 延迟下载秒数 # DOWNLOAD_DELAY = 2 # The download delay setting will honor only one of: # 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名 # CONCURRENT_REQUESTS_PER_DOMAIN = 2 # 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP # CONCURRENT_REQUESTS_PER_IP = 3 # Disable cookies (enabled by default) # 8. 是否支持cookie,cookiejar进行操作cookie # COOKIES_ENABLED = True # COOKIES_DEBUG = True # Disable Telnet Console (enabled by default) # 9. Telnet用于查看当前爬虫的信息,操作爬虫等... # 使用telnet ip port ,然后通过命令操作 # engine.pause() 暂停 # engine.unpause() 重启 # TELNETCONSOLE_ENABLED = True # TELNETCONSOLE_HOST = '127.0.0.1' # TELNETCONSOLE_PORT = [6023,] # 10. 默认请求头 # Override the default request headers: # DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', # } # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html # 11. 定义pipeline处理请求 # ITEM_PIPELINES = { # 'step8_king.pipelines.JsonPipeline': 700, # 'step8_king.pipelines.FilePipeline': 500, # } # 12. 自定义扩展,基于信号进行调用 # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html # EXTENSIONS = { # # 'step8_king.extensions.MyExtension': 500, # } # 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度 # DEPTH_LIMIT = 3 # 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo # 后进先出,深度优先 # DEPTH_PRIORITY = 0 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue' # 先进先出,广度优先 # DEPTH_PRIORITY = 1 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue' # 15. 调度器队列 # SCHEDULER = 'scrapy.core.scheduler.Scheduler' # from scrapy.core.scheduler import Scheduler # 16. 访问URL去重 # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl' # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html """ 17. 自动限速算法 from scrapy.contrib.throttle import AutoThrottle 自动限速设置 1. 获取最小延迟 DOWNLOAD_DELAY 2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY 3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY 4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间 5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY target_delay = latency / self.target_concurrency new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间 new_delay = max(target_delay, new_delay) new_delay = min(max(self.mindelay, new_delay), self.maxdelay) slot.delay = new_delay """ # 开始自动限速 # AUTOTHROTTLE_ENABLED = True # The initial download delay # 初始下载延迟 # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # 最大下载延迟 # AUTOTHROTTLE_MAX_DELAY = 10 # The average number of requests Scrapy should be sending in parallel to each remote server # 平均每秒并发数 # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # 是否显示 # AUTOTHROTTLE_DEBUG = True # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings """ 18. 启用缓存 目的用于将已经发送的请求或相应缓存下来,以便以后使用 from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware from scrapy.extensions.httpcache import DummyPolicy from scrapy.extensions.httpcache import FilesystemCacheStorage """ # 是否启用缓存策略 # HTTPCACHE_ENABLED = True # 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy" # 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy" # 缓存超时时间 # HTTPCACHE_EXPIRATION_SECS = 0 # 缓存保存路径 # HTTPCACHE_DIR = 'httpcache' # 缓存忽略的Http状态码 # HTTPCACHE_IGNORE_HTTP_CODES = [] # 缓存存储的插件 # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' """ 19. 代理,需要在环境变量中设置 from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware 方式一:使用默认 os.environ { http_proxy:http://root:woshiniba@192.168.11.11:9999/ https_proxy:http://192.168.11.11:9999/ } 方式二:使用自定义下载中间件 def to_bytes(text, encoding=None, errors='strict'): if isinstance(text, bytes): return text if not isinstance(text, six.string_types): raise TypeError('to_bytes must receive a unicode, str or bytes ' 'object, got %s' % type(text).__name__) if encoding is None: encoding = 'utf-8' return text.encode(encoding, errors) class ProxyMiddleware(object): def process_request(self, request, spider): PROXIES = [ {'ip_port': '111.11.228.75:80', 'user_pass': ''}, {'ip_port': '120.198.243.22:80', 'user_pass': ''}, {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, {'ip_port': '101.71.27.120:80', 'user_pass': ''}, {'ip_port': '122.96.59.104:80', 'user_pass': ''}, {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, ] proxy = random.choice(PROXIES) if proxy['user_pass'] is not None: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass'])) request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) print "**************ProxyMiddleware have pass************" + proxy['ip_port'] else: print "**************ProxyMiddleware no pass************" + proxy['ip_port'] request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) DOWNLOADER_MIDDLEWARES = { 'step8_king.middlewares.ProxyMiddleware': 500, } """ """ 20. Https访问 Https访问时有两种情况: 1. 要爬取网站使用的可信任证书(默认支持) DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory" 2. 要爬取网站使用的自定义证书 DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory" # https.py from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate) class MySSLFactory(ScrapyClientContextFactory): def getCertificateOptions(self): from OpenSSL import crypto v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read()) v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read()) return CertificateOptions( privateKey=v1, # pKey对象 certificate=v2, # X509对象 verify=False, method=getattr(self, 'method', getattr(self, '_ssl_method', None)) ) 其他: 相关类 scrapy.core.downloader.handlers.http.HttpDownloadHandler scrapy.core.downloader.webclient.ScrapyHTTPClientFactory scrapy.core.downloader.contextfactory.ScrapyClientContextFactory 相关配置 DOWNLOADER_HTTPCLIENTFACTORY DOWNLOADER_CLIENTCONTEXTFACTORY """ """ 21. 爬虫中间件 class SpiderMiddleware(object): def process_spider_input(self,response, spider): ''' 下载完成,执行,然后交给parse处理 :param response: :param spider: :return: ''' pass def process_spider_output(self,response, result, spider): ''' spider处理完成,返回时调用 :param response: :param result: :param spider: :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable) ''' return result def process_spider_exception(self,response, exception, spider): ''' 异常调用 :param response: :param exception: :param spider: :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline ''' return None def process_start_requests(self,start_requests, spider): ''' 爬虫启动时调用 :param start_requests: :param spider: :return: 包含 Request 对象的可迭代对象 ''' return start_requests 内置爬虫中间件: 'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50, 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500, 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700, 'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800, 'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900, """ # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html SPIDER_MIDDLEWARES = { # 'step8_king.middlewares.SpiderMiddleware': 543, } """ 22. 下载中间件 class DownMiddleware1(object): def process_request(self, request, spider): ''' 请求需要被下载时,经过所有下载器中间件的process_request调用 :param request: :param spider: :return: None,继续后续中间件去下载; Response对象,停止process_request的执行,开始执行process_response Request对象,停止中间件的执行,将Request重新调度器 raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception ''' pass def process_response(self, request, response, spider): ''' spider处理完成,返回时调用 :param response: :param result: :param spider: :return: Response 对象:转交给其他中间件process_response Request 对象:停止中间件,request会被重新调度下载 raise IgnoreRequest 异常:调用Request.errback ''' print('response1') return response def process_exception(self, request, exception, spider): ''' 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常 :param response: :param exception: :param spider: :return: None:继续交给后续中间件处理异常; Response对象:停止后续process_exception方法 Request对象:停止中间件,request将会被重新调用下载 ''' return None 默认下载中间件 { 'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300, 'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400, 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500, 'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550, 'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580, 'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600, 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700, 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750, 'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830, 'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850, 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900, } """ # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # 'step8_king.middlewares.DownMiddleware1': 100, # 'step8_king.middlewares.DownMiddleware2': 500, # } #23 Logging日志功能 Scrapy提供了log功能,可以通过 logging 模块使用 可以修改配置文件settings.py,任意位置添加下面两行 LOG_FILE = "mySpider.log" LOG_LEVEL = "INFO" Scrapy提供5层logging级别: CRITICAL - 严重错误(critical) ERROR - 一般错误(regular errors) WARNING - 警告信息(warning messages) INFO - 一般信息(informational messages) DEBUG - 调试信息(debugging messages) logging设置 通过在setting.py中进行以下设置可以被用来配置logging: LOG_ENABLED 默认: True,启用logging LOG_ENCODING 默认: 'utf-8',logging使用的编码 LOG_FILE 默认: None,在当前目录里创建logging输出文件的文件名 LOG_LEVEL 默认: 'DEBUG',log的最低级别 LOG_STDOUT 默认: False 如果为 True,进程所有的标准输出(及错误)将会被重定向到log中。例如,执行 print "hello" ,其将会在Scrapy log中显示 settings
九、scraoy-redis实现分布式
1.基于scrapy-redis的去重规则:基于redis集合
方案一:完全自定制
# dupeFilter.py from scrapy.dupefilters import BaseDupeFilter class RedisFilter(BaseDupeFilter): def __init__(self): pool = ConnectionPool(host='127.0.0.1', port='6379') self.conn = Redis(connection_pool=pool) def request_seen(self, request): """ 检测当前请求是否已经被访问过 :param request: :return: True表示已经访问过;False表示未访问过 """ fd = request_fingerprint(request=request) # key可以自定制 added = self.conn.sadd('visited_urls', fd) return added == 0 # settings.py 末尾加 DUPEFILTER_CLASS = "lww.dupeFilter.RedisFilter"
方案二:完全依赖scrapy-redis
REDIS_HOST = '127.0.0.1' # 主机名 REDIS_PORT = 6379 # 端口 # REDIS_PARAMS = {'password':'beta'} # Redis连接参数 默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,}) REDIS_ENCODING = "utf-8" # redis编码类型 默认:'utf-8' # REDIS_URL = 'redis://user:pass@hostname:9001' # 连接URL(优先于以上配置) DUPEFILTER_KEY = 'dupefilter:%(timestamp)s' # 修改默认去重规则 DUPEFILTER_CLASS ='scrapy_redis.dupefilter.RFPDupeFilter'
方案三: 继承scrapy-redis 实现自定制
dupeFilter.py class RedisDupeFilter(RFPDupeFilter): @classmethod def from_settings(cls, settings): server = get_redis_from_settings(settings) # 优点: 可以自定制 # 缺点:不能获取spider,自定制有限 key = defaults.DUPEFILTER_KEY % {'timestamp': 'test_scrapy_redis'} debug = settings.getbool('DUPEFILTER_DEBUG') return cls(server, key=key, debug=debug) settings.py # REDIS_HOST = '127.0.0.1' # 主机名 # REDIS_PORT = 6379 # 端口 # REDIS_PARAMS = {'password':'0000'} # Redis连接参数 默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,}) # REDIS_ENCODING = "utf-8" # DUPEFILTER_CLASS ='lww.dupeFilter.RedisDupeFilter'
2. 调度器
方案四:通过修改调度器
- settings配置
连接redis配置: REDIS_HOST = '127.0.0.1' # 主机名 REDIS_PORT = 6073 # 端口 # REDIS_PARAMS = {'password':'xxx'} # Redis连接参数 默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,}) REDIS_ENCODING = "utf-8" # redis编码类型 默认:'utf-8' 去重的配置: DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' 调度器配置: SCHEDULER = "scrapy_redis.scheduler.Scheduler" DEPTH_PRIORITY = 1 # 广度优先 # DEPTH_PRIORITY = -1 # 深度优先 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表) # 广度优先 # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表) # 深度优先 # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表) SCHEDULER_QUEUE_KEY = '%(spider)s:requests' # 调度器中请求存放在redis中的key SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # 对保存到redis中的数据进行序列化,默认使用pickle SCHEDULER_PERSIST = True # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空 SCHEDULER_FLUSH_ON_START = False # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空 # SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。 SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter' # 去重规则,在redis中保存时对应的key # 优先使用DUPEFILTER_CLASS,如果么有就是用SCHEDULER_DUPEFILTER_CLASS # SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' # 去重规则对应处理的类
- 执行流程
1. scrapy crawl chouti --nolog 2. 找到 SCHEDULER = "scrapy_redis.scheduler.Scheduler" 配置并实例化调度器对象 - 执行Scheduler.from_crawler - 执行Scheduler.from_settings - 读取配置文件: SCHEDULER_PERSIST # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空 SCHEDULER_FLUSH_ON_START # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空 SCHEDULER_IDLE_BEFORE_CLOSE # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。 - 读取配置文件: SCHEDULER_QUEUE_KEY # %(spider)s:requests SCHEDULER_QUEUE_CLASS # scrapy_redis.queue.FifoQueue SCHEDULER_DUPEFILTER_KEY # '%(spider)s:dupefilter' DUPEFILTER_CLASS # 'scrapy_redis.dupefilter.RFPDupeFilter' SCHEDULER_SERIALIZER # "scrapy_redis.picklecompat" - 读取配置文件: REDIS_HOST = '140.143.227.206' # 主机名 REDIS_PORT = 8888 # 端口 REDIS_PARAMS = {'password':'beta'} # Redis连接参数 默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,}) REDIS_ENCODING = "utf-8" - 示例Scheduler对象 3. 爬虫开始执行起始URL - 调用 scheduler.enqueue_requests() def enqueue_request(self, request): # 请求是否需要过滤? # 去重规则中是否已经有?(是否已经访问过,如果未访问添加到去重记录中。) if not request.dont_filter and self.df.request_seen(request): # 如果需要过滤且已经被访问过,返回false self.df.log(request, self.spider) # 已经访问过就不要再访问了 return False if self.stats: self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider) # print('未访问过,添加到调度器', request) self.queue.push(request) return True 4. 下载器去调度器中获取任务,去下载 - 调用 scheduler.next_requests() def next_request(self): block_pop_timeout = self.idle_before_close request = self.queue.pop(block_pop_timeout) if request and self.stats: self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider) return request
注:scheduler的enqueue_request方法只有在触发start_requests方法时才会执行,所以spider中必须实现start_requets方法
- 数据持久化
定义持久化,爬虫yield Item对象时执行RedisPipeline a. 将item持久化到redis时,指定key和序列化函数 REDIS_ITEMS_KEY = '%(spider)s:items' REDIS_ITEMS_SERIALIZER = 'json.dumps' b. 使用列表保存item数据
- 起始url相关
""" 起始URL相关 a. 获取起始URL时,去集合中获取还是去列表中获取?True,集合;False,列表 REDIS_START_URLS_AS_SET = False # 获取起始URL时,如果为True,则使用self.server.spop;如果为False,则使用self.server.lpop b. 编写爬虫时,起始URL从redis的Key中获取 REDIS_START_URLS_KEY = '%(name)s:start_urls' """ # If True, it uses redis' ``spop`` operation. This could be useful if you # want to avoid duplicates in your start urls list. In this cases, urls must # be added via ``sadd`` command or you will get a type error from redis. # REDIS_START_URLS_AS_SET = False # Default start urls key for RedisSpider and RedisCrawlSpider. # REDIS_START_URLS_KEY = '%(name)s:start_urls'
from scrapy_redis.spiders import RedisSpider class SpiderchoutiSpider(RedisSpider): name = 'spiderchouti' allowed_domains = ['dig.chouti.com'] # 不用写start_urls
from redis import Redis, ConnectionPool pool = ConnectionPool(host='127.0.0.1', port=6379) conn = Redis(connection_pool=pool) conn.lpush('spiderchouti:start_urls','https://dig.chouti.com/')
- 一些知识点
1. 什么是深度优先?什么是广度优先? 就像一颗树,深度优先先执行一颗子树中的所有节点在执行另一颗子树中的所有节点,广度优先先执行完一层,在执行下一层所有节点 2. scrapy中如何实现深度和广度优先? 基于栈和队列实现: 先进先出,广度优先 后进先出,深度优先 基于有序集合实现: 优先级队列: DEPTH_PRIORITY = 1 # 广度优先 DEPTH_PRIORITY = -1 # 深度优先 3. scrapy中 调度器 和 队列 和 dupefilter的关系? 调度器,调配添加或获取那个request. 队列,存放request,先进先出(深度优先),后进先出(广度优先),优先级队列。 dupefilter,访问记录,去重规则。
十、TinyScrapy
from twisted.internet import reactor # 事件循环(终止条件,所有的socket都已经移除) from twisted.web.client import getPage # socket对象(如果下载完成,自动从时间循环中移除...) from twisted.internet import defer # defer.Deferred 特殊的socket对象 (不会发请求,手动移除) from queue import Queue class Request(object): """ 用于封装用户请求相关信息,供用户编写spider时发送请求所用 """ def __init__(self,url,callback): self.url = url self.callback = callback class HttpResponse(object): """ 通过响应请求返回的数据和穿入的request对象封装成一个response对象 目的是为了将请求返回的数据不仅包括返回的content数据,使其拥有更多的属性,比如请求头,请求url,请求的cookies等等 更方便的被回调函数所解析有用的数据 """ def __init__(self,content,request): self.content = content self.request = request self.url = request.url self.text = str(content,encoding='utf-8') class Scheduler(object): """ 任务调度器: 1.初始化一个队列 2.next_request方法:读取队列中的下一个请求 3.enqueue_request方法:将请求加入到队列 4.size方法:返回当前队列请求的数量 5.open方法:无任何操作,返回一个空值,用于引擎中用装饰器装饰的open_spider方法返回一个yield对象 """ def __init__(self): self.q = Queue() def open(self): pass def next_request(self): try: req = self.q.get(block=False) except Exception as e: req = None return req def enqueue_request(self,req): self.q.put(req) def size(self): return self.q.qsize() class ExecutionEngine(object): """ 引擎:所有的调度 1.通过open_spider方法将start_requests中的每一个请求加入到scdeuler中的队列当中, 2.处理每一个请求响应之后的回调函数(get_response_callback)和执行下一次请求的调度(_next_request) """ def __init__(self): self._close = None self.scheduler = None self.max = 5 self.crawlling = [] def get_response_callback(self,content,request): self.crawlling.remove(request) response = HttpResponse(content,request) result = request.callback(response) import types if isinstance(result,types.GeneratorType): for req in result: self.scheduler.enqueue_request(req) def _next_request(self): """ 1.对spider对象的请求进行调度 2.设置事件循环终止条件:调度器队列中请求的个数为0,正在执行的请求数为0 3.设置最大并发数,根据正在执行的请求数量满足最大并发数条件对sceduler队列中的请求进行调度执行 4.包括对请求进行下载,以及对返回的数据执行callback函数 5.开始执行事件循环的下一次请求的调度 """ if self.scheduler.size() == 0 and len(self.crawlling) == 0: self._close.callback(None) return """设置最大并发数为5""" while len(self.crawlling) < self.max: req = self.scheduler.next_request() if not req: return self.crawlling.append(req) d = getPage(req.url.encode('utf-8')) d.addCallback(self.get_response_callback,req) d.addCallback(lambda _:reactor.callLater(0,self._next_request)) @defer.inlineCallbacks def open_spider(self,start_requests): """ 1.创建一个调度器对象 2.将start_requests中的每一个请求加入到scheduler队列中去 3.然后开始事件循环执行下一次请求的调度 注:每一个@defer.inlineCallbacks装饰的函数都必须yield一个对象,即使为None """ self.scheduler = Scheduler() yield self.scheduler.open() while True: try: req = next(start_requests) except StopIteration as e: break self.scheduler.enqueue_request(req) reactor.callLater(0,self._next_request) @defer.inlineCallbacks def start(self): """不发送任何请求,需要手动停止,目的是为了夯住循环""" self._close = defer.Deferred() yield self._close class Crawler(object): """ 1.用户封装调度器以及引擎 2.通过传入的spider对象的路径创建spider对象 3.创建引擎去打开spider对象,对spider中的每一个请求加入到调度器中的队列中去,通过引擎对请求去进行调度 """ def _create_engine(self): return ExecutionEngine() def _create_spider(self,spider_cls_path): """ :param spider_cls_path: spider.chouti.ChoutiSpider :return: """ module_path,cls_name = spider_cls_path.rsplit('.',maxsplit=1) import importlib m = importlib.import_module(module_path) cls = getattr(m,cls_name) return cls() @defer.inlineCallbacks def crawl(self,spider_cls_path): engine = self._create_engine() spider = self._create_spider(spider_cls_path) start_requests = iter(spider.start_requests()) yield engine.open_spider(start_requests) #将start_requests中的每一个请求加入到调度器的队列中去,并有引擎调度请求的执行 yield engine.start() #创建一个defer对象,目的是为了夯住事件循环,手动停止 class CrawlerProcess(object): """ 1.创建一个Crawler对象 2.将传入的每一个spider对象的路径传入Crawler.crawl方法 3.并将返回的socket对象加入到集合中 4.开始事件循环 """ def __init__(self): self._active = set() def crawl(self,spider_cls_path): """ :param spider_cls_path: :return: """ crawler = Crawler() d = crawler.crawl(spider_cls_path) self._active.add(d) def start(self): dd = defer.DeferredList(self._active) dd.addBoth(lambda _:reactor.stop()) reactor.run() class Command(object): """ 1.创建开始运行的命令 2.将每一个spider对象的路径传入到crawl_process.crawl方法中去 3.crawl_process.crawl方法创建一个Crawler对象,通过调用Crawler.crawl方法创建一个引擎和spider对象 4.通过引擎的open_spider方法创建一个scheduler对象,将每一个spider对象加入到schedler队列中去,并且通过自身的_next_request方法对下一次请求进行调度 5. """ def run(self): crawl_process = CrawlerProcess() spider_cls_path_list = ['spider.chouti.ChoutiSpider','spider.cnblogs.CnblogsSpider',] for spider_cls_path in spider_cls_path_list: crawl_process.crawl(spider_cls_path) crawl_process.start() if __name__ == '__main__': cmd = Command() cmd.run()