概念:多台机器上可以执行同一个爬虫程序,实现网站数据的分布爬取。
原生的scrapy是不可以实现分布式爬虫?
a. 调度器无法共享
b. 管道无法共享
crapy-redis组件
专门为scrapy开发的一套组件。该组件可以让scrapy实现分布式。
下载:pip install scrapy-redis
分布式爬取的流程:
1. redis配置文件的配置
2. bind 127.0.0.1 进行注释
3. protected-mode no 关闭保护模式
4. redis服务器的开启:基于配置文件
5. 创建scrapy工程后,创建基于crawlSpider的爬虫文件
6. 导入RedisCrawlSpider类,然后将爬虫文件修改成基于该类的源文件
7. 将start_url修改成redis_key = ‘XXX’
8. 在配置文件中进行相应配置:将管道配置成scrapy-redis集成的管道
9. 在配置文件中将调度器切换成scrapy-redis集成好的调度器
10. 执行爬虫程序:scrapy runspider xxx.py
11. redis客户端:lpush 调度器队列的名称 “起始url”
如果redis服务器不在自己本机,则需要在setting中进行如下配置
REDIS_HOST = 'redis服务的ip地址'
REDIS_PORT = 6379
爬虫相关操作
# -*- coding: utf-8 -*- from scrapy.linkextractors import LinkExtractor from scrapy.spiders import Rule from redisPro.items import RedisproItem from scrapy_redis.spiders import RedisCrawlSpider class QiubaiSpider(RedisCrawlSpider): name = 'qiubai' # allowed_domains = ['https://www.qiushibaike.com/pic/'] # start_urls = ['http://https://www.qiushibaike.com/pic//'] # 调度器队列的名称 redis_key = 'qiubaispider' # 跟start_urls含义是一样的 link = LinkExtractor(allow=r'/pic/page/d+') rules = ( Rule(link, callback='parse_item', follow=True), ) def parse_item(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: img_url = 'https:' + div.xpath('.//div[@class="thumb"]/a/img/@src').extract_first() item = RedisproItem() item['img_url'] = img_url yield item
储存解析到的页面数据
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class RedisproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() img_url = scrapy.Field()
管道
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html class RedisproPipeline(object): def process_item(self, item, spider): return item
配置
BOT_NAME = 'redisPro' SPIDER_MODULES = ['redisPro.spiders'] NEWSPIDER_MODULE = 'redisPro.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # USER_AGENT = 'redisPro (+http://www.yourdomain.com)' USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False ITEM_PIPELINES = { # 'redisPro.pipelines.RedisproPipeline': 300, 'scrapy_redis.pipelines.RedisPipeline': 300, } # 使用scrapy-redis组件的去重队列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 是否允许暂停 SCHEDULER_PERSIST = True # 如果redis服务器不在自己本机,则需要如下配置 REDIS_HOST = '172.20.10.9' # 存储到的redsi服务器 REDIS_PORT = 6379