• 分布式爬虫


    概念:多台机器上可以执行同一个爬虫程序,实现网站数据的分布爬取。

    原生的scrapy是不可以实现分布式爬虫?

    a. 调度器无法共享

    b. 管道无法共享

    crapy-redis组件

    专门为scrapy开发的一套组件。该组件可以让scrapy实现分布式。

    下载:pip install scrapy-redis

    分布式爬取的流程:

    1. redis配置文件的配置

    2. bind 127.0.0.1 进行注释

    3. protected-mode no   关闭保护模式

    4. redis服务器的开启:基于配置文件

    5. 创建scrapy工程后,创建基于crawlSpider的爬虫文件

    6. 导入RedisCrawlSpider类,然后将爬虫文件修改成基于该类的源文件

    7. 将start_url修改成redis_key = ‘XXX’

    8.  在配置文件中进行相应配置:将管道配置成scrapy-redis集成的管道

    9. 在配置文件中将调度器切换成scrapy-redis集成好的调度器

    10. 执行爬虫程序:scrapy runspider  xxx.py

    11. redis客户端:lpush 调度器队列的名称 “起始url”

    如果redis服务器不在自己本机,则需要在setting中进行如下配置
    REDIS_HOST = 'redis服务的ip地址'
    REDIS_PORT = 6379

    爬虫相关操作

    # -*- coding: utf-8 -*-
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import Rule
    from redisPro.items import RedisproItem
    
    from scrapy_redis.spiders import RedisCrawlSpider
    
    
    class QiubaiSpider(RedisCrawlSpider):
        name = 'qiubai'
        # allowed_domains = ['https://www.qiushibaike.com/pic/']
        # start_urls = ['http://https://www.qiushibaike.com/pic//']
    
        # 调度器队列的名称
        redis_key = 'qiubaispider'  # 跟start_urls含义是一样的
        link = LinkExtractor(allow=r'/pic/page/d+')
    
        rules = (
            Rule(link, callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            div_list = response.xpath('//div[@id="content-left"]/div')
            for div in div_list:
                img_url = 'https:' + div.xpath('.//div[@class="thumb"]/a/img/@src').extract_first()
                item = RedisproItem()
                item['img_url'] = img_url
    
                yield item

    储存解析到的页面数据

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class RedisproItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        img_url = scrapy.Field()

    管道

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class RedisproPipeline(object):
        def process_item(self, item, spider):
            return item

    配置

    BOT_NAME = 'redisPro'
    
    SPIDER_MODULES = ['redisPro.spiders']
    NEWSPIDER_MODULE = 'redisPro.spiders'
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    # USER_AGENT = 'redisPro (+http://www.yourdomain.com)'
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    ITEM_PIPELINES = {
        # 'redisPro.pipelines.RedisproPipeline': 300,
        'scrapy_redis.pipelines.RedisPipeline': 300,
    }
    
    
    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True
    
    # 如果redis服务器不在自己本机,则需要如下配置
    REDIS_HOST = '172.20.10.9'  # 存储到的redsi服务器
    REDIS_PORT = 6379
  • 相关阅读:
    安全购买数码相机的诀窍!(1)
    获得网卡MAC地址和IP地址
    用Asp.net实现基于XML的留言簿之二
    安全购买数码相机的诀窍!(2)
    使用Flash读取COOKIE
    数码常识:CCD的原理
    ACE 5.5 Make in RedHat AS 4 Update 4 Issue
    Eclipse Plugins 开发 (1)
    RedHat AS4 Update4 DNS (bind 9) 配置
    Maven2 & Continuum 持续整合 (1)
  • 原文地址:https://www.cnblogs.com/lshedward/p/10711500.html
Copyright © 2020-2023  润新知