• scrapy--redis(分布式爬虫)


    分布式爬虫:scrapy本身并不是一个为分布式爬取而设计的框架,但第三方库scrapy-redis为其扩展了分布式爬取的功能,两者结合便是一个分布式Scrapy爬虫框架。在分布式爬虫框架中,需要使用某种通信机制协调各个爬虫的工作,让每一个爬虫明确自己的任务:

      1.当前的爬取任务,即下载+提取数据(分配任务)
      2.当前爬取任务是否已经被其他爬虫执行过(任务去重)
      3.如何存储爬取到的数据(数据存储)

    前期准备:Redis的安装与基本知识(http://www.runoob.com/redis/redis-keys.html)

    老规矩,先上爬取效果图,大家也赶快行动起来!QAQ

    开始爬取:

    1.首先看看分布式爬虫的整体文件架构

    Books
      Books
        spiders
          __init__.py
          books.py
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
      scrapy_redis  (第三方库下载地址--https://github.com/rmax/scrapy-redis)
        __init__.py
        connection.py
        defaults.py
        dupefilter.py
        picklecompat.py
        pipelines.py
        queque.py
        scheduler.py
        spiders.py
        utils.py
      scrapy.cfg

    2.看起来比较复杂,其实和之前爬虫没太大变化,不需要动scrapy_redis文件下脚本,只需要调用就好
    books.py

    # -*- coding: utf-8 -*-
    import scrapy
    import pdb
    from scrapy.linkextractors import LinkExtractor
    from Books.items import BooksItem
    from scrapy_redis.spiders import RedisSpider
    
    #class BooksSpider(scrapy.Spider):
    class BooksSpider(RedisSpider):  #(调用分布式爬虫最重要的,继承RedisSpider的类)
        name = 'books'
        #allowed_domains = ['books.toscrape.com']
        #start_urls = ['http://books.toscrape.com/']  (这里起始地址需要备注掉,运行爬虫的时候,在redis-cli之后,启动)
    
        def parse(self, response):
            sels = response.css('article.product_pod')
            book = BooksItem()
            for sel in sels:
                book["name"] = sel.css('h3 a::attr(title)').extract()[0]
                book["price"] = sel.css('div.product_price p::text').extract()[0]
                yield book
    
            links = LinkExtractor(restrict_css='ul.pager li.next').extract_links(response)
            yield scrapy.Request(links[0].url,callback=self.parse)

    3.Pipelines.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import scrapy
    import pdb
    from scrapy.exceptions import DropItem
    from scrapy.item import Item
    import pymongo
    import redis
    
    class BooksPipeline(object):
        def process_item(self, item, spider):
            return item
    
    class PriceConverterPipeline(object):    #提取的price进行价额转换
        exchange_rate = 8.5309
        def process_item(self,item,spider):
            price = float(item['price'][1:])*self.exchange_rate
            item['price']= '$%.2f'%price
            return item
    class DuplicatesPipeline(object):    #去重进行过滤
        def __init__(self):
            self.set= set()
        def process_item(self,item,spider):
            name = item["name"]
            if name in self.set:
                raise DropItem("Duplicate book found:%s"%item)
    
            self.set.add(name)
            return item
    class MongoDBPipeline(object):    #存储到mongodb中
        @classmethod
        def from_crawler(cls,crawler):
            cls.DB_URL = crawler.settings.get("MONGO_DB_URL",'mongodb://localhost:27017/')
            cls.DB_NAME = crawler.settings.get("MONGO_DB_NAME",'scrapy_data')
            return cls()
        def open_spider(self,spider):
            pdb.set_trace()
            self.client = pymongo.MongoClient(self.DB_URL)
            self.db     = self.client[self.DB_NAME]
        def close_spider(self,spider):
            self.client.close()
    
        def process_item(self,item,spider):
            collection = self.db[spider.name]
            post = dict(item) if isinstance(item,Item) else item
            collection.insert_one(post)
    
            return item
    class RedisPipeline:    #下载到redis数据库中
        def open_spider(self,spider):
            db_host = spider.settings.get("REDIS_HOST",'10.240.176.134')
            #db_host = spider.settings.get("REDIS_HOST",'localhost')
            db_port = spider.settings.get("REDIS_PORT",6379)
            db_index= spider.settings.get("REDIS_DB_INDEX",0)
            #db_passwd = spider.settings.get('REDIS_PASSWD','redisredis')
    
            #self.db_conn = redis.StrictRedis(host=db_host,port=db_port,db=db_index,password=db_passwd)
            self.db_conn = redis.StrictRedis(host=db_host,port=db_port,db=db_index)
            self.item_i = 0
    
        def close_spider(self,spider):
            self.db_conn.connection_pool.disconnect()
    
        def process_item(self,item,spider):
            self.insert_db(item)
            return item
    
        def insert_db(self,item):
            if isinstance(item,Item):
                item = dict(item)
    
            self.item_i += 1
            self.db_conn.hmset('books12:%s'%self.item_i,item)

    4.1settings.py

    (1)添加代理:Middlewares.py文件中

    class BooksSpiderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the spider middleware does not modify the
        # passed objects.
    
        def __init__(self,ip=''):
            self.ip = ip
        def process_request(self,request,spider):
            print('http://10.240.252.16:911')
            request.meta['proxy']= 'http://10.240.252.16:911'

    (2)settings.py

    DOWNLOADER_MIDDLEWARES = {
        #'Books.middlewares.BooksDownloaderMiddleware': 543,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':543,
        'Books.middlewares.BooksSpiderMiddleware':125,
    }
    ITEM_PIPELINES = {
        #'Books.pipelines.BooksPipeline': 300,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':1,
        'Books.pipelines.PriceConverterPipeline': 300,
        'Books.pipelines.DuplicatesPipeline':350,
        #'Books.pipelines.MongoDBPipeline':400,
        'Books.pipelines.RedisPipeline':404,
    }

    4.2基础设置

    ROBOTSTXT_OBEY = False
    CONCURRENT_REQUESTS = 32
    DOWNLOAD_DELAY = 3
    COOKIES_ENABLED = False

    4.3下载到mongodb数据库中

    MONGO_DB_URL = 'mongodb://localhost:27017/'
    MONGO_DB_NAME = 'eilinge'
    
    FEED_EXPORT_FIELDS = ['name','price']#设置导出文件格式顺序

    4.4实现redis和存储

    REDIS_HOST = '10.240.176.134'
    #REDIS_HOST = 'localhost'
    REDIS_PORT = 6379
    REDIS_DB_INDEX = 0
    #REDIS_PASSWD = 'redisredis'
    REDIS_URL = 'redis://10.240.176.134:6379'    #指定爬虫所使用的Redis数据库
    
    SCHEDULER = 'scrapy_redis.scheduler.Scheduler'     #使用scrapy_redis的调度器替代Scrapy原版调度器(FreeBSD系统中运行会报错,需要绑定core,然而freebsd中core路径不同)
    
    DUPEFILER = 'scrapy_redis.dupefilter.RFPDupeFilter'     #使用scrapy_redis的RFPDupeFilter作为去重过滤器
    
    SCHEDULER_PERSIST = True    #爬虫停止后,保留/清理Redis中请求队列以及去重集合

    需要注意的点:

        1.假使你有3台服务器可以同时运行爬取,使用scp远程传输Books文件,进行拷贝
        2.分别在3台主机使用相同命令运行爬虫:scrapy crawl books
        3.2018-09-03 12:30:47 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
        2018-09-03 12:31:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)   ...停止在此
        运行后,由于Redis中的起始爬虫点列表和请求队列都是空的,3个爬虫都进入了暂停等待的状态,因此在任意主机上使用Redis客户端设置起始爬取点
        redis-cli -h 10.240.176.134
        10.240.176.134:6379>lpush books:start_urls "http://books.toscrape.com"

    补充知识:

    Redis数据库的配置文件redis.conf

      #bind 127.0.0.1
    
      bind 0.0.0.0 #接收来自任意IP的请求
    
      #acquirepass redisredis #远程连接需要密码验证

     不同系统下运行redis服务

    1.ubuntu:sudo service redis-server restart
    
    2.linux(fedora):service redis restart
    
    3.Freebsd:service redis onerestart
  • 相关阅读:
    名种样式的加入收藏和设为主页代码
    Android蓝牙UUID
    Discuz (1040) notconnect错误的解决办法
    IIS上配置404页面的图文教程
    C#操作excel(多种方法比较)
    Server Application Unavailable出现的原因及解决方案集锦
    怎么在google player下载apk
    apk反编译|android程序反编译
    discuz x2.5帖子无法访问|discuz x2.5 帖子报错500
    C#实现路由器断开连接,更改公网ip
  • 原文地址:https://www.cnblogs.com/eilinge/p/9579135.html
Copyright © 2020-2023  润新知