• 分布式爬取阳光热线网


    - 分布式
    - 概念:需要搭建一个分布式的机群,然后在机群的每一台电脑中执行同一组程序,让其对某一个网站的数据进行联合分布爬取。
    - 原生的scrapy框架是不可以实现分布式?
    - 因为调度器不可以被共享
    - 管道不可以被共享
    - 如何实现分布式?
    - scrapy+scrapy_redis实现分布式
    - scrapy-redis组件的作用是什么?
    - 可以提供可被共享的调度器和管道
    - 特性:数据只可以存储到redis数据库。
    - 分布式的实现流程:
    - 1.pip install scrapy-redis
    - 2.创建工程
    - 3. cd 工程目录中
    - 4.创建爬虫文件(a.创建基于Spider的爬虫文件 b.创建CrawlSpider的爬虫文件)
    - 5.修改爬虫类:
    - 导报:from scrapy_redis.spiders import RedisCrawlSpider
    - 修改当前爬虫类的父类为RedisCrawlSpider
    - allowed_domains和start_urls删除
    - 添加一个新属性:redis_key = 'fbsQueue',表示的是可以被共享的调度器队列的名称
    - 编写爬虫类的其他操作(常规操作)
    - 6.settings配置文件的配置
    - UA伪装
    - Robots
    - 管道的指定:
    ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
    }
    - 指定调度器:
    # 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据
    SCHEDULER_PERSIST = True

    - 指定redis数据库
    REDIS_HOST = 'redis服务的ip地址'
    REDIS_PORT = 6379

    - redis的配置文件进行配置redis.windows.conf:
    - 关闭默认绑定:56Line:#bind 127.0.0.1
    - 关闭保护模式:75Line:protected-mode no
    - 启动redis的服务端和客户端:
    - redis-server.exe redis.windows.conf
    - redis-cli

    - 启动程序:
    scrapy runspider xxx.py
    - 向调度器的队列中仍入一个起始的url:
    - 队列是存在于redis中
    - 开启redis的客户端: lpush fbsQueue http://wz.sun0769.com/index.php/question/questionType?type=4&page=
    1、创建项目scrapy startproject 项目名

    2、进入项目cd 项目名

    3、创建虫子:scrapy genspider -t crawl 虫名 www.xxx.com

    4、导入from scrapy_redis.spiders import RedisCrawlSpider

    5、虫子里面编写正则并解析标题和状态

    6、item里面处理标题、状态

    7、虫子里面导入items里面的类名

    8、配置文件处理、UA伪装、robts、log_level

    9、开启scrapy_redis.pipelines.RedisPipeline里面的管道

    10、指定调度器

    11、指定redis数据库

    12、配置redis.windows.conf文件

    13、启动redis服务端和客户端

    14、启动程序:scrapy runspider 虫名

    15、向调度器的队列中仍入一个起始的url:

    16、keys *

    17、llen fbs:items

    fbs.py

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy_redis.spiders import RedisCrawlSpider
    from fbsPro1.items import Fbspro1Item


    class FbsSpider(RedisCrawlSpider):
    name = 'fbs'
    # allowed_domains = ['www.xxx.com']
    # start_urls = ['http://www.xxx.com/']
    redis_key = 'fbsQueue' # 表示的是可以被共享的调度器队列的名称

    rules = (
    Rule(LinkExtractor(allow=r'type=4&page=d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
    # 解析标题和状态
    tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table/tr')
    for tr in tr_list:
    title = tr.xpath('./td[2]/a[2]/text()').extract_first()
    status = tr.xpath('./td[3]/span/text()').extract_first()
    # 实例化对象
    item = Fbspro1Item()
    item['title'] = title
    item['status'] = status
    yield item

    settings.py

    # -*- coding: utf-8 -*-

    # Scrapy settings for fbsPro1 project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    # https://docs.scrapy.org/en/latest/topics/settings.html
    # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    # https://docs.scrapy.org/en/latest/topics/spider-middleware.html

    BOT_NAME = 'fbsPro1'
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
    SPIDER_MODULES = ['fbsPro1.spiders']
    NEWSPIDER_MODULE = 'fbsPro1.spiders'

    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    # USER_AGENT = 'fbsPro1 (+http://www.yourdomain.com)'

    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False

    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    CONCURRENT_REQUESTS = 3

    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    # DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    # CONCURRENT_REQUESTS_PER_DOMAIN = 16
    # CONCURRENT_REQUESTS_PER_IP = 16

    # Disable cookies (enabled by default)
    # COOKIES_ENABLED = False

    # Disable Telnet Console (enabled by default)
    # TELNETCONSOLE_ENABLED = False

    # Override the default request headers:
    # DEFAULT_REQUEST_HEADERS = {
    # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    # 'Accept-Language': 'en',
    # }

    # Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    # SPIDER_MIDDLEWARES = {
    # 'fbsPro1.middlewares.Fbspro1SpiderMiddleware': 543,
    # }

    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    # DOWNLOADER_MIDDLEWARES = {
    # 'fbsPro1.middlewares.Fbspro1DownloaderMiddleware': 543,
    # }

    # Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    # EXTENSIONS = {
    # 'scrapy.extensions.telnet.TelnetConsole': None,
    # }

    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    # ITEM_PIPELINES = {
    # 'fbsPro1.pipelines.Fbspro1Pipeline': 300,
    # }

    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    # AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    # AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    # AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    # AUTOTHROTTLE_DEBUG = False

    # Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    # HTTPCACHE_ENABLED = True
    # HTTPCACHE_EXPIRATION_SECS = 0
    # HTTPCACHE_DIR = 'httpcache'
    # HTTPCACHE_IGNORE_HTTP_CODES = []
    # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

    # 用scrapy_redis.pipelines.RedisPipeline里面的管道:
    ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
    }

    # 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据
    SCHEDULER_PERSIST = True

    REDIS_HOST = '192.168.1.104'
    REDIS_PORT = 6379

  • 相关阅读:
    register based 和 stack based虚拟机的区别
    Java in a Nutshell学习笔记
    Java中interface和abstract class的区别和联系
    Java中final的作用
    Android 源码下载
    Android Fragment 你应该知道的一切
    Android Fragment 真正的完全解析(下)
    Android Fragment 真正的完全解析(上)
    IntelliJ IDEA 使用总结
    Linux在目录中查找某个函数
  • 原文地址:https://www.cnblogs.com/zhang-da/p/12441974.html
Copyright © 2020-2023  润新知