• python高级之scrapy-redis


    目录:

    • scrapy-redis组件

    • scrapy-redis配置示例

    一、scrapy-redis组件

    1、scrapy-redis简介:

    scrapy-redis是一个基于redis的scrapy组件,通过它可以快速实现简单分布式爬虫程序,该组件本质上提供了三大功能:

    • scheduler - 调度器
    • dupefilter - URL去重规则(被调度器使用)
    • pipeline   - 数据持久化

    2、url去重

    多爬虫分布式并发,如何保证调用的url不重复。需要把爬虫队列和调度器,去重规则,提取到redis中。

    组件: scrapy-redis,将去重规则和调度器放置到redis中。

    流程:连接redis,指定调度器时,调用去重规则.request_seen方法。

     1 定义去重规则(被调度器调用并应用)
     2  
     3     a. 内部会使用以下配置进行连接Redis
     4  
     5         # REDIS_HOST = 'localhost'                            # 主机名
     6         # REDIS_PORT = 6379                                   # 端口
     7         # REDIS_URL = 'redis://user:pass@hostname:9001'       # 连接URL(优先于以上配置)
     8         # REDIS_PARAMS  = {}                                  # Redis连接参数             默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
     9         # REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' # 指定连接Redis的Python模块  默认:redis.StrictRedis
    10         # REDIS_ENCODING = "utf-8"                            # redis编码类型             默认:'utf-8'
    11      
    12     b. 去重规则通过redis的集合完成,集合的Key为:
    13      
    14         key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
    15         默认配置:
    16             DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'
    17               
    18     c. 去重规则中将url转换成唯一标示,然后在redis中检查是否已经在集合中存在
    19      
    20         from scrapy.utils import request
    21         from scrapy.http import Request
    22          
    23         req = Request(url='http://www.cnblogs.com/wupeiqi.html')
    24         result = request.request_fingerprint(req)
    25         print(result) # 8ea4fd67887449313ccc12e5b6b92510cc53675c
    26          
    27          
    28         PS:
    29             - URL参数位置不同时,计算结果一致;
    30             - 默认请求头不在计算范围,include_headers可以设置指定请求头
    31             示例:
    32                 from scrapy.utils import request
    33                 from scrapy.http import Request
    34                  
    35                 req = Request(url='http://www.baidu.com?name=8&id=1',callback=lambda x:print(x),cookies={'k1':'vvvvv'})
    36                 result = request.request_fingerprint(req,include_headers=['cookies',])
    37                  
    38                 print(result)
    39                  
    40                 req = Request(url='http://www.baidu.com?id=1&name=8',callback=lambda x:print(x),cookies={'k1':666})
    41                  
    42                 result = request.request_fingerprint(req,include_headers=['cookies',])
    43                  
    44                 print(result)
    45          
    46 """
    47 # Ensure all spiders share same duplicates filter through redis.
    48 # DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    View Code

    ps.有关爬虫队列和调度器,去重规则详见

    http://www.cnblogs.com/wangshuyang/p/7717263.html

    3、调度器

     1 """
     2 调度器,调度器使用PriorityQueue(有序集合)、FifoQueue(列表)、LifoQueue(列表)进行保存请求,并且使用RFPDupeFilter对URL去重
     3      
     4     a. 调度器
     5         SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'          # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
     6         SCHEDULER_QUEUE_KEY = '%(spider)s:requests'                         # 调度器中请求存放在redis中的key
     7         SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"                  # 对保存到redis中的数据进行序列化,默认使用pickle
     8         SCHEDULER_PERSIST = True                                            # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
     9         SCHEDULER_FLUSH_ON_START = True                                     # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
    10         SCHEDULER_IDLE_BEFORE_CLOSE = 10                                    # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。
    11         SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'                  # 去重规则,在redis中保存时对应的key
    12         SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'# 去重规则对应处理的类
    13  
    14  
    15 """
    16 # Enables scheduling storing requests queue in redis.
    17 SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    18  
    19 # Default requests serializer is pickle, but it can be changed to any module
    20 # with loads and dumps functions. Note that pickle is not compatible between
    21 # python versions.
    22 # Caveat: In python 3.x, the serializer must return strings keys and support
    23 # bytes as values. Because of this reason the json or msgpack module will not
    24 # work by default. In python 2.x there is no such issue and you can use
    25 # 'json' or 'msgpack' as serializers.
    26 # SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"
    27  
    28 # Don't cleanup redis queues, allows to pause/resume crawls.
    29 # SCHEDULER_PERSIST = True
    30  
    31 # Schedule requests using a priority queue. (default)
    32 # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
    33  
    34 # Alternative queues.
    35 # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
    36 # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'
    37  
    38 # Max idle time to prevent the spider from being closed when distributed crawling.
    39 # This only works if queue class is SpiderQueue or SpiderStack,
    40 # and may also block the same time when your spider start at the first time (because the queue is empty).
    41 # SCHEDULER_IDLE_BEFORE_CLOSE = 10  
    View Code

    4、数据持久化

    1 2. 定义持久化,爬虫yield Item对象时执行RedisPipeline
    2      
    3     a. 将item持久化到redis时,指定key和序列化函数
    4      
    5         REDIS_ITEMS_KEY = '%(spider)s:items'
    6         REDIS_ITEMS_SERIALIZER = 'json.dumps'
    7      
    8     b. 使用列表保存item数据
    View Code

    5、起始URL相关

     1 """
     2 起始URL相关
     3  
     4     a. 获取起始URL时,去集合中获取还是去列表中获取?True,集合;False,列表
     5         REDIS_START_URLS_AS_SET = False    # 获取起始URL时,如果为True,则使用self.server.spop;如果为False,则使用self.server.lpop
     6     b. 编写爬虫时,起始URL从redis的Key中获取
     7         REDIS_START_URLS_KEY = '%(name)s:start_urls'
     8          
     9 """
    10 # If True, it uses redis' ``spop`` operation. This could be useful if you
    11 # want to avoid duplicates in your start urls list. In this cases, urls must
    12 # be added via ``sadd`` command or you will get a type error from redis.
    13 # REDIS_START_URLS_AS_SET = False
    14  
    15 # Default start urls key for RedisSpider and RedisCrawlSpider.
    16 # REDIS_START_URLS_KEY = '%(name)s:start_urls'
    View Code

    二、scrapy-redis配置示例

    1、示例文件

     1 # DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
     2 #
     3 #
     4 # from scrapy_redis.scheduler import Scheduler
     5 # from scrapy_redis.queue import PriorityQueue
     6 # SCHEDULER = "scrapy_redis.scheduler.Scheduler"
     7 # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'          # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
     8 # SCHEDULER_QUEUE_KEY = '%(spider)s:requests'                         # 调度器中请求存放在redis中的key
     9 # SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"                  # 对保存到redis中的数据进行序列化,默认使用pickle
    10 # SCHEDULER_PERSIST = True                                            # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
    11 # SCHEDULER_FLUSH_ON_START = False                                    # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
    12 # SCHEDULER_IDLE_BEFORE_CLOSE = 10                                    # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。
    13 # SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'                  # 去重规则,在redis中保存时对应的key
    14 # SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'# 去重规则对应处理的类
    15 #
    16 #
    17 #
    18 # REDIS_HOST = '10.211.55.13'                           # 主机名
    19 # REDIS_PORT = 6379                                     # 端口
    20 # # REDIS_URL = 'redis://user:pass@hostname:9001'       # 连接URL(优先于以上配置)
    21 # # REDIS_PARAMS  = {}                                  # Redis连接参数             默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
    22 # # REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' # 指定连接Redis的Python模块  默认:redis.StrictRedis
    23 # REDIS_ENCODING = "utf-8"                              # redis编码类型             默认:'utf-8'
    View Code

    2、爬虫文件

     1 import scrapy
     2 
     3 
     4 class ChoutiSpider(scrapy.Spider):
     5     name = "chouti"
     6     allowed_domains = ["chouti.com"]
     7     start_urls = (
     8         'http://www.chouti.com/',
     9     )
    10 
    11     def parse(self, response):
    12         for i in range(0,10):
    13             yield
    View Code
  • 相关阅读:
    OSCP Learning Notes Buffer Overflows(3)
    OSCP Learning Notes Buffer Overflows(5)
    OSCP Learning Notes Exploit(3)
    OSCP Learning Notes Exploit(4)
    OSCP Learning Notes Exploit(1)
    OSCP Learning Notes Netcat
    OSCP Learning Notes Buffer Overflows(4)
    OSCP Learning Notes Buffer Overflows(1)
    OSCP Learning Notes Exploit(2)
    C++格式化输出 Learner
  • 原文地址:https://www.cnblogs.com/wangshuyang/p/scrapy.html
Copyright © 2020-2023  润新知