• Scrapy去重


    一、原生

    1、模块

    from scrapy.dupefilters import RFPDupeFilter

    2、RFPDupeFilter方法

    a、request_seen

    核心:爬虫每执行一次yield Request对象,则执行一次request_seen方法

    作用:用来去重,相同的url只能访问一次

    实现:将url值变成定长、唯一的值,如果这个url对象存在,则返回True表名已经访问过,若url不存在则添加该url到集合

    1)、request_fingerprint

    作用:对request(url)变成定长唯一的值,如果使用md5的话,下面的两个url值不一样

    注意:request_fingerprint() 只接收request对象

    from scrapy.utils.request import request_fingerprint
    from scrapy.http import Request
    
    #
    url1 = 'https://test.com/?a=1&b=2'
    url2 = 'https://test.com/?b=2&a=1'
    request1 = Request(url=url1)
    request2 = Request(url=url2)
    
    # 只接收request对象
    rfp1 = request_fingerprint(request=request1)
    rfp2 = request_fingerprint(request=request2)
    print(rfp1)
    print(rfp2)
    
    if rfp1 == rfp2:
        print('url相同')
    else:
        print('url不同')

    2)、request_seen

    def request_seen(self, request):
        # request_fingerprint 将request(url) -> 唯一、定长
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True        # 返回True,表明已经执行过一次
        self.fingerprints.add(fp)

    b、open

    父类BaseDupeFilter中的方法,爬虫开始时,执行

    def open(self):
        # 爬虫开始
        pass

    c、close

    爬虫结束时执行

    def close(self, reason):
        # 关闭爬虫时执行
        pass

    d、log

    记录日志

    def log(self, request, spider):
        # 记录日志
        pass

    e、from_settings

    原理及作用:和pipelines中的from_crawler一致

    @classmethod
    def from_settings(cls, settings):
        return cls()

    二、自定义

    待续

    1、配置文件(settings.py)

    # 原生
    # DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
    DUPEFILTER_CLASS = 'toscrapy.dupefilters.MyDupeFilter'

    2、自定义去重类(继承BaseDupeFilter)

    from scrapy.dupefilters import BaseDupeFilter
    from scrapy.utils.request import request_fingerprint
    #
    
    
    class MyDupeFilter(BaseDupeFilter):
        def __init__(self):
            self.visited_fp = set()
    
        @classmethod
        def from_settings(cls, settings):
            return cls()
    
        def request_seen(self, request):
            # 判断当前的request对象是否,在集合中,若在则放回True,表明已经访问,否则,访问该request的url并将该url添加到集合中
            if request_fingerprint(request) in self.visited_fp:
                return True
            self.visited_fp.add(request_fingerprint(request))
    
        def open(self):  # can return deferred
            print('开启爬虫')
    
        def close(self, reason):  # can return a deferred
            print('结束爬虫')
    
        def log(self, request, spider):  # log that a request has been filtered
            pass

    3、前提条件

    yield request的对象

    yield scrapy.Request(url=_next, callback=self.parse, dont_filter=True)

    dont_filter不能为True,这个值默认为False

  • 相关阅读:
    python 常用的一些库
    Windows Server 2016-存储新增功能
    Windows Server 2016-Hyper-V 2016新增功能
    Windows Server 2016-Win Ser 2016已删减内容
    Windows Server 2016-Win Ser 2016新增功能
    Windows Server 2016-WinSer 2016标准版与数据中心版的区别
    Windows Server 2016-重置目录还原模式密码
    Windows Server 2016-清理残留域控信息
    Windows Server 2016-抢占FSMO角色
    Windows Server 2016-重命名域控制器
  • 原文地址:https://www.cnblogs.com/wt7018/p/11741458.html
Copyright © 2020-2023  润新知