• scrapy之防ban策略


    策略一:设置download_delay

      设置下载的等待时间,减少被ban的几率

    通过在setting.py文件中设置DOWNLOAD_DELAY参数,可以限制爬虫的访问频度。

    DOWNLOAD_DELAY =0.25    # 250 ms of delay

    通过启用RANDOMIZE_DOWNLOAD_DELAY参数(默认为开启状态),可以使爬取时间间隔随机化,随机时长控制在0.5-1.5倍的DOWNLOAD_DELAY之间,这也可以降低爬虫被墙掉的几率。

    download_delay可以设置在settings.py中,也可以在spider中设置

    策略二:禁止cookies

    cookies是指某些网站为了辨别用户身份而存储在用户本地终端上的数据(通常经过加密),禁止cookies也就防止了可能使用cookies识别爬虫轨迹的网站得逞.

      在setting.py中设置COOKIES_ENABLES= False   也就是不启用cookies.middleware,不向服务器发送cokkies

    策略三:使用useragent代理池

      useragent是指包含浏览器信息,操作系统信息等的一个字符串,也成为一种特殊的网络协议. 服务器通过它判断当前访问对象是浏览器,邮件客户端还是网络爬虫. 在request.header中可以查看user agent 

    scrapy    shell   url

    request.headers

    接下来在spiders目录下新建rotate_useragent.py

    贴一下代码:

    #coding:utf-8
    from scrapy import log
    from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
    import random
    class RotateUserAgentMiddleware(UserAgentMiddleware):
        def __init__(self, useragent=''):
            self.user_agent = useragent
            self.user_agent_list = [
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 ",
                "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
                "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 ",
                "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 ",
                "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 ",
                "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
                "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 ",
                "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 ",
                "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
                "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 ",
                "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
                "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 ",
                "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
                "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 ",
                "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            ]
        def process_request(self, request, spider):
            ua = random.choice(self.user_agent_list)
            if ua:
                print('************Current UserAgent:%s***********' % ua)
                log.msg('Current UserAgent: ' + ua, level=3)

    要在settings.py(配置文件)中禁用默认的useragent并启用重新实现的User Agent。配置方法如下:

    #取消默认的useragent,使用新的useragent
    DOWNLOADER_MIDDLEWARES = {
            'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
            'myProject.spiders.rotate_useragent.RotateUserAgentMiddleware' :400
        }

    策略四:使用IP池

    策略五:使用分布式爬取



  • 相关阅读:
    Android Things专题 1.前世今生
    用Power BI解读幸福星球指数
    [leetcode]Simplify Path
    字段的划分完整的问题
    k-means算法MATLAB和opencv代码
    【Oracle】RAC下的一些经常使用命令(一)
    Java中经常使用缓存Cache机制的实现
    jenkins环境自动部署
    jenkins环境搭建
    springboot单元测试@test的使用
  • 原文地址:https://www.cnblogs.com/Garvey/p/6689138.html
Copyright © 2020-2023  润新知