• 下载中间件简单使用


    1.使用方法

    编写一个DOWNLOADER_MIDDLEWARES和pipline差不多,都是定义一个类,在setting中开启
    一般而言,scrapy都会自动给你生成。
    setting.py
    DOWNLOADER_MIDDLEWARES = {
        # 'projectname.middlewares.yoursDownloaderMiddleware': None,
        'projectname.middlewares.yoursMiddleware': 300,
    }

    后面的数字是代表权重,代表数据通过下载中间件的顺序,越小的优先级越高,越大离下载器越近。

    这里贴张图理解

     

    2.

    DOWNLOADER_MIDDLEWARES默认方法:
    process_request(self, request, spider):
    每个request通过中间件都会调用该方法
    #这里示例设置随机agent
        def process_request(self, request, spider):
            # 这句话用于随机选择user-agent
            ua = random.choice(user_agent_list)
            # 用于打印设置的随机user-agent信息
            # 通常调试通过后可以删除
            if ua:
                print(ua)
    
            # request.headers.setdefault('User-Agent', ua)
            # 第二种方式
            request.headers['User-Agent'] = ua
    user_agent_list
    这里随便找了些
    user_agent_list = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/531.21.8 (KHTML, like Gecko) Version/4.0.4 Safari/531.21.10",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.1.17) Gecko/20110123 (like Firefox/3.x) SeaMonkey/2.0.12",
    "Mozilla/5.0 (Windows NT 5.2; rv:10.0.1) Gecko/20100101 Firefox/10.0.1 SeaMonkey/2.7.1",
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/532.8 (KHTML, like Gecko) Chrome/4.0.302.2 Safari/532.8",
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.464.0 Safari/534.3",
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.15 Safari/534.13",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.186 Safari/535.1",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.54 Safari/535.2",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
    "Mozilla/5.0 (Macintosh; U; Mac OS X Mach-O; en-US; rv:2.0a) Gecko/20040614 Firefox/3.0.0 ",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.0.3) Gecko/2008092414 Firefox/3.0.3",
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1) Gecko/20090624 Firefox/3.5",
     ]
    但是这样太过于繁琐感觉,我一般采用调用fake-useragnet

    import random
    from fake_useragent import UserAgent 
    
    def process_request(self, request, spider):
          
            ua = UserAgent()
            res = ua.data_browsers
            re = random.choice(res['chrome'])
           
            if re:
                print(re)
    
            # request.headers.setdefault('User-Agent', re)
            # 第二种方式
            request.headers['User-Agent'] = re

    当然在该默认方法里还可以设置proxy,前提是在setting也要开启(工程名自定义)

    #这里示例单个,也可以使用random的方式来随机请求。

    #当使用随机proxy的时候,先把所有代理放入一个程序来验证是否可以使用,来保持可用性。

    class projectnameDownloaderMiddleware(object):
        # 中间件(aop)开启代理,在执行爬虫文件之前执行,底层原理是面向切面编程
        # 减少代码冗余,提高复用性,分工明确.
    
        def process_request(self, request, spider):
    #要在request里的mate信息里添加proxy字段 #代理形式 协议+ip地址+接口 request.meta[
    'proxy'] = 'https://180.97.33.249:80'

    return None

    #这里要么返回none,要么不返

    3.process_response(self,request,response,spider):

    该方法是在下载器处理完http请求后,传递响应给引擎的时候调用,用于处理响应

    class Checkuseragent():
            def process_response(self,request,response,spider):
                  print(dir(response.request))
                  print(request.headers['User-Agent'])
                  return response
    
    #这里返回response或者request

    官网还有更详细解释,建议去看看:https://docs.scrapy.org/en/latest/

  • 相关阅读:
    InvalidIndexNameException[Invalid index name [2Shard], must be lowercase]
    Head插件——学习Elasticsearch的锋刃利器!
    Fiddler使用AutoResponder进行本地文件和线上文件的映射
    启动redis出现Creating Server TCP listening socket *:6379: bind: No such file or directory
    Project configuration is not up-to-date with pom.xml错误解决方法
    创建支持eclipse的多模块maven项目
    Eclipse添加默认的JRE
    错误:HttpServlet was not found on the Java
    Android插件化开发之Atlas初体验
    Android屏幕适配全攻略(最权威的官方适配指导)
  • 原文地址:https://www.cnblogs.com/cheflone/p/13712629.html
Copyright © 2020-2023  润新知