• scrapy-splash学习


    材料清单

    docker

    scrapy


    当我们经常遇到js加载的页面,用scrapy来抓取其实挺麻烦的。Splash是做来加载渲染后的页面,可以支持scrapy使用。由于Splash和Scrapy都支持异步处理,而Selenium的对接过程中每个页面渲染下载过程是在Downloader Middleware里面完成的,所以整个过程是堵塞式的,Scrapy会等待这个过程完成后再继续处理和调度其他请求,影响了爬取效率,因此使用Splash爬取效率上比Selenium高出很多。
    首先安装docker,直接拉取镜像 docker pull scrapinghub/splash
    启动Splashdocker run -p 8050:8050 scrapinghub/splash
    然后测试一下是否可以连上curl http://localhost:8050


    如果关闭防火墙之类操作已经做完了,那么远程是可以连接上splash的
    在这里插入图片描述


    接着开始在scrapy的配置,在settings.py中添加如下配置

    # 加入splash的url以及去重类
    SPLASH_URL = 'http://192.168.99.100:8050'  
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'  
    # 修改下载中间件
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 723,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        #'splash_163.middlewares.Splash163DownloaderMiddleware': 543,
    }
    # 修改爬虫中间件
    SPIDER_MIDDLEWARES = {
        'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }
    

    在spider中加入生成的splash请求

    from scrapy_splash import SplashRequest
    ...
    ...
    yield SplashRequest(url, callback=self.parse_result,
        args={
            # optional; parameters passed to Splash HTTP API
            'wait': 0.5,
            # 'url' is prefilled from request url
            # 'http_method' is set to 'POST' for POST requests
            # 'body' is set to request body for POST requests
        },
        endpoint='render.json', # optional; default is render.html
        splash_url='<url>',     # optional; overrides SPLASH_URL
    )
    
    # 另外我们也可以生成Request对象,关于Splash的配置通过meta属性配置即可,代码如下:
    yield scrapy.Request(url, self.parse_result, meta={
        'splash': {
            'args': {
                # set rendering arguments here
                'html': 1,
                'png': 1,
                # 'url' is prefilled from request url
                # 'http_method' is set to 'POST' for POST requests
                # 'body' is set to request body for POST requests
            },
            # optional parameters
            'endpoint': 'render.json',  # optional; default is render.json
            'splash_url': '<url>',      # optional; overrides SPLASH_URL
            'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
            'splash_headers': {},       # optional; a dict with headers sent to Splash
            'dont_process_response': True, # optional, default is False
            'dont_send_headers': True,  # optional, default is False
            'magic_response': False,    # optional, default is True
        }
    })
    

    当我们使用脚本来执行某些操作时,就需要Lua脚本了。Lua脚本可以像selenium那样来实现页面加载、模拟点击翻页的功能

    script = """
    function main(splash, args)
      args = {
        url="https://s.taobao.com/search?q=羽毛球",
        wait=5,
        page=5
      }
      splash.images_enabled = false
      assert(splash:go(args.url))
      assert(splash:wait(args.wait))
      js = string.format("document.querySelector('#mainsrp-pager div.form > input').value=%d;document.querySelector('#mainsrp-pager div.form > span.btn.J_Submit').click()", args.page)
      splash:evaljs(js)
      assert(splash:wait(args.wait))
      return splash:png()
    end
    """
    class TaobaoSpider(Spider):
        name = 'taobao'
        allowed_domains = ['www.taobao.com']
        base_url = 'https://s.taobao.com/search?q='
        
        def start_requests(self):
            for keyword in self.settings.get('KEYWORDS'):
                for page in range(1, self.settings.get('PAGE_NUM') + 1):
                    url = self.base_url + quote(keyword)
                    yield SplashRequest(url, callback=self.parse, endpoint='execute', 
                                        args={'lua_source': script, 'page': page, 'wait': 3})
    

    顺便贴个post请求的Lua脚本

    script = """
    function main(splash, args)
      local treat = require("treat")
      local json = require("json")
      local response = splash:http_post{args.url, 
      					body=json.encode({keywords="园林"})}
      splash:wait(10)
      return {
        html = treat.as_string(response.body),
        url = response.url,
        status = response.status
      }
    end
    """
    
  • 相关阅读:
    这个命令可以看到你的cpu到底集合
    关于redis的主从、哨兵、集群(转)
    redis配置主从备份以及主备切换方案配置(转)
    redis主从配置及其java的调用(转)
    mongodb批量处理
    Java mongodb api疑问之MongoCollection与DBCollection
    JDK8日期处理API(转)
    Badboy + JMeter性能测试(转)
    Jmeter接口测试+压力测试(转)
    JMeter性能测试,完整入门篇(转)
  • 原文地址:https://www.cnblogs.com/triangle959/p/12024355.html
Copyright © 2020-2023  润新知