• scrapy相关:splash 实践


    0.

    1.参考

    https://github.com/scrapy-plugins/scrapy-splash#configuration

    以此为准

    scrapy相关:splash安装 A javascript rendering service 渲染

    1. 启动 Docker Quickstart Terminal
    2. 使用 putty 连接如下ip,端口22,用户名/密码:docker/tcuser
    3. 开启服务:
      1.   sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
    4. 浏览器打开:http://192.168.99.100:8050/
    docker is configured to use the default machine with IP 192.168.99.100
    For help getting started, check out the docs at https://docs.docker.com
    
    Start interactive shell
    
    win7@win7-PC MINGW64 ~
    $

    2.实践

    2.1新建项目后修改 settings.py

    ROBOTSTXT_OBEY 改为 False,同时添加如下内容:

    '''https://github.com/scrapy-plugins/scrapy-splash#configuration'''
    # 1.Add the Splash server address to settings.py of your Scrapy project like this:
    SPLASH_URL = 'http://192.168.99.100:8050'
    
    # 2.Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES in your settings.py file 
    # and changing HttpCompressionMiddleware priority:
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    # Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.
    # HttpCompressionMiddleware priority should be changed in order to allow advanced response processing; 
    # see https://github.com/scrapy/scrapy/issues/1895 for details.
    
    # 3.Enable SplashDeduplicateArgsMiddleware by adding it to SPIDER_MIDDLEWARES in your settings.py:
    SPIDER_MIDDLEWARES = {
        'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }
    # This middleware is needed to support cache_args feature; 
    # it allows to save disk space by not storing duplicate Splash arguments multiple times in a disk request queue. 
    # If Splash 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Splash server multiple times.
    
    # 4.Set a custom DUPEFILTER_CLASS:
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    
    # 5.If you use Scrapy HTTP cache then a custom cache storage backend is required. 
    # scrapy-splash provides a subclass of scrapy.contrib.httpcache.FilesystemCacheStorage:
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    # If you use other cache storage then it is necesary to subclass it 
    # and replace all scrapy.util.request.request_fingerprint calls with scrapy_splash.splash_request_fingerprint.
    
    # Note
    # Steps (4) and (5) are necessary because Scrapy doesn't provide a way to override request fingerprints calculation algorithm globally; this could change in future.
    # There are also some additional options available. Put them into your settings.py if you want to change the defaults:
    # SPLASH_COOKIES_DEBUG is False by default. Set to True to enable debugging cookies in the SplashCookiesMiddleware. This option is similar to COOKIES_DEBUG for the built-in scarpy cookies middleware: it logs sent and received cookies for all requests.
    # SPLASH_LOG_400 is True by default - it instructs to log all 400 errors from Splash. They are important because they show errors occurred when executing the Splash script. Set it to False to disable this logging.
    # SPLASH_SLOT_POLICY is scrapy_splash.SlotPolicy.PER_DOMAIN by default. It specifies how concurrency & politeness are maintained for Splash requests, and specify the default value for slot_policy argument for SplashRequest, which is described below.
    View Code

    2.2 编写基本 spider

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy_splash import SplashRequest
    from scrapy.shell import inspect_response
    import base64
    from PIL import Image
    from io import BytesIO
    
    
    class CnblogsSpider(scrapy.Spider):
        name = 'cnblogs'
        allowed_domains = ['cnblogs.com']
        start_urls = ['https://www.cnblogs.com/']
        
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url, self.parse, args={'wait': 0.5})    
    
        def parse(self, response):
            inspect_response(response, self)  ########################

    调试 view(response) 是个txt。。。另存为 html 使用浏览器浏览即可。

    2.3 编写截图 spider

    同时参考 https://stackoverflow.com/questions/45172260/scrapy-splash-screenshots

        def start_requests(self):
            splash_args = {
                'html': 1,
                'png': 1,
                #'width': 1024, #默认1027*768,4:3
                #'render_all': 1, #长图截屏,不提供则是第一屏,需要同时提供 wait,否则报错
                #'wait': 0.5,
                
            }
                 
            for url in self.start_urls:
                yield SplashRequest(url, self.parse, endpoint='render.json', args=splash_args)

    http://splash.readthedocs.io/en/latest/api.html?highlight=wait#render-png

    render_all=1 requires non-zero wait parameter. This is an unfortunate restriction, but it seems that this is the only way to make rendering work reliably with render_all=1.

    https://github.com/scrapy-plugins/scrapy-splash#responses

    Responses

    scrapy-splash returns Response subclasses for Splash requests:

    • SplashResponse is returned for binary Splash responses - e.g. for /render.png responses;
    • SplashTextResponse is returned when the result is text - e.g. for /render.html responses;
    • SplashJsonResponse is returned when the result is a JSON object - e.g. for /render.json responses or /execute responses when script returns a Lua table.

    SplashJsonResponse provide extra features:

    • response.data attribute contains response data decoded from JSON; you can access it like response.data['html'].

    show 另存文件

        def parse(self, response):
            # In [6]: response.data.keys()
            # Out[6]: [u'title', u'url', u'geometry', u'html', u'png', u'requestedUrl']        
        
            imgdata = base64.b64decode(response.data['png'])
            img = Image.open(BytesIO(imgdata))
            img.show()
            filename = 'some_image.png'
            with open(filename, 'wb') as f:
                f.write(imgdata)        
            inspect_response(response, self)  ########################

    x

  • 相关阅读:
    开题
    kafka介绍原理
    xxl-job
    多线程使用
    基础
    linux命令
    oracle id 自增
    feign调用远程服务 并传输媒体类型
    复杂sql mybatis查询
    开源easyExcel应用
  • 原文地址:https://www.cnblogs.com/my8100/p/splash_practice.html
Copyright © 2020-2023  润新知