• Scrapy可视化管理软件SpiderKeeper


    通常开发好的Scrapy爬虫部署到服务器上,要不使用nohup命令,要不使用scrapyd。如果使用nohup命令的话,爬虫挂掉了,你可能还不知道,你还得上服务器上查或者做额外的邮件通知操作。如果使用scrapyd,就是部署爬虫的时候有点复杂,功能少了点,其他还好。

    SpiderKeeper是一款管理爬虫的软件,和scrapinghub的部署功能差不多,能多台服务器部署爬虫,定时执行爬虫,查看爬虫日志,查看爬虫执行情况等功能。
    项目地址:https://github.com/DormyMo/SpiderKeeper

    一、运行环境

    • Centos7
    • Python2.7
    • Python3.6
      注意:supervisor用的是Python2.7,scrapyd用的是Python3.6,需要自行编译安装。Python3具体安装自行百度。

    二、安装依赖

    1、supervisor pip install supervisor
    2、scrapyd pip3 install scrapyd
    3、SpiderKeeperpip3 install SpiderKeeper

    三、配置scrapyd

    1、新建scrapyd的配置文件:

    [scrapyd]
    eggs_dir    = eggs
    logs_dir    = logs
    items_dir   =
    jobs_to_keep = 5
    dbs_dir     = dbs
    max_proc    = 0
    max_proc_per_cpu = 4
    finished_to_keep = 100
    poll_interval = 5.0
    bind_address = 0.0.0.0
    http_port   = 6800
    debug       = off
    runner      = scrapyd.runner
    application = scrapyd.app.application
    launcher    = scrapyd.launcher.Launcher
    webroot     = scrapyd.website.Root
    
    [services]
    schedule.json     = scrapyd.webservice.Schedule
    cancel.json       = scrapyd.webservice.Cancel
    addversion.json   = scrapyd.webservice.AddVersion
    listprojects.json = scrapyd.webservice.ListProjects
    listversions.json = scrapyd.webservice.ListVersions
    listspiders.json  = scrapyd.webservice.ListSpiders
    delproject.json   = scrapyd.webservice.DeleteProject
    delversion.json   = scrapyd.webservice.DeleteVersion
    listjobs.json     = scrapyd.webservice.ListJobs
    daemonstatus.json = scrapyd.webservice.DaemonStatus
    

    四、配置supervisor

    1、创建配置的文件夹和配置文件

    mkdir /etc/supervisor
    echo_supervisord_conf > /etc/supervisor/supervisord.conf
    

    2、编辑配置文件vim /etc/supervisor/supervisord.conf

    ;[include]
    ;files = relative/directory/*.ini
    

    改为

    [include]
    files = conf.d/*.conf
    

    3、新建conf.d文件夹mkdir /etc/supervisor/conf.d
    4、添加scrapyd的配置文件vim /etc/supervisor/conf.d/scrapyd.conf

    [program:scrapyd]
    command=/usr/local/python3.5/bin/scrapyd
    directory=/opt/SpiderKeeper
    user=root
    stderr_logfile=/var/log/scrapyd.err.log
    stdout_logfile=/var/log/scrapyd.out.log
    

    5、添加spiderkeeper的配置文件vim /etc/supervisor/conf.d/spiderkeeper.conf

    [program:spiderkeeper]
    command=spiderkeeper --server=http://localhost:6800
    directory=/opt/SpiderKeeper
    user=root
    stderr_logfile=/var/log/spiderkeeper.err.log
    stdout_logfile=/var/log/spiderkeeper.out.log
    

    6、启动supervisor,supervisord -c /etc/supervisor/supervisord.conf

    五、使用

    1、登录http://localhost:5000
    2、新建project
    3、打包爬虫文件
    pip3 install scrapyd-client
    scrapyd-deploy --build-egg output.egg
    4、上传打包好的爬虫egg文件

    SpiderKeeper可以识别多台服务器的scrapyd,具体多加--server就好。

  • 相关阅读:
    DNS收集分析之host
    DNS收集分析fierce
    DNS收集分析dig
    DNS收集分析reverseraider
    DNS收集分析lbd
    DNS收集查询dnsrecon
    DNS收集分析dnsmap
    Preclassify
    OSPF Forwarding Address 引起的环路
    kali笔记
  • 原文地址:https://www.cnblogs.com/ginponson/p/7638579.html
Copyright © 2020-2023  润新知