• pyspider安装


     操作系统

    CentOS Linux release 7.0.1406 (Core)
    

    Python环境

    Python安装  

    安装依赖:
      yum install gcc # 安装python必须
      yum install zlib # 以下四个安装setuptools必须,如果安装在python后,则需要重新make python
      yum install zlib-devel
      yum install openssl
      yum install openssl-devel


      cd Python-2.7.13
      ./configure --prefix=/python2.7
      make
      make install

    配置环境变量
    # vi ~/.bash_profile
      export PATH=/python2.7/bin:$PATH

    安装pip

       依赖:setuptools

        依赖:six-1.10.0.tar.gz packaging-16.8.tar.gz pyparsing-2.2.0.tar.gz appdirs-1.4.3.tar.gz 

       cd pip-9.0.1

       # python setup.py install

    安装pyspider

    从github下载最新版pyspider

    依赖系统包:

      tcl  protobuf libcurl-devel libxslt-devel  libxml2

    使用yum install 安装他们。。。

    cd pyspider
    # 安装依赖包并安装
    pip install -r requirements.txt
    python setup.py install
    

      由于requirements.txt中的mysql-connector无法下载,所以选择安装其它版本的mysql-connector

      pip install mysql-connector==2.1.4

    安装mysql数据库

    用yum安装完后,参考http://www.itnose.net/detail/6310643.html,完成数据库的安装。

    # 重启mysql
    service mysqld restart
    
    # mysql -u root
    # 修改root密码
    mysql> use msyql
    mysql> update user set password=password('123456') where user='root';
    
    
    # 创建数据库并授权
    mysql> create database taskdb;
    mysql> create database projectdb;
    mysql> create database resultdb;
    mysql> create user 'pyspider'@'%';
    mysql> create user pyspider@'localhost' identified by 'pyspider-pass';
    mysql> grant select,insert,update,references,delete,create,drop,alter,index,trigger,create view,show view,execute,alter routine,create routine,create temporary tables,lock tables,event on taskdb.* to 'pyspider'@'%';
    mysql> grant select,insert,update,references,delete,create,drop,alter,index,trigger,create view,show view,execute,alter routine,create routine,create temporary tables,lock tables,event on projectdb.* to 'pyspider'@'%';
    mysql> grant select,insert,update,references,delete,create,drop,alter,index,trigger,create view,show view,execute,alter routine,create routine,create temporary tables,lock tables,event on resultdb.* to 'pyspider'@'%';
    mysql> flush privileges;
    
    修改配置文件(为集群做准备)
    vi /etc/my.cnf
    bind-address = 0.0.0.0
    
    # 重启数据库 
    service mysqld restart
    

       

    安装redis

    下载redis,并解压到/root/training目录下

    安装redis

    cd /root/training/redis-2.8.12
    make
    make test
    make install
    
    # 为集群做准备
    cd /root/training/redis-3.2.8
    cp redis.conf /etc/
    
    vi /etc/redis.conf
    bind 0.0.0.0 
    
    # 启动 redis 
    redis-server /etc/redis.conf &
    

      

    启动成功标志:The server is now ready to accept connections on port 6379

    防火墙

    查看防火墙状态:

    firewall-cmd --state

    自己两条配置:

    iptables -A INPUT -s 127.0.0.1 -p tcp --dport 6379 -j ACCEPT
    iptables -A INPUT -p tcp --dport 6379 -j DROP

    关闭firewall:
    systemctl stop firewalld.service #停止firewall
    systemctl disable firewalld.service #禁止firewall开机启动

    如果不会配置,最好停止防火墙。

    安装phantomjs

    下载:wget https://bbuseruploads.s3.amazonaws.com/fd96ed93-2b32-46a7-9d2b-ecbc0988516a/downloads/396e7977-71fd-4592-8723-495ca4cfa7cc/phantomjs-2.1.1-linux-x86_64.tar.bz2?Signature=guF7TAUW11qr9nZXcTBHu7dg1ds%3D&Expires=1488510600&AWSAccessKeyId=AKIAIVFPT2YJYYZY3H4A&versionId=null&response-content-disposition=attachment%3B%20filename%3D%22phantomjs-2.1.1-linux-x86_64.tar.bz2%22
    

    下载phantomjs-2.1.1-linux-x86_64.tar.bz2到/root目录下,解压

    将 phantomjs/bin目录下的phantomjs文件拷贝到/python2.7/bin目录下

    配置文件

    ====================================================================

    pyspider配置文件如下:

    {
      "taskdb": "mysql+taskdb://pyspider:pyspider-pass@localhost:3306/taskdb",
      "projectdb": "mysql+projectdb://pyspider:pyspider-pass@localhost:3306/projectdb",
      "resultdb": "mysql+resultdb://pyspider:pyspider-pass@localhost:3306/resultdb",
      "message_queue": "redis://localhost:6379/db",
      "webui": {
        "port":5555,
        "username": "pyspider",
        "password": "pyspider-pass",
        "need-auth": true
      }
    }
    

      

     =========================================

    # 为安全起见,我们新建一个普通用户来存储配置文件
    useradd -md /pyspider pyspider
    # 保存配置文件
    /pyspider/config.json
    # 权限设置
    chown -R pyspider:pyspider /pyspider
    chmod 400 config.json

    启动pyspider

    启动pyspider

    /anaconda2/bin/pyspider -c /pyspider/config.json

    结果如下:

    # pyspider -c /pyspider/config.json 
    [W 170516 17:45:05 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [I 170516 17:45:05 result_worker:49] result_worker starting...
    [W 170516 17:45:06 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [W 170516 17:45:06 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [W 170516 17:45:06 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [W 170516 17:45:06 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [I 170516 17:45:06 processor:211] processor starting...
    [W 170516 17:45:07 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [W 170516 17:45:07 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [I 170516 17:45:07 tornado_fetcher:638] fetcher starting...
    [W 170516 17:45:07 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [W 170516 17:45:07 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [W 170516 17:45:07 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [I 170516 17:45:09 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333
    [I 170516 17:45:09 scheduler:647] scheduler starting...
    phantomjs fetcher running on port 25555
    [I 170516 17:45:09 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
    [W 170516 17:45:10 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [W 170516 17:45:10 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [W 170516 17:45:10 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [W 170516 17:45:10 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [W 170516 17:45:10 __init__:54] redis DB must zero-based numeric index, using 0 instead
    [I 170516 17:45:10 app:76] webui running on 0.0.0.0:5555
    

      

     ///目前这块还有问题

    安装supervisor,监控所有进程

    supervisor用来监控pyspider进程,如果停止则立即启动,下载supervisor-3.3.1到/root目录下,并解压。

    cd /root/supervisor-3.3.1
    python setup.py install

    pip install supervisor

    创建默认的配置文件并设置

    # /python2.7/bin/echo_supervisord_conf > /python2.7/conf/supervisor.conf

    ; Sample supervisor config file.
    ;
    ; For more information on the config file, please see:
    ; http://supervisord.org/configuration.html
    ;
    ; Notes:
    ;  - Shell expansion ("~" or "$HOME") is not supported.  Environment
    ;    variables can be expanded using this syntax: "%(ENV_HOME)s".
    ;  - Comments must have a leading space: "a=b ;comment" not "a=b;comment".
    
    [unix_http_server]
    file=/tmp/supervisor.sock   ; (the path to the socket file)
    chmod=0700                 ; socket file mode (default 0700)
    chown=root:root       ; socket file uid:gid owner
    ;username=user              ; (default is no username (open server))
    ;password=123               ; (default is no password (open server))
    
    [inet_http_server]         ; inet (TCP) server disabled by default
    port=127.0.0.1:9001        ; (ip_address:port specifier, *:port for all iface)
    username=supervisor             ; (default is no username (open server))
    password=123               ; (default is no password (open server))
    
    [supervisord]
    logfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log)
    logfile_maxbytes=50MB        ; (max main logfile bytes b4 rotation;default 50MB)
    logfile_backups=10           ; (num of main logfile rotation backups;default 10)
    loglevel=info                ; (log level;default info; others: debug,warn,trace)
    pidfile=/tmp/supervisord.pid ; (supervisord pidfile;default supervisord.pid)
    nodaemon=false               ; (start in foreground if true;default false)
    minfds=1024                  ; (min. avail startup file descriptors;default 1024)
    minprocs=200                 ; (min. avail process descriptors;default 200)
    ;umask=022                   ; (process file creation umask;default 022)
    ;user=chrism                 ; (default is current user, required if root)
    ;identifier=supervisor       ; (supervisord identifier, default is 'supervisor')
    ;directory=/tmp              ; (default is not to cd during start)
    ;nocleanup=true              ; (don't clean up tempfiles at start;default false)
    ;childlogdir=/tmp            ; ('AUTO' child log dir, default $TEMP)
    ;environment=KEY="value"     ; (key value pairs to add to environment)
    ;strip_ansi=false            ; (strip ansi escape codes in logs; def. false)
    
    ; the below section must remain in the config file for RPC
    ; (supervisorctl/web interface) to work, additional interfaces may be
    ; added by defining them in separate rpcinterface: sections
    [rpcinterface:supervisor]
    supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
    
    [supervisorctl]
    serverurl=unix:///tmp/supervisor.sock ; use a unix:// URL  for a unix socket
    ;serverurl=http://127.0.0.1:9001 ; use an http:// url to specify an inet socket
    username=suppervisor             ; should be same as http_username if set
    password=123                ; should be same as http_password if set
    prompt=mysupervisor         ; cmd line prompt (default "supervisor")
    history_file=~/.sc_history  ; use readline history if available
    
    ; The below sample program section shows all possible program subsection values,
    ; create one or more 'real' program: sections to be able to control them under
    ; supervisor.
    
    ;[program:theprogramname]
    ;command=/bin/cat              ; the program (relative uses PATH, can take args)
    ;process_name=%(program_name)s ; process_name expr (default %(program_name)s)
    ;numprocs=1                    ; number of processes copies to start (def 1)
    ;directory=/tmp                ; directory to cwd to before exec (def no cwd)
    ;umask=022                     ; umask for process (default None)
    ;priority=999                  ; the relative start priority (default 999)
    ;autostart=true                ; start at supervisord start (default: true)
    ;startsecs=1                   ; # of secs prog must stay up to be running (def. 1)
    ;startretries=3                ; max # of serial start failures when starting (default 3)
    ;autorestart=unexpected        ; when to restart if exited after running (def: unexpected)
    ;exitcodes=0,2                 ; 'expected' exit codes used with autorestart (default 0,2)
    ;stopsignal=QUIT               ; signal used to kill process (default TERM)
    ;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)
    ;stopasgroup=false             ; send stop signal to the UNIX process group (default false)
    ;killasgroup=false             ; SIGKILL the UNIX process group (def false)
    ;user=chrism                   ; setuid to this UNIX account to run the program
    ;redirect_stderr=true          ; redirect proc stderr to stdout (default false)
    ;stdout_logfile=/a/path        ; stdout log path, NONE for none; default AUTO
    ;stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
    ;stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)
    ;stdout_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
    ;stdout_events_enabled=false   ; emit events on stdout writes (default false)
    ;stderr_logfile=/a/path        ; stderr log path, NONE for none; default AUTO
    ;stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
    ;stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)
    ;stderr_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
    ;stderr_events_enabled=false   ; emit events on stderr writes (default false)
    ;environment=A="1",B="2"       ; process environment additions (def no adds)
    ;serverurl=AUTO                ; override serverurl computation (childutils)
    
    ; The below sample eventlistener section shows all possible
    ; eventlistener subsection values, create one or more 'real'
    ; eventlistener: sections to be able to handle event notifications
    ; sent by supervisor.
    
    ;[eventlistener:theeventlistenername]
    ;command=/bin/eventlistener    ; the program (relative uses PATH, can take args)
    ;process_name=%(program_name)s ; process_name expr (default %(program_name)s)
    ;numprocs=1                    ; number of processes copies to start (def 1)
    ;events=EVENT                  ; event notif. types to subscribe to (req'd)
    ;buffer_size=10                ; event buffer queue size (default 10)
    ;directory=/tmp                ; directory to cwd to before exec (def no cwd)
    ;umask=022                     ; umask for process (default None)
    ;priority=-1                   ; the relative start priority (default -1)
    ;autostart=true                ; start at supervisord start (default: true)
    ;startsecs=1                   ; # of secs prog must stay up to be running (def. 1)
    ;startretries=3                ; max # of serial start failures when starting (default 3)
    ;autorestart=unexpected        ; autorestart if exited after running (def: unexpected)
    ;exitcodes=0,2                 ; 'expected' exit codes used with autorestart (default 0,2)
    ;stopsignal=QUIT               ; signal used to kill process (default TERM)
    ;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)
    ;stopasgroup=false             ; send stop signal to the UNIX process group (default false)
    ;killasgroup=false             ; SIGKILL the UNIX process group (def false)
    ;user=chrism                   ; setuid to this UNIX account to run the program
    ;redirect_stderr=false         ; redirect_stderr=true is not allowed for eventlisteners
    ;stdout_logfile=/a/path        ; stdout log path, NONE for none; default AUTO
    ;stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
    ;stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)
    ;stdout_events_enabled=false   ; emit events on stdout writes (default false)
    ;stderr_logfile=/a/path        ; stderr log path, NONE for none; default AUTO
    ;stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
    ;stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)
    ;stderr_events_enabled=false   ; emit events on stderr writes (default false)
    ;environment=A="1",B="2"       ; process environment additions
    ;serverurl=AUTO                ; override serverurl computation (childutils)
    
    ; The below sample group section shows all possible group values,
    ; create one or more 'real' group: sections to create "heterogeneous"
    ; process groups.
    
    ;[group:thegroupname]
    ;programs=progname1,progname2  ; each refers to 'x' in [program:x] definitions
    ;priority=999                  ; the relative start priority (default 999)
    
    ; The [include] section can just contain the "files" setting.  This
    ; setting can list multiple files (separated by whitespace or
    ; newlines).  It can also contain wildcards.  The filenames are
    ; interpreted as relative to this file.  Included files *cannot*
    ; include files themselves.
    
    ;[include]
    ;files = relative/directory/*.ini
    [group:pyspider]
    programs=pyspider-fetcher,pyspider-processor
    
    [program:pyspider-fetcher]
    command=/python2.7/bin/pyspider -c /pyspider/config.json fetcher
    autorestart=true
    autostart=true
    user=root
    group=pyspider
    stopasgroup=true
    
    [program:pyspider-processor]
    command=/python2.7/bin/pyspider -c /pyspider/config.json processor
    autorestart=true
    autostart=true
    user=root
    group=pyspider
    stopasgroup=true
    stderr_logfile=/var/Spider/Log/Process/spider_process_err.log
    stdout_logfile=/var/Spider/Log/Process/spider_process_out.log
    

      

     启动supervisor

     # supervisord -c /etc/supervisor.conf

    注:config.json配置修改后需要重载

    # supervisorctl reload

    目前为止pyspider已安装完成

    登陆pyspider

    http://ip:5555/

      

    排错:

    ImportError: pycurl: libcurl link-time ssl backend (nss) is different from compile-time ssl backend (none/other)

    # pip uninstall pycurl
    # export PYCURL_SSL_LIBRARY=nss
    # pip install pycurl

    ImportError: No module named _sqlite3

    # find / -name _sqlite*.so
    /usr/lib64/python2.7/lib-dynload/_sqlite3.so
    /usr/lib64/python2.7/site-packages/_sqlitecache.so

    # cp /usr/lib64/python2.7/lib-dynload/_sqlite3.so /python2.7/lib/python2.7/lib-dynload/

    
    
  • 相关阅读:
    Windows Azure Platform Introduction (6) Windows Azure应用程序运行环境
    Windows Azure Platform Introduction (2) 云计算的分类和服务层次
    【转载】修改oracle的最大连接数 以及 object is too large to allocate on this o/s
    Windows Azure Platform Introduction (3) 云计算的特点
    Windows Azure Platform Introduction (8) Windows Azure 账户管理
    XML手册地址
    用dataset方式取值
    xml dataset的发布
    虚惊一场
    XML的一些特点
  • 原文地址:https://www.cnblogs.com/kongzhagen/p/6495130.html
Copyright © 2020-2023  润新知