• Ubuntu上安装torque过程


     过程参考(以及基本翻译自)此文:https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/ 和此文: https://linuxcluster.wordpress.com/2012/04/01/enabling-torque-for-email-notification/ .

    此过程会将当前计算机当作server, compue node, scheduler and submission host.

    Step 1: 从Ubuntu上安装Torque

    apt-get install torque-server torque-client torque-mom torque-pam
    

      这儿下载安装的是老版本Torque-2.4.16.一路Yes即可.

    Step 2: 关闭当前开启的默认服务

    /etc/init.d/torque-mom stop
    /etc/init.d/torque-scheduler stop
    /etc/init.d/torque-server stop
    pbs_server -t create
    

      以及:

    killall pbs_server
    

      这一步很重要,否则接下来所做的修改将在下一次pbs_server重启后被覆盖.

    Step 3: 因为Panther当前没有FQDN只有IP, 所以选了个Domain Name为panther.ncsu.

    (注: 按照参考博客的说法,这儿需要选一个两单词的server.domain形式的domain name, 否则后文可能会遇到问题.)

    echo panther.ncsu > /etc/torque/server_name
    echo panther.ncsu > /var/spool/torque/server_priv/acl_svr/acl_hosts
    echo root@panther.ncsu > /var/spool/torque/server_priv/acl_svr/operators
    echo root@panther.ncsu > /var/spool/torque/server_priv/acl_svr/managers
    

      并且在/etc/hosts中加入此行:

    10.123.32.** panther.ncsu
    

      

    Step 4: 将计算机本身当作compute node

    echo "panther.ncsu np=4" > /var/spool/torque/server_priv/nodes
    

      这儿可根据实际情况修改np

    告诉Mom_nodes compute node的具体位置:

    echo panther.ncsu > /var/spool/torque/mom_priv/config
    

    Step 5: 重启torque服务

    /etc/init.d/torque-server start
    /etc/init.d/torque-scheduler start
    /etc/init.d/torque-mom start
    

    Step 6: 设置PBS参数

    qmgr -c 'set server scheduling = true'
    qmgr -c 'set server keep_completed = 1000' #最长时间1000小时
    qmgr -c 'set server mom_job_sync = true'
    qmgr -c 'create queue std' #创建std queue
    qmgr -c 'set queue batch queue_type = execution'
    qmgr -c 'set queue batch started = true'
    qmgr -c 'set queue batch enabled = true'
    qmgr -c 'set queue batch resources_default.walltime = 10:00:00'
    qmgr -c 'set queue batch resources_default.nodes = 1'
    qmgr -c 'set server default_queue = std'
    

     以及设置submission pool:

    qmgr -c 'set server submit_hosts = panther'
     qmgr -c 'set server allow_node_submit = true'
    

     上面选了domain name为panther.ncsu,这儿需要选择其name,panther为submission pool

    Step 8: 提交测试任务

    #! /bin/bash
    
    #PBS -q std
    #PBS -m bea
    #PBS -M *****@foxmail.com
    #PBS -j oe
    #PBS -o oe.$PBS_JOBID
    #PBS -l walltime=1000:00:00
    #PBS -N arch-data
    
    echo -e "
    -----------------------------------------------------------------------"
    echo -e " Environment variables:"
    echo -e "-----------------------------------------------------------------------
    
    
    "
    printenv
    echo -e "
    -----------------------------------------------------------------------
    
    
    "
    
    #export OMP_NUM_THREADS=1
    
    #cd /lustre/or-hydra/cades-virtues/z8j/run/ZrH2/vasp/supercell/disps/disp-001/
    cd $PBS_O_WORKDIR
    
    date
    
    #mv fropho_calc/ /home/zjyx/data_drive/data-archived/ZrH2/
    #cp pbs.sh /home/zjyx/data_drive/data-archived/ZrH2/
    echo -e "
    -----------------------------------------------------------------------"
    echo -e " stdout + stderr:"
    echo -e "-----------------------------------------------------------------------
    
    
    "
    tar zcvf frophon.tgz fropho_calc
    echo -e "
    -----------------------------------------------------------------------
    
    
    "
    wait
    
    date
    

    结果:

    附录. 使用ssmtp设置邮件通知: https://help.ubuntu.com/community/EmailAlerts

    Errors and solutions:

    1. Errors:

     Unable to copy file /var/spool/torque/spool/15.panther.ncsu.OU to zjyx@Panther:/home/zjyx/work/tests/pbs/fdm/oe.15.panther.ncsu
    *** error from copy
    Host key verification failed.
    lost connection
    *** end error output
    Output retained on that host in: /var/spool/torque/undelivered/15.panther.ncsu.OU

    Solutions: (http://torqueusers.supercluster.narkive.com/Ut2n70R1/host-key-verification-failed: Host key verification failed)

    Just try to delete ~/.ssh/known_hosts, and ssh between different nodes set up by torque. In my case, I did ssh panther.ncsu, ssh localhost, ssh Panther, and ssh panther.

  • 相关阅读:
    超文本传输协议 HTTP/1.0 Hyptertext Transfer Protocol
    VB.NET中使用代表对方法异步调用
    蚂蚁解道德经(1)[转载]
    vb.net 类的属性的设置和获取问题
    VB.net入门(6):类~构造函数,事件
    什么是Ajax技术
    千里之外
    一个asp.net2005的页面文件调用CSS样式的BUG
    一个.net发送HTTP数据实体的类
    利用ASP发送和接收XML数据的处理方法
  • 原文地址:https://www.cnblogs.com/zjyx/p/8448279.html
Copyright © 2020-2023  润新知