过程参考(以及基本翻译自)此文:https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/ 和此文: https://linuxcluster.wordpress.com/2012/04/01/enabling-torque-for-email-notification/ .
此过程会将当前计算机当作server, compue node, scheduler and submission host.
Step 1: 从Ubuntu上安装Torque
apt-get install torque-server torque-client torque-mom torque-pam
Step 2: 关闭当前开启的默认服务
/etc/init.d/torque-mom stop /etc/init.d/torque-scheduler stop /etc/init.d/torque-server stop pbs_server -t create
killall pbs_server
Step 3: 因为Panther当前没有FQDN只有IP, 所以选了个Domain Name为panther.ncsu.
(注: 按照参考博客的说法,这儿需要选一个两单词的server.domain形式的domain name, 否则后文可能会遇到问题.)
echo panther.ncsu > /etc/torque/server_name echo panther.ncsu > /var/spool/torque/server_priv/acl_svr/acl_hosts echo root@panther.ncsu > /var/spool/torque/server_priv/acl_svr/operators echo root@panther.ncsu > /var/spool/torque/server_priv/acl_svr/managers
10.123.32.** panther.ncsu
Step 4: 将计算机本身当作compute node
echo "panther.ncsu np=4" > /var/spool/torque/server_priv/nodes
告诉Mom_nodes compute node的具体位置:
echo panther.ncsu > /var/spool/torque/mom_priv/config
Step 5: 重启torque服务
/etc/init.d/torque-server start /etc/init.d/torque-scheduler start /etc/init.d/torque-mom start
Step 6: 设置PBS参数
qmgr -c 'set server scheduling = true' qmgr -c 'set server keep_completed = 1000' #最长时间1000小时 qmgr -c 'set server mom_job_sync = true' qmgr -c 'create queue std' #创建std queue qmgr -c 'set queue batch queue_type = execution' qmgr -c 'set queue batch started = true' qmgr -c 'set queue batch enabled = true' qmgr -c 'set queue batch resources_default.walltime = 10:00:00' qmgr -c 'set queue batch resources_default.nodes = 1' qmgr -c 'set server default_queue = std'
以及设置submission pool:
qmgr -c 'set server submit_hosts = panther' qmgr -c 'set server allow_node_submit = true'
上面选了domain name为panther.ncsu,这儿需要选择其name,panther为submission pool
Step 8: 提交测试任务
#! /bin/bash #PBS -q std #PBS -m bea #PBS -M *****@foxmail.com #PBS -j oe #PBS -o oe.$PBS_JOBID #PBS -l walltime=1000:00:00 #PBS -N arch-data echo -e " -----------------------------------------------------------------------" echo -e " Environment variables:" echo -e "----------------------------------------------------------------------- " printenv echo -e " ----------------------------------------------------------------------- " #export OMP_NUM_THREADS=1 #cd /lustre/or-hydra/cades-virtues/z8j/run/ZrH2/vasp/supercell/disps/disp-001/ cd $PBS_O_WORKDIR date #mv fropho_calc/ /home/zjyx/data_drive/data-archived/ZrH2/ #cp pbs.sh /home/zjyx/data_drive/data-archived/ZrH2/ echo -e " -----------------------------------------------------------------------" echo -e " stdout + stderr:" echo -e "----------------------------------------------------------------------- " tar zcvf frophon.tgz fropho_calc echo -e " ----------------------------------------------------------------------- " wait date
附录. 使用ssmtp设置邮件通知: https://help.ubuntu.com/community/EmailAlerts
Errors and solutions:
1. Errors:
Unable to copy file /var/spool/torque/spool/15.panther.ncsu.OU to zjyx@Panther:/home/zjyx/work/tests/pbs/fdm/oe.15.panther.ncsu
*** error from copy
Host key verification failed.
lost connection
*** end error output
Output retained on that host in: /var/spool/torque/undelivered/15.panther.ncsu.OU
Solutions: (http://torqueusers.supercluster.narkive.com/Ut2n70R1/host-key-verification-failed: Host key verification failed)
Just try to delete ~/.ssh/known_hosts, and ssh between different nodes set up by torque. In my case, I did ssh panther.ncsu, ssh localhost, ssh Panther, and ssh panther.