• Rhel6-torque作业调度系统配置文档


    系统环境: rhel6 x86_64 iptables and selinux disabled

    主机: 192.168.122.121 server21.example.com 计算节点

    192.168.122.173 server73.example.com 计算节点

    192.168.122.135 server35.example.com 调度节点(注:时间需同步)

    所需的包:icpi-64 torque-4.1.2.tar.gz

    相关网址http://www.clusterresources.com/downloads/torque/


    注:首先建立调度节点与所有计算节点间的 ssh 无密码连接,方法可参考Rhel6-mpich2 hpc集群.pdf


    #安装并配置torque

    以下步骤在server35(调度节点)上实施:

    [root@server35 kernel]# tar zxf torque-4.1.2.tar.gz

    [root@server35 kernel]# cd torque-4.1.2

    [root@server35 torque-4.1.2]# ./configure --with-rcp=scp --with-default-server=server35.example.com

    此时会出现如下错误:

    (1)configure: error: no acceptable C compiler found in $PATH

    (2)configure: error: cannot find a make command

    (3)configure: error: TORQUE needs lib openssl-devel in order to build

    (4)configure: error: TORQUE needs libxml2-devel in order to build


    解决方法如下:

    [root@server35 torque-4.1.2]# yum install gcc -y

    [root@server35 torque-4.1.2]# yum install make -y

    [root@server35 torque-4.1.2]# yum install openssl-devel -y

    [root@server35 torque-4.1.2]# yum install libxml2-devel -y


    [root@server35 torque-4.1.2]# make && make install (torque 的配置目录: /var/spool/torque)

    [root@server35 torque-4.1.2]# make packages (生成计算节点安装包,即在 host1 host2 上安装的包,确保所有计算 节点和服务节点的架构是相同的)

    torque-package-clients-linux-x86_64.sh

    torque-package-devel-linux-x86_64.sh

    torque-package-doc-linux-x86_64.sh

    torque-package-mom-linux-x86_64.sh

    torque-package-server-linux-x86_64.sh

    [root@server35 torque-4.1.2]# cd contrib/init.d/

    [root@server35 init.d]# cp pbs_server /etc/init.d/

    [root@server35 init.d]# cp pbs_sched /etc/init.d/

    [root@server35 init.d]# cp pbs_mom /etc/init.d/ (如果调度端同时做计算的话就拷贝)

     

    [root@server35 init.d]# scp pbs_mom 192.168.122.121:/etc/init.d/

    [root@server35 init.d]# scp pbs_mom 192.168.122.173:/etc/init.d/

     

    [root@server35 init.d]# cd /root/kernel/torque-4.1.2

    [root@server35 torque-4.1.2]# ./torque.setup root (设置 torque 的管理帐户)

    [root@server35 torque-4.1.2]# vim /var/spool/torque/server_priv/nodes

    server21.example.com

    server73.example.com (设定计算节点,服务节点也可做计算)

    [root@server35 torque-4.1.2]# scp torque-package-clients-linux-x86_64.sh torque-package-mom-linux-x86_64.sh root@192.168.122.121:/root/kernel/

    [root@server35 torque-4.1.2]# scp torque-package-clients-linux-x86_64.sh torque-package-mom-linux-x86_64.sh root@192.168.122.173:/root/kernel/

    [root@server35 torque-4.1.2]# qterm -t quick (停止torque)

    [root@server35 torque-4.1.2]# /etc/init.d/pbs_server start (启动torque)

    [root@server35 torque-4.1.2]# /etc/init.d/pbs_sched start (启动调度程序)

     

     

    以下步骤在server21server73(所有计算节点)上实施:

    [root@server21 kernel]# ./torque-package-clients-linux-x86_64.sh –install

    [root@server21 kernel]# ./torque-package-mom-linux-x86_64.sh –install

    :如果计算节点的架构和服务节点不同,安装方法如下:

    tar zxf torque-4.1.2.tar.gz

    ./configure --with-rcp=rcp –with-default- server=server35.example.com

    make

    make install_mom install_clients

    [root@server21 kernel]# vim /var/spool/torque/mom_priv/config

    $pbsserver server35.example.com

    $logevent 255

    [root@server21 kernel]# /etc/init.d/pbs_mom start

    [root@server21 kernel]# su - lmx

    [lmx@server21 ~]$ mpdboot -n 2 -f mpd.hosts


    测试前配置:

    :torque 的调度需要使用非 root 用户

    [root@server35 ~]# su - lmx

    [lmx@server35 ~]$ vim job1.pbs (串行作业)

    #!/bin/bash

    #PBS -N job1

    #PBS -o job1.log

    #PBS -e job1.err

    #PBS -q batch

    cd /home/lmx

    echo Running on hosts `hostname`

    echo Time is `date`

    echo Directory is $PWD

    echo This job runs on the following nodes:

    cat $PBS_NODEFILE

    echo This job has allocated 1 node

    ./prog

    [lmx@server35 ~]$ vim job2.pbs (并行作业)

    #!/bin/bash

    #PBS -N job2

    #PBS -o job2.log

    #PBS -e job2.err

    #PBS -q batch

    #PBS -l nodes=2

    cd /home/lmx

    echo Time is `date`

    echo Directory is $PWD

    echo This job runs on the following nodes:

    cat $PBS_NODEFILE

    NPROCS=`wc -l < $PBS_NODEFILE`

    echo This job has allocated $NPROCS nodes

    mpiexec -machinefile $PBS_NODEFILE -np $NPROCS ./prog

    [lmx@server35 ~]$ vim prog

    #!/bin/bash

    echo 1000000000 | ./icpi-64 (icpi 程序是 mpi 自带的,拷贝过来即可)

    [lmx@server35 ~]$ chmod +x prog

     

    qsub jobx.pbs(提交作业)

    qstat (查看作业)

    pbsnodes (查看节点)




    测试结果:

    [lmx@server35 ~]$ qsub job1.pbs (提交串行作业)

    10.server35.example.com

    [lmx@server35 ~]$ qstat

    Job id Name User Time Use S Queue

    ------------------------- ---------------- --------------- -------- - -----

    10.server35 job1 lmx 0 R batch

    [lmx@server35 ~]$ pbsnodes

    server21.example.com

    state = job-exclusive

    np = 1

    ntype = cluster

    jobs = 0/10.server35.example.com

    status = rectime=1375075596,varattr=,jobs=,state=free,netload=18001357,gres=,loadave=0.00,ncpus=1,physmem=285532kb,availmem=1196472kb,totmem=1301332kb,idletime=7413,nusers=0,nsessions=0,uname=Linux server21.example.com 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64,opsys=linux

    mom_service_port = 15002

    mom_manager_port = 15003

    gpus = 0


    server73.example.com

    state = free

    np = 1

    ntype = cluster

    status = rectime=1375075593,varattr=,jobs=,state=free,netload=18502638,gres=,loadave=0.00,ncpus=1,physmem=285532kb,availmem=1194920kb,totmem=1301332kb,idletime=12865,nusers=0,nsessions=0,uname=Linux server73.example.com 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64,opsys=linux

    mom_service_port = 15002

    mom_manager_port = 15003

    gpus = 0

    [lmx@server35 ~]$ cat job1.log (查看计算结果)

    Running on hosts server21.example.com

    Time is Mon Jul 29 13:26:58 CST 2013

    Directory is /home/lmx

    This job runs on the following nodes:

    server21.example.com

    This job has allocated 1 node

    Enter the number of intervals: (0 quits) pi is approximately 3.1415926535899708, Error is 0.0000000000001776

    wall clock time = 31.147027

    Enter the number of intervals: (0 quits) No number entered; quitting



    [lmx@server35 ~]$ qsub job2.pbs (提交并行作业)

    11.server35.example.com

    [lmx@server35 ~]$ qstat

    Job id Name User Time Use S Queue

    ------------------------- ---------------- --------------- -------- - -----

    10.server35 job1 lmx 00:00:31 C batch

    11.server35 job2 lmx 0 R batch

    [lmx@server35 ~]$ pbsnodes

    server21.example.com

    state = job-exclusive

    np = 1

    ntype = cluster

    jobs = 0/11.server35.example.com

    status = rectime=1375075821,varattr=,jobs=,state=free,netload=18314029,gres=,loadave=0.02,ncpus=1,physmem=285532kb,availmem=1196340kb,totmem=1301332kb,idletime=7638,nusers=1,nsessions=2,sessions=1209 2980,uname=Linux server21.example.com 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64,opsys=linux

    mom_service_port = 15002

    mom_manager_port = 15003

    gpus = 0


    server73.example.com

    state = job-exclusive

    np = 1

    ntype = cluster

    jobs = 0/11.server35.example.com

    status = rectime=1375075818,varattr=,jobs=,state=free,netload=18756208,gres=,loadave=0.00,ncpus=1,physmem=285532kb,availmem=1194860kb,totmem=1301332kb,idletime=13090,nusers=0,nsessions=0,uname=Linux server73.example.com 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64,opsys=linux

    mom_service_port = 15002

    mom_manager_port = 15003

    gpus = 0

    [lmx@server35 ~]$ cat job2.log (查看计算结果)

    Time is Mon Jul 29 13:30:43 CST 2013

    Directory is /home/lmx

    This job runs on the following nodes:

    server73.example.com

    server21.example.com

    This job has allocated 2 nodes

    Enter the number of intervals: (0 quits) pi is approximately 3.1415926535900072, Error is 0.0000000000002141

    wall clock time = 16.151319

    Enter the number of intervals: (0 quits) No number entered; quitting



    :

    1. 测试时需确保所有节点上有lmx这个用户

    2. 启动 mpd 程序时要使用lmx用户([lmx@server21 ~]$ mpdboot -n 2 -f mpd.hosts) 因为在调度时会连接计算节点 上/tmp/mpd2.console_wxh

  • 相关阅读:
    Mysql 5.7解压版安装
    Java Web 整合案例
    maven 创建Java web项目
    LintCode 数字三角形
    Hibernate 泛型Dao实现
    LintCode 将二叉查找树转换成双链表
    LintCode 删除链表中倒数第n个节点
    LintCode 二级制中有多少个1
    LintCode翻转二叉树
    SpringMVC 运行流程
  • 原文地址:https://www.cnblogs.com/xautlmx/p/4381198.html
Copyright © 2020-2023  润新知