• 服务器集群(八)——LSF之OpenLava作业限制


    LSF之OpenLava作业限制

    1、限制每个主机的job数量

    设置 lsb.hosts 文件中 rd2的策略

    1)修改前

    82 [fhu@rd2 11:38:04 ~]$ bhosts
    HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
    rd1                unavail         -      1      0      0      0      0      0
    rd2                ok              -      2      0      0      0      0      0
    rd3                ok              -      2      0      0      0      0      0
    

    2)lsb.hosts中加入rd2的策略,MXJ主机的最大job数设置为0,每个用户的最大job数设置为0

    # lsb.hosts文件
    
    Begin Host
    HOST_NAME     MXJ JL/U   r1m    pg    ls     tmp  DISPATCH_WINDOW  # Keywords
    #host0        1    1   3.5/4.5  15/   12/15  0      ()             # Example
    #host1       ()   2     3.5  15/18   12/    0/  (5:19:00-1:8:30 20:00-8:30)
    #host2        ()   ()   3.5/5   18    15     ()     ()             # Example
    default       !   ()     ()    ()    ()     ()     ()              # Example
    rd2           0   0     ()    ()    ()     ()     ()               # Example
    End Host
    

    3)修改后,badmin reconfig 配置生效,rd2就不再接收job

    84 [fhu@rd2 11:40:15 ~]$ bhosts
    HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
    rd1                unavail         -      1      0      0      0      0      0
    rd2                closed          0      0      0      0      0      0      0
    rd3                ok              -      2      0      0      0      0      0
    

    4)可见,所有RUN状态的job均在rd3上

    101 [fhu@rd2 11:42:55 ~]$ bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    872     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 11:42
    873     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 11:42
    874     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 11:42
    875     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 11:42
    876     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 11:42
    

    5)一旦注释 rd2 的配置,badmin reconfig后生效配置, PEND的job会立即分配到rd2

    102 [fhu@rd2 11:48:06 ~]$ bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    872     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 11:42
    873     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 11:42
    874     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 11:42
    875     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 11:42
    876     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 11:42
    

    2、 限制每个queue的job数量

    设置 lsb.queues 文件中normal队列的 QJOB_LIMIT,即队列最大job数量

    1)修改前 bqueues

    104 [fhu@rd2 13:40:37 ~]$ bqueues
    QUEUE_NAME     PRIO      STATUS      MAX  JL/U JL/P JL/H NJOBS  PEND  RUN  SUSP
    normal          30    Open:Active      -    -    -    -     0     0     0     0
    
    
    106 [fhu@rd2 13:42:52 ~]$ bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    986     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:42
    987     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:42
    988     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 13:42
    989     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 13:42
    990     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:42
    

    2)修改 QJOB_LIMIT = 1

    # lsb.queues 文件
    
    Begin Queue
    QUEUE_NAME   = normal
    PRIORITY     = 30
    NICE         = 20
    #QJOB_LIMIT   = 60              # 该queue的最大job数量
    QJOB_LIMIT   = 1               # job limit of the queue
    #UJOB_LIMIT   = 5               # 每个user的最大job数量
    #PJOB_LIMIT   = 2               # 每个processor的最大job数量
    #RUN_WINDOW   = 5:19:00-1:8:30 20:00-8:30
    #r1m         = 0.7/2.0        # loadSched/loadStop
    #r15m         = 1.0/2.5
    #pg           = 4.0/8
    #ut           = 0.2
    #io           = 50/240
    #CPULIMIT     = 180/apple      # job的CPU使用限制
    #FILELIMIT    = 20000
    #MEMLIMIT     = 5000           # jobs bigger than this (5M) will be niced
    #DATALIMIT    = 20000          # jobs data segment limit
    #STACKLIMIT   = 2048
    #CORELIMIT    = 20000
    #PROCLIMIT    = 5              # job processor limit
    #USERS        = all            # 指定哪些用户可以提交,可以用lsb.users中的UserGroup和User,多个User使用空格间隔
    #HOSTS        = all            # 指定提交给哪些主机,可以用lsb.hosts中的HostGroup和Host
    #PRE_EXEC     = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
    #POST_EXEC    = /usr/local/lsf/misc/testq_post |grep -v "Hey"
    #REQUEUE_EXIT_VALUES = 55 34 78
    #ROUND_ROBIN_POLICY = y
    #FAIRSHARE = USER_SHARES[[G1,1] [G2,1]]
    #HOSTS_SHARES = [all, 5]
    DESCRIPTION  = For normal low priority jobs, running only if hosts are \
    lightly loaded.
    End Queue
    

    3)badmin reconfig 生效,此时队列中便只能有1个job在run

    119 [fhu@rd2 13:44:39 ~]$ bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    1095    fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:44
    1096    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    1097    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    1098    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    1099    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    1100    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    

    3、 限制每个user的job数量

    设置 lsb.user 文件中 User的 MAX_JOBS,控制所有机器上某个用户job数量

    1)修改前, fhu用户可以运行4个job

    106 [fhu@rd2 13:42:52 ~]$ bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    986     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:42
    987     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:42
    988     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 13:42
    989     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 13:42
    990     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:42
    

    2)修改 MAX_JOBS=1

    # lsb.user 文件
    
    Begin User
    USER_NAME       MAX_JOBS        JL/P
    #develop@        20              8
    #support         50              -
    fhu              1               -
    End User
    

    2)badmin reconfig生效后,fhu用户在所有机器上只有1个job在run

    119 [fhu@rd2 13:44:39 ~]$ bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    1095    fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:44
    1096    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    1097    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    1098    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    1099    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    1100    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    1101    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
    

    3)修改 MAX_JOBS=3,并 badmin reconfig生效后,fhu用户在所有机器上有3个job在run

    120 [fhu@rd2 14:13:20 ~]$ bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    1308    fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 14:12
    1309    fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 14:12
    1310    fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 14:12
    1311    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12
    1312    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12
    1313    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12
    1314    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12
    
  • 相关阅读:
    Oracle 日期总结
    JavaScript 获取文件名,后缀名
    JavaScript Array pop(),shift()函数
    JavaScript Array splice函数
    Oracle 创建表空间、临时表空间、创建用户并指定表空间、授权,删除用户及表空间
    eclipse debug调试java程序的九个技巧
    Oracle dos连接数据库基本操作
    Oracle 隐式游标 存储过程
    Oracle 修改表名
    Oracle 时间 MM-dd形式转换
  • 原文地址:https://www.cnblogs.com/linagcheng/p/16401102.html
Copyright © 2020-2023  润新知