学习笔记之Slurm

学习笔记之Slurm
Slurm Workload Manager - Overview
- https://slurm.schedmd.com/overview.html
- Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Optional plugins can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.
Slurm Workload Manager - Quick Start User Guide
- https://slurm.schedmd.com/quickstart.html
Slurm Workload Manager - Wikipedia
- https://en.wikipedia.org/wiki/Slurm_Workload_Manager
- The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.
- It provides three key functions:
  - allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,
  - providing a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes, and
  - arbitrating contention for resources by managing a queue of pending jobs.
- Slurm is the workload manager on about 60% of the TOP500 supercomputers.^[1]
- Slurm uses a best fit algorithm based on Hilbert curve scheduling or fat tree network topology in order to optimize locality of task assignments on parallel computers.^[2]
Slurm Workload Manager - sacct
- https://slurm.schedmd.com/sacct.html
sbatch - Submit a batch script to Slurm
- https://slurm.schedmd.com/sbatch.html
- $ sbatch mytestsbatch.sh
- Actually the second srun will start only when the previous srun is completed, so no sleep is required.
1 # ============================================================================= 2 # mytestscript.sh 3 # ============================================================================= 4 #!/bin/sh 5 date & 6 7 # ============================================================================= 8 # mytestsbatch.sh 9 # ============================================================================= 10 #!/bin/sh 11 #SBATCH -N 2 12 #SBATCH -n 10 13 14 srun -n10 -o testscript1.log mytestscript.sh 15 sleep 10; srun -n10 -o testscript2.log mytestscript.sh 16 wait
View Code
scancel - Used to signal jobs or job steps that are under the control of Slurm.
- https://slurm.schedmd.com/scancel.html
- $ scancel 123
scontrol - view or modify Slurm configuration and state.
- https://slurm.schedmd.com/scontrol.html
- $ scontrol show job 123
squeue - view information about jobs located in the Slurm scheduling queue.
- https://slurm.schedmd.com/squeue.html
- $ squeue
- $ squeue -u username
srun - Run parallel jobs
- https://slurm.schedmd.com/srun.html
- $ cat testscript.sh
- #!/bin/sh
- python mytest.py --arg test
- $ chmod +x testscript.sh
- $ srun -N5 -n100 testscript.sh
  - Run it on 5 nodes with 100 tasks
- $ srun -n5 --nodelist=host1, host2 -o testscript.log testscript.sh
- $ srun -n10 -o testscript.log --begin=now+2hour testscript.sh
- $ srun --begin=now+10 date &
Convenient SLURM Commands | FAS Research Computing
- https://www.rc.fas.harvard.edu/resources/documentation/convenient-slurm-commands/
srun: error: --begin is ignored because nodes are already allocated.
- use sleep in lieu of --begin
- bash - Can you help me run tasks in parallel in Slurm? - Stack Overflow
  - https://stackoverflow.com/questions/56727579/can-you-help-me-run-tasks-in-parallel-in-slurm/56769363
srun: error: Unable to create job step: More processors requested than permitted
- In the submission script, you request resources with the #SBATCH directives, and you cannot use more resource than than in the subsequent calls to srun.
- slurm - Questions on alternative ways to run 4 parallel jobs - Stack Overflow
  - https://stackoverflow.com/questions/46493726/questions-on-alternative-ways-to-run-4-parallel-jobs
相关阅读:
Nginx动静分离经典案例配置
 Nginx实现HTTP反向代理配置
 mac 使用brew安装nginx 各种命令
 nginx 启动报错Error: undefined method `named' for #<OptionParser:0x00007fdd090802d0>
【Java多线程】PriorityBlockingQueue源码分析（二十五）
【Java】PriorityQueue 源码分析
 【Java多线程】SynchronousQueue源码分析（二十四）
【Java多线程】线程最快累加方案（二十三）
【Java多线程】线程池ThreadPoolExecutor实现原理（二十二）
【Java多线程】读写锁（ReadWriteLock）（二十一）
原文地址：https://www.cnblogs.com/pegasus923/p/11511332.html