Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training...
Transcript of Runtime environment SLURM Basics · Runtime environment SLURM Basics HPC 101 Shaheen II Training...
Runtime environmentSLURM Basics
HPC 101 Shaheen II Training Workshop
Dr Samuel KortasComputational Scientist
KAUST Supercomputing [email protected]
19 September 2017
Outline
• Resources available• Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM
commands• How to place my program on allocated Resource ?• SLURM interesting features
• → comprehensive example in the application talk!
Outline
• Resources available• Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM
commands• How to place my program on allocated Resource ?• SLURM interesting features
• → comprehensive example in the application talk!
Which type of resource available?
� Computing power � Fast acess storage (Burst Buffer)� Software licenses (commercial application)� Mass Storage (Lustre)� KSL expertise
One at a time…to guaranty performance and restitution time
� Computing power (1 full node per user)� Fast acess storage (reserved Space) � Software licences (token)� Mass Storage (Lustre filesystem)� KSL expertise (us!)
It’s our job to help you…But we need inputs from you.
� Computing power = core x hour = project application� Fast acess storage (SSD) = Burst Buffer application
� Mass Storage (Lustre) = in project application
� KSL expertise = in project application
Outline
• Resources available • Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM
commands• How to place my program on allocated Resource ?• SLURM interesting features•
• → comprehensive example in the application talk!
Why scheduling?...…to optimize shared resource use.
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvv
v
v
v
v
v
v
v
v
v
n n
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
TIME
On Shaheen 2 and Dragon : SLURMSame principle with PBS, LoadLeveler, Sun GridEngine….
4 nodes
20 mins
6174 nodes availables
vvvvvvv☐☐☐☐☐☐☐☐☐☐☐
Theconcept ofbackfilling
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
vvvvvvvvvvvvvv
vvvvvvvvvvvvvv
vvvvvvvvvvvvvv
vvvvvvvvvvvvvv
vvvv☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvv☐☐☐☐☐☐☐☐
nn
Resources available : nodes, memory, disk space….
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
TIME
In order to run the big job, the scheduler has to make some room on the machine. Some
empty space appears
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
10 H
24 H
fits
Only jobs small enough will fit in the gap…
It’s your best interest to set a realistic job duration
18 H
The concept of backfilling
In order to run the big job, the scheduler has to make some room on the machine. Some
empty space appears vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
10 H
24 Hfits
Only jobs small enough will fit in the gap…
It’s your best interest to set a realistic job duration
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐☐
n n
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
6174 nodes availables
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
� Shaheen II has � 6,174 compute nodes� Each node having 2 sockets� Each socket having 16 Cores� Each core having 2
HyperThreads(choice by default)
� à 197, 568 cores� à Double of threads� = SLURM CPU
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
…
nn
36 racks of compute nodes
workq
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnn
Shaheen II’s Resources
Shaheen II’s ResourcesScheduler view
Slurm counts in threads: 1 Slurm CPU is a Thread
Outline
• Resources available • Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM commands• How to place my program on allocated Resource ?• SLURM interesting features•
• → comprehensive example in the application talk!
Steps to run a job
1. Listing available resource : sinfo2. Acquiring resources asking SLURM to make
available what you need : sbatch3. Waiting for the resource to be scheduled : squeue,4. Gathering information on resource: scontrol5. Using the resource : srun6. Releasing the resource : scancel7. Tracking my resource use : sacct
Listing Resources : sinfo
Slurm CPU is an HyperthreadStatus of each node inthe partition
Name ofthe partition
Job limits in size and duration
Among all the nodes avaiblable, 30 of them will accept 1-node computation jobs for a duration of 72 hours
Acquiring resources :sbatch
� sbatch job.sh� Minimum request :
� 1 full node (the samenode cannot be used by two different jobs)
� 1 minute
#!/bin/bash#SBATCH --partition=workq#SBATCH --job-name=my_job#SBATCH --output=out.txt#SBATCH --error=err.txt#SBATCH –nodes=32#SBATCH --ntasks-per-node=32#SBATCH --time=hh:mm:ss (or dd-hh:mm)
echo helllo
cd WORKING_DIRsrun –n 1024 a.out….
Parallel runon the compute
nodes
File job.sh
Only execute on the first node
Acquiring interactive resources : salloc
� salloc -ntasks=4 –time=10:00� Used to get a resource allocation and use it interactively
from your computer terminal��� à example given in application talk
Monitoring resources :squeue
� squeue : lists all jobs queued (can be long…)� squeue –u <my_login> (only my jobs, much faster)� squeue –l (queue displayed in more detailed
format)
� squeue --start (display expected starting time)� squeue -i60� reports currently active jobs every 60 seconds
à warning 1 : 1 SLURM CPU = 1 Hyperthreadà warning 2: on the number of CPUs printed for a high number
: it’s truncated! 197, 568 CPUS will read 19758
Monitoring resources :scontrol show job
� scontrol show job jobID
Managing already allocated resources
scancel <job_id>kills a given job
scancel --user=<login> --state=pendingkills all pending jobs of the user <login>
scancel --name=<job_name>kills all jobs of the user <login> with a given name
sacctReport accounting information by individual job core-hours
sb kxxxReport accounting information for the whole project
Outline
• Resources available• Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM
commands• How to place my program on allocated Resource ?• SLURM interesting features
• → comprehensive example in the application talk!
1 Shaheen II nodes
1 Shaheen II nodes2 Hyperthreads per core
1 Hyperthread
1 node x 2 sockets x 16 cores x 2 hyperthreads = 64 hyperthreads
1 Shaheen II nodes2 Hyperthreads per core
1 Hyperthread
srun my_program.exe
srun → spawning my program to the resource
1 Hyperthread
srun my_program.exe
1 MPI process
1 Shaheen II nodes2 Hyperthreads per core
1 Hyperthread
srun --ntasks 32 my_program.exe
srun → spawning my program to the resource
1 Hyperthread
srun --ntasks 32 my_program.exe
1 MPI process
srun → spawning my program to the resource
1 Hyperthread
srun --ntasks 32 my_program.exe
1 MPI process
srun → spawning my program to the resource
1 Hyperthread
srun --hint=nomultithread--ntasks 32 my_program.exe
1 MPI process
srun → pinning my program to the resource
1 Hyperthread
srun --hint=nomultithread--ntasks 8 my_program.exe
1 MPI process
srun → pinning my program to the resource
1 Hyperthread
srun --hint=nomultithread--ntasks 8 my_program.exe
1 MPI process
srun → pinning my program to the resource
1 Hyperthread
srun --hint=nomultithread--ntasks 8 my_program.exe
1 MPI process
srun → pinning my program to the resource
1 Hyperthread
srun –hint=nomultithread --cpu-bind=cores
--ntasks 8 my_program.exe
1 MPI process
srun → pinning my program to the resource
1 Hyperthread
srun –hint=nomultithread --cpu-bind=cores
--ntasks 8 my_program.exe
1 MPI process
srun → spreading my program on the resource
1 Hyperthread
srun –hint=nomultithread --cpu-bind=cores
--ntasks 8 my_program.exe
1 MPI process
16 GB
variablesprog
prog 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB
srun → spreading my program on the resource
1 Hyperthread
srun –hint=nomultithread --ntasks-per-socket=4--ntasks 8 my_program.exe
1 MPI process
16 GB
variablesprog
prog 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB 16 GB
Job.sh
Slurm counts in threads: 1 Slurm CPU is a Thread
Slurm Control
parameters
ScriptCommand
Job.sh
Slurm counts in threads: 1 Slurm CPU is a Thread
Name of the job in the queueOutput file name for the jobError file name for the jobRequested number of tasksRequested ellapse timepartitionGroup to be charged
Job.sh
Slurm counts in threads: 1 Slurm CPU is a Thread
Job.out
Slurm counts in threads: 1 Slurm CPU is a Thread
Job.err
Slurm counts in threads: 1 Slurm CPU is a Thread
Job.out
Slurm counts in threads: 1 Slurm CPU is a Thread
Job.err
Slurm counts in threads: 1 Slurm CPU is a Thread
Outline
• Resources available • Why Scheduling ?• What are the main steps of an HPC Computation?• The scheduling process, most often used SLURM
commands• How to place my program on allocated Resource ?• SLURM interesting features
• → comprehensive example in the application talk!
Environment Variables
SLURM’s interesting features
� Job arrays� Job dependency� Email on job state transition� Interactive job with salloc� Reservation (with the help of sysadmin)� Flexible scheduling : the job will run with the set of
parameters with which it can start soonest� sbatch –partition=debug,batch� sbatch –nodes=16-32 ... � sbatch –time=10:00:00 –time-min=4:00:00 ...
So much more…
� How to submit workflows?� How to handle thousands jobs?� How to manage dependency between jobs?� How to tune scheduling?� How to cope with hardware failure?
à ask us or check or KSL monthly Seminars!
Launching hundreds of jobs…
Submit and manage collection of similar jobs easily To submit 50,000 element job array:
sbatch –array=1-50000 -N1 -i my_in_%a -o my_out_%a job.sh
“%a” in file name mapped to array task ID (1 – 50000)
Default standard output: slurm-<job_id>_<task_id>.out
Only supported for batch jobs
squeue and scancel commands plus some • scontrol options can operate on entire job array or select task IDs • squeue -r option prints each task ID separately