Southgreen HPC system
description
Transcript of Southgreen HPC system
![Page 1: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/1.jpg)
Southgreen HPC system
2014
![Page 2: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/2.jpg)
Concepts
Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal” (marmadais.cirad.fr)
Job Manager (SGE) : software that allows you to initiate and/or send jobs to the cluster compute servers (also known as compute hosts or nodes)
![Page 3: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/3.jpg)
Architecture
![Page 4: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/4.jpg)
Usefull links
Available woftware :
http://gohelle.cirad.fr/cluster/doc_logiciels_cluster.xls
Software installation form :
http://gohelle.cirad.fr/cluster/logiciel.php
![Page 5: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/5.jpg)
Do not run jobs on marmadais
Jobs or programs found running on marmadais could be killed.
Submit batch jobs (qsub) or, if really needed, work on a cluster node (qrsh or qlogin)
![Page 6: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/6.jpg)
Compute node characteristics
CPU x86_64
cc-adm-calcul-1, …, cc-adm-calcul-23 : 8 cores, 32G RAM
cc-adm-calcul-24, …, cc-adm-calcul-26 : 8 cores, 92G RAM
cc-adm-calcul-27 : 32 cores, 1To RAM
![Page 7: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/7.jpg)
Type of jobs/programs
Sequential : use only 1 core.
Multi-thread : use multiple CPU cores via program threads on a single cluster node.
Parallel (MPI) : use many CPU cores on many cluster nodes using a Message Passing Interface (MPI) environment.
![Page 8: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/8.jpg)
File systems
home2 : GPFS (on slow disks)
Work : GPFS (on quick disks)
All the rest : NFS
![Page 9: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/9.jpg)
Step by step
Preparation :Data
Script
Submit
Job computation
Checkout
![Page 10: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/10.jpg)
Using the cluster
All computing on the cluster is done by logging into marmadais.cirad.fr and
submitting batch jobs or interactive.
Job submission : qsub, qrsh, qsh/qlogin
Job management : qmod, qdel, qhold, qrls, qalter
Cluster information display : qstat, qhost, qconf
Accountig : qacct
![Page 11: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/11.jpg)
Interactive jobs
qrsh establishes a remote shell connection (ssh) on a cluster node which meets your requirements.
qlogin (bigmem.q) establishes a remote shell connection (ssh) with X display export on a cluster node which meets your requirements.
![Page 12: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/12.jpg)
A first look at batch job submission
create a shell script
vi script1.sh # create a shell script! The shell script
hostname # show name of cluster node this job is running on!
ps -ef # see what’s running (on the cluster node)!
sleep 30 # sleep for 30 seconds!
echo -e “\n --------- Done”! submit shell script with a given job name (note use of -N option)
qsub -N mytest script1.sh check the status of your job(s)
qstat! The output (stdout and stderr) goes to files in your home directory
ls mytest.o* mytest.e*
![Page 13: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/13.jpg)
A few options
run the job in the current working directory (where qsub was executed) rather than the default (home directory)
-cwd
send standard output (error) stream to a different file
-o path/filename
-e path/filename
merge the standard error stream into the standard output
-j y
![Page 14: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/14.jpg)
Memory limitations
Resources are limited and shared …
so specify your resource requirements
Ask for the amount of memory that you expect your job will need so that it runs on a cluster node with sufficient memory available
… AND …
place a limit on the amount of memory your job can use so that you don’t accidentally crash a compute node and take down other users’ jobs along with yours:
qsub -cwd -l mem_free=6G,h_vmem=8G myscript.sh
qrsh -l mem_free=8G,h_vmem=10G
If not set, we do it for you.
![Page 15: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/15.jpg)
Job simple
Template : /usr/local/bioinfo/template/job_sequentiel.sh
![Page 16: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/16.jpg)
Job multi-thread
Template : /usr/local/bioinfo/template/job_multithread.sh
![Page 17: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/17.jpg)
Job MPI
Template : /usr/local/bioinfo/template/job_parallel.sh
![Page 18: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/18.jpg)
Parallel env
parallel_2parallel_4parallel_8parallel_fillparallel_rrparallel_smp
![Page 19: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/19.jpg)
How many jobs can I submit ?
As many as you want …
BUT, only a limited number of “slots” will run. The rest will have the queue wait state ‘qw’ and will run as your other jobs finish.
In SGE, a slot generally corresponds to a single cpu-core.
The maximum number of slots per user may change depending on the availability of cluster resources or special needs and requests.
![Page 20: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/20.jpg)
queues
normal.q 32Go RAM nodes Access : external users Runtime: maximum 48 hours Normal priority
urgent.q Any node Access : limited Runtime: maximum 24 hours Highest priority
hypermem.q 1To RAM node Access : everybody Runtime: no limit Normal priority
bigmem.q 92Go RAM nodes Access : everybody Runtime: no limit Normal priority
bioinfo.q 32Go RAM nodes Access : internal users Runtime: maximum 48 hours Normal priority
long.q For long running jobs (32Go RAM nodes) Access : everybody Runtime: no limit Low priority
![Page 21: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/21.jpg)
qstat
info for a given user
qstat -u username (or just qstat or qu for your jobs) full dump of queue and job status
qstat -f What do the column labels mean?
job-ID a unique identifier for the job
name the name of the job
state the state of the job
r: running
s: suspended
t: being transferred to an execution host
qw: queued and waiting to run
Eqw: an error occured with the job Why is my job in Eqw state?
qstat -j job-ID -explain E
![Page 22: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/22.jpg)
Visualisation
![Page 23: Southgreen HPC system](https://reader036.fdocuments.in/reader036/viewer/2022062500/56815873550346895dc5d232/html5/thumbnails/23.jpg)
qdel
To terminate a job, first get the job-id with qstat
qstat (or qu )
Terminate the job
qdel job-id
Forced termination of a running job (admins only)
qdel -f job-id