Intro to Monsoon and Slurm

IntrotoMonsoonandSlurm

1/18/2017

Introductions

• Introduceyourself– Name– Department/Group– Whatproject(s)doyouplantousemonsoonfor?– LinuxorUnixexperience– Previousclusterexperience?

ListofTopics

• Clustereducation– Whatisacluster,exactly?– Queues,schedulingandresourcemanagement

• ClusterOrientation– Monsoonclusterspecifics– HowdoIusethiscluster?– Groupresourcelimits– Exercises– Questionandanswer

Whatisacluster?

• Acomputerclusterismanyindividualcomputerssystems(nodes)networkedtogetherlocallytoserveasasingleresource

• Abilitytosolveproblemsonalargescalenotfeasiblealone

InsideaNode

Socket0

Socket3Socket2

Socket1

Whatisaqueue?

• Normallythoughtofasaline,FIFO• QueuesonaclustercanbeasbasicasaFIFO,orfarmoreadvancedwithdynamicprioritiestakingintoconsiderationmanyfactors

Whatisscheduling?

• “Aplanorprocedurewithagoalofcompletingsomeobjectivewithinsometimeframe”

• Schedulingforaclusteratthebasiclevelismuchthesame. Assigningworktocomputerstocompleteobjectiveswithinsometimeavailability

• Notexactlythateasythough. Manyfactorscomeintoplayschedulingworkonacluster.

Scheduling

• A schedulerneedstoknowwhatresourcesareavailableonthecluster

• Assignmentofworkonaclusteriscarriedoutmostefficientlywithschedulingandresourcemanagementworkingtogether

ResourceManagement

• Monitoringresourceavailabilityandhealth• Allocationofresources• Executionofresources• Accountingofresources

Clusterschedulinggoals

• Optimizequantityofwork• Optimizeusageofresources• Serviceallusersandprojectsjustly• Makeschedulingdecisionstransparent

ClusterResources

• Node• Memory• CPU’s• GPU’s• Licenses

Manyschedulingmethods

• FIFO– Simplyfirstinfirstout

• Backfill– Runssmallerjobswithlowerresourcerequirementswhilelargerjobswaitforhigherresourcerequirementstobeavailable

• Fairshare– Prioritizesjobsbasedonusersrecentresourceconsumption

Monsoon• TheMonsoonclusterisaresourceavailabletotheNAUresearchenterprise

• 32systems(nodes)– cn[1-32]• 884cores• 12GPUs,NVIDIATeslaK80• Red Hat EnterpriseLinux6.7• 12TBmemory - 128GB/nodemin,1.5TBmax• 170TBhigh-speed scratch storage• 500TBlong-term storage• Highspeed interconnect:FDRInfiniband

Monsoonscheduling

• Slurm (SimpleLinuxUtilityforResourceManagement)• Excellentresourcemanagerandscheduler• Precisecontroloverresourcerequests• DevelopedatLLNL,continuedbySchedMD• Usedeverywherefromsmallclusterstothelargestclusters:

– SunwayTaihuLight (#1),10.6Mcores,93PF,15kKW- China– Titan(#3),561Kcores,17.6PF,8kKW,UnitedStates

SmallCluster!

Dualcore?

LargestCluster!

10,649,600 cores

Monsoonscheduling

• Combinationofschedulingmethods• Currentlyconfiguredtoutilizebackfillalongwithamultifactorprioritysystemtoprioritizejobs

Factorsattributingtopriority

• Fairshare (predominantfactor)– Prioritypointsdeterminedonusersrecentresourceusage– Decayhalflifeover1days

• QOS(QualityofService)– SomeQOShavehigherprioritythanothers,forinstance:debug

• Age– howlonghasthejobsatpending• Jobsize- thenumberofnodes/cpus ajobisrequesting

Storage

• /home– 10GBquota– Keepyourscriptsandexecutables here– Snapshottedtwiceaday:/home/.snapshot– Pleasedonotwritejoboutput(logs,results)here!!

• /scratch– 30dayretention– Veryfaststorage,capableof11GB/sec– Checkpoints,logs– Keepalltemp/intermediatedatahere– Shouldbeyourdefaultlocationtoperforminput/output

Storage

• /projects– Long-termstorageprojectshares– Spaceisassignedtofacultymemberforgrouptoshare– Snapshotsavailable– Nobackupstoday

• /common– Clustersupportshare– Contrib:placetoputscripts/libs/confs/db’s forothersuse

DataFlow

1. Keepscriptsandexecutablesin/home2. Writetemp/intermediatedatato/scratch3. Copydatato/projects/<group_project>,forgroupstorage

andreferenceinotherprojects4. Cleanup/scratch

**Remember,/scratchisascratchfilesystem,usedforhigh-speedtemporary,andintermediatedata

Remotestorageaccess

• scp– scp filesnauid@monsoon.hpc.nau.edu:/scratch/nauid– WinSCP

• samba/cifs– \\nau.froot.nau.edu\cirrus(windows)– smb://nau.froot.nau.edu/cirrus(mac)

Groups

• NAUhasaresourcecalledEnterprisegroups• Theyareavailabletoyouontheclusterifyou’dliketomanageaccesstodata

• Manageaccesstoyourfiles• https://my.nau.edu

– “GotoEnterpriseGroups”– TakealookatourFAQ::nau.edu/hpc/faq

• Iftheyarenotworkingforyou,contactITShelpdesk

Software• ENVI/IDL• Matlab• IntelCompilers,andMKL• R• Qiime• AnacondaPython• OpenFOAM• SOWFA• Lotsofbioinformaticsprograms• Requestadditionalsoftwaretobeinstalled!

Modules

• Softwareenvironmentmanagementhandledbythemodulespackagemanagementsystem

• moduleavail– whatmodulesareavailable• modulelist– modulescurrentlyloaded• moduleload<modulename>- loadapackagemodule• moduledisplay<modulename>- detailedinformationincludingenvironmentvariableseffected

• QuicknoteonMPI• MessagePassingInterfaceforparallelcomputing• OpenMPIsetasdefaultMPI• Mvapich2alsoavailable

– moduleunloadopenmpi– moduleloadmvapich2

• ExampleMPIjobscript:– /common/contrib/examples/job_scripts/mpijob.sh

InteractingwithSlurm• Decidewhatyouneedtoaccomplish• Whatresourcesareneeded?

– 2cpus,12GBmemory,for2hours?• Whatstepsarerequired?

– Runprog1,thenprog2…etc– Arethestepsdependentononeanother?

• Canyourwork,orprojectbebrokenupintosmallerpieces?Smallerpiecescanmaketheworkloadmoreagile.

• Howlongshouldyourjobrunfor?• Isyoursoftwaremultithreaded,usesOpenMP orMPI?

ExampleJobscript• #!/bin/bash• #SBATCH--job-name=test• #SBATCH--output=/scratch/nauid/output.txt• #SBATCH--time=20:00 #shortertime=soonerstart• #SBATCH--workdir=/scratch/nauid

• #replacethismodulewithsoftwarerequiredinyourworkload• moduleloadpython/3.3.4

• #examplejobcommands• #eachsrun commandisajobstep,sothisjobwillhave2steps• srun sleep300• srun python-V

Interactive/DebugWork

• Runyourcompilesandtestingontheclusternodesby:

– srun –pallgcc hello.c –oa.out– srun –pall-c12make-j12– srun –qos=debug-c12make-j12– srun Rscript analysis.r– srun pythonanalysis.py– srun cp -av /scratch/NAUID/lots_o_files /scratch-lt/NAUID/destination

LongInteractivework• salloc

– ObtainaSLURMjoballocationthatyoucanworkwithforanextendedamountoftimeinteractively.Thisisusefulfortesting/debuggingforanextendedamountoftime.

[cbc@wind ~]$salloc –c8--time=2-00:00:00salloc:Grantedjoballocation33442[cbc@wind ~]$srun pythonanalysis.py[cbc@wind ~]$exitsalloc:Relinquising job allocation 33442[cbc@wind ~]$salloc -N2salloc:Grantedjoballocation33443[cbc@wind ~]$srun hostnamecn3.nauhpccn2.nauhpc[cbc@wind ~]$exitsalloc:Relinquising job allocation 33443

JobParametersYouwant SwitchesneededMorethan onecpu forthejob --cpus-per-task=2, or-c2

To specify an ordering of your jobs --dependency=afterok:job_id,or-djob_id

Split up the output, and errors --output=result.txt --error=error.txtTo run your job at a particular time/day

--begin=16:00 --begin=now+1hour --begin=2010-01-20T12:34:00

Add MPItasks/rankstoyourjob --ntasks=2, or-n2To control job failure options --norequeue –requeueToreceivestatusemail --mail-type=ALL

Contraints andResourcesYouwant SwitchesneededTo choose a specific node feature (e.g. avx2) --constraint=avx2

To use a generic resources (e.g. a gpu) --gres=gpu:tesla:1

To reserve a whole node for yourself --exclusive To chose a partition --partition

Submitthescript

[cbc@wind ~]$sbatch jobscript.shSubmittedbatchjob85223

– slurm returns ajob idforyour job that you can use tomonitorormodify constraints

Monitoringyourjob

• squeue– viewinformationaboutjobslocatedintheSLURMschedulingqueue.

• squeue --start• squeue -ulogin• squeue -o“%j%u…“• squeue -ppartitionname• squeue -Ssortfield• squeue -t<state>(PDorR)

Clusterinfo

• sinfo– viewinformationaboutSLURMnodesandpartitions.

• sinfo -N–l• sinfo –R

– Listreasonsfordownednodesandpartitions

Monitoringyourjob

• sprio– viewthefactorsthatcompriseajob’sschedulingpriority

• sprio –l-- listpriorityofusersjobsinpendingstate

• sprio -o“%j%u…“• sprio -w

Monitoringyourjob

• sstat– Displayvariousstatusinformationofarunningjob/step.

• sstat -jjobid• sstat -oAveCPU,AveRSS• Onlyworkswithjobscontainingsteps

Controling yourjob

• scancel– UsedtosignaljobsorjobstepsthatareunderthecontrolofSlurm.

• scancel -jjobid• scancel -njobname• scancel -umylogin• scancel -tpending(onlyyours)

Controllingyourjob

• scontrol– UsedtoviewandmodifySlurm configurationandstate.

• scontrol showjob85224

JobAccounting• sacct

– displaysaccountingdataforofyourjobsandjobstepsintheSLURMjobaccountinglogorSLURMdatabase

• sacct -jjobid -ojobid,elapsed,maxrss• sacct -Nnodelist• sacct -umylogin

• Tryouralias“jobstats”– jobstats– jobstats –j<jobid>

JobAccounting

• sshare– Toolforlistingthesharesofassociationstoacluster.

• sshare -l:viewandcompareyourfairshare withotheraccounts

• sshare -a:viewallusersfairshare• sshare –A–a<account>:viewallmembersinyouraccount(group)

Accounthierarchy• Youruseraccountbelongstoaparentfacultyaccount(group)• Youruseraccountsharesresourcesthatareprovidedforyourgroup

• Example:– coffey

• cbc• mkg52

• Viewtheaccountstructureyoubelongtowith:“sshare -a–A<account>”

• Example:– sshare -a-Acoffey

Limitsontheaccount(group)

• Limitsareinplacetopreventintentionalorunintentionalmisuseofresourcestoensurequickandfairturnaroundtimesonjobsforeveryone.

• Groupsarelimitedtoatotalnumberofcpu minutesinuseatonetime:700,000

• Thiscpu resourcelimitmechanismisreferredtoas:“TRESRunMins”

TRESRunMins Limit

• Whattheheckisthat!?• Anumberwhichlimitsthetotalnumberofremainingcpu minuteswhichyourrunning jobscanoccupy.

• Enablesflexibleresourcelimiting• Staggersjobs• Increasesclusterutilization• Leadstomoreaccurateresourcerequests

• Sumofjobs(cpus *timelimit remaining)

Examples

• 14400=10jobs,1cpu,1dayinlength• 144000=10jobs,10cpu,1dayinlength• 576000=10jobs,10cpu,5daysinlength• 648000=100jobs,1cpu,½dayinlength

Questions?

• Checkyourgroupscpu minusage:– sshare -l

Exercise1

GettoknowmonsoonandSlurm,onyourown.

1. Howmanynodesmakeupmonsoon?– Hint:use“sinfo”

2. Howmanynodesareintheallpartition?3. Howmanyjobsarecurrentlyintherunningstate?

– Hint:use“squeue -tR”4. Howmanyjobsarecurrentlyinthependingstate?Why?

– Hint:use“squeue –tPD”

Exercise2• Createasimplejobinyourhomedirectory• Examplehere:/common/contrib/examples/job_scripts/simplejob.sh (copyitifyoulikeJ )

• Nameyourjob:“exercise”• Nameyourjobsoutput:“exercise.out”• Outputshouldgoto/scratch/<user>/exercise.out• Loadthemodule“workshop”• Runthe“date”command• Andadditionally,the“secret”command• Submityourjobwithsbatch,i.e.“sbatch simplejob.sh”

Exercise3• Makeyourjobsleepfor5minutes(sleep300)

– Sleepisacommandthatcreatesaprocessthat…sleeps• Monitoryourjob

– squeue -uyour_nauid– squeue -tR– scontrol showjobjobnum– sacct -jjobnum

• Inspectthesteps

• Cancelyourjob– scancel jobnum

Exercise4• Copyjobscriptandedit:

– /common/contrib/examples/job_scripts/lazyjob.sh

• Submitthejob,itwilltake65sectocomplete• Usesstat andmonitorthejob

– sstat -j<jobid>• Reviewtheresourcesthatthejobused

– jobstats -j<jobid>• Wearelookingfor“MaxRSS”,MaxRSS isthe

maxamountofmemoryused• Editthejobscript,reducethememorybeing

requestedinMBandresubmit,edit“--mem=“,e.g.--mem=600

• Reviewtheresourcesthattheoptimizedjobutilizedonceagain– jobstats -j<jobid>

• Ok,memorylooksgood,butnoticethattheusercpuisthesameastheelapsedtime

Usercpu =num utilizedcpus *elapsedtime

• Thisisbecausetheapplicationwewererunningonlyused1ofthe4cpus thatwerequested

• Editthelazyjobscript,commentoutfirstsruncommand,anduncommentthesecondsruncommand.

• Resubmit• Rerunjobstats -j<jobid>,noticenowusercpu isa

multipletimestheelapsedtime,inthiscase(4).Becausewewereallocated4cpus,andused 4cpus.

Slurm Arrays!

Slurm ArraysExercise

• Fromyourscratchdirectory:“/scratch/nauid”• tarxvf /common/contrib/examples/bigdata_example.tar• cdbigdata• editthefile“job_array.sh”sothatitworkswithyournauidreplacingallNAUIDwithyours

• Submitthescript“sbatch job_array.sh”• Run“squeue”,noticethereare5jobsrunning,howdidthathappen!

MPIExample

• RefertotheMPIexamplehere:– /common/contrib/examples/job_scripts/mpijob.sh

• Editit,foryourworkareas,thenexperiment:– Changenumberoftasks,nodes…etc

• Alsocanruntheexamplelikethis:– srun --qos=debug–n4/common/contrib/examples/mpi/hellompi

Keepthesetipsinmind

• Knowthesoftwareyouarerunning• Requestresourcesaccurately• Supplyanaccuratetimelimitforyourjob• Don’tbelazy,itwilleffectyouandyourgroupnegatively

QuestionandAnswer

• Moreinfohere:http://nau.edu/hpc

• Linuxshellhelphere:– http://linuxcommand.org/tlcl.php– Freebookdownload

• Andonthenauhpc listserv– nauhpc@lists.nau.edu

Intro to Monsoon and Slurm

Documents

Transcript of Intro to Monsoon and Slurm

SLURM Deployment Experiences on Stampede

Communication-aware Job Scheduling using SLURM

RESEARCH COMPUTING PBS SLURM TRANSITION

SLURM At CSCS - SchedMD · June 2010—First working port of SLURM to a Cray ... Fall 2010March 2011 Initial experimentation with the use of SLURM ... SigTerm vs SigKill changes to

Slurm License Management

Slurm at UPPMAX - Uppsala University

Bright Cluster Manager - SchedMD :: SLURM Development

Introduction to Slurm - NASA

Slurm Singularity Spank Plugin · 2017-10-19 · Slurm User Group Four node cluster with 56 CPUs – Running Redhat 7.2, Open MPI 2.0.0, PMIX 1.1.4, Slurm 16.05.04 Singularity 2.1

Slurm Advanced Scheduling - Sciencesconf.org

Slurm Migration

Evaluating SLURM simulator with real-machine SLURM and ...

Brigham Young University - Slurm Workload Manager · Brigham Young University Fulton Supercomputing Lab Ryan Cox SLURM User Group 2013. Fun Facts

Slurm and Singularity Training October 2021

introduction to slurm - Read-Only

SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

Docker, Monitoring and SLURM Speciﬁc Visualisationstorbenr/presentations/NeIC_presentation... · •Docker in a Nutshell • QNIBx Terminal Monitoring Inventory • SLURM Autogenerated

Tutorial 3: Cgroups Support On SLURM · Slurm User Group 2012 (c) Bull 2012 Tutorial 3: Cgroups Support On SLURM SLURM User Group 2012, ... SLURM Cgroups Documentation • Cgroups

Intro to SLURM - University of Virginia School of ... GPGPU GPU equipped nodes available inside & outside of Slurm – gpusrv01 - gpusrv06 available outside of Slurm Cuda-toolkit available

Kerberos & SLURM using Auks