TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200...

54
TORQUE Tutorial A Beginner's Guide Kenneth Nielson September 16, 2009

Transcript of TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200...

Page 1: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

TORQUE TutorialA Beginner's Guide

Kenneth NielsonSeptember 16, 2009

Page 2: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

2

TORQUE Resource Manager

Wh a t is T O R Q UET O R Q UE 's R oleT O R Q UE C om p one ntsIns ta lla tionC onfig ura tionJob Adm in is tra tionD ia g n os tic sMPI Multi-m om a nd An y m omR oa dm a pQ &A

Page 3: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

3

What is TORQUE?● Terascale Open-Source Resource and QUEue Manager

● TORQUE is an open source resource manager providing controlover batch jobs and distributed compute nodes. It is a communityeffort based on the original *PBS project and, with more than1,200 patches, has incorporated significant advances in the areasof scalability, fault tolerance, and feature extensions contributedby NCSA, OSC, USC , the U.S. Dept of Energy, Sandia, PNNL, U ofBuffalo, TeraGrid, and many other leading edge HPCorganizations.

● PBS – Portable Batch System

Page 4: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

4

What is TORQUE

T he Porta b le B a tc h S y s te m , PB S , is a b a tc h job a ndc om pute r s y s te m re s ourc e m a n a g e m e nt p a c k a g e . Itwa s de v e lope d with the inte nt to be c onform a nt withthe PO S IX 1 0 0 3 .2 d B a tc h E n v ironm e nt S ta nda rd . Ass uc h, it will a c c e pt b a tc h jobs , a s he ll s c ript a ndc ontrol a ttribute s , pre s e rv e a nd prote c t the job until itis run , run the job , a nd de liv e r output b a c k to thes ubm itte r. PBS m a y be ins ta lle d a nd c onfig ure d tos upport jobs run on a s ing le s y s te m , or m a n y s y s te m sg roupe d tog e the r. B e c a us e of the flex ib ility of PBS , thes y s te m s m a y be g roupe d in m a n y fa s h ions .

Page 5: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

5

TORQUE's Role

● Provide job queuing facility● Monitor resource configuration, utilization, and health● Provide remote job execution and job management facilities● Reports information to cluster scheduler● Receives direction from cluster scheduler● Handles user client requests

Page 6: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

6

TORQUE Components

Commands

Job Server

Job Executor

Job Scheduler

Page 7: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

7

TORQUE Components

Commands● Three classes of commands

○ user – any authorized user can execute○ Operator – special access privileges required○ Manager – special access privileges required

● User commands○ qsub, qstat, pbsnodes, qdel

● Operator and manager commands

Page 8: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

8

TORQUE Components

Job Server● pbs_server

● Central focus of TORQUE● All commands and other daemons communicate withpbs_server via TCP/IP and UDP/IP

● Provides basic batch services○ Job creation○ Job modification○ Job protection○ Job execution

Page 9: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

9

TORQUE Components

Job Executor

● pbs_mom○ Daemon called MOM – Machine-Oriented Miniserver○ receives copy of jobs from pbs_server○ Places jobs into execution○ Creates new session similar to user login session○ For parallel jobs a Mother Superior manages group ofsister nodes

○ Returns output to pbs_server or Mother Superior

Page 10: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

10

TORQUE Components

Job S c he dule r

C ontrols s ite polic y T O R Q UE s upports m ultip le s c he dule rs

pbs _s c he d ■ not s upporte d by Ada ptiv e C om puting

Ma ui■ O pe n s ourc e■ Us er G roup s up port only

Moa b■ Torque s upp ort inc lude d■ For wha t Moa b c a n do tha t Ma ui c a nnot g o to

h t t p : //w w w. c lu s t e r r e s o u r c e s . c o m /p r o d u c t s /m a u i/d o c s /a .k m o a b c o m p . s h t m l

Page 11: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

11

TORQUE InstallationWhere to get it. svn (subversion)

svn://svn.clusterresources.com/torque

/trunk – currently 2.4 beta

/branches/2.3-fixes – snapshot build with latest fixes

/branches/2.3-multimom – allows multiple moms on a single node

www.clusterresources.com

http://www.clusterresources.com/downloads/torque/

torque-2.3.7.tar.gz is the latest released version

Page 12: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

12

TORQUE Installation

Extract and build the distribution to the machine that will act asthe TORQUE server.

> tar -xzvf torqueXXX.tar.gz> cd torqueXXX> ./configure> make> make install

Page 13: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

13

TORQUE InstallationTorque Install Directory

● Default location /usr/local/

○ - bin● Contains client commands – qstat, pbsnodes, qsub, etc.● Needed on server and login/submission hosts

○ - sbin● Contains server and node daemons – pbs_server, pbs_mom,pbs_demux, pbs_sched, momctl

○ - lib● Contains TORQUE libraries – libtorque.so.x

Page 14: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

14

TORQUE Installation

Init ial TORQUE Startup

pbs_server

As root typepbs_server -t createortorque.setup < user>

Stop pbs_server before running in product ionqterm

Page 15: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

15

TORQUE Installation

root@ke n-linux B ox :/us r/loc a l/s b in# pbs _s e rv e r -t c re a te

Q m g r: p s## S e t s e rv e r a ttribute s .#s e t s e rv e r a c l_hos ts = ke n-linux B oxs e t s e rv e r log _e v e nts = 5 1 1s e t s e rv e r m a il_from = a dms e t s e rv e r s c he dule r_ite ra tion = 6 0 0s e t s e rv e r node _c he c k _ra te = 1 5 0s e t s e rv e r tc p_tim e out = 6

Page 16: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

16

TORQUE Installationke n@ke n-linux B ox :~ /de v /torq ue /2 .3 -fix e s $ s ud o ./torq ue .s e tup ke n

c re a te que ue b a tc h # s e t q ue ue b a tc h que ue _ty pe = E x e c ution s e t q ue ue b a tc h re s ou rc e s _de fa u lt.node s = 1 s e t q ue ue b a tc h re s ou rc e s _de fa u lt.w a lltim e = 0 1 :0 0 :0 0 s e t q ue ue b a tc h e n a ble d = True s e t q ue ue b a tc h s ta rte d = True # # S e t s e rv e r a ttrib ute s . # s e t s e rv e r s c he duling = True s e t s e rv e r a c l_h os ts = ke n-linux B ox s e t s e rv e r de fa ult_que ue = b a tc h s e t s e rv e r log _e v e nts = 5 1 1 s e t s e rv e r m a il_from = a dm s e t s e rv e r s c he dule r_ite ra tion = 6 0 0 s e t s e rv e r node _c he c k _ra te = 1 5 0 s e t s e rv e r tc p_tim e out = 6 s e t s e rv e r m om _job_s y nc = True s e t s e rv e r ke e p_c om ple te d = 3 0 0

Page 17: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

17

TORQUE ConfigurationT O R Q UE H om e D ire c tory

● D e fa ult /v a r/s pool/torque -- $ TO R Q UE _H O ME , $ PB S _H O ME ,e tc .○ /v a r/s pool/torque

● s e rv e r_na m e – Na m e of hos t whe re pbs _s e rv e r re s ide s .C a n ha v e m ultip le hos t na m e s for h ig h a v a ila b ility

○ s e rv e r_priv● jobs● node s

○ s e rv e r_log s● file s of the form y y y y m m dd ( i.e . 2 0 0 9 0 9 1 6 )

○ m om _priv● jobs● c onfig

○ m om _log s● file s of the form y y y y m m dd ( i.e . 2 0 0 9 0 9 1 6 )

Page 18: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

18

TORQUE Configuration

pbs _s e rv e r C onfig ura tion -- node s file● s e rv e r_priv /node s

○ c onta ins lis t of m om hos t na m e s a nd a ttribute s■ a ttribute s

● np – num be r of proc e s s e s● note – a dm in is tra tor note● prope rtie s – a dm in is tra tors c hoic e

● node s file s y nta x○ hos t np= X note = s tring prope rty 1 prope rty 2 ...prope rty n○ ex a m ple :

■ hos t1 np= 4 note = ne w inte l_i7 da ta■ hos t2 np= 4 x 8 6 ■ hos t3 np= 8 a m d_6 4

Page 19: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

19

TORQUE Configurationpbs _s e rv e r node c onfig ura tion

● Re s ta rt pbs _s e rv e r● Run pbs node s

hos t1 s ta te = down np = 4 prope rtie s = inte l_i7 ,da ta nty pe = c lus te r note = ne w

hos t2 s ta te = down

np= 4

Page 20: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

20

TORQUE Configuration

pbs _s e rv e r node c onfig ura tion

● D y na m ic node c onfig ura tion> qm g r -c “c re a te node node 0 0 3 ”

Ma nua lly e dit the node s file■ $ T O R Q UE H O ME /s e rv e r_priv /n ode s ● Re s ta rt pbs _s e rv e r da e m on a fte r c ha ng e

Page 21: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

21

TORQUE Configuration

● p b s _s e r v e r q u e u e c o n f ig u r a t io n○ Attribute s

■ que ue _ty pe● ex e c ution, route

■ re s ourc e s _de fa u lt● de fa u lt re s ourc e re quire m e nts for jobs (wa lltim e , node s )

● e na ble d○ S pe c ifie s whe the r que ue a c c e pts ne w jobs . (D e fa u lt

FAL S E )○ s ta rte d

■ s pe c ifie s whe the r jobs in que ue a re a llowe d to ex e c ute .(D e fa u lt Fa le s )

Page 22: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

22

TORQUE Configuration● p b s _s e r v e r q u e u e c o n f ig u r a t io n

○ de fa u lt que ue ba tc h○ c re a te ne w que ue

■ qm g r● c re a te que u e re g● s et q ue ue reg que u e_ty p e= E x ec ution● s et q ue ue reg re s ourc e s _de fa u lt.node= 1● s et q ue ue reg re s ourc e s _de fa u lt.wa lltim e= 0 1 :0 0 :0 0● s et q ue ue reg e na b le d= True● s et q ue ue reg s ta rte d= True

○ s e tting de fa u lt que ue■ qm g r -c “s e t s e rv e r de fa u lt_que ue = re g ”

Note : A que ue is c a lle d a c la s s in Moa b

Page 23: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

23

TORQUE Configuration

pbs _m om C onfig ura tion● As root run pbs _m om

○ No s pe c ia l c onfig ura tion ne e de d to s ta rt○ us e m om _priv /c onfig for options

● m om _priv /c onfig○ Allows c us tom c onfig ura tion of m om node○ S y nta x

■ $ < option> v a lue■ ex a m ple

$ log le v e l 3$ us e c p *.fte .c om :/da ta /us r/loc a l/da ta

Page 24: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

24

TORQUE Configuration

● For s ha re d file s y s te m s us e the $ us e c p pa ra m e te r in them om _priv /c onfig file

$ us e c p *.fte .c om :/da ta /us r/loc a l/da ta

● For local, non-shared filesystems, rcp or scpmust be c onfig ure d to a llow d ire c t c opy without prom ptingfor pa s s words (ke y a uthe ntic a tion, e tc .)

http://www.c lus te rre s ource s .c om /produc ts /torque /doc s /6 .1 sc ps e tup.s htm l

Page 25: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

25

TORQUE Configuration

S c he dule r C onfig ura tion

● Follow d ire c tions for s c he dule r of c hoic e● Moa b c onfig ura tion

○ http ://www.c lus terre s ourc es .c om /prod uc ts /m wm /doc s /2 .0 ins ta lla tio n .s htm l

Page 26: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

26

Advanced Configuration

C us tom iz ing the Ins ta llMos t re c om m e nde d c onfig ure options ha v e be e n s e le c te d a s

de fa u lt. S om e ofte n us e d options

--with-d e bug – for us e with g db --prefix = < D IR > -- c ha ng e ins ta ll d irec tory --ex ec -prefix = < D IR > -- c ha ng e only ex ec uta ble ins ta ll d irec tory --d is a b le -g c c -wa rn ing s – Us e with c a re .

./c onfig ure --h e lp will g iv e a ll options

Page 27: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

27

Advanced Configuration

● C onfig uring Job S ubm is s ion H os ts● Us e a c l_hos ts● Us e torque .c fg s ubm ithos ts ,a llowc om pute hos ts● /e tc /hos ts .e quiv

● C onfig uring T O R Q UE on a Multi-H om e d S e rv e r● S pe c ify ing Non-Root Adm in is tra tors

> qm g r

Q m g r: s e t s e rv e r m a na g e rs + = jos h@*.fs c .c omQ m g r: s e t s e rv e r ope ra tors + = jos h@*.fs c .c omQ m g r: s e t s e rv e r log _le v e l= 3

Page 28: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

28

Job Administration

Jo b F lo w

● pbs _s e rv e r re c e iv e s ne w job● Inform s the s c he dule r● Whe n node s a v a ila b le , s c he dule r s e nds ins truc tions a nd

node s lis t to pbs _s e rv e r● pbs _s e rv e r s e nds job to the firs t node in the node lis t● T he firs t node , or Mothe r S upe rior, la unc he s the job a nd

pa s s e s it to the re s t of the node s in the node lis t, or theS is te r m om s

Page 29: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

29

Job Administration

qsub● Batch and Interactive ● Requesting Resources

● Examples● To ask for 2 processors on each of four nodes:

● qsub -l nodes=4:ppn=2 ● The following job will wait until node01 is free with 200 MB of

available memory:● qsub -l nodes=node01,mem=200mb /home/user/script.sh

● Directives can be embedded into job script● example on next page

Page 30: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

30

Job Administration

# !/b in/s h

# PB S -N ds 1 4 Fe e dba c k D e fa ults# PB S -q te s tque ue# PB S -l node s = 1 :ppn= 2 ,wa lltim e = 2 4 0 :0 0 :0 0# PB S -M us e r@m y dom a in.c om

s ourc e ~ /.ba s hrc

c a t $ PB S _NO D E F IL Ec a t $ PB S _O _JO B ID

Page 31: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

31

Job Administration

Manually Administrating Jobs

> qsub scatter

4807.ken-linuxbox

> qstat

Job id Name User Time Use S Queue

---------------- ---------------- ---------------- -------- - -----

4807 scatter user01 12:56:34 Q batch

Page 32: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

32

Job Administraton

Manually Administrating Jobs

> qrun 4807

> qstat

Job id Name User Time Use S Queue

---------------- ---------------- ---------------- -------- - -----

4807 scatter user01 12:56:34 R batch

>qstat

Job id Name User Time Use S Queue

---------------- ---------------- ---------------- -------- - -----

4807 scatter user01 12:56:34 C batch

Page 33: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

33

Job Administration

Canceling Jobsqdel

-w delay Specify the delay between the sending of the SIGTERM and SIGKILL signals.

-p purge Forcibly purge the job from the server. This option is only available to a batch operator or the

batch administrator.-m message

Specify a comment to be included in the email. The argument message specifies the commentto send. This option is only available to a batch operator or the batch administrator.

[all|ALL]

Delete all jobs in the queue

Page 34: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

34

Job AdministrationAutomating Job Administration

Integrate with an external schedulerMoab Workload Manager

Job Arrayssubmit multiple jobs at once

Submit Filters

Job Preemption

Page 35: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

35

Job Administration

● Job Arrays○ TORQUE 2.3 and later○ Allows single line submission of multiple jobs for a single script○ Job can be monitored as a group

Example> qsub -t 0-3 scatter 33.hostname> qstat

Job id Name User Time Use S Queue

---------------- ---------------- ---------------- -------- - -----33-0 scatter-0 user01 12:56:34 R batch33-1 scatter-1 user01 12:56:34 R batch33-2 scatter-2 user01 12:56:34 R batch

Page 36: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

36

Job Administration

S ubm it Filte rs

When s ubm it filters ex is t T O R Q UE s e n ds c om m a nd file to thes c ript/ex ec uta ble whic h m odifie s the reque s t ba s e d on s ite polic ie s .

S ubm it filter d e s ig na te d in torque.c fg .Found in /v a r/s pool/torqueKe y word S UB MIT F ILT E R

E x a m ple torque.c fgS UB MIT F ILT E R /hom e /us er/s ubm it_filter

Page 37: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

37

Job AdministrationS ubm it Filte r E x a m ple s

/hom e /us e r/s ubm it_filte r

# !/b in/s h

# a dd de fa ult m e m ory cons tra ints a nd a dd a e -m a il notific a tion a ddre s s to a llre que s ts# tha t d id not s pe c ify it in us e r's s c ript or com m a nd line

e c ho “# PB S -l m e m = 1 6 MB”e c ho “# PB S -M ke n@a da ptiv e c om puting .c om ”

while re a d Ido

e c ho $ idone

Page 38: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

38

Job Administration

S ubm it Filte r E x a m ple slis tte s t.s h

# !/b in/s hls -a lR /

q s ub lis tte s t.s h1 0 .k m n.c ridom a in

c a t /v a r/s p ool/torqu e /s erv er_priv /jobs /1 0 .k m n.c ridom a in.S C

# PB S -l m em = 1 6 MB# PB S -M ke n@ a da ptiv ec om puting .c omls -a lR /

Page 39: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

39

TORQUE Administration

Job Pre e m ptionTorque ha s thre e ba s ic tools

C a nc e l – qde lre -que – qre runc he c k point

T he s c he dule r us e s the ba s ic tools to e na ble job pre e m ption.S e e Moa b for m ore inform a tion

h t t p : //w w w .c lu s t e r r e s o u r c e s .c o m /p r o d u c t s /m w m /d o c s /8 .4 p r e e m p t io n .s h t m l

Page 40: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

40

TORQUE AdministrationMonitoring Resources

TORQUE reports a number of attributes broken into 3 major categories:

ConfigurationIncludes both detected hardware configuration, and specified batch attributes Can report static ‘generic resources’ via specification in the mom config file

UtilizationIncludes information regarding the amount of node resources currently available (in

use) as well as information about who or what is consuming itCan report dynamic ‘generic resources’ via specification of a ‘monitor script’ in the

mom config file

StateIncludes administrative status, general node health information, and general usage

status

Page 41: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

41

TORQUE AdminstrationM o n it o r in g R e s o u r c e s

> p b s n o d e s

k e n - l in u x B o x s t a t e = f r e e n p = 2 p r o p e r t ie s = b l d g 1 , i n t e l_i7 n t y p e = c lu s t e r s t a t u s = o p s y s = l in u x , u n a m e = L in u x k e n - l in u x B o x 2 .6 .2 4 -2 3 -

g e n e r ic # 1 S M P W e d A p r 1 2 1 : 4 7 : 2 8 U T C 2 0 0 9 i6 8 6 , s e s s i o n s = 4 9 8 3 5 8 7 3 6 2 2 0 6 3 3 1 6 3 3 5 6 3 6 0 6 3 6 9 6 4 0 2 6 4 5 6 6 4 6 0 6 4 8 9 6 5 8 2 , n s e s s i o n s = 1 2 , n u s e r s = 2 , id le t im e = 1 ,

t o t m e m = 8 1 2 3 8 2 4 k b , a v a i lm e m = 7 5 8 4 6 4 8 k b , p h y s m e m = 2 0 6 7 3 6 0 k b , n c p u s = 2 , lo a d a v e = 0 .0 5 ,n e t lo a d = 3 6 9 5 7 5 3 2 , s t a t e = f r e e , jo b s = , v a r a t t r = , r e c t im e = 1 2 5 2 4 6 7 7 8 7

n o t e = b a c k e d _u p

Page 42: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

42

TORQUE AdministrationN o d e S t a t e s

S ta te s down (down)offline (dra ine d)job-ex c lus iv e (bus y ) fre e ( id le /running )re s e rv ejob-s ha ringbus ytim e -s ha re ds ta te -unk nown

C ha ng ing node s ta teO ffline

pbs node s -o < node na m e > O nline

pbs node s -c < node na m e >

Viewing nod e s of a pa rtic u la r s ta tepbs node s -l

Page 43: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

43

TORQUE AdministrationN o d e P r o p e r t ie s

● Nod e Prop erty Attribute s● C a n a pp ly m ultip le prop ertie s per node● Prop ertie s a re ‘opa qu e’● E a c h prop erty c a n b e a pp lie d to m ultip le node s● Prop ertie s c a n not b e c ons um ed

● D y n a m ic a lly with qm g r> qm g r -c “s et nod e node 0 0 1 prop ertie s = b ig m em ”> qm g r -c “s et nod e node 0 0 1 prop ertie s + = dua lc ore ”

● Ma nua lly e d it s erv er_priv /nod e s file○ a lwa y s re s ta rt p bs _s erv er a fter m odify in g n od e s file

Page 44: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

44

TORQUE AdministrationA c c o u n t in g R e c o r d s

● Torque m a inta ins a c c ounting re c ords of jobs ins e rv e r_priv /a c c ounting

● file of the form y y y y m m dd●

Re c ord Ma rke r Re c ord Ty pe D e s c riptionD de le te Job wa s de le te dE ex it Job ha s ex ite d (s uc c e s s fu lly or uns uc c e s s fu lly )Q que ue Job ha s be e n s ubm itte d/que ue dS s ta rt a n a tte m pt to s ta rt the job ha s be e n m a de ( if the

job fa ils to prope rly s ta rt, it m a y ha v e m ultip lejob s ta rt re cords )

● 0 9 /0 8 /2 0 0 9 2 2 :1 5 :5 8 ;Q ;9 .ke n-linux box ;qu e ue = ba tc h

Page 45: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

45

Diagnostics

L og File s

pbs _s e rv e r log file s/v a r/s pool/torque /s e rv e r_log sqm g r: s e t s e rv e r log _le v e l= x

pbs _m om log file s/v a r/s pool/torque /m om _log s/v a r/s pool/torque /m om _priv /c onfig

$ log le v e l x

Page 46: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

46

DiagnoticsMOM Diagnostics

momctl○ Diagnoses mom configuration and communication with server○ -d3 option○ Output on next slide

Page 47: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

47

DiagnosticsH os t: ke n-linux B ox /ke n-linux box Ve rs ion: 2 .3 .8 PID : 1 2 7 9 2S e rv e r[0 ]: ke n-linux B ox (1 2 7 .0 .1 .1 :1 5 0 0 1 ) In it Ms g s Rec e iv e d: 0 he llos /1 c lus te r-a ddrs In it Ms g s S e nt: 1 he llos L a s t Ms g From S e rv e r: 8 s e c onds (S ta tus Job) L a s t Ms g To S e rv e r: 1 5 s e c ondsH om e D ire c tory: /v a r/s pool/torque /m om _privs tdout/s tde rr s pool d ire c tory: '/v a r/s pool/torque /s pool/' (1 1 0 5 4 2 3 7 1 bloc k s a v a ila ble )NO T E : s y s log e na ble dMOM a c tiv e : 1 5 3 s e c ondsC he c k Poll T im e : 4 5 s ec ondsS e rv e r Upda te Inte rv a l: 4 5 s ec ondsLog Le v e l: 0 (us e S IG US R 1 /S IG US R 2 to a djus t)C om m unic a tion Mode l: R PPMe m Loc ke d: T R UE (m loc k )TC P T im eout: 2 0 s e c ondsProlog : /v a r/s pool/torque /m om _priv /prolog ue (d is a ble d)Ala rm Tim e : 0 of 1 0 s ec ondsTrus te d C lie nt L is t: 1 2 7 .0 .1 .1 ,1 2 7 .0 .0 .1C opy C om m a nd: /us r/bin/s c p -rpBjob[1 2 .ke n-linux box ] s ta te = R UNNING s id lis t= 1 2 8 3 0As s ig ne d C PU C ount: 1

A

dia g nos tic s c om plete

Page 48: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

48

MPI

M P I ( M e s s a g e P a s s in g In t e r f a c e )

● Us e d for pa ra lle l jobs● Aug m e nts c om m unic a tion be twe e n ta s k s d is tribute d a c ros s

c lus te r● T O R Q UE c a n run with a ny MPI libra ry● T O R Q UE prov ide s lim ite d inte g ra tion with s om e MPI libra rie s● MPI pa c k a g e s

○ MPIC H – Arg onne Na tiona l L a b○ MPIC H -V MI – NC S A○ O pe n MPI

Page 49: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

49

MPIMPIE x e c O v e rv ie w

● R e pla c em e nt for m pirun s c ript● In itia liz e s a pa ra lle l job with a PB S b a tc h or intera c tiv e e n v ironm e nt● Us e s ta s k m a na g er libra ry of PB S to s p a wn c opie s of ex ec uta ble on

nod e s● T M interfa c e fa s ter tha n in v ok ing s e p a ra te rs h (m pirun)● R e s ourc e s u s e d b y s pa wne d proc e s s a c c ounted c orrec tly with

m piex ec● Ta s k s tha t ex c e e d a s s ig n e d lim its (wa lltim e, m em ory , d is k s pa c e)

a re k illed● m piex ec c a n e nforc e a s ec urity polic y. O bv ia te s us e of rs h or s s h

S e e m piex ec hom e pa g e for m ore in form a tion.http ://www.os c .e du /~ djohns on /m piex ec /ind ex .php

Page 50: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

50

Multi-Mom

● Multip le pbs _m om da e m ons on a s ing le node● Inte nde d to e nha nc e te s ting but pos s ib le to us e in

produc tion● Mom s un ique ly ide ntifie d by na m e a nd ports● D e fa ult pbs _m om ports

○ 1 5 0 0 2○ 1 5 0 0 3

● Us e a lia s in /e tc /hos ts○ 1 9 2 .1 6 8 .0 .1 0 m y hos t m y hos t1 m y hos t2○ m a x a lia s na m e s ?

Page 51: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

51

Multi-Mom

Inv ok ing m ulti-m om● s y nta x – pbs _m om -m -M 3 0 0 0 2 -R 3 0 0 0 3● m odify node s file

○ node 1 np= 2○ node 2 np= 2 m om _s e rv ic e _port= 3 0 0 0 2

m om _m a na g e r_port= 3 0 0 0 3

● s topping m ulti-m om○ m om c tl -s -p 3 0 0 0 3

Page 52: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

52

Any-mom

● E na ble s a ny m om node to join a c lus te r without ha v ing a ne ntry in the s e rv e r_priv /node s file .

● S y nta x● pbs _s e rv e r -e●

● C a n dy na m ic a lly a dd m om s to c lus te r without re s ta rtingpbs _s e rv e r

● C re a te s s e c urity is s ue s● c a nnot c ontrol who joins the c lus te r● ne e d outs ide s e c urity polic y

Page 53: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

53

TORQUE RoadmapT O R Q UE 2 .3 .8

● B ug fix e s only

T O R Q UE 2 .4● C om ple te 2 .3 -fix e s m e rg e● C PU a ffin ity (v e ry ba s ic im ple m e nta tion)● Multi-m om● Any m om

T O R Q UE 2 .5● T OR QUE te s ting fra m e work● E lim ina te ne e d for priv ile g e d ports● C PUs e ts im prov e m e nts● Im prov e T OR QUE H A

T O R Q UE 3 .0● Alte rna te c om m unc a tion m ode l be twee n pbs _s e rv e r, MO Ms a nd s is te rs● s c a lea bilty for s upe r la rg e s y s te m s with la rg e MPI jobs (1 0 ,0 0 0 + node s )

Page 54: TORQUE Tutorial - Adaptive Computing · Installation Configuration Job Administration ... 1,200 patches, ... ls -alR / qsub listtest.sh 10.kmn.cridomain

54

TORQUE Q&A