An Overview of the Portable Batch System Gabriel Mateescu National Research Council Canada I M S B...

An Overview of the Portable Batch System

Gabriel Mateescu

National Research Council Canada

I M S B

gabriel.mateescu@nrc.ca

www.sao.nrc.ca/~gabriel/presentations/sgi_pbs

Outline• PBS highlights

• PBS components

• Resources managed by PBS

• Choosing a PBS scheduler

• Installation and configuration of PBS

• PBS scripts and commands

• Adding preemptive job scheduling to PBS

PBS Highlights • Developed by Veridian / MRJ

• Robust, portable, effective, extensible batch job queuing and resource management system

• Supports different schedulers

• Supports heterogeneous clusters

• Open PBS - open source version

• PBS Pro - commercial version

Recent Versions of PBS • PBS 2.2, November 1999:

– both the FIFO and SGI scheduler have bugs in enforcing resource limits

– poor support for stopping & resuming jobs

• OpenPBS 2.3, September 2000– better FIFO scheduler: resource limits

enforced, backfilling added

• PBS Pro 5.0, September 2000 – claims support for job stopping/resuming,

better scheduling, IRIX cpusets

Resources managed by PBS• PBS manages jobs, CPUs, memory,

hosts and queues

• PBS accepts batch jobs, enqueues them, runs the jobs, and delivers output back to the submitter

• Resources - describe attributes of jobs, queues, and hosts

• Scheduler - chooses the jobs that fit within queue and cluster resources

Main Components of PBS • Three daemons:

– pbs_server server,

– pbs_sched scheduler,

– pbs_mom job executor & resource monitor

• The server accepts commands and communicates with the daemons– qsub - submit a job– qstat - view queue and job status– qalter - change job’s attributes– qdel - delete a job

Batch Queuing

Node (CPUs + memory)

Queue A Queue B

SGI Origin System

Job exclusive scheduling

Resource Examples

• ncpus number of CPUs per job

• mem resident memory per job

• pmem per-process memory

• vmem virtual memory per job

• cput CPU time per job

• walltime real time per job

• file file size per job

Resource limits• resources_max - per job limit for a

resource; determines whether a job fits in a queue

• resources_default - default amount of a resource assigned to a job

• resources_available - advice to the scheduler on how much of a resource can be used by all running jobs

Choosing a Scheduler (1) • FIFO scheduler:

– First-fit placement: enqueues a job in the first queue where it may fit even if it does not currently fit there and there is another queue where it will fit

– Supports per job and (in version 2.3) per queue resource limits: ncpus, mem

– Supports per server limits on the number of CPUs, and memory, (based on server attribute resources_available)

Choosing a Scheduler (2) • Algorithms in FIFO scheduler

– FIFO - sort jobs by job queuing time running the earliest job first

– Backfill - relax FIFO rule for parallel jobs, as long as out-of-order jobs do not delay jobs submitted before by the FIFO order

– Fair share: sort & schedule jobs based on past usage of the machine by the job owners

– Round-robin - pick a job from each queue

– By key - sort jobs by a set of keys: shortest_job_first, smallest_memory_first

Choosing a Scheduler (3) • FIFO scheduler supports round robin

load balancing as of version 2.3

• FIFO scheduler– allows decoupling the job requirements on

the number of CPUs from that on the amount memory

– simple first-fit placement can lead to the need that the user specifies an execution queue for the jobs, when the job could fit in more than one queue

Choosing a Scheduler (4) • SGI scheduler

– supports FIFO, fair share, backfilling, and attempts to avoid job starvation

– supports both per job limits and per queue limits on number of CPUs, memory

– per server limit is the number of node cards– makes a best effort in choosing a queue

where to run a job. A job not having enough resources to run is kept in the submit queue

– ties the number of cpus allocated to the memory allocated per job

Resource allocation • SGI scheduler allocates nodes -

node = [ PE_PER_NODE cpus, MB_PER_NODE Mbyte ]

• Number of nodes N for a job is such that [ ncpus, mem] <= [ N*PE_PER_NODE, N* MB_PER_NODE ]

where ncpus and mem are the job’s memory and cpu job limits specified, e.g., with #PBS -l mem

• Job attributes Resource_List.{ncpus, mem} set to

Resource_List.ncpus = N * PE_PER_NODE

Resource_List.mem = N * MB_PER_NODE

Queue and Server Limits• FIFO scheduler:

– per job limits (ncpus, mem) are defined by resources_max queue attributes

– as of version 2.3, resources_max also defines per queue limits

– per server resource limits enforced with resources_available attributes

Queue and Server Limits• SGI scheduler:

– per job limits (ncpus, mem) are defined by resources_max queue attributes

– resources_max also defines per queue limits

– per server limit is given by the number of Origin node cards. Unlike the FIFO scheduler, resource_available limits are not enforced

Job enqueing (1) • The scheduler places each job in some queue• This involves several tests for resources

• Which queue a job is enqueued into depends on – what limits are tested– first-fit versus best fit placement

• A job can fit in a queue if the resources requested

by the job do not exceed the maximum value of

the resources defined for the queue. For

example, for the resource ncpus

Resource_List.ncpus <= resources_max.ncpus

Job enqueing (2) • A job fits in a queue if the amount of resources

assigned to the queue plus the requested resources do not exceed the maximum number of resources for the queue. For example, for ncpus

resources_assigned.ncpus + Resource_List.ncpus <= resources_max.ncpus

• A job fits in the system if the sum of all assigned resources does not exceed the available resources. For example, for the ncpus resource,

Σ resources_assigned.ncpus + Resource_List.ncpus <=

resources_available.ncpus

First fit versus best fit

• The FIFO scheduler finds the first queue where a can fit and dispatches the job to that queue – if the jobs does not actually fit it will wait for the

requested resources in the execution queue

• The SGI scheduler keeps the job in the submit queue until it finds an execution queue where the job fits then dispatches the job to that queue

• If queues are defined to have monotonically increasing resource limits (e.g., CPU time) , then first fit is not a penalty.

• However, if a job can fit in several queues, then SGI scheduler will find a better schedule

Limits on the number of running jobs

• Per queue and per server limits on the number of running jobs:– max_running– max_user_run, max_group_run max

number of running jobs per user or group • Unlike the FIFO scheduler, the SGI

scheduler enforces these limits only on a per queue basis – It enforces MAX_JOBS from the

scheduler config file - substitute for max_running

SGI Origin Install (1)

• Source files under OpenPBS_v2_3/src

• Consider the SGI scheduler

• Make sure the machine dependent values defines

in scheduler.cc/samples/sgi_origin/toolkit.h match the actual machine hardware

#define MB_PER_NODE ((size_t) 512*1024*1024) #define PE_PER_NODE 2

• May set PE_PER_NODE =1 to allocate half-nodes

if MB_PER_NODE is set accordingly

SGI Origin Install (2) • Bug fixes in

scheduler.cc/samples/sgi_origin/pack_queues.c

• Operator precedence bug (line 198):

for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {

if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {

// bad operator precedence bypasses this function

if ( !schd_evaluate_system(...) ) {

// DONT_START_JOB (0) so don’t change allfull continue; } // ... }

SGI Origin Install (3)• Fix of a logical bug in pack_queues.c: if a system limit

is exceeded should not try to schedule the job

for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {

if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {

if ( !schd_evaluate_system(...) ) {

// DONT_START_JOB (0) so don’t change allfull continue; } // ... }

for (qptr=(allfull)?NULL:qlist; qptr !=NULL; qptr=qptr->next) {

// if allfull set, do not attempt to schedule

SGI Origin Install (4)• Fix of a logical bug in user_limits.c, function

user_running() • This function counts number of running jobs so

must test for equality between job status and ‘R’

user_running ( ...)

for ( job= queue->jobs; job != NULL; job = job->next) {

if ( (job_state == ‘R’) && (!strcmp(job->owner,user) ) )

jobs_running++; // … }

SGI Origin Install (5)• The limit npcus is not enforced in the function

mom_over_limit(), located in the file mom_mach.c under the directory src/resmom/irix6array

#define SGI_ZOMBIE_WRONG 1

int mom_over_limit( ... ) {

// ...

#if !defined(SGI_ZOMBIE_WRONG)

return (TRUE);

#endif

// ... }

SGI Origin Install (4)Script to run the configure command___________________________________________________

#!/bin/csh -f

set PBS_HOME=/usr/local/pbs set PBS_SERVER_HOME=/usr/spool/pbs

# Select SGI or FIFO scheduler

set SCHED="--set-sched-code=sgi_origin --enable-nodemask

#set SCHED="--set-sched-code=fifo --enable-nodemask”

$HOME/PBS/OpenPBS_v2_3/configure \

--prefix=$PBS_HOME \

--set-server-home=$PBS_SERVER_HOME \

--set-cc=cc --set-cflags="-Dsgi -D_SGI_SOURCE -64 -g" \

--set-sched=cc $SCHED --enable-array --enable-debug

SGI Origin Install (5)___________________________________________________

# cd /usr/local/pbs

# makePBS

# make

# make install

# cd /usr/spool/pbs

the script from the previous slide

sched_priv

config decay_usage

Configuring for SGI scheduler • Queue types

– one submit queue – one or several execution queues

• Per server limit on the number of running job

• Load Control

• Fair share scheduling– Past usage of the machine used in ranking the jobs

– Decayed past usage per user is kept in sched_priv/decay_usage

• Scheduler restart action

• PBS manager tool: qmgr

Queue definition• File sched_priv/config SUBMIT_QUEUE submit

BATCH_QUEUES hpc,back MAX_JOBS 256

ENFORCE_PRIME_TIME False

ENFORCE_DEDICATED_TIME False

SORT_BY_PAST_USAGE True

DECAY_FACTOR 0.75

SCHED_ACCT_DIR /usr/spool/pbs/server_priv/accounting

SCHED_RESTART_ACTION RESUBMIT

Load Control

• Load control for SGI scheduler

sched_priv/config TARGET_LOAD_PCT 90%

TARGET_LOAD_VARIANCE -15%,+10%

• Load Control for FIFO scheduler

mom_priv/config

$max_load 2.0

$ideal_load 1.0

PBS for SGI scheduler • Qmgr tool s server managers=bob@n0.bar.com create queue submit s q submit queue_type = Execution s q submit resources_max.ncpus = 4 s q submit resources_max.ncpus = 1gb

s q submit resources_default.mem = 256mb s q submit resources_default.ncpus = 1 s q submit resources_default.nice = 15 s q submit enabled = True s q submit started = True

PBS for SGI scheduler create queue hpc s q hpc queue_type = Execution s q hpc resources_max.ncpus = 2 s q hpc resources_max.ncpus = 512mb

s q hpc resources_default.mem = 256mb s q hpc resources_default.ncpus = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True s q hpc enabled = True s q hpc started = True

PBS for SGI scheduler • Server attributes set server default_queue = submit

s server acl_hosts = *.bar.com s server acl_host_enable = True

s server scheduling = True

s server query_other_jobs = True

PBS for FIFO scheduler • File sched_config instead of config

and queues are not defined there

• Submit queue is Route queue s q submit queue_type = Route

s q submit route_destinations = hpc s q submit route_destinations += back

• Server attributes

s server resources_available.mem = 1gb

s server resources_available.ncpus = 4

PBS Job Scripts• Job scripts contain PBS directives and

shell commands #PBS -l ncpus=2 #PBS -l walltime=12:20:00 #PBS -m ae #PBS -c c=30

cd ${PB_O_WORKDIR} mpirun -np 2 foo.x

Basic PBS commands• Jobs are submitted with qsub % qsub [-q hpc] foo.pbs 13.node0.bar.com

• Job status is queried with qstat [-f|-a] to get job owner, name, queue, status, session ID, # CPUs, walltime

% qstat -a 13

• Alter job attributes % qalter -l walltime 20:00:00 13

Job Submission and Tracking

• Find jobs in status R (running) or submitted by user bob

% qselect -s R % qselect -u bob• Query queue status to find if the

queue is enabled/started, and the number of jobs in the queue

qstat [-f | -a ] -Q

• Delete a job: qdel 13

Job Environment and I/O• The job’s current directory is the

submitter’s $HOME, which is also the default location for the files created by the job. Changed with cd in the script

• The standard out and err of the job are

spooled to JobName.{o|e}JobID in the submitter’s current directory. Override this with

#PBS -o | -e pathname

Tips• Trace the history of a job

% tracejob - give a time-stamped sequence of events affecting a job

• Cron jobs for cleaning up daemon work files under mom_logs, sched_logs, server_logs

• #crontab -e

9 2 * * 0 find /usr/spool/pbs/mom_logs -type f -mtype +7 -exec rm {} \;

9 2 * * 0 find /usr/spool/pbs/sched_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/server_logs -type f -mtype +7 -exec rm {} \;

Execution server Submission server

node0 node1

pbs_server, pbs_sched, pbs_mom

Sample PBS Front-End

qsub, qdel, ...

PBS for clusters

• File staging - copy files (other than stdout/stderr) from a submission-only host to the server

#PBS -W stagein=/tmp/bar@n1:/home/bar/job1

#PBS -W stageout=/tmp/bar/job1/*@n1:/home/bar/job1

PBS uses the directory /tmp/bar/job1 as a scratch directory

• File staging may precede job starting -

helps in hiding latencies

Setting up a PBS Cluster• Assume n1 runs the pbs_mom daemon• $PBS_SERVER_HOME/server_priv/nodes n0 np=2 gaussian n1 np=2 irix

• n0:$PBS_SERVER_HOME/mom_priv/config $clienthost n1 $ideal_load 1.5 $max_load 2.0

• n1:$PBS_SERVER_HOME/mom_priv/config $ideal_load 1.5 $max_load 2.0

Setting up a PBS Cluster• Qmgr tool s server managers=bob@n0.bar.com create queue hpc s q hpc queue_type = Execution s q hpc Priority = 100

s q hpc resources_max.ncpus = 2 s q hpc resources_max.nodect = 1

s q hpc acl_groups = marley s q hpc acl_group_enable = True

Setting up a PBS Cluster• Server attributes

set server default_node = n0 set server default_queue = hpc

s server acl_hosts = *.bar.com s server acl_host_enable = True s s resources_default.nodect = 1 s s resources_default.nodes = 1 s s resources_default.neednodes = 1

set server max_user_run = 2

PBS features

• The job submitter can request a number of nodes with some properties

• For example – request a node with the property

gaussian:

#PBS -l nodes=gaussian

– request two nodes with the property irix

#PBS -l nodes=2:irix

PBS Security Features

• All files used by PBS are owned by root and can be written only by root

• Configuration files: sched_priv/config, mom_priv/config are readable only by root

• $PBS_HOME/pbs_environment defines $PATH; it is writable only by root

• pbs_mom daemon accepts connections from a privileged port on localhost or from a host listed in mom_priv/config

• The server accepts commands from selected hosts and users

Why preemptive scheduling? • Resource reservation (CPU, memory)

is needed to achieve high job throughput

• Static resource reservation may lead to low machine utilization, high job waiting times, and hence slow job turn-around

• An approach is needed to achieve both high job throughput and rapid job turn-around

Static Reservation Pitfall (1)

Node (CPU + memory)

Physics Group Biotech Group

Parallel Computer or Cluster

Job Requests

Partition boundary

Static Reservation Pitfall (2)

• Physics Group’s Job 1 is assigned 3 nodes and dispatched

• Biotech Group’s Job 2 is also dispatched, while Job 3 cannot execute before Job 2 finishes: there is only 1 node available for the group

• However, there are enough resources for Job 3

Proposed Approach (1)• Leverage the features of the Portable

Batch System (PBS)

• Extend PBS with preemptive job scheduling

• All queues but one have reserved resources (CPUs, memory) and hold jobs that cannot be preempted. These are the dedicated queues

• Define a queue for jobs that may be preempted: the background queue

Proposed Approach (2)

• Each user belongs to a group and each group is authorized to submit jobs to some dedicated queues as well as to the background queue

• The sum of the resources defined for the dedicated queues does not exceed the machine resources

• The resources assigned to jobs in a dedicated queue do not exceed the queue resource limits

• Jobs fitting in a dedicated queue are dispatched, observing job owner’s access rights

• Jobs not fitting in a dedicated queue are dispatched to the background queue, if there are enough available resources in the system

• Jobs in the background queue borrow resources from the dedicated queues

• If a job entering the system would fit in a dedicated queue provided resources lent to the background queue are reclaimed, job preemption is triggered

• Jobs from the background queue will be held to release the resources needed by a dedicated queue

• Held jobs are re-queued and will be dispatched along with the other pending jobs

Example (1) Two queues, each with 4 CPU capacity

Job Queue #CPU Submit CPU time time _________________________________ 1 Physics 1 0 4 h

2 Biotech 2 0 4 h

3 Physics 4 0 3 h

4 Biotech 2 2 h 1 h

5 Physics 2 2 h 1 h

Example (2)

Turn-around times

with without

Job 1 4 h 4 h Job2 4 h 4 h Job 3 4 h 7 h Job 4 3 h 3 h Job 5 3 h 3 h

75 % reduction for job 3

Key Points

• Provide guaranteed resources per user group and per job

• Allow resources not used by the dedicated queues to be borrowed by the background queue

• Provide a mechanism for reclaiming resources lent to the background queue

• Achieve low job waiting time and high job throughput

Benefits of the Approach

• Reduce job waiting time by harnessing resources not used by the dedicated queues

• Reduce job wall-time by reserving resources for all the jobs

• Pending jobs fitting in dedicated queues can reclaim resources from jobs that borrowed those resources and run in the background queue

For more information

• Veridian web site:

www.openpbs.org

www.pbspro.com

• NRC - IMSB documentation and links

www.sao.nrc.ca/~gabriel/pbs/pbs_user.html

An Overview of the Portable Batch System Gabriel Mateescu National Research Council Canada I M S B...

Documents

Transcript of An Overview of the Portable Batch System Gabriel Mateescu National Research Council Canada I M S B...

Gabriel Praddo - Amazon Web Servicestalent.marycollins.com.s3.amazonaws.com/resume/gabriel-praddo.pdf · Gabriel Praddo Hair: Dark Brown Eyes: Hazel Height: 5’11”

AND/OR Multi-Valued Decision Diagrams (AOMDDs) for ... · AND/OR Multi-Valued Decision Diagrams (AOMDDs) for Graphical Models Robert Mateescu MATEESCU@PARADISE. ... with the same

Gabriel peytonmanning

GLDSS 2009 Mateescu - University of Wisconsinansci.wisc.edu/extension-new copy/sheep/Publications_and... · Genetic Markers for Milk Pd iProduction Raluca Mateescu, Oklahoma State

mateescu bifcarcasstraits june 2019 - bifconference.comRaluca Mateescu |Associate Professor Animal Genomics BIF, June 2019 Using genomics to improve meat quality in Bos Indicus influenced

Structures and Materials Performance Laboratory - · PDF fileStructures and Materials Performance Laboratory Dany Paraschivoiu Manager, Operations e-mail: Dany.Paraschivoiu@nrc.ca

Gabriel Rodríguez Full Professor - files.pucp.edu.pefiles.pucp.edu.pe/departamento/economia/CV-Gabriel-Rodriguez2018 … · Gabriel Rodríguez Full Professor Ponti–cia Universidad

WMO/GWP Integrated Drought Management Programme by Elena Mateescu

Walking Back and Forth in Labelled Transition Systemsconvecs.inria.fr/doc/presentations/Mateescu-GRAPHITE-14.pdf · Walking Back and Forth in Labelled Transition Systems Radu Mateescu

Programming the UNIX/linux Shell - nrc-cnrc.gc.carcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/bourne.pdf · · 2018-03-11Programming the UNIX/linux Shell Claude Cantin (claude.cantin@nrc.ca)

“Lady Lilith” Dante Gabriel Rossetti. Dante Gabriel Rossetti - English poet, painter, translator - Preferred mythological subjects - Born Gabriel Charles.

Biblioteca Municipal Gabriel y Galán Paza. de Gabriel y ...

Diana Ştefania Mateescu et al. - Diagnostic of Early Onset ...Diana Ştefania Mateescu et al. - Diagnostic of Early Onset Polycystic Kidney Disease in Neonates . Fig.2. A,B-Neonatal

Saint Gabriel - Saint Joseph Parish Parroquia San Gabriel ...

Ghid Mateescu MF

1 TMS and ATMS Philippe Dague and Yuhong YAN NRC-IIT Yuhong.yan@nrc.ca Philippe.dague@lipn.univ-paris13.fr.

On the Power of Belief Propagation: A Constraint Propagation Perspective Rina Dechter Bozhena Bidyuk Robert Mateescu Emma Rollon.

Gabriel García

Model Checking and Performance Evaluation with CADP ... · PDF fileRadu Mateescu, Wendelin Serwe To cite this version: Radu Mateescu, Wendelin Serwe. Model Checking and Performance

San Gabriel POA Vs City of San Gabriel