Post on 18-Dec-2015
An Overview of the Portable Batch System
Gabriel Mateescu
National Research Council Canada
I M S B
gabriel.mateescu@nrc.ca
www.sao.nrc.ca/~gabriel/presentations/sgi_pbs
Outline• PBS highlights
• PBS components
• Resources managed by PBS
• Choosing a PBS scheduler
• Installation and configuration of PBS
• PBS scripts and commands
• Adding preemptive job scheduling to PBS
PBS Highlights • Developed by Veridian / MRJ
• Robust, portable, effective, extensible batch job queuing and resource management system
• Supports different schedulers
• Supports heterogeneous clusters
• Open PBS - open source version
• PBS Pro - commercial version
Recent Versions of PBS • PBS 2.2, November 1999:
– both the FIFO and SGI scheduler have bugs in enforcing resource limits
– poor support for stopping & resuming jobs
• OpenPBS 2.3, September 2000– better FIFO scheduler: resource limits
enforced, backfilling added
• PBS Pro 5.0, September 2000 – claims support for job stopping/resuming,
better scheduling, IRIX cpusets
Resources managed by PBS• PBS manages jobs, CPUs, memory,
hosts and queues
• PBS accepts batch jobs, enqueues them, runs the jobs, and delivers output back to the submitter
• Resources - describe attributes of jobs, queues, and hosts
• Scheduler - chooses the jobs that fit within queue and cluster resources
Main Components of PBS • Three daemons:
– pbs_server server,
– pbs_sched scheduler,
– pbs_mom job executor & resource monitor
• The server accepts commands and communicates with the daemons– qsub - submit a job– qstat - view queue and job status– qalter - change job’s attributes– qdel - delete a job
Resource Examples
• ncpus number of CPUs per job
• mem resident memory per job
• pmem per-process memory
• vmem virtual memory per job
• cput CPU time per job
• walltime real time per job
• file file size per job
Resource limits• resources_max - per job limit for a
resource; determines whether a job fits in a queue
• resources_default - default amount of a resource assigned to a job
• resources_available - advice to the scheduler on how much of a resource can be used by all running jobs
Choosing a Scheduler (1) • FIFO scheduler:
– First-fit placement: enqueues a job in the first queue where it may fit even if it does not currently fit there and there is another queue where it will fit
– Supports per job and (in version 2.3) per queue resource limits: ncpus, mem
– Supports per server limits on the number of CPUs, and memory, (based on server attribute resources_available)
Choosing a Scheduler (2) • Algorithms in FIFO scheduler
– FIFO - sort jobs by job queuing time running the earliest job first
– Backfill - relax FIFO rule for parallel jobs, as long as out-of-order jobs do not delay jobs submitted before by the FIFO order
– Fair share: sort & schedule jobs based on past usage of the machine by the job owners
– Round-robin - pick a job from each queue
– By key - sort jobs by a set of keys: shortest_job_first, smallest_memory_first
Choosing a Scheduler (3) • FIFO scheduler supports round robin
load balancing as of version 2.3
• FIFO scheduler– allows decoupling the job requirements on
the number of CPUs from that on the amount memory
– simple first-fit placement can lead to the need that the user specifies an execution queue for the jobs, when the job could fit in more than one queue
Choosing a Scheduler (4) • SGI scheduler
– supports FIFO, fair share, backfilling, and attempts to avoid job starvation
– supports both per job limits and per queue limits on number of CPUs, memory
– per server limit is the number of node cards– makes a best effort in choosing a queue
where to run a job. A job not having enough resources to run is kept in the submit queue
– ties the number of cpus allocated to the memory allocated per job
Resource allocation • SGI scheduler allocates nodes -
node = [ PE_PER_NODE cpus, MB_PER_NODE Mbyte ]
• Number of nodes N for a job is such that [ ncpus, mem] <= [ N*PE_PER_NODE, N* MB_PER_NODE ]
where ncpus and mem are the job’s memory and cpu job limits specified, e.g., with #PBS -l mem
• Job attributes Resource_List.{ncpus, mem} set to
Resource_List.ncpus = N * PE_PER_NODE
Resource_List.mem = N * MB_PER_NODE
Queue and Server Limits• FIFO scheduler:
– per job limits (ncpus, mem) are defined by resources_max queue attributes
– as of version 2.3, resources_max also defines per queue limits
– per server resource limits enforced with resources_available attributes
Queue and Server Limits• SGI scheduler:
– per job limits (ncpus, mem) are defined by resources_max queue attributes
– resources_max also defines per queue limits
– per server limit is given by the number of Origin node cards. Unlike the FIFO scheduler, resource_available limits are not enforced
Job enqueing (1) • The scheduler places each job in some queue• This involves several tests for resources
• Which queue a job is enqueued into depends on – what limits are tested– first-fit versus best fit placement
• A job can fit in a queue if the resources requested
by the job do not exceed the maximum value of
the resources defined for the queue. For
example, for the resource ncpus
Resource_List.ncpus <= resources_max.ncpus
Job enqueing (2) • A job fits in a queue if the amount of resources
assigned to the queue plus the requested resources do not exceed the maximum number of resources for the queue. For example, for ncpus
resources_assigned.ncpus + Resource_List.ncpus <= resources_max.ncpus
• A job fits in the system if the sum of all assigned resources does not exceed the available resources. For example, for the ncpus resource,
Σ resources_assigned.ncpus + Resource_List.ncpus <=
resources_available.ncpus
First fit versus best fit
• The FIFO scheduler finds the first queue where a can fit and dispatches the job to that queue – if the jobs does not actually fit it will wait for the
requested resources in the execution queue
• The SGI scheduler keeps the job in the submit queue until it finds an execution queue where the job fits then dispatches the job to that queue
• If queues are defined to have monotonically increasing resource limits (e.g., CPU time) , then first fit is not a penalty.
• However, if a job can fit in several queues, then SGI scheduler will find a better schedule
Limits on the number of running jobs
• Per queue and per server limits on the number of running jobs:– max_running– max_user_run, max_group_run max
number of running jobs per user or group • Unlike the FIFO scheduler, the SGI
scheduler enforces these limits only on a per queue basis – It enforces MAX_JOBS from the
scheduler config file - substitute for max_running
SGI Origin Install (1)
• Source files under OpenPBS_v2_3/src
• Consider the SGI scheduler
• Make sure the machine dependent values defines
in scheduler.cc/samples/sgi_origin/toolkit.h match the actual machine hardware
#define MB_PER_NODE ((size_t) 512*1024*1024) #define PE_PER_NODE 2
• May set PE_PER_NODE =1 to allocate half-nodes
if MB_PER_NODE is set accordingly
SGI Origin Install (2) • Bug fixes in
scheduler.cc/samples/sgi_origin/pack_queues.c
• Operator precedence bug (line 198):
for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {
if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {
// bad operator precedence bypasses this function
if ( !schd_evaluate_system(...) ) {
// DONT_START_JOB (0) so don’t change allfull continue; } // ... }
SGI Origin Install (3)• Fix of a logical bug in pack_queues.c: if a system limit
is exceeded should not try to schedule the job
for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {
if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {
if ( !schd_evaluate_system(...) ) {
// DONT_START_JOB (0) so don’t change allfull continue; } // ... }
for (qptr=(allfull)?NULL:qlist; qptr !=NULL; qptr=qptr->next) {
// if allfull set, do not attempt to schedule
}
SGI Origin Install (4)• Fix of a logical bug in user_limits.c, function
user_running() • This function counts number of running jobs so
must test for equality between job status and ‘R’
user_running ( ...)
{
for ( job= queue->jobs; job != NULL; job = job->next) {
if ( (job_state == ‘R’) && (!strcmp(job->owner,user) ) )
jobs_running++; // … }
SGI Origin Install (5)• The limit npcus is not enforced in the function
mom_over_limit(), located in the file mom_mach.c under the directory src/resmom/irix6array
#define SGI_ZOMBIE_WRONG 1
int mom_over_limit( ... ) {
// ...
#if !defined(SGI_ZOMBIE_WRONG)
return (TRUE);
#endif
// ... }
SGI Origin Install (4)Script to run the configure command___________________________________________________
#!/bin/csh -f
set PBS_HOME=/usr/local/pbs set PBS_SERVER_HOME=/usr/spool/pbs
# Select SGI or FIFO scheduler
set SCHED="--set-sched-code=sgi_origin --enable-nodemask
#set SCHED="--set-sched-code=fifo --enable-nodemask”
$HOME/PBS/OpenPBS_v2_3/configure \
--prefix=$PBS_HOME \
--set-server-home=$PBS_SERVER_HOME \
--set-cc=cc --set-cflags="-Dsgi -D_SGI_SOURCE -64 -g" \
--set-sched=cc $SCHED --enable-array --enable-debug
SGI Origin Install (5)___________________________________________________
# cd /usr/local/pbs
# makePBS
# make
# make install
# cd /usr/spool/pbs
the script from the previous slide
sched_priv
config decay_usage
Configuring for SGI scheduler • Queue types
– one submit queue – one or several execution queues
• Per server limit on the number of running job
• Load Control
• Fair share scheduling– Past usage of the machine used in ranking the jobs
– Decayed past usage per user is kept in sched_priv/decay_usage
• Scheduler restart action
• PBS manager tool: qmgr
Queue definition• File sched_priv/config SUBMIT_QUEUE submit
BATCH_QUEUES hpc,back MAX_JOBS 256
ENFORCE_PRIME_TIME False
ENFORCE_DEDICATED_TIME False
SORT_BY_PAST_USAGE True
DECAY_FACTOR 0.75
SCHED_ACCT_DIR /usr/spool/pbs/server_priv/accounting
SCHED_RESTART_ACTION RESUBMIT
Load Control
• Load control for SGI scheduler
sched_priv/config TARGET_LOAD_PCT 90%
TARGET_LOAD_VARIANCE -15%,+10%
• Load Control for FIFO scheduler
mom_priv/config
$max_load 2.0
$ideal_load 1.0
PBS for SGI scheduler • Qmgr tool s server managers=bob@n0.bar.com create queue submit s q submit queue_type = Execution s q submit resources_max.ncpus = 4 s q submit resources_max.ncpus = 1gb
s q submit resources_default.mem = 256mb s q submit resources_default.ncpus = 1 s q submit resources_default.nice = 15 s q submit enabled = True s q submit started = True
PBS for SGI scheduler create queue hpc s q hpc queue_type = Execution s q hpc resources_max.ncpus = 2 s q hpc resources_max.ncpus = 512mb
s q hpc resources_default.mem = 256mb s q hpc resources_default.ncpus = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True s q hpc enabled = True s q hpc started = True
PBS for SGI scheduler • Server attributes set server default_queue = submit
s server acl_hosts = *.bar.com s server acl_host_enable = True
s server scheduling = True
s server query_other_jobs = True
PBS for FIFO scheduler • File sched_config instead of config
and queues are not defined there
• Submit queue is Route queue s q submit queue_type = Route
s q submit route_destinations = hpc s q submit route_destinations += back
• Server attributes
s server resources_available.mem = 1gb
s server resources_available.ncpus = 4
PBS Job Scripts• Job scripts contain PBS directives and
shell commands #PBS -l ncpus=2 #PBS -l walltime=12:20:00 #PBS -m ae #PBS -c c=30
cd ${PB_O_WORKDIR} mpirun -np 2 foo.x
Basic PBS commands• Jobs are submitted with qsub % qsub [-q hpc] foo.pbs 13.node0.bar.com
• Job status is queried with qstat [-f|-a] to get job owner, name, queue, status, session ID, # CPUs, walltime
% qstat -a 13
• Alter job attributes % qalter -l walltime 20:00:00 13
Job Submission and Tracking
• Find jobs in status R (running) or submitted by user bob
% qselect -s R % qselect -u bob• Query queue status to find if the
queue is enabled/started, and the number of jobs in the queue
qstat [-f | -a ] -Q
• Delete a job: qdel 13
Job Environment and I/O• The job’s current directory is the
submitter’s $HOME, which is also the default location for the files created by the job. Changed with cd in the script
• The standard out and err of the job are
spooled to JobName.{o|e}JobID in the submitter’s current directory. Override this with
#PBS -o | -e pathname
Tips• Trace the history of a job
% tracejob - give a time-stamped sequence of events affecting a job
• Cron jobs for cleaning up daemon work files under mom_logs, sched_logs, server_logs
• #crontab -e
9 2 * * 0 find /usr/spool/pbs/mom_logs -type f -mtype +7 -exec rm {} \;
9 2 * * 0 find /usr/spool/pbs/sched_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/server_logs -type f -mtype +7 -exec rm {} \;
Execution server Submission server
node0 node1
pbs_server, pbs_sched, pbs_mom
Sample PBS Front-End
qsub, qdel, ...
PBS for clusters
• File staging - copy files (other than stdout/stderr) from a submission-only host to the server
#PBS -W stagein=/tmp/bar@n1:/home/bar/job1
#PBS -W stageout=/tmp/bar/job1/*@n1:/home/bar/job1
PBS uses the directory /tmp/bar/job1 as a scratch directory
• File staging may precede job starting -
helps in hiding latencies
Setting up a PBS Cluster• Assume n1 runs the pbs_mom daemon• $PBS_SERVER_HOME/server_priv/nodes n0 np=2 gaussian n1 np=2 irix
• n0:$PBS_SERVER_HOME/mom_priv/config $clienthost n1 $ideal_load 1.5 $max_load 2.0
• n1:$PBS_SERVER_HOME/mom_priv/config $ideal_load 1.5 $max_load 2.0
Setting up a PBS Cluster• Qmgr tool s server managers=bob@n0.bar.com create queue hpc s q hpc queue_type = Execution s q hpc Priority = 100
s q hpc resources_max.ncpus = 2 s q hpc resources_max.nodect = 1
s q hpc acl_groups = marley s q hpc acl_group_enable = True
Setting up a PBS Cluster• Server attributes
set server default_node = n0 set server default_queue = hpc
s server acl_hosts = *.bar.com s server acl_host_enable = True s s resources_default.nodect = 1 s s resources_default.nodes = 1 s s resources_default.neednodes = 1
set server max_user_run = 2
PBS features
• The job submitter can request a number of nodes with some properties
• For example – request a node with the property
gaussian:
#PBS -l nodes=gaussian
– request two nodes with the property irix
#PBS -l nodes=2:irix
PBS Security Features
• All files used by PBS are owned by root and can be written only by root
• Configuration files: sched_priv/config, mom_priv/config are readable only by root
• $PBS_HOME/pbs_environment defines $PATH; it is writable only by root
• pbs_mom daemon accepts connections from a privileged port on localhost or from a host listed in mom_priv/config
• The server accepts commands from selected hosts and users
Why preemptive scheduling? • Resource reservation (CPU, memory)
is needed to achieve high job throughput
• Static resource reservation may lead to low machine utilization, high job waiting times, and hence slow job turn-around
• An approach is needed to achieve both high job throughput and rapid job turn-around
Static Reservation Pitfall (1)
Node (CPU + memory)
Physics Group Biotech Group
Parallel Computer or Cluster
Job Requests
Partition boundary
Static Reservation Pitfall (2)
• Physics Group’s Job 1 is assigned 3 nodes and dispatched
• Biotech Group’s Job 2 is also dispatched, while Job 3 cannot execute before Job 2 finishes: there is only 1 node available for the group
• However, there are enough resources for Job 3
Proposed Approach (1)• Leverage the features of the Portable
Batch System (PBS)
• Extend PBS with preemptive job scheduling
• All queues but one have reserved resources (CPUs, memory) and hold jobs that cannot be preempted. These are the dedicated queues
• Define a queue for jobs that may be preempted: the background queue
Proposed Approach (2)
• Each user belongs to a group and each group is authorized to submit jobs to some dedicated queues as well as to the background queue
• The sum of the resources defined for the dedicated queues does not exceed the machine resources
• The resources assigned to jobs in a dedicated queue do not exceed the queue resource limits
Proposed Approach (3)
• Jobs fitting in a dedicated queue are dispatched, observing job owner’s access rights
• Jobs not fitting in a dedicated queue are dispatched to the background queue, if there are enough available resources in the system
• Jobs in the background queue borrow resources from the dedicated queues
Proposed Approach (4)
• If a job entering the system would fit in a dedicated queue provided resources lent to the background queue are reclaimed, job preemption is triggered
• Jobs from the background queue will be held to release the resources needed by a dedicated queue
• Held jobs are re-queued and will be dispatched along with the other pending jobs
Example (1) Two queues, each with 4 CPU capacity
Job Queue #CPU Submit CPU time time _________________________________ 1 Physics 1 0 4 h
2 Biotech 2 0 4 h
3 Physics 4 0 3 h
4 Biotech 2 2 h 1 h
5 Physics 2 2 h 1 h
Example (2)
Turn-around times
with without
Job 1 4 h 4 h Job2 4 h 4 h Job 3 4 h 7 h Job 4 3 h 3 h Job 5 3 h 3 h
75 % reduction for job 3
Key Points
• Provide guaranteed resources per user group and per job
• Allow resources not used by the dedicated queues to be borrowed by the background queue
• Provide a mechanism for reclaiming resources lent to the background queue
• Achieve low job waiting time and high job throughput
Benefits of the Approach
• Reduce job waiting time by harnessing resources not used by the dedicated queues
• Reduce job wall-time by reserving resources for all the jobs
• Pending jobs fitting in dedicated queues can reclaim resources from jobs that borrowed those resources and run in the background queue