Introduction to Batch Job Queues

27
Managing Batch Jobs63 ProMAX® System Administration Other Docs Known Problems Search Page Managing Batch Jobs Managing batch jobs for seismic data processing via queues provides the following benefits: sequential release of serially dependent jobs parallel release of groups of independent jobs optimized system performance by controlling resource allocation centralized management of system workload Introduction to Batch Job Queues Seismic data processing using the SeisSpace® or ProMAX® processing systems on an individual workstation or a Linux cluster can benefit from using a flexible batch queuing and resource management software package. Batch queueing software generally has three components; a server, a scheduler, and some sort of executor (Mom). A generic diagram showing the relationship between the various components of the Torque queuing software is illustrated below. Torque Server Torque Scheduler Torque Mom ProMAX® qmgr commands SeisSpace® UI

Transcript of Introduction to Batch Job Queues

Page 1: Introduction to Batch Job Queues

Managing Batch Jobs63 ProMAX® System Administration

Managing Batch Jobs

Managing batch jobs for seismic data processing via queues provides thefollowing benefits:

• sequential release of serially dependent jobs

• parallel release of groups of independent jobs

• optimized system performance by controlling resource allocation

• centralized management of system workload

Introduction to Batch Job Queues

Seismic data processing using the SeisSpace® or ProMAX® processingsystems on an individual workstation or a Linux cluster can benefit fromusing a flexible batch queuing and resource management software package.Batch queueing software generally has three components; a server, ascheduler, and some sort of executor (Mom). A generic diagram showingthe relationship between the various components of the Torque queuingsoftware is illustrated below.

TorqueServer

TorqueScheduler

TorqueMom

ProMAX®

qmgr commands

SeisSpace® UI

Other Docs Known ProblemsSearch Page

Page 2: Introduction to Batch Job Queues

Managing Batch Jobs64 ProMAX® System Administration

Generic Queued Job Workflow

1. A job is submitted to the queuing system server via a command like"qsub".

2. The server communicates with the scheduler and requests the numberof nodes the job needs.

3. The scheduler gathers current node or workstation resource utilizationand reports back to the server which nodes to use.

4. The server communicates with the mom(s) to start the job on thenode(s) allocated.

Note that a single Linux workstation has one mom daemon, as the diagramabove shows, but the diagram for a Linux cluster can have hundreds tothousands of compute nodes with one mom on each.

Torque and SGE (Sun Grid Engine) are typical of the available queuingpackages. For this release we tested and documented batch job queuingusing Torque. This package can be freely downloaded fromhttp://www.clusterresources.com/downloads/torque.

ProMAX® Interface

The ProMAX® system defines an interface thatcommunicates with the installed queuing system. It consistsof a number of named functions, their arguments, and returnvalues. For example, to submit a flow to a queue, the systemcalls $PROMAX_HOME/sys/exe/queup. This shell script isresponsible for calling the routines specific to the queuingsystem and for querying, formatting, and returning theinformation required by the system. For a completedescription of the options, read ProMAX® InterfaceProtocol.

Other Docs Known ProblemsSearch Page

Page 3: Introduction to Batch Job Queues

Managing Batch Jobs65 ProMAX® System Administration

Torque Installation and Configuration Steps

1. Download and install Torque source code

2. Set torque configuration parameters

3. Compile and link the Torque source code

4. Install the Torque executables and libraries

5. Configure the Torque server and mom

6. Test Torque Queue Submission

7. Start Torque server, scheduler, and mom at boot

8. Build the Torque packages for use in installing Torque on clustercompute nodes, then install these packages

9. Integrate the ProMAX® and SeisSpace® systems with Torque

10. Recommendations for Torque queues

Download and Install Torque Source Code

Landmark does not distribute Torque so you will have to download thelatest source tar bundle. which looks similar to torque-xx.yy.zz, from thefollowing URL:

http://www.clusterresources.com/downloads/torque

The latest version of Torque we tested is 2.3.3. on a RedHat 4 Update 5system.

Note: PBS and Torque are used interchangeably throughout this document.

As the root user, untar the source code for building the Torque server,scheduler, and mom applications.

> mkdir <some path>/apps/torque

> cd <some path>/apps/torque

> tar -zxvf <downloaded tar file location>/torque-xx.yy.zz.tar.gz

> cd torque-xx.yy.zz

If you decide you want to build the Torque graphical queue monitoringutilities (recommended) xpbs and xpbsmon, there are some requirements.

Make sure tcl, tclx, tk, and their devel rpm’s are installed for thearchitecture type of your system, such as i386 or x86_64. Since the tcl-

Other Docs Known ProblemsSearch Page

Page 4: Introduction to Batch Job Queues

Managing Batch Jobs66 ProMAX® System Administration

devel-8.*.rpm and tk-devel-8.*.rpm files may not be included with severalof the RHEL distributions, you may need to download them. There may beother versions that work as well.

Here is an example of required RPM’s from a RHEL 4.5 x86_64installation:

[root@sch1 prouser]# rpm -qa | grep tcl-8

> tcl-8.4.7-2

[root@sch1 prouser]# rpm -qa | grep tcl-devel-8

> tcl-devel-8.4.7-2

[root@sch1 prouser]# rpm -qa | grep tclx-8

> tclx-8.3.5-4

[root@sch1 prouser]# rpm -qa | grep tk-8

> tk-8.4.7-2

[root@sch1 prouser]# rpm -qa | grep tk-devel-8

> tk-devel-8.4.7-2

Here is an example of required RPM’s from a RHEL 5.2 x86_64installation:

> rpm -qa | grep libXau-dev

libXau-devel-1.0.1-3.1

> rpm -qa | grep tcl-devel-8

tcl-devel-8.4.13-3.fc6

> rpm -qa | grep xorg-x11-proto

xorg-x11-proto-devel-7.1-9.fc6

> rpm -qa | grep libX11-devel

libX11-devel-1.0.3-8.el5

> rpm -qa | grep tk-devel

tk-devel-8.4.13-3.fc6

> rpm -qa | grep libXdmcp-devel

libXdmcp-devel-1.0.1-2.1

> rpm -qa | grep mesa-libGL-devel

mesa-libGL-devel-6.5.1-7.2.el5

Other Docs Known ProblemsSearch Page

Page 5: Introduction to Batch Job Queues

Managing Batch Jobs67 ProMAX® System Administration

> rpm -qa | grep tclx-devel

tclx-devel-8.4.0-5.fc6

Set Torque Configuration Parameters

We will now compile and link the server, scheduler, and mom all at thesame time, then later generate specific Torque "packages" to install on allcompute nodes, which run just the moms. There are many ways to installand configure Torque queues and here we are presenting just one.

Torque queue setup for a single workstation is exactly the same as for themaster node of a cluster, except with some changes discussed later. Youshould be logged into the master node as root if you are installing on aLinux cluster, or logged into your workstation as root.

Here is RHEL 4.5 x86_64:

> ./configure --enable-mom --enable-server --with-scp --with-server-default=<hostname of server> --enable-gui --enable-docs --with-tclx=/usr/lib64

Note that we pointed to /usr/lib64 for the 64-bit tclx libraries. This wouldbe /usr/lib on 32-bit systems.

Here is RHEL 5.2 x86_64:

> ./configure --enable-mom --enable-server --with-scp --with-server-default=<hostname of server> --enable-gui --enable-docs --with-tcl=/usr/lib64 --without-tclx

With the use of "--with-scp" we are selecting ssh for file transfers betweenthe server and moms. This means that ssh needs to be set up such that nopasswords are required in both directions between the server and moms forall users.

Compile and Link the Torque Source Code

We will now compile, link and install the torque binaries.

> make

Install the Torque Executables and Libraries

We will now install the Torque executables and libraries.

> make install

Other Docs Known ProblemsSearch Page

Page 6: Introduction to Batch Job Queues

Managing Batch Jobs68 ProMAX® System Administration

Configure the Torque Server and Mom

Instructions for installing and configuring Torque in this document treat asingle workstation and the master node of a cluster the same, then discusseswhere the configuration of a cluster is different.

Let’s go ahead and setup some two example queues for our workstation orcluster. The first thing we will do is configure our master node or singleworkstation for the Torque server and mom daemons.

> cd /var/spool/torque/server_priv

Now let’s define which nodes our queues will be communicating with. Thefirst thing to do is to build the /var/spool/torque/server_priv/nodes file. Thisfile states the nodes that are to be monitored and submitted jobs to, the typeof node, the number of CPUs the node has, and any special node properties.

Here is an example nodes file:

master np=2 ntype=cluster promax

n1 np=2 ntype=cluster promax seisspace

n2 np=2 ntype=cluster promax seisspace

n3 np=2 ntype=cluster seisspace

.

.

nxx np=2 ntype=cluster seisspace

The promax and seisspace entries are called properties. It is possible toassign queue properties that only submit jobs to nodes with that sameproperty. Instead of the entries n1, n2, etc., you would enter yourworkstations hostname or the hostnames of your compute nodes.

Now let’s initialize the pbs mom /var/spool/torque/mom_priv/config file,here is an example of what one would look like:

# Log all but debug events, but 127 is good for normal logging.

$logevent 127

# Set log size and deletion parameters so we don’t fill /var

$log_file_max_size 1000

$log_file_roll_depth 5

# Make node unschedulable if load >4.0; continue when load drops <3.0

$ideal_load 3.0

$max_load 4.0

Other Docs Known ProblemsSearch Page

Page 7: Introduction to Batch Job Queues

Managing Batch Jobs69 ProMAX® System Administration

# Define server node

$pbsserver <server hostname>

# Use cp rather than scp or rcp for local (nfs) file delivery

$usecp *:/export /export

The $max_load and $ideal_load parameters will have to be tuned for yoursystem over time, and are gauged against the current entry in the/proc/loadavg file. You can also use the "uptime" command to see what thecurrent load average of the system is.

How many and what type of processes can the node handle before it isoverloaded? For example, if you have a quad-core machine then a$max_load of 4 and an $ideal_load of 3.0 would be just fine. For the$pbsserver be sure to put the hostname of your Torque server.

After a job is finished the stdout and stderr files are copied back to theserver so they can be viewed. The $usecp entry directs for which filessystems a simple "cp" command can be used rather than "scp" or "rcp".The output of the "df" command shows what should go into the $usecpentry. For example:

df

Filesystem 1K-blocks Used Available Use% Mounted on

sch1:/data 480721640 327473640 148364136 69% /data

The $usecp entry would be "$usecp *:/data /data"

Now let’s start the Torque server so we can load its database with our newqueue configuration.

> /usr/local/sbin/pbs_server -t create

Warning - if you have an existing set of Torque queues, the "-t create"option will erase those configured.

Now we need to add and configure some queues. We have documented asimple script which should help automate this process. You can type theseinstructions in by hand, or build a script to run. Here is what this scriptlooks like:

#!/bin/ksh

/usr/local/bin/qmgr -e <server name> << "EOF"

c q serial queue_type=execution

c q parallel queue_type=execution

s q serial enabled=true, started=true, max_user_run=1

Other Docs Known ProblemsSearch Page

Page 8: Introduction to Batch Job Queues

Managing Batch Jobs70 ProMAX® System Administration

s q parallel enabled=true, started=true

set server scheduling=true

s s scheduler_iteration=30

s s default_queue=interactive

s s managers="<username>@*"

s s node_pack=false

s s query_other_jobs=true

print server

EOF

When creating and configuring queues you typically are doing thefollowing:

• Creating a queue and specifying its type: execution or route.

• Enabling and starting the queue.

• Defining any resource limitation, such as job runtime, or other prop-erties for a queue.

• Defining properties of the server, such as who can manage queues.

To type these in by hand start the Torque queue manager by typing:

> /usr/local/bin/qmgr

Now let’s restart the Torque server and start the Torque scheduler and momon the master node or single workstation and test our installation.

> killall pbs_server

> /usr/local/sbin/pbs_server

> /usr/local/sbin/pbs_sched

> /usr/local/sbin/pbs_mom

Now let’s start the Torque GUIs xpbs and xpbsmon to see the status of ourqueues and the Torque mom.

> /usr/local/bin/xpbs &

You should see a GUI similar to the following, if you built it.

Other Docs Known ProblemsSearch Page

Page 9: Introduction to Batch Job Queues

Managing Batch Jobs71 ProMAX® System Administration

> /usr/local/bin/xpbsmon &

You should see a GUI similar to the following, if you built it.

Other Docs Known ProblemsSearch Page

Page 10: Introduction to Batch Job Queues

Managing Batch Jobs72 ProMAX® System Administration

Testing Torque Queue Submission

Before integrating the ProMAX® software with Torque it is a good idea totest the Torque setup by submitting a job (script) to Torque from thecommand line. Here is an example script called pbs_queue_test:

#!/bin/ksh#PBS -S /bin/ksh#PBS -N pbs_queue_test#PBS -j oe#PBS -r y#PBS -o <NFS mounted filesystem>/pbs_queue_output#PBS -l nodes=1######### End of Job ##########

hostnameecho ""envecho ""cat $PBS_NODEFILE

You will need to modify the #PBS -o line of the script to direct the outputto an NFS mounted filesystem which can be seen by the master node orsingle workstation. Submit the job to Torque as follows using a non-rootuser:

> /usr/local/bin/qsub -q serial -m n <script path>/pbs_queue_test

If the job ran successfully, there should be a file called <NFS mountedfilesystem>/pbs_queue_output containing the results of the script.

Starting Torque Server, Scheduler, and Mom to start at boot

To start Torque daemons when the machines boot up, use the followingscripts for the master node and single workstation:

• pbs_server, pbs_sched, and pbs_mom

The following /etc/init.d/pbs_server script starts pbs_server for Linux:

#!/bin/sh

#

# pbs_server This script will start and stop the PBS Server

#

# chkconfig: 345 85 85

Other Docs Known ProblemsSearch Page

Page 11: Introduction to Batch Job Queues

Managing Batch Jobs73 ProMAX® System Administration

# description: PBS is a batch versitle batch system for SMPs andclusters

#

# Source the library functions

. /etc/rc.d/init.d/functions

BASE_PBS_PREFIX=/usr/local

ARCH=$(uname -m)

AARCH="/$ARCH"

if [ -d "$BASE_PBS_PREFIX$AARCH" ]

then

PBS_PREFIX=$BASE_PBS_PREFIX$AARCH

else

PBS_PREFIX=$BASE_PBS_PREFIX

fi

PBS_HOME=/var/spool/torque

# let see how we were called

case "$1" in

start)

echo -n "Starting PBS Server: "

if [ -r $PBS_HOME/server_priv/serverdb ]

then

daemon $PBS_PREFIX/sbin/pbs_server

else

daemon $PBS_PREFIX/sbin/pbs_server -t create

fi

echo

;;

stop)

Other Docs Known ProblemsSearch Page

Page 12: Introduction to Batch Job Queues

Managing Batch Jobs74 ProMAX® System Administration

echo -n "Shutting down PBS Server: "

killproc pbs_server

echo

;;

status)

status pbs_server

;;

restart)

$0 stop

$0 start

;;

*)

echo "Usage: pbs_server {start|stop|restart|status}"

exit 1

esac

The following /etc/init.d/pbs_sched script starts pbs_sched for Linux:

#!/bin/sh

#

# pbs_sched This script will start and stop the PBS Scheduler

#

# chkconfig: 345 85 85

# description: PBS is a batch versitle batch system for SMPs andclusters

#

# Source the library functions

. /etc/rc.d/init.d/functions

BASE_PBS_PREFIX=/usr/local

Other Docs Known ProblemsSearch Page

Page 13: Introduction to Batch Job Queues

Managing Batch Jobs75 ProMAX® System Administration

ARCH=$(uname -m)

AARCH="/$ARCH"

if [ -d "$BASE_PBS_PREFIX$AARCH" ]

then

PBS_PREFIX=$BASE_PBS_PREFIX$AARCH

else

PBS_PREFIX=$BASE_PBS_PREFIX

fi

# let see how we were called

case "$1" in

start)

echo -n "Starting PBS Scheduler: "

daemon $PBS_PREFIX/sbin/pbs_sched

echo

;;

stop)

echo -n "Shutting down PBS Scheduler: "

killproc pbs_sched

echo

;;

status)

status pbs_sched

;;

restart)

$0 stop

$0 start

;;

Other Docs Known ProblemsSearch Page

Page 14: Introduction to Batch Job Queues

Managing Batch Jobs76 ProMAX® System Administration

*)

echo "Usage: pbs_sched {start|stop|restart|status}"

exit 1

esac

The following /etc/init.d/pbs_mom script starts pbs_mom for Linux:

#!/bin/sh

#

# pbs_mom This script will start and stop the PBS Mom

#

# chkconfig: 345 85 85

# description: PBS is a batch versitle batch system for SMPs andclusters

#

# Source the library functions

. /etc/rc.d/init.d/functions

BASE_PBS_PREFIX=/usr/local

ARCH=$(uname -m)

AARCH="/$ARCH"

if [ -d "$BASE_PBS_PREFIX$AARCH" ]

then

PBS_PREFIX=$BASE_PBS_PREFIX$AARCH

else

PBS_PREFIX=$BASE_PBS_PREFIX

fi

# let see how we were called

case "$1" in

start)

Other Docs Known ProblemsSearch Page

Page 15: Introduction to Batch Job Queues

Managing Batch Jobs77 ProMAX® System Administration

if [ -r /etc/security/access.conf.BOOT ]

then

cp -f /etc/security/access.conf.BOOT/etc/security/access.conf

fi

echo -n "Starting PBS Mom: "

daemon $PBS_PREFIX/sbin/pbs_mom -r

echo

;;

stop)

echo -n "Shutting down PBS Mom: "

killproc pbs_mom

echo

;;

status)

status pbs_mom

;;

restart)

$0 stop

$0 start

;;

*)

echo "Usage: pbs_mom {start|stop|restart|status}"

exit 1

esac

The following commands actually setup the scripts so the O/S will startthem at boot:

Other Docs Known ProblemsSearch Page

Page 16: Introduction to Batch Job Queues

Managing Batch Jobs78 ProMAX® System Administration

> chkconfig pbs_server on

> chkconfig pbs_sched on

> chkconfig pbs_mom on

Installing Torque On The Compute Nodes

Now that Torque seems to be working let’s install it on the compute nodes.To perform this we need to generate some Torque self-extracting scriptscalled "packages". In these packages we need to also include Torque momsystem startup (init.d) scripts, as well mom configuration information. Notethat this step is not necesary for the single workstation.

> cd <some path>/apps/torque-xx.yy.zz

> mkdir pkgoverride;cd pkgoverride

> mkdir mom;cd mom

> tar -cvpf - /var/spool/torque/mom_priv/config | tar -xvpf -

> tar -cvpf - /etc/rc.d/init.d/pbs_mom | tar -xvpf -

> cd <some path>/apps/torque-xx.yy.zz;make packages

Now that although all the packages are generated, we only need to installsome of them on the compute nodes. Here is a list of all the packages:

• torque-package-clients-linux-x84_64.sh

• torque-package-devel-linux-x86_64.sh

• torque-package-mom-linux-x86_64.sh

To install these packages you need to copy them to an NFS mountedfilesystem if the directory where they are stored is not visable to allcompute nodes. For example:

> cp *.sh <NFS mount filesystem>

Note that if you are using cluster management software such as XCAT,Warewulf, or RocksClusters, you are better off to integrate the Torque momfiles and configuration into the compute node imaging scheme.

Install the packages by hand on each node, or if you have some type ofcluster management software such as XCAT, use that to install onto eachnode.

> psh compute <NFS mounted filesystem>/torque-package-clients-linux-x86_64.sh --install

Other Docs Known ProblemsSearch Page

Page 17: Introduction to Batch Job Queues

Managing Batch Jobs79 ProMAX® System Administration

> psh compute <NFS mounted filesystem>/torque-package-devel-linux-x86_64.sh --install

> psh compute <NFS mounted filesystem>/torque-package-mom-linux-x86_64.sh --install

> psh compute /sbin/chkconfig pbs_mom on

> psh compute /sbin/service pbs_mom start

The xpbsmon application should refresh shortly showing the status of thecompute nodes, which should be "green" if the nodes are ready to acceptscheduled jobs.

Connecting the ProMAX® system and Torque

The ProMAX® software by default is set to use Torque (PBS) queues. The$PROMAX_HOME/etc/qconfig_pbs file defines which Torque queues areavailable for use, the name associations, the function to be called inbuilding a job execution script, and any variables which get passed to thefunction script. You should modify this file to conform with the Torquequeues that you have created.

#

# PBS batch queues

#

name = serial

type = batch

description = "Serial Execution Batch Jobs"

function = pbs_submit

menu = que_res_pbs.menu

properties = local

machine = <torque_batch_server>

#

name = parallel

type = batch

description = "Parallel Execution Batch Jobs"

function = pbs_submit

properties = local

menu = que_res_pbs.menu

machine = <torque_batch_server>

Other Docs Known ProblemsSearch Page

Page 18: Introduction to Batch Job Queues

Managing Batch Jobs80 ProMAX® System Administration

The following is what the SeisSpace job submit window might resemblewith the configuration above:

If you have configured your queues for a cluster, and have confirmed thatthey are working properly, you need to do a couple of things to disable themaster node from being used as a compute node.

1. Turn off the pbs_mom.

> service pbs_mom stop

2. Disable the pbs_mom from starting at boot.

> chkconfig pbs_mom off

Other Docs Known ProblemsSearch Page

Page 19: Introduction to Batch Job Queues

Managing Batch Jobs81 ProMAX® System Administration

3. Remove the master node from the /var/spool/torque/server_priv/nodesfile.

Recommendations for Torque queues

Based on our batch job queue testing efforts we offer the following guidelines for configuring your Torque batch job queues.

• It is important that the queue does not release too many jobs at thesame time. You specify the number of available nodes and CPUs pernode in the /var/spool/torque/server_priv/nodes file. Each job is sub-mitted to the queue with a request for a number of CPU units. Thedefault for ProMAX® jobs is 1 node and 1 CPU or 1 CPU unit. Thatis, to release a job, there must be at least one node that has 1 CPU un-allocated.

• There can be instances when jobs do not quickly release from thequeue although resources are available. It can take a few minutes forthe jobs to release. You can change the scheduler_iteration settingthe Torque qmgr command. The default is 600 seconds (or 10 min-utes). We suggest a value of 30 seconds. Even with this setting, deadtime for up to 2 minutes have been observed. It can take some timebefore the loadavg begins to fall after the machine has been loaded.

• By default, Torque installs itself into the /var/spool/torque,/usr/local/bin and /usr/local/sbin directories. Always address theqmgr by its full name of /usr/local/bin/qmgr. The directory path/usr/local/bin is added to the PATH statement inside the queue man-agement scripts by setting the PBS_BIN environment variable. If youare going to alter the PBS makefiles and have PBS installed in a loca-tion other than /usr/local, make sure you change the PBS_BIN envi-ronment setting in the ProMAX® sys/exe/pbs/* files, and in theSeisSpace etc/SSclient script example.

• Run the xpbs and xpbsmon programs, located generally in the/usr/local/bin directory, to monitor how jobs are being released andhow the CPUs are monitored for availability. Black boxes in the xpb-smon user interface indicate that the node CPU load is greater thanwhat has been configured, and no jobs can be spawned there until theload average drops. It is normal for nodes to show as different coloredboxes in the xpbsmon display. This means that the nodes are busy andnot accepting any work. You can also modify the automatic updatetime in the xpbsmon display. However, testing has shown that theautomatic updating of the xpbs display may not be functioning.

• Landmark suggests that you read the documentation for Torque.These documents include more information about the system and

Other Docs Known ProblemsSearch Page

Page 20: Introduction to Batch Job Queues

Managing Batch Jobs82 ProMAX® System Administration

ways to customize the configuration, and can be found on the Torquewebsite.

• Torque requires that you have the hostnames and IP addresses in thehosts files of all the nodes.

Note: hostname is the name of your machine; hostname.domainnamecan be found in /etc/hosts, and commonly ends with .com:

ip address hostname.domain.com hostname

For DHCP users, ensure that all of the processing and manager nodesalways get the same ip address.

We present one method of installing and configuring Torque job queues.There are many alternative methods that will be successful so long as thefollowing conditions exist:

• Install Torque for all nodes of the cluster. The installation can be doneon each machine independently, or you can use a common NFSmounted file system, or your cluster management software may con-tain a preconfigured image.

• Install all components including the server and scheduler on onenode. This is known as the server node and serves the other main pro-cessing nodes. Normally this will be the cluster manager node. On asingle workstation the server, scheduler, and mom daemons are allinstalled.

• The following files must be the same on all installations on allmachines:

/var/spool/torque/server_name/var/spool/torque/mom_priv/config

These files are only used by the server and scheduler on the managermachine:

/var/spool/torque/server_priv/nodes

• The UID and GID for users must be consistent across the master andcompute nodes.

• All application, data, and home directories must be mounted the sameon the master and compute nodes.

Switching Between Torque and other Queues

Multiple queues can be installed and functioning on thesystem concurrently. However, you cannot interface morethan one at the same time. Under the default configuration,

Other Docs Known ProblemsSearch Page

Page 21: Introduction to Batch Job Queues

Managing Batch Jobs83 ProMAX® System Administration

the executable queue functions point to the pbs or torquequeues. This is accomplished via links in the$PROMAX_HOME/sys/exe/ directory for the various que*executables. The ProMAX® distribution automatically linksthese to the executables found in the$PROMAX_HOME/sys/exe/pbs/ subdirectory.

A perl script:

# $PROMAX_HOME/port/bin/Setqueues

is included with this installation. It facilitates switching backand forth between different queues. lp, nqs, and pbs queuesare shown as examples in the script. You will needpermission to alter the contents of the ProMAX® installation.

Make sure PROMAX_HOME and PROMAX_ETC_HOMEenvironment variables are set before you run this scipt. Thescript does have a dry run and a verbose mode. For exampleif you run Setqueues -dv you will see a log of what the scriptwill do without actually performing the operations.

Other Docs Known ProblemsSearch Page

Page 22: Introduction to Batch Job Queues

Queue Interface Protocol84 ProMAX® System Administration

Queue Interface Protocol

The ProMAX® system calls a suite of small programs or shellscripts to implement the various capabilities that it expectsfrom a batch queueing system.

The following are the executables that reside in the sys/exedirectory. You can write the internals of these executables to:

• satisfy the arglist and any defined return requirements

• call any que programs or scripts needed to perform tasks

queup

Arguments: 'queue name' 'file name' 'job name'

• queue name = the system name of the queue

• file name = the batch job file

• job name= area/line/flow

Synopsis

When Queup successfully runs, it returns a completion codeof 0 and a request id (rid) on standard output. This version ofLPD need to pass a request id to the cancel command in thebody of the quekill script to ensure that the ProMAX® flowscript is canceled. Therefore, this rid is used in quekill.

quekill

Arguments: 'queue name' 'pid' 'rid' 'killer pathname'

• queue name = the system name of the queue

• pid = the process number of the active task (deprecated)

• rid = request id. This is the output of queup.

• killer pathname = directory containing killer.exe(deprecated)

Synopsis

Try to use a rid instead of pid because pid is ofteninsufficient to handle queued tasks to cleanup.

Other Docs Known ProblemsSearch Page

Page 23: Introduction to Batch Job Queues

Queue Interface Protocol85 ProMAX® System Administration

The killer pathname is not used by the LPD or NQS versionsof this script because cancel is used. However, it remains fortasks that need to know where the ProMAX® software isinstalled.

quedelete

Arguments: 'queue name' 'rid'

• queue name = the system name of the queue

• rid = request id. This is the output of queup.

Synopsis

This argument removes the queued task with request id, rid,from the queue. Unlike quekill, quedelete does not remove arunning task. In some queue systems, these arguments areidentical. However the system flags an error if you attempt tocall quedelete on a flow which has established a connectionto the UI.

questat

Arguments: 'queue name' 'pid' 'rid' 'type'

• queue name = the system name of the queue

• pid = the process number of the active task (deprecated)

• rid = request id. This is the output of queup.

• type = a number

Synopsis

Click MB2 or MB3 on the pid/rid to pass 2 or 3 as type. 1 isreserved for the interactive socket based inquiry. The statusis displayed.

queclear

Arguments: 'queue name'

• queue name = the system name of the queue.

Other Docs Known ProblemsSearch Page

Page 24: Introduction to Batch Job Queues

Queue Interface Protocol86 ProMAX® System Administration

Synopsis

Clear all jobs owned by the current user from the namedqueue.

quepriority

Arguments: 'queue name' 'rid' 'priority'

• queue name = the system name of the queue.

• rid = request id. This is the output of queup.

• 'priority = the new priority.

Synopsis

Changes the priority of rid. Returns 0 on success.

quelist

Arguments: 'queue list'

• queue list = a list of queues.

Synopsis

Scans each queue in queue list for your list of jobs. Each jobhas the following format:- 'rid' 'queue name' 'priority'

questate

Arguments: 'queue list' | 'queue name' 'state'

questate is called in on of the following ways:

• queue list = a list of queues

or

• queue name = the system name of the queue.

• state = the new state (up or down)

Synopsis

the quelist options outputs the queue name and the currentqueue state. the queue name and state option changes thestate of queue; returning 0 on success.

Other Docs Known ProblemsSearch Page

Page 25: Introduction to Batch Job Queues

qconfig Options87 ProMAX® System Administration

qconfig Options

The etc/qconfig file is used to declare batch queues. Thissection describes qconfig parameters and requirements.

qconfig parameters

Each queue is described by the following qconfig parameters:

• name is the name of the queue as known to the system.It typically kept short and is the first descriptor.

• type is the type of queue (batch or plot).

• description is the name of the queue that you see in theUI.

• properties sets the arguments local, save, and button.

• button uses the first queue with this property for jobsexecuted using MB1 or MB2 when foreground jobsubmission is disabled.

• function acts as a filter to help further manage jobs andqueues. It is the name of an additional executable file.

• machine is the machine that accesses the queue. Thedefault is the local machine.

• user is your user login name. This is necessary becausequelist and questate occur during initialization.

• menu is any menu that prompts for additional queueinput.

Parameter Descriptions

name matches the ProMAX® interface to the queuing system.For example for an lp queue this must match the namereturned by lpstat -a. For an NQS queue the names returnedby qstat -l.

type is the type of batch sed for jobs executed with MB3.

plot is used for accessing the older mft software from SDI.This option is almost obsolete because most plotting softwaremaintains its own queue control.

Other Docs Known ProblemsSearch Page

Page 26: Introduction to Batch Job Queues

qconfig Options88 ProMAX® System Administration

description or desc, is the description of the queue that yousee in queue popup menus.

properties or props is a list of keywords indicating specificbehavior for the last named queue. The available keywordsare:

• local tell the execution script to handle the disposition ofthe printout from the flow. .

• save writes a copy of the output to$PROMAX_HOME/queues/ <queuename>/<machinename>. Donot use this keyword if aar_promax is running, becauseaar_promax also collects a copy of job.output.

If you specify save, the following text files must exist:

$PROMAX_HOME/queues/<queuename>/<machinename>

For example, if you are setting up the sml queue on amachine named vader, create$PROMAX_HOME/queues/sml/vader. This file should havegeneral read/write permissions. Type the following:

# touch <hostname>

# chmod 666 <hostname>

Any job submitted through a queue with the saveproperty writes a copy of the job.output and any errorsfrom the queue itself to the corresponding file in thequeues directory. Review this file if the queues are notoperating correctly.

• button is used only if foreground job submission isdisabled by inclusion of the stanza all_batch: t in the$PROMAX_HOME/etc/config_file.

function passes additional information into the batch jobscript. Any output written to standard output (stdout) byfunction is written into the batch job script file.

These directives tell type of shell to execute and where tosend the output. If function is a relative pathname, it is takenrelative to $PROMAX_HOME/sys/exe or a fully qualifiedpathname. Function is called with three arguments:

• Real is the first argument (dry is no longer meaningful).

• Queue name is the name of the queue (sml or bsh).

• Output file writes error messages to this file.

Other Docs Known ProblemsSearch Page

Page 27: Introduction to Batch Job Queues

qconfig Options89 ProMAX® System Administration

The function terminates with the following error message if itfails:

exit [error number]

Note: Do not specify a function in the qconfig file when usinglp queues. If a function is specified for an lp queue, youcannot view the output of the job until the job is complete.

machine is accesses remote machines by two basic methods:

• the queuing system is aware of the remote queue andprovides it with a local handle.

• the ProMAX® system recognizes a particular queue isremote and uses queup to submit a flow to it. Ifmachine is present then the last named queue isaccessed.

user (if machine is present) provides the user id to use in theremote call. If not present then the current user id is used.

Note: any remote execution is controlled using remote_exec:stanza in the config_file.

menu is the port/menu or absolute pathname of a Lisp formatmenu file. A menu stores auxiliary input that is accessed asan X Windows routine.

Code single user and machine into menu to overrideparameter values. Menu parameter names containing thestring label are ignored.

submit has the format of the sample menu. (Refer to samplemenus in $PROMAX_PORT_MENU_HOME/promax/que_res.menuor que_res_nqs.menu.) If the submit parameter has the wrongformat the ProMAX® UI hangs. All other parameter areconverted to environment variables that are accessed byfunction or queup.

Other Docs Known ProblemsSearch Page