Post on 14-Dec-2015
Links:
Condor’s homepage: http://www.cs.wisc.edu/condor/
Condor manual (for the version currently used): http://www.cs.wisc.edu/condor/manual/
v6.8/
Table of contents
Condor overview Usefull Condor commands Vanilla universe Macros Standard universe Java universe Matlab in Condor ClassAds DagMan
Condor overview
Condor is a system for running lots of jobs on a (preferably large) cluster of computers.
Condor is a specialized workload management system for compute-intensive jobs.
Condor overview
Condor’s inner structure: Condor is built of several daemons:
condor_master: This daemon is responsible for keeping all the rest of the Condor daemons running
condor_startd: This daemon represents a given machine to the Condor pool. It advertises attributes about the machine it’s running on. Must run on machines accepting jobs.
condor_schedd: This daemon is responsible for submitting jobs to condor. It manages the job queue (each machine has one!). Must run on machines submitting jobs.
condor_collector: Runs only on the condor server. This daemon is responsible for collecting all the information about the status of a Condor pool. All other daemons periodically sends updates to the collector.
condor_negotiator: Runs only on the condor server. This daemon is responsible for all the match-making within the Condor system.
condor_ ckpt_server: Runs only on the checkpointing server. This is the checkpoint server. It services requests to store and retrieve checkpoint files.
Condor overview
Condor uses user priorities to allocate machines to users in a fair manner. A lower numerical value for user priority means higher
priority. Each user starts out with the best user priority, 0.5. If the number of machines a user currently has is greater
then his priority, then his user priority will worsen (numerically increase) over time.
If the number of machines a user currently has is lower then his priority, then priority will improve over time.
Use condor_userprio {-allusers} to see user priorities
Usefull Condor commands
condor_status Shows all of the computers connected to condor
(not all are accepting jobs) Usefull arguments:
-claimed shows only machines running condor jobs ( and who runs them).
-available shows only machines which are willing to run jobs now
-long display entire classads. (discussed later on)
-constraint <const.> show only resources matching the given.
Usefull Condor commands
condor_status Attributes
Arch: INTEL means a 32bit linux X86_64 means a 64bit linux
Activity: “Idle” There is no job activity “Busy” A job is busy running “Suspended” A job is currently suspended “Vacating” A job is currently
checkpointing “Killing” A job is currently being killed “Benchmarking” The startd is running benchmarks
Usefull Condor commands
condor_status More attributes
State: “Owner” The machine owner is using the
machine, and it is unavailable to Condor. “Unclaimed” The machine is available to run Condor
jobs, but a good match is either not available or not yet found. “Matched” The Condor central manager has
found a good match for this resource, but a Condor scheduler has not yet claimed it.
“Claimed” The machine is claimed by a remote machine and is probably running a job.
“Preempting” A Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.
Usefull Condor commands
condor_q Shows state of jobs submitted from the calling
computer (the one running condor_q) Usefull arguments:
-analyze Perform schedulability analysis on jobs. Usefull to see why a scheduled job isn’t running, and if it’s ever going to run.
-dag Sort DAG jobs under their DAGMan -constraint <const.> (classads) -global ( -g ) get the global queue. -run get information about running
jobs.
Usefull Condor commands
condor_rm Removes a scheduled job from the queue (of the
scheduling computer). condor_rm cluster.proc
Remove the given job condor_rm cluster
Remove the given cluster of jobs condor_rm user
Remove all jobs owned by user condor_rm –all
Remove all jobs
Vanilla universe jobs
Vanilla universe is used for running jobs without special needs and features.
In Vanilla universe Condor runs the job the same as it would run without Condor
Start with a simple example.c:#include <stdio.h>int main(){
printf(“hello condor”);return 0;
} Compile as usual: gcc example.c –o example
Vanilla universe jobs
In order to submit the job to Condor we use the condor_submit command.
Usage: condor_submit <sub_file> A simple submit file (sub_example):
Universe = VanillaExecutable = exampleLog = test.logOutput = test.outError = test.errorQueue
Notice that the submission commands are case insensitive.
Vanilla universe jobs
There are a few other usefull commands arguments = arg1 arg2 …
run the executable with the given arguments Input = <input file>
The file given is used as standard input environment = “<var1>=<value1> <var2>=<value2>
…” Runs the job with the given environment variables. In order to use spaces in the entry use single quote To insert quotation use double quote mark, example:
environment = “ a=“”quote”” b=‘a ‘’b’’ c’ ”
Vanilla universe jobs
getenv = <True | False> If getenv is set to True, then condor_
submit will copy all of the user's current shell environment variables at the time of job submission into the job ClassAd. The job will therefore execute with the same set of environment variables that the user had at submit time.
Defaults to False.
Vanilla universe jobs
A more advanced submission:Universe = VanillaExecutable = exampleLog = test.$(cluster).$(process).logOutput = test.$(cluster).$(process).outError = test.$(cluster).$(process).errorQueue 7
Here we see a use of predefined macros. ‘cluster’ gives us the value of the ClusterId job ClassAd attribute, and the $(process) macro supplies the value of the ProcId job ClassAd attribute
Macros
More on Macros: A macro is defined as follows: <macro_name> =
string It can be then used by writing $(macro_name) $$(attribute) is used to get a classad attribute
from the machine running the job. $ENV(variable) gives us the environment
variable ‘variable’ from the machine running the job.
For more on macros go to condor’s manual, condor_submit section.
Standard universe
The Standard universe provides checkpointing and remote system calls.
Remote system calls: All system calls made by the job running in
Condor are made on the submitting computer. Chekpointing:
Save a snapshot of the current state of the running job, so the job can be restarted from the saved state in case of: Migration to another computer Machine crash or failureSS
Standard universe
In order to execute a program in the Standard universe it must be relinked with the Condor’s library.
To do so use condor_compile with your usual link command. Example: condor_compile gcc example.c
To manually cause a checkpoint use condor_checkpoint hostname
There are some restrictions on jobs running in the standard universe:
Standard universe - restrictions
Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system().
Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory.
Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration.
Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed.
Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep().
Standard universe - restrictions
Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed.
Memory mapped files are not allowed. This includes system calls such as mmap() and munmap().
File locks are allowed, but not retained between checkpoints. All files must be opened read-only or write-only. A file opened
for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error.
Your job must be statically linked (On Digital Unix (OSF/1), HP-UX, and Linux, and therefore on our school).
Reading to or writing from files larger than 2 GB is not supported.
Java universe
Used to run java programs Example submit description file:
universe = javaexecutable = Example.classarguments = Exampleoutput = Example.outputerror = Example.errorqueue
Notice that the first argument is the main class of the job. The JVM must be informed when submitting jar files, this is done in
the following way: jar_files = example.jar
To run on a machine with a specific java version: Requirements = (JavaVersion==“1.5.0_01”)
Options to the Java VM itself can be set in the submit description file: java_vm_args = -DMyProperty=Value -verbose:gc … These options go after the java command but before the main class (Usage: java
[options] class [args...]). Do not use this to set the classpath (condor handles that itsef).
Matlab Functions
Matlab functions/scripts are written in .m files.
Structure:function {ret_var = } func_name(arg1, arg2,
…)
…
Running Matlab functions in condor
First method: Calling matlab What we want to do is run:
matlab -nodisplay -nojvm -nosplash –r ‘func(arg1, arg2, …)’
Instead of transferring the matlab executable we’ll write a script (run.csh):
#!/bin/csh –f
matlab -nodisplay -nojvm -nosplash -r "$*"
Running Matlab functions in condor
First method: Calling matlab The submission file:
executable = run.cshlog = mat.logerror = mat.erroroutput = mat.outputuniverse = vanillagetenv = Truearguments = func(arg1, arg2, …)queue 1
Notice that in order to run matlab we must set getenv = true
Running Matlab functions in condor
Second method: Compiling the function First, we compile our Matlab script, example.m, into an
executable:mcc –mv example.m
The –v option is not mandatory. It is used to show details in the process of compilation.
The files required for running will be “example” nad example.ctf
The compiled function requires matlab’s shared libraries in order to run.
So, we’ll send Condor a script which defines the necessary env. Variables and then runs the executable.
Running Matlab functions in condor
Second method: Compiling the function The script:
#!/bin/tcsh
setenv LD_LIBRARY_PATH /usr/local/stow/matlab-7.0.4-R14SP2/lib/matlab-7.0.4- R14SP2/bin/glnx86:/usr/local/stow/matlab-7.0.4-R1
4SP2/lib/matlab-7.0.4-R14SP2/sys/os/glnx86:/usr/local/stow/matlab-7.0.4- R14SP2/lib/matlab-7.0.4-R14SP2/sys/java/jre/glnx86/jr
e1.4.2/lib/i386/client:/usr/local/stow/matlab-7.0.4-R14SP2/lib/matlab-7.0.4- R14SP2/sys/java/jre/glnx86/jre1.4.2/lib/i386:/usr
/local/stow/matlab-7.0.4-R14SP2/lib/matlab-7.0.4-R14SP2/sys/opengl/lib/glnx86:
setenv XAPPLRESDIR /usr/local/stow/matlab-7.0.4-R14SP2/lib/matlab-7.0.4- R14SP2/X11/app-defaults
setenv LD_PRELOAD /lib/libgcc_s.so.1
./multi $1 $2
ClassAds
ClassAds are a flexible mechanism for representing the characteristics and constraints of machines and jobs in the Condor system
Condor acts as a matchmaker for ClassAds. ClassAds are analogous to the classified advertising section
in a newspaper. All machines running Condor advertise their attributes. A
machine also advertises under what conditions it is willing to run a job, and what type of job it would prefer.
When submitting a job, you specify your requirements and preferences. These attributes are bundled up into a job ClassAd.
ClassAds
ClassAd expressions are formed by composing literals, attribute references and other sub-expressions with operators and functions Literals: may be
integers (including TRUE – 1 and FALSE – 0) Real String, a list of characters between two double quote
chars. Use \ to include the following char in the string, irrespective of what that character is.
UNDEFINED keyword (case insensitive) ERROR keyword (case insensitive)
ClassAds
Attributes A pair (name, expression) is called an attribute. The attribute name is case insensitive. An optional scope resolution prefix may be added:
“MY.” and “TARGET.” MY. refers to an attribute defined in the current
ClassAd. TARGET. Refers to an attribute defined in the ClassAd
in which the current ClassAd is evaluated. If no scope prefix is given, the first try “MY.”, if not
found try “TARGET.”, if not found try the ClassAd environment, if not found then value is UNDEFINED.
If there is a circular dependency between two classads (e.g. A uses B and B uses A) then the value is ERROR
ClassAds
Operators The operators are similar to c language. All operators are case insensitive for strings, with the
following exeptions: =?= “is identical to” operator (similar to ==) =!= “is not identical to” operator (similar to !
=) Precedence:
ClassAds
Predefined functions Examples:
Integer strcmp(AnyType Expr1, AnyType Expr2) String strcat(AnyType Expr1 [ , AnyType Expr2 ... ]) Boolean isInteger(AnyType Expr)
Function names are case insensitive For a full list of the functions refer to the user
manual, section 4.1.1.4
ClassAds
When submitting a job, one give requirements which only machines answering them may run the job.
One can also rank the machines available to run the job and choose the the highest ranked machine to run the job.
This can be done using the Requirements and Rank commands in the submission file.
ClassAds submission commands
Requirements = <ClassAd Boolean Expression> The job will run on a machine only if the
requirements expression evaluates to TRUE on that machine.
Example: requirements = Memory >= 64 && Arch == "intel"
The running machine must have at least 64 MB of ram and the architecture is INTEL.
The computers in our school have two possible architecture names: “INTEL” if it’s a 32bit computer or “X86_64” if it’s a 64bit computer.
ClassAds submission commands
By default Condor adds to the requirements of a job the following requirements: Arch, OpSys the same as the submitting computer. Disk >= DiskUsage. The DiskUsage attribute is initialized
to the size of the executable plus the size of any files specified in a transfer_input_files command.
(Memory * 1024) >= ImageSize. To ensure the target machine has enough memory to run your job.
If Universe is set to Vanilla, FileSystemDomain is set equal to the submit machine's FileSystemDomain.
In order to see a submitted job’s requirements (along with everything else about the job) use condor_q –l .
ClassAds submission commands
rank = <ClassAd Float Expression> Sorts all matching machines by the given
exression. Condor will give the job the machine with the highest rank.
The expression is a numeric expression (where boolean sub-expressions evaluate to 1.0 or 0.0)
DagMan
Use a directed acyclic graph (DAG) to represent a set of jobs to be run in a certain order.
A basic DAG submit file:JOB name1 submit_file1
JOB name2 submit_file2
…
If “DONE” is specified in the end of a JOB line then that job is considered complete and is not submitted.
DagMan
Additional dag commands: SCRIPT:
Sets processing to be done before/after running the job.
These “scripts” run on the submitting machine. SCRIPT PRE job_name executable [arguments]
Runs the executable before job_name is submitted SCRIPT POST job_name executable [arguments]
Runs the executable after job_name has completed its execution under Condor.
DagMan
Additional dag commands: PARENT … CHILD
Used to describe the dependencies between the jobs. PARENT p1 p2 … CHILD c1 c2 …
Makes all pi’s parents of all ci’s (i.e. the ci’s will be submitted only after all pi’s have completed their execution)
RETRY RETRY jobName NumOfRetries [UNLESS-EXIT
value] If job fails runs runs again at most NumOfRetries times. If UNLESS-EXIT is specified and the value returned
equals “value” then no further retries will be attempted.
DagMan
Additional dag commands: VARS
Defines macros that can be used in the submit description file of a job.
VARS jobName macroname= “string” [macroname2= “string” …] ABORT-DAG-ON
Aborts the entire DAG if a specific node returns a specific value. Stops all nodes within the DAG immediately. This includes nodes
currently running. ABORT-DAG-ON JobName AbortExitValue [RETURN
DAGReturnValue] By default the returned value of the DAG is the value returned from
the aborted node. If RETURN is specified then the return value is DAGReturnValue
DagMan
Example DAG file:JOB A a.submitJOB B b.submitJOB C a.submitPARENT A CHILD B CRETRY C 3ABORT-DAG-On A 2
Submission of DAG’s is done with: condor_submit_dag file.dag
In order to specify the max number of jobs submitted by the DagMan add the argument: -maxjobs numOfJobs
If any node in a DAG fails, The DagMan continues to run the reminder of the nodes untill no
more forward progress can be made. Then it creates a rescue file (input_file.rescue), where for each
node that completed its execution the corresponding JOB line ends with DONE. Submitting this file continues DAG execution.
DagMan
It is possible to create a visualization of the DAG: Add a line to the DAG file with:
“DOT dot_file.dot” Submit the DAG dot -Tps dot_file.dot -o dag.ps
A DAG inside a DAG: Suppose you want to include inner.dag in outer.dag Execute
condor_submit_dag -no_submit inner.dag Include the following “JOB” line in outer.dag:
JOB jobName inner.dag.condor.sub inner.dag.condor.sub is the submission file for inner.dag