1 Using Condor An Introduction ICE 2008.
-
date post
21-Dec-2015 -
Category
Documents
-
view
233 -
download
3
Transcript of 1 Using Condor An Introduction ICE 2008.
![Page 1: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/1.jpg)
1http://www.cs.wisc.edu/condor
Using Condor An Introduction
ICE 2008
![Page 2: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/2.jpg)
2http://www.cs.wisc.edu/condor
The Condor Project (Established ‘85)
Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.
![Page 3: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/3.jpg)
3http://www.cs.wisc.edu/condor
Definitions› Job
The Condor representation of your work
› Machine The Condor representation of computers
and that can perform the work
› Match Making Matching a job with a machine
“Resource”
![Page 4: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/4.jpg)
4http://www.cs.wisc.edu/condor
Job
Jobs state their requirements and preferences:I need a Linux/x86 platformI need the machine at least 500 MbI prefer a machine with more memory
![Page 5: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/5.jpg)
5http://www.cs.wisc.edu/condor
Machine
Machines state their requirements and preferences:Run jobs only when there is no keyboard
activityI prefer to run Frieda’s jobsI am a machine in the econ departmentNever run jobs belonging to Dr. Smith
![Page 6: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/6.jpg)
6http://www.cs.wisc.edu/condor
The Magic of Matchmaking
› Jobs and machines state their requirements and preferences
› Condor matches jobs with machinesbased on requirements and
preferences
![Page 7: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/7.jpg)
7http://www.cs.wisc.edu/condor
Getting Started:Submitting Jobs to
Condor› Overview:
Choose a “Universe” for your job Make your job “batch-ready” Create a submit description file Run condor_submit to put your job in
the queue
![Page 8: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/8.jpg)
8http://www.cs.wisc.edu/condor
1. Choose the “Universe”› Controls how Condor handles jobs
› Choices include: Vanilla Standard Grid Java Parallel VM
![Page 9: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/9.jpg)
9http://www.cs.wisc.edu/condor
Using the Vanilla Universe
• The Vanilla Universe:– Allows running almost
any “serial” job– Provides automatic
file transfer, etc.– Like vanilla ice cream
• Can be used in just about any situation
![Page 10: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/10.jpg)
10http://www.cs.wisc.edu/condor
2. Make your job batch-ready
Must be able to run in the background
• No interactive input
• No GUI/window clicks
• No music ;^)
![Page 11: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/11.jpg)
11http://www.cs.wisc.edu/condor
Make your job batch-ready (continued)…
Job can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices
Similar to UNIX shell:•$ ./myprogram <input.txt >output.txt
![Page 12: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/12.jpg)
12http://www.cs.wisc.edu/condor
3. Create a Submit Description File
› A plain ASCII text file
› Condor does not care about file extensions
› Tells Condor about your job: Which executable, universe, input, output and
error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later)
› Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.
![Page 13: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/13.jpg)
13http://www.cs.wisc.edu/condor
Simple Submit Description File
# Simple condor_submit input file# (Lines beginning with # are comments)# NOTE: the words on the left side are not# case sensitive, but filenames are!Universe = vanillaExecutable = my_jobOutput = output.txt Queue
![Page 14: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/14.jpg)
14http://www.cs.wisc.edu/condor
4. Run condor_submit› You give condor_submit the name
of the submit file you have created: condor_submit my_job.submit
› condor_submit: Parses the submit file, checks for
errors Creates a “ClassAd” that describes
your job(s) Puts job(s) in the Job Queue
![Page 15: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/15.jpg)
15http://www.cs.wisc.edu/condor
The Job Queue
› condor_submit sends your job’s ClassAd(s) to the schedd
› The schedd (more details later): Manages the local job queue Stores the job in the job queue
• Atomic operation, two-phase commit• “Like money in the bank”
› View the queue with condor_q
![Page 16: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/16.jpg)
16http://www.cs.wisc.edu/condor
Examplecondor_submit and
condor_q% condor_submit my_job.submitSubmitting job(s).1 job(s) submitted to cluster 1.
% condor_q
-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job
1 jobs; 1 idle, 0 running, 0 held
%
![Page 17: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/17.jpg)
17http://www.cs.wisc.edu/condor
Input, output & error files› Controlled by submit file settings
› You can define the job’s standard input, standard output and standard error: Read job’s standard input from “input_file”:
• Input = input_file• Shell equivalent: program <input_file
Write job’s standard ouput to “output_file”:• Output = output_file• Shell equivalent: program >output_file
Write job’s standard error to “error_file”:• Error = error_file• Shell equivalent: program 2>error_file
![Page 18: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/18.jpg)
18http://www.cs.wisc.edu/condor
Email about your job
• Condor sends email about job events to the submitting user
• Specify “notification” in your submit file to control which events:
Notification = completeNotification = neverNotification = errorNotification = always
Default
![Page 19: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/19.jpg)
19http://www.cs.wisc.edu/condor
Feedback on your job
› Create a log of job events
› Add to submit description file:log = sim.log
› Becomes the Life Story of a Job Shows all events in the life of a job Always have a log file
![Page 20: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/20.jpg)
20http://www.cs.wisc.edu/condor
Sample Condor User Log
000 (0001.000.000) 05/25 19:10:03 Job submitted from host: <128.105.146.14:1816>
...
001 (0001.000.000) 05/25 19:12:17 Job executing on host: <128.105.146.14:1026>
...
005 (0001.000.000) 05/25 19:13:06 Job terminated.
(1) Normal termination (return value 0)
...
![Page 21: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/21.jpg)
21http://www.cs.wisc.edu/condor
Example Submit Description File With
Logging# Example condor_submit input file# (Lines beginning with # are comments)# NOTE: the words on the left side are not# case sensitive, but filenames are!Universe = vanillaExecutable = /home/frieda/condor/my_job.condorLog = my_job.log ·Job log (from Condor)Input = my_job.in ·Program’s standard inputOutput = my_job.out ·Program’s standard outputError = my_job.err ·Program’s standard errorArguments = -a1 -a2 ·Command line argumentsInitialDir = /home/frieda/condor/runQueue
![Page 22: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/22.jpg)
22http://www.cs.wisc.edu/condor
Let’s run a job
› First, need a terminal emulator http://www.putty.org
• (or similar)
› Login to chopin.cs.wisc.edu as cguserXX, and the given password
› source /scratch/ice08
![Page 23: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/23.jpg)
23http://www.cs.wisc.edu/condor
Logged In?
› condor_q
› condor_status
![Page 24: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/24.jpg)
24http://www.cs.wisc.edu/condor
Create submit file
› nano submit• universe = vanilla• executable = /bin/echo• Arguments = hello world• Should_transfer_files = always• When_to_transfer_output = on_exit• Output = out• Log = log• queue
![Page 25: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/25.jpg)
25http://www.cs.wisc.edu/condor
And submit it…
› condor_submit submit
› (wait… remember the HTC bit?)
› Condor_q xx
› cat output
![Page 26: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/26.jpg)
26http://www.cs.wisc.edu/condor
“Clusters” and “Processes”
› If your submit file describes multiple jobs, we call this a “cluster”
› Each cluster has a unique “cluster number”› Each job in a cluster is called a “process”
Process numbers always start at zero› A Condor “Job ID” is the cluster number, a period,
and the process number (i.e. 2.1) A cluster can have a single process
• Job ID = 20.0 ·Cluster 20, process 0 Or, a cluster can have more than one process
• Job ID: 21.0, 21.1, 21.2 ·Cluster 21, process 0, 1, 2
![Page 27: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/27.jpg)
27http://www.cs.wisc.edu/condor
Submit File for a Cluster# Example submit file for a cluster of 2 jobs
# with separate input, output, error and log filesUniverse = vanillaExecutable = my_jobArguments = -x 0log = my_job_0.logInput = my_job_0.inOutput = my_job_0.outError = my_job_0.errQueue ·Job 2.0 (cluster 2, process 0)Arguments = -x 1log = my_job_1.logInput = my_job_1.inOutput = my_job_1.outError = my_job_1.errQueue ·Job 2.1 (cluster 2, process 1)
![Page 28: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/28.jpg)
28http://www.cs.wisc.edu/condor
% condor_submit my_job.submit-file
Submitting job(s).
2 job(s) submitted to cluster 2.
% condor_q
-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 frieda 4/15 06:52 0+00:02:11 R 0 0.0 my_job –a1 –a2
2.0 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 0
2.1 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 1
3 jobs; 2 idle, 1 running, 0 held
%
Submitting The Job
![Page 29: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/29.jpg)
29http://www.cs.wisc.edu/condor
Organize your files and directories for big runs
› Create subdirectories for each “run” run_0, run_1, … run_599
› Create input files in each of these run_0/simulation.in run_1/simulation.in … run_599/simulation.in
› The output, error & log files for each job will be created by Condor from your job’s output
![Page 30: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/30.jpg)
30http://www.cs.wisc.edu/condor
Submit Description File for 600 Jobs
# Cluster of 600 jobs with different directoriesUniverse = vanillaExecutable = simLog = simulation.log...Arguments = -x 0
InitialDir = run_0 ·Log, input, output & error files -> run_0
Queue ·Job 3.0 (Cluster 3, Process 0)
Arguments = -x 1
InitialDir = run_1 ·Log, input, output & error files -> run_1
Queue ·Job 3.1 (Cluster 3, Process 1)
·Do this 598 more times…………
![Page 31: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/31.jpg)
31http://www.cs.wisc.edu/condor
Submit File for a Big Cluster of Jobs
› We just submitted 1 cluster with 600 processes
› All the input/output files will be in different directories
› The submit file is pretty unwieldy (over 1200 lines)
› Isn’t there a better way?
![Page 32: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/32.jpg)
32http://www.cs.wisc.edu/condor
Submit File for a Big Cluster of Jobs (the better
way) #1› We can queue all 600 in 1 “Queue” command Queue 600
› Condor provides $(Process) and $(Cluster) $(Process) will be expanded to the
process number for each job in the cluster• 0, 1, … 599
$(Cluster) will be expanded to the cluster number• Will be 4 for all jobs in this cluster
![Page 33: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/33.jpg)
33http://www.cs.wisc.edu/condor
Submit File for a Big Cluster of Jobs (the better
way) #2› The initial directory for each job can
be specified using $(Process) InitialDir = run_$(Process) Condor will expand these to “run_0”,
“run_1”, … “run_599” directories
› Similarly, arguments can be variable Arguments = -x $(Process) Condor will expand these to “-x 0”, “-x 1”, … “-x 599”
![Page 34: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/34.jpg)
34http://www.cs.wisc.edu/condor
Better Submit File for 600 Jobs
# Example condor_submit input file that defines# a cluster of 600 jobs with different directoriesUniverse = vanillaExecutable = my_jobLog = my_job.logInput = my_job.inOutput = my_job.outError = my_job.errArguments = –x $(Process) ·–x 0, -x 1, … -x 599InitialDir = run_$(Process) ·run_0 … run_599Queue 600 ·Jobs 4.0 … 4.599
![Page 35: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/35.jpg)
35http://www.cs.wisc.edu/condor
Now, we submit it…$ condor_submit my_job.submitSubmitting
job(s) ...............................................................................................................................................................................................................................................................
Logging submit event(s) ...............................................................................................................................................................................................................................................................
600 job(s) submitted to cluster 4.
![Page 36: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/36.jpg)
36http://www.cs.wisc.edu/condor
And, Check the queue$ condor_q
-- Submitter: x.cs.wisc.edu : <128.105.121.53:510> : x.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
4.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 my_job -arg1 –x 0
4.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 my_job -arg1 –x 1
4.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 my_job -arg1 –x 2
4.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 3
...
4.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 598
4.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 599
600 jobs; 599 idle, 1 running, 0 held
![Page 37: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/37.jpg)
37http://www.cs.wisc.edu/condor
Removing jobs› If you want to remove a job from
the Condor queue, you use condor_rm
› You can only remove jobs that you own
› Privileged user can remove any jobs “root” on UNIX “administrator” on Windows
![Page 38: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/38.jpg)
38http://www.cs.wisc.edu/condor
Removing jobs (continued)
› Remove an entire cluster: condor_rm 4 ·Removes the whole
cluster
› Remove a specific job from a cluster: condor_rm 4.0 ·Removes a single job
› Or, remove all of your jobs with “-a” condor_rm -a ·Removes all jobs / clusters
![Page 39: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/39.jpg)
39http://www.cs.wisc.edu/condor
Submit cluster of 10 jobs
› nano submit• universe = vanilla• executable = /bin/echo• Arguments = hello world $(PROCESS)• Should_transfer_files = always• When_to_transfer_output = on_exit• Output = out.$(PROCESS)• Log = log• Queue 10
![Page 40: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/40.jpg)
40http://www.cs.wisc.edu/condor
And submit it…
› condor_submit submit
› (wait…)
› Condor_q xx
› cat log
› cat output.yy
![Page 41: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/41.jpg)
41http://www.cs.wisc.edu/condor
My new jobs run for 20 days…
› What happens when a job is forced off it’s CPU? Preempted by higher priority
user or job Vacated because of user
activity
› How can I add fault tolerance to my jobs?
![Page 42: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/42.jpg)
42http://www.cs.wisc.edu/condor
Condor’s Standard Universe to the rescue!› Support for transparent process
checkpoint and restart› Remote system calls (remote
I/O) Your job can read / write files as if they were local
![Page 43: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/43.jpg)
43http://www.cs.wisc.edu/condor
Remote System Calls inthe Standard Universe
› I/O system calls are trapped and sent back to the submit machineExamples: open a file, write to a file
› No source code changes typically required
› Programming language independent
![Page 44: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/44.jpg)
44http://www.cs.wisc.edu/condor
Process Checkpointing in the
Standard Universe› Condor’s process checkpointing provides a mechanism to automatically save the state of a job
› The process can then be restarted from right where it was checkpointed After preemption, crash, etc.
![Page 45: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/45.jpg)
45http://www.cs.wisc.edu/condor
Checkpointing:Process Starts
checkpoint: the entire state of a program, saved in a file CPU registers, memory image, I/O
time
![Page 46: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/46.jpg)
46http://www.cs.wisc.edu/condor
Checkpointing:Process Checkpointed
time
1 2 3
![Page 47: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/47.jpg)
47http://www.cs.wisc.edu/condor
Checkpointing:Process Killed
time
3
3
Killed!
![Page 48: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/48.jpg)
48http://www.cs.wisc.edu/condor
Checkpointing:Process Resumed
time
3
3
goodput badput goodput
![Page 49: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/49.jpg)
49http://www.cs.wisc.edu/condor
When will Condor checkpoint your job?
› Periodically, if desired For fault tolerance
› When your job is preempted by a higher priority job
› When your job is vacated because the execution machine becomes busy
› When you explicitly run condor_checkpoint, condor_vacate, condor_off or condor_restart command
![Page 50: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/50.jpg)
50http://www.cs.wisc.edu/condor
Making the Standard Universe Work
› The job must be relinked with Condor’s standard universe support library
› To relink, place condor_compile in front of the command used to link the job:
% condor_compile gcc -o myjob myjob.c
- OR -
% condor_compile f77 -o myjob filea.f fileb.f
- OR -
% condor_compile make –f MyMakefile
![Page 51: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/51.jpg)
51http://www.cs.wisc.edu/condor
Limitations of the Standard Universe
› Condor’s checkpointing is not at the kernel level. Standard Universe the job may not:
• Fork()• Use kernel threads• Use some forms of IPC, such as pipes and shared
memory
› Must have access to source code to relink
› Many typical scientific jobs are OK
![Page 52: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/52.jpg)
52http://www.cs.wisc.edu/condor
Submitting Std uni job
› #include <stdio.h>
› int main(int argc, char **argv) {
› int i;for(i = 0 ; i < 10000000; i++) {}
› }
![Page 53: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/53.jpg)
53http://www.cs.wisc.edu/condor
And submit…
› condor_compile –o foo foo.c
› condor_submit
![Page 54: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/54.jpg)
54http://www.cs.wisc.edu/condor
My jobs have have dependencies…
Can Condor help solve my dependency problems?
![Page 55: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/55.jpg)
55http://www.cs.wisc.edu/condor
Condor Universes:Scheduler and Local
› Scheduler Universe Plug in a meta-scheduler Developed for DAGMan (more later) Similar to Globus’s fork job manager
› Local Very similar to vanilla, but jobs run on
the local host Has more control over jobs than
scheduler universe
![Page 56: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/56.jpg)
56http://www.cs.wisc.edu/condor
Frieda learns DAGMan
› Directed Acyclic Graph Manager
› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.
› (e.g., “Don’t run job “B” until job “A” has completed successfully.”)
![Page 57: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/57.jpg)
57http://www.cs.wisc.edu/condor
What is a DAG?
› A DAG is the data structure used by DAGMan to represent these dependencies.
› Each job is a “node” in the DAG.
› Each node can have any number of “parent” or “children” nodes – as long as there are no loops!
Job A
Job B
Job C
Job D
![Page 58: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/58.jpg)
58http://www.cs.wisc.edu/condor
Defining a DAG
› A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D
› each node will run the Condor job specified by its accompanying Condor submit file
Job A
Job B Job C
Job D
![Page 59: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/59.jpg)
59http://www.cs.wisc.edu/condor
Submitting a DAG
› To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:
% condor_submit_dag diamond.dag
› condor_submit_dag is run by the schedd DAGMan daemon itself is “watched” by
Condor, so you don’t have to
![Page 60: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/60.jpg)
60http://www.cs.wisc.edu/condor
DAGMan
Running a DAG
› DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.
CondorJobQueue
B C
D
A
A
.dagFile
![Page 61: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/61.jpg)
61http://www.cs.wisc.edu/condor
DAGMan
Running a DAG (cont’d)
› DAGMan holds & submits jobs to the Condor queue at the appropriate times.
CondorJobQueue D
B
C
B
A
C
![Page 62: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/62.jpg)
62http://www.cs.wisc.edu/condor
Running a DAG (cont’d)
› In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.
CondorJobQueue DAGMan
X
D
A
BRescue
File
![Page 63: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/63.jpg)
63http://www.cs.wisc.edu/condor
Recovering a DAG
› Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.
CondorJobQueue
RescueFile
CDAGMan D
A
B C
![Page 64: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/64.jpg)
64http://www.cs.wisc.edu/condor
DAGMan
Recovering a DAG (cont’d)
› Once that job completes, DAGMan will continue the DAG as if the failure never happened.
CondorJobQueue
C
D
A
B
D
![Page 65: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/65.jpg)
65http://www.cs.wisc.edu/condor
DAGMan
Finishing a DAG
› Once the DAG is complete, the DAGMan job itself is finished, and exits.
CondorJobQueue
C
D
A
B
![Page 66: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/66.jpg)
66http://www.cs.wisc.edu/condor
Additional DAGMan Features
› Provides other handy features for job management…
nodes can have PRE & POST scripts failed nodes can be automatically re-
tried a configurable number of times job submission can be “throttled”
![Page 67: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/67.jpg)
67http://www.cs.wisc.edu/condor
General User Commands› condor_status View Pool Status
› condor_q View Job Queue› condor_submit Submit new Jobs› condor_rm Remove Jobs› condor_prio Intra-User Prios› condor_history Completed Job Info› condor_submit_dag Submit new DAG› condor_checkpoint Force a checkpoint› condor_compile Link Condor library
![Page 68: 1 Using Condor An Introduction ICE 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062304/56649d5e5503460f94a3de00/html5/thumbnails/68.jpg)
68http://www.cs.wisc.edu/condor
Thank you!
Check us out on the Web:http://www.condorproject.org
Email:[email protected]