Post on 25-Dec-2015
CONDOR DAGMan and Pegasus
Selim KalayciFlorida International University
07/28/2009
Note: Slides are compiled from various TeraGrid Documentations
2
DAGMan• Directed Acyclic Graph Manager
• DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.
• (e.g., “Don’t run job “B” until job “A” has completed successfully.”)
3
What is a DAG?
• A DAG is the data structure used by DAGMan to represent these dependencies.
• Each job is a “node” in the DAG.
• Each node can have any number of “parent” or “children” nodes – as long as there are no loops!
Job A
Job B Job C
Job D
4
• A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D
• each node will run the Condor job specified by its accompanying Condor submit file
Defining a DAG
Job A
Job B Job C
Job D
5
Submitting a DAG• To start your DAG, just run condor_submit_dag with
your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:
% condor_submit_dag diamond.dag
• condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable.
• Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.
6
DAGMan
Running a DAG
• DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor-G based on the DAG dependencies.
Condor-GJobQueue
C
D
A
A
B.dagFile
7
DAGMan
Running a DAG (cont’d)
• DAGMan holds & submits jobs to the Condor-G queue at the appropriate times.
Condor-GJobQueue
C
D
B
C
B
A
8
DAGMan
Running a DAG (cont’d)
• In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.
Condor-GJobQueue
X
D
A
BRescue
File
9
DAGMan
Recovering a DAG -- fault tolerance
• Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.
Condor-GJobQueue
C
D
A
BRescue
File
C
10
DAGMan
Recovering a DAG (cont’d)
• Once that job completes, DAGMan will continue the DAG as if the failure never happened.
Condor-GJobQueue
C
D
A
B
D
11
DAGMan
Finishing a DAG
• Once the DAG is complete, the DAGMan job itself is finished, and exits.
Condor-GJobQueue
C
D
A
B
Additional DAGMan Features
• Provides other handy features for job management…
– nodes can have PRE & POST scripts– failed nodes can be automatically re-tried a
configurable number of times– job submission can be “throttled”
HANDS-ON
• http://users.cs.fiu.edu/~skala001/DAGMan_Lab.htm
Ewa Deelman, deelman@isi.edu
www.isi.edu/~deelmanpegasus.isi.edu
Scientific AnalysisW
orkf
low
Evo
lutio
n Select the Input Data
Map the Workflow onto Available Resources
Execute the Workflow
Construct the Analysis
Workflow Template
Abstract Worfklow
Concrete Workflow
Tasks to be executed
Grid Resources
Ewa Deelman, deelman@isi.edu
www.isi.edu/~deelmanpegasus.isi.edu
Execution EnvironmentScientific AnalysisW
orkf
low
Evo
lutio
n
Grid Resources
Select the Input Data
Map the Workflow onto Available Resources
Execute the Workflow
Information Services
Library of Application
Components
Data Catalogs
Construct the Analysis
Resource availability and characteristics
Tasks to be executed
Data properties
Component characteristics
Workflow Template
Abstract Worfklow
Concrete Workflow
Aut
omat
edU
ser
guid
ed
Pegasus: Planning for Execution in Grids
• Abstract Workflows - Pegasus input workflow description– workflow “high-level language”– only identifies the computations that a user wants to do– devoid of resource descriptions– devoid of data locations
• Pegasus (http://pegasus.isi.edu)– a workflow “compiler”– target language - DAGMan’s DAG and Condor submit files– transforms the workflow for performance and reliability– automatically locates physical locations for both workflow components and data– finds appropriate resources to execute the components– provides runtime provenance
• DAGMan– A workflow executor– Scalable and reliable execution of an executable workflow
Pegasus Workflow Management System
Condor Schedd
DAGMan
Pegasus mapper
Reliable, scalable execution of independent tasks (locally, across the network), priorities, scheduling
Reliable and scalable execution of dependent tasks
A reliable, scalable workflow management system that an application or workflow composition service can depend on to get the job done
A decision system that develops strategies for reliable and efficient execution in a variety of environments
Cyberinfrastructure: Local machine, cluster, Condor pool, OSG, TeraGrid
Abstract Workflow
client tool with no special requirements on the infrastructure
Generating a Concrete Workflow
Information– location of files and component Instances– State of the Grid resources
Select specific – Resources– Files– Add jobs required to form a concrete
workflow that can be executed in the Grid environment
• Data movement– Data registration– Each component in the abstract
workflow is turned into an executable job
FFT filea
/usr/local/bin/fft /home/file1
Move filea from host1://home/filea
to host2://home/file1
Abstract Workflow
Concrete Workflow
DataTransfer
Data Registration
Information Components used by Pegasus
• Globus Monitoring and Discovery Service (MDS)– Locates available resources– Finds resource properties
• Dynamic: load, queue length• Static: location of GridFTP server, RLS, etc
• Globus Replica Location Service– Locates data that may be replicated– Registers new data products
• Transformation Catalog– Locates installed executables
Example Workflow Reduction
• Original abstract workflow
• If “b” already exists (as determined by query to the RLS), the workflow can be reduced
d1 d2ba c
d2
b c
Mapping from abstract to concrete
• Query RLS, MDS, and TC, schedule computation and data movement
Execute d2 at B
Move b from A
to B
Move c from B
to U
Register c in the
RLS
d2
b c
Pegasus Research
• resource discovery and assessment• resource selection • resource provisioning• workflow restructuring
– task merged together or reordered to improve overall performance
• adaptive computing– Workflow refinement adapts to changing
execution environment
Benefits of the workflow & Pegasus approach
• The workflow exposes – the structure of the application– maximum parallelism of the application
• Pegasus can take advantage of the structure to– Set a planning horizon (how far into the workflow to
plan)– Cluster a set of workflow nodes to be executed as one
(for performance)
• Pegasus shields from the Grid details
Benefits of the workflow & Pegasus approach
• Pegasus can run the workflow on a variety of resources • Pegasus can run a single workflow across multiple
resources• Pegasus can opportunistically take advantage of
available resources (through dynamic workflow mapping)
• Pegasus can take advantage of pre-existing intermediate data products
• Pegasus can improve the performance of the application.