CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are...

CONDOR DAGMan and Pegasus

Selim KalayciFlorida International University

07/28/2009

Note: Slides are compiled from various TeraGrid Documentations

DAGMan• Directed Acyclic Graph Manager

• DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.

• (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

What is a DAG?

• A DAG is the data structure used by DAGMan to represent these dependencies.

• Each job is a “node” in the DAG.

• Each node can have any number of “parent” or “children” nodes – as long as there are no loops!

Job B Job C

• A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D

• each node will run the Condor job specified by its accompanying Condor submit file

Defining a DAG

Job B Job C

Submitting a DAG• To start your DAG, just run condor_submit_dag with

your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:

% condor_submit_dag diamond.dag

• condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable.

• Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

DAGMan

Running a DAG

• DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor-G based on the DAG dependencies.

Condor-GJobQueue

B.dagFile

DAGMan

Running a DAG (cont’d)

• DAGMan holds & submits jobs to the Condor-G queue at the appropriate times.

Condor-GJobQueue

DAGMan

Running a DAG (cont’d)

• In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.

Condor-GJobQueue

BRescue

DAGMan

Recovering a DAG -- fault tolerance

• Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.

Condor-GJobQueue

BRescue

DAGMan

Recovering a DAG (cont’d)

• Once that job completes, DAGMan will continue the DAG as if the failure never happened.

Condor-GJobQueue

DAGMan

Finishing a DAG

• Once the DAG is complete, the DAGMan job itself is finished, and exits.

Condor-GJobQueue

Additional DAGMan Features

• Provides other handy features for job management…

– nodes can have PRE & POST scripts– failed nodes can be automatically re-tried a

configurable number of times– job submission can be “throttled”

HANDS-ON

• http://users.cs.fiu.edu/~skala001/DAGMan_Lab.htm

Ewa Deelman, deelman@isi.edu

www.isi.edu/~deelmanpegasus.isi.edu

Scientific AnalysisW

n Select the Input Data

Map the Workflow onto Available Resources

Execute the Workflow

Construct the Analysis

Workflow Template

Abstract Worfklow

Concrete Workflow

Tasks to be executed

Grid Resources

Ewa Deelman, deelman@isi.edu

www.isi.edu/~deelmanpegasus.isi.edu

Execution EnvironmentScientific AnalysisW

Grid Resources

Select the Input Data

Map the Workflow onto Available Resources

Execute the Workflow

Information Services

Library of Application

Components

Data Catalogs

Construct the Analysis

Resource availability and characteristics

Tasks to be executed

Data properties

Component characteristics

Workflow Template

Abstract Worfklow

Concrete Workflow

Pegasus: Planning for Execution in Grids

• Abstract Workflows - Pegasus input workflow description– workflow “high-level language”– only identifies the computations that a user wants to do– devoid of resource descriptions– devoid of data locations

• Pegasus (http://pegasus.isi.edu)– a workflow “compiler”– target language - DAGMan’s DAG and Condor submit files– transforms the workflow for performance and reliability– automatically locates physical locations for both workflow components and data– finds appropriate resources to execute the components– provides runtime provenance

• DAGMan– A workflow executor– Scalable and reliable execution of an executable workflow

Pegasus Workflow Management System

Condor Schedd

DAGMan

Pegasus mapper

Reliable, scalable execution of independent tasks (locally, across the network), priorities, scheduling

Reliable and scalable execution of dependent tasks

A reliable, scalable workflow management system that an application or workflow composition service can depend on to get the job done

A decision system that develops strategies for reliable and efficient execution in a variety of environments

Cyberinfrastructure: Local machine, cluster, Condor pool, OSG, TeraGrid

Abstract Workflow

client tool with no special requirements on the infrastructure

Generating a Concrete Workflow

Information– location of files and component Instances– State of the Grid resources

Select specific – Resources– Files– Add jobs required to form a concrete

workflow that can be executed in the Grid environment

• Data movement– Data registration– Each component in the abstract

workflow is turned into an executable job

FFT filea

/usr/local/bin/fft /home/file1

Move filea from host1://home/filea

to host2://home/file1

Abstract Workflow

Concrete Workflow

DataTransfer

Data Registration

Information Components used by Pegasus

• Globus Monitoring and Discovery Service (MDS)– Locates available resources– Finds resource properties

• Dynamic: load, queue length• Static: location of GridFTP server, RLS, etc

• Globus Replica Location Service– Locates data that may be replicated– Registers new data products

• Transformation Catalog– Locates installed executables

Example Workflow Reduction

• Original abstract workflow

• If “b” already exists (as determined by query to the RLS), the workflow can be reduced

d1 d2ba c

Mapping from abstract to concrete

• Query RLS, MDS, and TC, schedule computation and data movement

Execute d2 at B

Move b from A

Move c from B

Register c in the

Pegasus Research

• resource discovery and assessment• resource selection • resource provisioning• workflow restructuring

– task merged together or reordered to improve overall performance

• adaptive computing– Workflow refinement adapts to changing

execution environment

Benefits of the workflow & Pegasus approach

• The workflow exposes – the structure of the application– maximum parallelism of the application

• Pegasus can take advantage of the structure to– Set a planning horizon (how far into the workflow to

plan)– Cluster a set of workflow nodes to be executed as one

(for performance)

• Pegasus shields from the Grid details

Benefits of the workflow & Pegasus approach

• Pegasus can run the workflow on a variety of resources • Pegasus can run a single workflow across multiple

resources• Pegasus can opportunistically take advantage of

available resources (through dynamic workflow mapping)

• Pegasus can take advantage of pre-existing intermediate data products

• Pegasus can improve the performance of the application.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are...

Documents

Transcript of CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are...

TeraGrid ’06 National Center for Supercomputing Applications Managing Credentials on the TeraGrid with MyProxy Jim Basney.

IU TeraGrid Gateway Support

TeraGrid Simo Niskala Teemu Pasanen. TeraGrid general objectives resources service architecture –grid services –teragrid application services Using TeraGrid.

Part 8: DAGMan

Condor DAGMan: Introduction & Update

TeraGrid Planning Workshop — June 7, 2007 TeraGrid Science Gateways.

KALAYCI CV March20171 KENAN KALAYCI UNIVERSITY OF QUEENSLAND School of Economics, Colin Clark Building (39), St Lucia, Brisbane Qld 4072, Australia. +61 (0) 7 3346 7064 k.kalayci@uq.edu.au

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

OAuth for TeraGrid Gateways - …security.ncsa.illinois.edu/teragrid-oauth/OAuthforTGGateways...OAuth for TeraGrid Gateways Jim Basney Jeff Gaynor

Condor-G Stork and DAGMan...Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Condor-G Stork and DAGMan http ...

TeraGrid Annual Review: Science Gateways Nancy Wilkins-Diehr TeraGrid Area Director for Science Gateways wilkinsn@sdsc.edu TeraGrid Annual Review, April.

Proposed DataONE TeraGrid Joint Initiative John Cobb, TeraGrid, and DataONE Presentation to TeraGrid Quarterly Management Meeting August 31, 2010 Seattle,

Http:// TeraGrid Science Gateways: Scaling TeraGrid Access Aaron Shelmire¹, Jim Basney², Jim Marsteller¹, Von Welch²,

TeraGrid National Cyberinfrastructure for Terascale Science Dane Skow Deputy Director, TeraGrid

TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006.

TeraGrid Science Support

TeraGrid 08 The Third Annual TeraGrid Conference

Workflows, Taverna, and Teragrid

The Neutron Science TeraGrid Gateway: TeraGrid ...

Dynamic DAGMan with ClassAds