Workflow Task Clustering for Best Effort Systems with Pegasus
-
Upload
rae-fields -
Category
Documents
-
view
44 -
download
0
description
Transcript of Workflow Task Clustering for Best Effort Systems with Pegasus
Workflow Task Clustering for Best Effort Systems with Pegasus
Gurmeet Singh, Mei-Hui Su, Karan VahiEwa Deelman, Gaurang MehtaInformation Sciences Institute
University of Southern CaliforniaMarina del Rey, CA 90292
Bruce Berriman, John GoodInfrared Processing and Analysis Center
California Institute of TechnologyPasadena, CA 91125
Daniel S. KatzCenter for Computation and Technology
Louisiana State UniversityBaton Rouge, LA 70803
*The full moon is 0.5 deg. sq. when viewed form Earth, Full Sky is ~ 400,000 deg. sq.
Generating mosaics of the sky
97GB
38GB
20GB
5.5GB
1.2GB
Total data footprint
2 hrs. 14 mins8,58622,8501,4446
1hr 46 mins4,85613,0617474
49 mins1,4443,9062122
6 hours20,65254,4343,72210
40 mins232588531
Approx. execution time (20 procs)
Number of jobs
Number of Intermediate files
Number of input data files
Size of the mosaic is degrees square*
97GB
38GB
20GB
5.5GB
1.2GB
Total data footprint
2 hrs. 14 mins8,58622,8501,4446
1hr 46 mins4,85613,0617474
49 mins1,4443,9062122
6 hours20,65254,4343,72210
40 mins232588531
Approx. execution time (20 procs)
Number of jobs
Number of Intermediate files
Number of input data files
Size of the mosaic is degrees square*
BgModel
Project
Project
Project
Diff
Diff
Fitplane
Fitplane
Background
Background
Background
Add
Image1
Image2
Image3
Pegasus
Based on programming language principles Leverage abstraction for workflow description to
obtain ease of use, scalability, and portability
Provide a compiler to map from high-level descriptions to executable workflows
Correct mapping
Performance enhanced mapping
Rely on a runtime engine to carry out the instructions
Scalable manner
Reliable manner
DAGMan (Directed Acyclic Graph MANager)
Runs workflows that can be specified as Directed Acyclic Graphs
Enforces DAG dependencies
Progresses as far as possible in the face of failures
Provides retries, throttling, etc.
Runs on top of Condor (and is itself a Condor job)
A view of the Rho Oph dark cloud constructed with Montage from deep exposures made with the Two Micron All Sky Survey (2MASS) Extended Mission
Pegasus Workflow Mapping
Original workflow: 15 compute nodesdevoid of resource assignment
41
85
10
9
13
12
15
Resulting workflow mapped onto 3 Grid sites:
11 compute nodes (4 reduced based on available intermediate data)
13 data stage-in nodes
8 inter-site data transfers
14 data stage-out nodes to long-term storage
14 data registration nodes (data cataloging)
9
4
837
10
13
12
15
60 jobs to execute
Resulting workflow mapped onto 3 Grid sites:
11 compute nodes (4 reduced based on available intermediate data)
13 data stage-in nodes
8 inter-site data transfers
14 data stage-out nodes to long-term storage
14 data registration nodes (data cataloging)
9
4
837
10
13
12
15
Resulting workflow mapped onto 3 Grid sites:
11 compute nodes (4 reduced based on available intermediate data)
13 data stage-in nodes
8 inter-site data transfers
14 data stage-out nodes to long-term storage
14 data registration nodes (data cataloging)
Resulting workflow mapped onto 3 Grid sites:
11 compute nodes (4 reduced based on available intermediate data)
13 data stage-in nodes
8 inter-site data transfers
14 data stage-out nodes to long-term storage
14 data registration nodes (data cataloging)
Resulting workflow mapped onto 3 Grid sites:
11 compute nodes (4 reduced based on available intermediate data)
13 data stage-in nodes
8 inter-site data transfers
14 data stage-out nodes to long-term storage
14 data registration nodes (data cataloging)
9
4
837
10
13
12
15
9
4
837
10
13
12
15
60 jobs to executeThe structure of a small Montage
workflow
Automatic Node clustering
Two clusters per level Two tasks per cluster
No clustering Level-based, clustering factor 5
0
1
10
100
1,000
10,000
100,000
1,000,000
42 43 44 45 49 50 51 2 3 4 5 7 8 25 26 27 29 30 31 34 41
Week of the year 2005-2006
job
s / t
ime
in
Ho
urs
jobs Hours SCEC CyberShake workflows run using Pegasus and DAGMan on the TeraGrid and USC resources
Cumulatively, the workflows consisted of over half a million tasks and used over 2.5 CPU Years.
The largest CyberShake workflow contained on the order of 100,000 nodes and accessed 10TB of data
Support for LIGO on Open Science GridLIGO Workflows:185,000 nodes, 466,000 edges10 TB of input data, 1 TB of output data.
pegasus.isi.edu
Condor QueueLOCAL SUBMIT HOST (Community resource)
DAGManAbstract Workflow
(Resource-independent)
Executable Workflow
(Resources Identified)
Ready Tasks
Pegasus
National CyberInfrastructure
jobsinformation
1 degree2
MontageOn TeraGrid
Pegasus
Can map portions of workflows at a time
Supports the range of just-in-time to full-ahead mappings
Can cluster workflow nodes to increase computational granularity
Can minimize the amount of space required for the execution of the workflow Dynamic data cleanup
Can handle workflows on the order of 100,000 tasks
Support for a variety of fault-recovery techniques