1 Resource Management of Large- Scale Applications on a Grid Laukik Chitnis and Sanjay Ranka (with...

1

Resource Management of Large-Scale Applications on a Grid

Laukik Chitnis and Sanjay Ranka(with Paul Avery, Jang-uk In and Rick

Cavanaugh)Department of CISE

University of Florida, [email protected]

352 392 6838(http://www.cise.ufl.edu/~ranka/)

mailto:[email protected]

2

Overview

High End Grid Applications and Infrastructure at University of Florida

Resource Management for Grids Sphinx Middleware for Resource

Provisioning Grid Monitoring for better meta-

scheduling Provisioning Algorithm Research for

multi-core and grid environments

3

The Evolution of High-End Applications

(and their system characteristics)

19801980 19901990 20002000

Data Intensive Data Intensive ApplicationsApplications

Compute Intensive Applications

• Geographicallydistributed datasets

• High speed storage• Gigabit networks

• Geographicallydistributed datasets

• High speed storage• Gigabit networks

• Large clusters• Supercomputers

• Large clusters• Supercomputers

MainFrame Applications

• Centralmainframes

• Centralmainframes

4

Some Representative Applications

HEP, Medicine, Astronomy, Distributed Data Mining

51- 1-10 petabytes

1000+ 20+ countries

Representative Application: High Energy Physics

6

Representative Application: Tele-Radiation Therapy

M A G N

E T O M

N e t S y sI m a g e s

D I C O M

D I C O M - R T

R T O G

• V i s u a l i z e ( 2 D 3 D ) , D V H , I s o - d o s e , C u t p l a n e , e t c .

• C a s e i n f o

• R e v i e w

• M o d i f y S t r u c t u r e s

• A n n o t a t e

• E t c .

R C E T S e r v e r

W e b B a s e d U p l o a d / D o w n l o a d T o o l

C l i n i c P C

R C E T D a t a b a s e

N e t S y s

D I C O M - R T

S e r v e r

N e t S y s

R T O G

R e a d e r

N e t S y s

I m a g e

R e a d e r

W e b B a s e d E l e c t r o n i c F o l d e r & R a p i d R e v i e w T o o l s

I m a g i n g D e v i c e

T r e a t m e n t P l a n n i n g

F i l m S c a n n e r

S O A N S

E x t e r n a l D a t a b a s e

I n v e s t i g a t o r P C

N e t S y s

RCET Center for Radiation OncologyRCET Center for Radiation Oncology

7

Representative Application: Distributed Intrusion Detection

NSF ITR Project:

Middleware for Distributed

Data Mining

(PI: Ranka

joint with Kumar and Grossman)

NSF ITR Project:

Middleware for Distributed

Data Mining

(PI: Ranka

joint with Kumar and Grossman)

Data Manageme

ntServices

Data Mining and Scheduling Services

Application

.

Application

.

Data Manageme

ntServices

Data Transport Services

8

Grid Infrastructure

Florida Lambda Rail and UF

9

Campus Grid (University of Florida)

NSF Major Research Instrumentation Project

(PI: Ranka, Avery et. al.)20 Gigabit/sec Network20+ Terabytes2-3 Teraflops10 Scientific and

Engineering Applications

NSF Major Research Instrumentation Project

(PI: Ranka, Avery et. al.)20 Gigabit/sec Network20+ Terabytes2-3 Teraflops10 Scientific and

Engineering Applications

Infiniband based Cluster

Infiniband based Cluster

Gigabit Ethernet Based Cluster

Gigabit Ethernet Based Cluster

10

Grid Services

The software part of the infrastructure!

11

Services offered in a Grid

Resource Managemen

t Services

Data Management

Services

Monitoring and

Information Services

Security Services

Note that all the other services use security services

12

Resource Management Services

Provide a uniform, standard interface to remote resources including CPU, Storage and Bandwidth

Main component is the remote job manager

Ex: GRAM (Globus Resource Allocation Manager)

13

Resource Management on a Grid

User

The Grid

Site 1Condor

PBS

LSF

fork

GRAM

Narration: note the different local schedulers

Site 3

Site 2

Site n

14

Scheduling your Application

15

Scheduling your Application

An application can be run on a grid site as a job The modules in grid architecture (such as

GRAM) allow uniform access to the grid sites for your job

But… Most applications can be “parallelized” And these separate parts of it can be

scheduled to run simultaneously on different sites

Thus utilizing the power of the grid

16

Modeling an Application Workflow

Many workflows can be modeled as a Directed Acyclic Graph

The amount of resource required (in units of time) is known to a degree of certainty

There is a small probability of failure in execution (in a grid environment this could happen due to resources no longer available)

Directed Acyclic Graph

17

Workflow Resource Provisioning

ResourcesResourcesResourcesResources

ApplicationsApplicationsApplicationsApplications

Policies

Policies

Policies

Policies

Priorit

y

Priorit

y

LargeLarge

Access

Access

Control

Control

PrecedencePrecedence

Quota

Quota

Multi

ple

Multi

ple

Ownersh

ip

Ownersh

ip

Executing multiple workflows

over distributed and adaptive (faulty)

resourceswhile managing

policies

Data Data IntensiveIntensive

Time Time ConstraintsConstraints

DistributedDistributed

Multi-coreMulti-core HeterogeneousHeterogeneous

FaultyFaulty

18

A Real Life Example from High Energy Physics

Merge two grids into a single multi-VO“Inter-Grid”

How to ensure that neither VO is harmed? both VOs actually benefit? there are answers to questions like:

“With what probability will my job be scheduled and complete before my conference deadline?”

Clear need for a scheduling middleware!

FNAL

Rice

UIMIT

UCSD

UF

UW

Caltech

UM

UTA

ANL

IU

UC

LBL

SMU

OU

BU

BNL

19

Typical scenario

VDT Server

VDT Server

VDT Server

VDT Client

??

?

20

Typical scenario

VDT Server

VDT Server

VDT Server

VDT Client

??

?

@#^%#%$@#@#^%#%$@#

21

Some Requirements for Effective Grid Scheduling

Information requirements Past & future

dependencies of the application

Persistent storage of workflows

Resource usage estimation

Policies Expected to vary slowly

over time Global views of job

descriptions Request Tracking and

Usage Statistics State information

important

Resource Properties and Status

Expected to vary slowly with time

Grid weather Latency of measurement

important Replica management

System requirements Distributed, fault-tolerant

scheduling Customisability Interoperability with

other scheduling systems Quality of Service

22

Incorporate Requirementsinto a Framework

VDT Server

VDT Server

VDT Server

VDT Client

Assume the GriPhyN Virtual Data Toolkit:

Client (request/job submission) Globus clients Condor-G/DAGMan Chimera Virtual Data System

Server (resource gatekeeper) MonALISA Monitoring Service Globus services RLS (Replica Location Service)

??

?

23

Incorporate Requirementsinto a Framework

Assume the Virtual Data Toolkit: Client (request/job submission)

Clarens Web Service Globus clients Condor-G/DAGMan Chimera Virtual Data System

Server (resource gatekeeper) MonALISA Monitoring Service Globus services RLS (Replica Location Service)

VDT Server

VDT Server

VDT Server

VDT Client

Framework design principles: Information driven Flexible client-server model General, but pragmatic and

simple Avoid adding middleware

requirements on grid resources

?

RecommendationEngine

24

System Adaptive Scheduling

Co-allocation

Fault-tolerant

Policy-based

QoS support

Flexible interface

Nimrod-GEconomy-drivenDeadline support

X O X X O X

Maui/SilverPriority-basedReservation

O O X O O X

PBSBatch job schedulingQueue-based

X O X X O X

EZ-GridPolicy-based

X O X O X O

ProphetParallel SPMD

X X X X O X

LSFInteractive,batch modes

X O O O O X

Related Provisioning Software

25

Innovative Workflow Scheduling Middleware Modular system

Automated scheduling procedure based on modulated service

Robust and recoverable system Database infrastructure Fault-tolerant and recoverable from internal failure

Platform independent interoperable system XML-based communication protocols

SOAP, XML-RPC Supports heterogeneous service environment

60 Java Classes 24,000 lines of Java code 50 test scripts, 1500 lines of script code

26

The Sphinx Workflow Execution Framework

Sphinx Server

VDT Client

VDT Server Site

MonALISA Monitoring Service

Globus Resource

Replica Location Service

Condor-G/DAGMan

Request Processing

Data Warehouse

Data Management

InformationGathering

Sphinx ClientChimera

Virtual Data System

Clarens

WS Backbone

27

Sphinx Workflow Scheduling Server

Functions as the Nerve Centre Data Warehouse

Policies, Account Information, Grid Weather, Resource Properties and Status, Request Tracking, Workflows, etc

Control Process Finite State Machine

Different modules modify jobs, graphs, workflows, etc and change their state

Flexible Extensible Sphinx Server

Control Process

Job Execution Planner

Graph Reducer

Graph Tracker

Job Predictor

Graph Data Planner

Job Admission Control

Message Interface

Graph Predictor

Graph Admission Control

Data Warehouse

Data Management

Information Gatherer

28

SPHINX

Scheduling in Parallel for Heterogeneous Independent

NetworXs

29

Policy Based Scheduling Sphinx provides “soft” QoS through

time dependent, global views of Submissions (workflows, jobs,

allocation, etc) Policies Resources

Uses Linear Programming Methods Satisfy Constraints

Policies, User-requirements, etc Optimize an “objective” function

Estimate probabilities to meet deadlines within policy constraints

J. In, P. Avery, R. Cavanaugh, and S. Ranka, "Policy Based Scheduling for Simple Quality of Service in Grid Computing", in Proceedings of the 18th IEEE IPDPS, Santa Fe, New Mexico, April, 2004

ResourcesS

ub

mis

sion

sTime

SubmissionsResources Time

Policy Space

30

Ability to tolerate task failures

Average Dag Completion Time (30 dags x 10 jobs/dag)

2000

2200

2400

2600

2800

3000

3200

3400

# of CPUs based Round-robin # of CPUs based-without feedback

Round-robin-without feedback

Scheduling Algorithms

Tim

e (S

eco

nd

s)

Timeout (120 dags x 10 jobs/dag)

125

386 327

154

2258

1

10

100

1000

10000

Completiontime based

Queue lengthbased

# of CPUsbased

Round robin # of CPUsbased-without

feedback


# o

f jo

bs

• Significant Impact of using feedback information

Jang-uk In, Sanjay Ranka et. al. "SPHINX: A fault-tolerant system for scheduling in dynamic grid environments", in Proceedings of the 19th IEEE IPDPS, Denver, Colorado, April, 2005

31

Grid Enabled Analysis

SC|03

Distributed Services for GridEnabled Data Analysis

Distributed Services for GridEnabled Data Analysis

Sphinx

Scheduling Service Fermilab

FileService

VDT ResourceService

Caltech

FileService

VDT ResourceService

RLS

Replica LocationService

Sphinx/VDT

Execution Service

MonALISA

Monitoring Service

ROOT

Data AnalysisClient

Chimera

Virtual Data Service

Iowa

FileService

VDT ResourceService

Florida

FileService

VDT ResourceService

Clarens

Cla

ren

s

Cla

rens Globus

Globus

Gri

dF

TP

Claren

s

Globus

MonALISA

33

Evaluation of Information gathered from grid monitoring systems

AvgJobDelay : Turnaround time v/s value

0

100

200

300

400

500

600

700

800

0 200 400 600 800 1000

parameter value

turn

aro

un

d t

ime

(sec

)

queue_length : Turnaround time v/s rating value

0

100

200

300

400

500

600

700

800

900

0 0.5 1 1.5

site rating value

turn

aro

un

d t

ime

(sec

)

cluster_load : Turnaround time v/s value

0

100

200

300

400

500

600

700

800

0 0.2 0.4 0.6 0.8 1 1.2

cluster_load value

turn

aou

nd

tim

e (s

ec)

Correlation index

Turnaround time

Queue length -0.05818

Cluster load -0.20775

Average Job Delay 0.892542

34

Limitation of Existing Monitoring Systems for the Grid

Information aggregated across multiple users is not very useful in effective resource allocation.

An end-to-end parameter such as Average Job Delay - the average queuing delay experienced by a job of a given user at an execution site - is a better estimate for comparing the resource availability and response time for a given user.

It is also not very susceptible to monitoring latencies.

35

Effective DAG Scheduling

Average Dag Completion Time (120 dags x 10 jobs/dag)

4500

5000

5500

6000

6500

7000

Completion timebased

Queue lengthbased

# of CPUs based Round robin


Tim

e (S

eco

nd

s)

The completion time based algorithm here uses the Average Job Delay parameter for scheduling

As seen in the adjoining figure, it outperforms the algorithms tested with other monitored parameters.

36

Work in Progress: Modeling Workflow Cost and developing efficient provisioning algorithms


1. Developing an objective measure of completion timeIntegrating performance and reliability of workflow execution P (Time to complete >=T) <= epsilon

2. Relating this measure to the properties of the longest path of the DAG based on the mean and uncertainty of time required for underlying tasks due to 1) variable time requirements due to different parameter values2) failure due to change of the underlying resources etc.

3. Developing novel scheduling and replication techniques to optimize allocation based on these metrics.

37

Work in Progress: Provisioning algorithms for multiple workflows (Yield Management)

• Quality of Service guarantees for each workflow• Controlled (a cluster of multi-core processors) versus uncontrolled

(grid of multiple clusters owned by multiple units) environment

Level 1

Level 2

Level 3

Level 4Dag 1 Dag 2 Dag 3 Dag 5Dag 4

Level 1

Level 2

Level 3

Level 4Dag 1 Dag 2 Dag 3 Dag 5Dag 4

Multiple Workflows

38

CHEPREO - Grid Education and Networking

E/O Center in Miami area Tutorial for Large Scale

Application Development

39

Grid Education

Developing a Grid tutorial as part of CHEPREO Grid basics Components of a Grid Grid Services OGSA …

OSG summer workshop South Padre island, Texas. July 11-15, 2005 http://osg.ivdgl.org/twiki/bin/view/SummerGridWorkshop/

Lectures and Hands-on sessions Building and Maintaining a Grid

40

Acknowledgements

CHEPREO project, NSF GriPhyN/iVDgL, NSF Data Mining Middleware, NSF Intel Corporation

41

Thank You

May the Force be with you!

42

Additional slides

43

Effect of latency on Average Job Delay

AvgJobDelay(-10) : Turnaround time v/s parameter value

0100200300400500600700800900

1000

0 200 400 600 800 1000

parameter value

turn

aro

un

d t

ime

(sec

)

AvgJobDelay(-5) : Turnaround time v/s parameter value

0

100

200

300

400

500

600

700

800

0 200 400 600 800

parameter value

turn

aro

un

d t

ime

(sec

)

Latency is simulated in the system by purposely retrieving old values for the parameter while making scheduling decisions

The correlation indices with added latencies are comparable, though lower as expected, to the correlation indices of ‘un-delayed’ Average Job Delay parameter. The amount of correlation is still quite high.

Average Job Delay correlation index with turnaround time

Added latency = 5 minutes

Added latency = 10 minutes

Site rank 0.688959 0.754222

Raw value 0.582685 0.777754

Learning period 29 jobs 48 jobs

44

SPHINX Scheduling Latency

0

5

10

15

20

25

30

35

40

45

0.5 1 2 4 11 13 17

# jobs / m inute

Sec

on

ds

20 DAG's

40 DAG's

80 DAG's

100 DAG's

Average scheduling latency for various number of

DAG’s (20, 40 , 80 and 100) with different arrival rate per minute.

45

Graphical user interfacefor data analysis

ROOT

Virtual data service

Chimera

Grid scheduling service

Sphinx

Grid enabledexecution

service

VDT client

Grid resourcemanagement

service

VDT server

Grid enabledWeb service Clarens

Clarens Clarens

Clarens Clarens

Grid resource monitoring system

MonALISA

Replica location service

RLS

Demonstration at Supercomputing Conference:Distributed Data Analysis in a Grid Environment

The architecture has been implemented and demonstrated in SC03 and SC04, Arizona, USA, 2003.

46

Scheduling DAGs: Dynamic Critical Path Algorithm

The DCP algorithm executes the following steps iteratively:

1. Compute the earliest possible start time (AEST) and the latest possible start time (ALST) for all tasks on each processor.

2. Select a task which has the smallest difference between its ALST and AEST and has no unscheduled parent task. If there are tasks with the same differences, select the one with a smaller AEST.

3. Select a processor which gives the earliest start time for the selected task

47

Scheduling DAGs: ILP- Novel algorithm to support heterogeneity (work supported by Intel Corporation)

There are two novel features: Assign multiple independent

tasks simultaneously – cost of task assigned depends on the processor available, many tasks commence with a small difference in start time.

Iteratively refine the scheduling - refines the scheduling by using the cost of the critical path based on the assignment in the previous iteration.


48

Comparison of different algorithms

10600

10650

10700

10750

10800

10850

10900

ICP (Th=3) ICP (Th=5) ICP (Th=7) DCP HEFT

2000 Tasks

Sch

edul

ing

Leng

th

Number of processors = 30.Number of Tasks = 2000.

0102030405060708090

100

1000 2000 3000 4000

Number of Tasks

Bes

t

ICP (Th=5)

DCP

HEFT

Number of processors = 30.

49

Time for Scheduling

0

500000

1000000

1500000

2000000

2500000

1000 2000 3000 4000

Number of Tasks

Sch

edul

ing

Tim

e ICP (Th=3)

ICP (Th=5)

ICP (Th=7)

DCP

HEFT

0100002000030000400005000060000700008000090000

100000

10 20 30 40 60 80

Number of ProcessorsS

ched

ulin

g T

ime ICP (Th=3)

ICP (Th=5)

ICP (Th=7)

DCP

HEFT

1 Resource Management of Large- Scale Applications on a Grid Laukik Chitnis and Sanjay Ranka (with...

Documents

Transcript of 1 Resource Management of Large- Scale Applications on a Grid Laukik Chitnis and Sanjay Ranka (with...