Download - ComplexHPC Spring School Day 2: KOALA Tutorial The KOALA ... · ComplexHPC Spring School Day 2: KOALA Tutorial The KOALA Scheduler Nezih Yigitbasi ... Introduction • Developed in

1/36

ComplexHPC Spring School Day 2: KOALA Tutorial

The KOALA Scheduler Nezih Yigitbasi

Delft University of Technology

4 May 10, 2011

2/36

Outline

•  Koala Architecture •  Job Model •  System Components

•  Support for different application types •  Parallel Applications •  Parameter Sweep Applications (PSAs) •  Workflows

3/36

Introduction

•  Developed in the DAS system •  Has been deployed on the DAS-2 in September 2005 •  Ported to DAS-3 in April’07, and to DAS-4 in April’11 •  Independent from grid middlewares such as Globus •  Runs on top of local schedulers

•  Objectives: •  Data and processor co-allocation in grids •  Supporting different application types •  Specialized job placement policies

4/36

May 13, 2011

VU (148 CPUs)

TU Delft (64) Leiden (32)

UvA/MultimediaN (72)

UvA (32)

Background (1): DAS-4

SURFnet6

10 Gb/s lambdas

•  1,600 cores (quad cores) •  2.4 GHz CPUs •  accelerators •  180 TB storage •  Infiniband •  Gb Eternet

Operational since oct. 2010

Astron (46)

5/36

Background (2): Grid Applications

•  Different application types with different characteristics: •  Parallel applications •  Parameter sweep applications •  Workflows •  Data intensive applications

•  Challenges: •  Application characteristics and needs •  Grid infrastructure is highly heterogeneous •  Grid infrastructure configuration issues •  Grid resources are highly dynamic

6/36

Fixed job Flexible job

Non-fixed job

scheduler decides on component placement

scheduler decides on split up and placement

job components

same total job size

job component placement fixed

Koala Job Model

•  A job consists of one or more job components

•  A job component contains: •  An executable name •  Sufficient information necessary for scheduling •  Sufficient information necessary for execution

7/36

Koala Architecture (1)

8/36

•  PIP/NIP: information services •  RLS: replica location service •  CO: co-allocator •  PC: processor claimer

•  RM: run monitor •  RL: runners listener •  DM: data mover •  Ri: runners

Koala Architecture (2): A Closer Look

9/36

Scheduler

•  Enforces Scheduling Policies •  Co-Allocation Policies

•  Worst-Fit, Flexible Cluster Min., Comm. Aware, Close to Files

•  Malleability Management Policies •  Favour Previously Started Malleable Applications •  Equi Grow Shrink

•  Cycle Scavenging Policies •  Equi-All, Equi-PerSite

•  Workflow Scheduling Policies •  Single-Site, Multi-Site

10/36

Runners

•  Extends support for different application types

•  KRunner: Globus runner •  PRunner: A simplified job runner •  IRunner: Ibis applications •  OMRunner: OpenMPI applications •  MRunner: Malleable applications based on the DYNACO

framework •  WRunner: For workflows (Directed Acyclic Graphs) and BoTs

11/36

The Runners Framework

12/36

Support for Different Application Types

•  Parallel Applications •  MPI, Ibis,… •  Co-Allocation •  Malleability

•  Parameter Sweep Applications •  Cycle Scavenging •  Run as low-priority jobs

•  Workflows

13/36

Support for Co-Allocation

•  What is co-allocation (just to remind)

•  Co-allocation Policies

•  Experimental Results

14/36

Co-Allocation

•  Simultaneous allocation of resources in multiple sites •  Higher system utilizations •  Lower queue wait times

•  Co-allocated applications might be less efficient due to the relatively slow wide-area communications •  Parallel applications may have different communication

characteristics

15/36

Co-Allocation Policies (1)

•  Dictate where the components of a job go

•  Policies for non-fixed jobs: •  Load-aware: Worst Fit (WF)

(balance load in clusters)

•  Input-file-location-aware: Close-to-Files (CF)

(reduce file-transfer times)

•  Communication-aware: Cluster Minimization (CM)

(reduce number of wide-area messages)

See: H.H. Mohamed and D.H.J. Epema, “An Evaluation of the Close-to-Files Processor and Data Co-Allocation Policy in Multiclusters,” IEEE Cluster 2004.

16/36


•  Placement policies for flexible jobs: •  Queue time-aware: Flexible Cluster

(CM + reduce queue wait time) Minimization (FCM)

•  Communication-aware: Communication (decisions based on inter-cluster Aware (CA)

communication speeds)

See: O.O.Sonmez, H.H. Mohamed and D.H.J. Epema, “Communication-aware Job Scheduling Policies for the Koala Grid Scheduler”, IEEE e-Science 2006.

17/36


8 8 8

Components

Clusters

C1 (16) C2 (16) C3 (16)

I

II

III

WF

24

Component C1 (16) C2 (16) C3 (16)

I

II

FCM

18/36

Experimental Results : Co-Allocation Vs. No co-allocation •  OpenMPI + DRMAA •  no co-allocation (left) vs. co-allocation (right) •  workloads of real parallel applications range from computation- (Prime) to very communication-intensive (Wave)

average job response time (s)

Prime Poisson Wave

co-allocation is disadvantageous for communication-intensive applications

19/36

Experimental Results : The performance of the policies

•  Flexible Cluster Min. vs. Comm. Aware •  Workloads of communication-intensive applications

average job response time (s)

considering the network metrics improves the co-allocation performance

FCM CA FCM CA [w/o Delft] [with Delft]

20/36

Support for PSAs in Koala

•  Background

•  System Design

•  Scheduling Policies

•  Experimental Results

21/36

Parameter Sweep Application Model

•  A single executable that runs for a large set of parameters •  E.g.; monte-carlo simulations, bioinformatics applications...

•  PSAs may run in multiple clusters simultaneously

•  We support OGF’s JSDL 1.0 (XML)

22/36

Motivation

•  How to run thousands of tasks in the DAS? •  Issues:

•  15 min. rule! •  Observational scheduling •  Overload

•  Run them as Cycle Scavenging Applications !! •  Sets priority classes implicitly •  No worries for observing empty clusters

23/36

Cycle Scavenging

•  The technology behind volunteer computing projects

•  Harnessing idle CPU cycles from desktops

•  Download a software (screen saver) •  Receive tasks from a central server •  Execute a task when the computer is idle •  Immediate preemption when the user is active again

24/36

System Requirements

1. Unobtrusiveness Minimal delay for (higher priority) local and grid jobs

2. Fairness Multiple cycle scavenging applications running concurrently should be assigned comparable CPU-Time

3. Dynamic Resource Allocation Cycle scavenging applications has to Grow/Shrink at runtime

4. Efficiency As much use of dynamic resources as possible

5. Robustness and Fault Tolerance Long-running, complex system: problems will occur, and must be dealt with

25/36

System Interaction

Scheduler

CS-Runner

Node

submits PSA(s)

JDL

grow/shrink messages registers

Clusters

Launcher

Head Node

KCM

submits launchers

deploys, monitors, and

preempts tasks

monitors/informs idle/demanded

resources CS Policies: •  Equi-All: grid-wide basis •  Equi-PerSite: per cluster

Application Level Scheduling: •  Pull-based approach •  Shrinkage policy

26/36

Cycle Scavenging Policies

1. Equipartition-All

Clusters

C1 (12) C2 (12) C3 (24) CS User-1

CS User-2

CS User-3

27/36

Cycle Scavenging Policies

2. Equipartition-PerSite

Clusters

C1 (12) C2 (12) C3 (24) CS User-1

CS User-2

CS User-3

28/36

Experimental Results •  DAS3 •  Using Launchers vs. not •  60s. dummy tasks •  Tested on a 32-node cluster

See: O. Sonmez, B. Grundeken, H.H. Mohamed, Alex Iosup, D.H.J. Epema, Scheduling Strategies for Cycle Scavenging in Multicluster

Grid Systems, CCGrid 2009.

Equi-PerSite is fair and superior to Equi-All

Num

ber

of C

ompl

eted

Job

s

Equi-All Equi-All Equi-PerSite Equi-PerSite WBlock WBurst WBlock WBurst

Mak

espa

n [s

]

•  Equi-All vs. Equi-PerSite •  3 CS Users submit the same application with the same parameter range •  Non-CS Workloads: WBlock, WBurst

job startup overhead +

information delay

Number of Jobs

29/36

Support for Workflows in Koala

•  Applications with dependencies •  e.g., Montage workflow

•  Astronomy application to generate mosaics of the sky

•  4500 tasks

•  Dependencies are file transfers

•  Experience the WRunner in the hands-on session

30/36

Workflow Scheduling Policies (1/3)

1. Round Robin: submits the eligible tasks to the clusters in round-robin order

2. Single Cluster: maps every complete workflow to the least-loaded cluster at its submission

3. All Clusters: submits each eligible task to the least loaded cluster

31/36

4. All Clusters File-Aware: submits each eligible task to the cluster that minimizes the transfer costs of the files on which it depends

5. Coarsening*: iteratively reduces the size of a graph by collapsing groups of nodes and their internal edges •  We use Heavy Edge Matching* technique to group tasks that are

connected with heavy edges

*G. Karypis and V. Kumar. Multilevel graph partitioning schemes. In Int. Conf. Par. Proc., pages 113–122, 1995.


32/36

6. Cluster Minimization: submits as many eligible tasks as possible to a cluster before considering the next cluster

7. HEFT* (Heterogeneous Earliest-Finish-Time): selects the task with the highest upward rank value and assigns the selected task to the cluster that ensures its earliest completion time

5/10/11 32

*H. Topcuoglu, S. Hariri, and M. Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE TPDS, 13(3):260–274, 2002.


We will use Single Cluster and All Clusters in the hands-on session

33/36

Workflow Engine Architecture

DRMAA

DFS

Workflow Description

SSH +

Custom Protocol

Execution

34/36

Cloud Integration

•  Resource management issues •  When and how many resources to allocate •  What type of resource to allocate •  Which availability zone

•  Performance issues •  No queueing but resource acquisition/release overheads •  Virtualization overhead •  Wide-area file transfers

•  Cost issues (for public clouds) •  Functionality Issues (MapReduce)

35/36

Conclusion

•  Koala supports multiple application types: •  Sequential applications •  Parallel applications that may need co-allocation •  Parallel applications that can grow/shrink at runtime •  Parameter sweep applications •  Workflows

•  Different scheduling policies for each application type •  No one size fits all policy

36/36

“[email protected]” http://www.st.ewi.tudelft.nl/~nezih/

More Information:

•  Koala Project: http://st.ewi.tudelft.nl/koala

•  PDS publication database: http://www.pds.twi.tudelft.nl

•  DAS-4: http://www.cs.vu.nl/das4/