1/36
ComplexHPC Spring School Day 2: KOALA Tutorial
The KOALA Scheduler Nezih Yigitbasi
Delft University of Technology
4 May 10, 2011
2/36
Outline
• Koala Architecture • Job Model • System Components
• Support for different application types • Parallel Applications • Parameter Sweep Applications (PSAs) • Workflows
3/36
Introduction
• Developed in the DAS system • Has been deployed on the DAS-2 in September 2005 • Ported to DAS-3 in April’07, and to DAS-4 in April’11 • Independent from grid middlewares such as Globus • Runs on top of local schedulers
• Objectives: • Data and processor co-allocation in grids • Supporting different application types • Specialized job placement policies
4/36
May 13, 2011
VU (148 CPUs)
TU Delft (64) Leiden (32)
UvA/MultimediaN (72)
UvA (32)
Background (1): DAS-4
SURFnet6
10 Gb/s lambdas
• 1,600 cores (quad cores) • 2.4 GHz CPUs • accelerators • 180 TB storage • Infiniband • Gb Eternet
Operational since oct. 2010
Astron (46)
5/36
Background (2): Grid Applications
• Different application types with different characteristics: • Parallel applications • Parameter sweep applications • Workflows • Data intensive applications
• Challenges: • Application characteristics and needs • Grid infrastructure is highly heterogeneous • Grid infrastructure configuration issues • Grid resources are highly dynamic
6/36
Fixed job Flexible job
Non-fixed job
scheduler decides on component placement
scheduler decides on split up and placement
job components
same total job size
job component placement fixed
Koala Job Model
• A job consists of one or more job components
• A job component contains: • An executable name • Sufficient information necessary for scheduling • Sufficient information necessary for execution
7/36
Koala Architecture (1)
8/36
• PIP/NIP: information services • RLS: replica location service • CO: co-allocator • PC: processor claimer
• RM: run monitor • RL: runners listener • DM: data mover • Ri: runners
Koala Architecture (2): A Closer Look
9/36
Scheduler
• Enforces Scheduling Policies • Co-Allocation Policies
• Worst-Fit, Flexible Cluster Min., Comm. Aware, Close to Files
• Malleability Management Policies • Favour Previously Started Malleable Applications • Equi Grow Shrink
• Cycle Scavenging Policies • Equi-All, Equi-PerSite
• Workflow Scheduling Policies • Single-Site, Multi-Site
10/36
Runners
• Extends support for different application types
• KRunner: Globus runner • PRunner: A simplified job runner • IRunner: Ibis applications • OMRunner: OpenMPI applications • MRunner: Malleable applications based on the DYNACO
framework • WRunner: For workflows (Directed Acyclic Graphs) and BoTs
11/36
The Runners Framework
12/36
Support for Different Application Types
• Parallel Applications • MPI, Ibis,… • Co-Allocation • Malleability
• Parameter Sweep Applications • Cycle Scavenging • Run as low-priority jobs
• Workflows
13/36
Support for Co-Allocation
• What is co-allocation (just to remind)
• Co-allocation Policies
• Experimental Results
14/36
Co-Allocation
• Simultaneous allocation of resources in multiple sites • Higher system utilizations • Lower queue wait times
• Co-allocated applications might be less efficient due to the relatively slow wide-area communications • Parallel applications may have different communication
characteristics
15/36
Co-Allocation Policies (1)
• Dictate where the components of a job go
• Policies for non-fixed jobs: • Load-aware: Worst Fit (WF)
(balance load in clusters)
• Input-file-location-aware: Close-to-Files (CF)
(reduce file-transfer times)
• Communication-aware: Cluster Minimization (CM)
(reduce number of wide-area messages)
See: H.H. Mohamed and D.H.J. Epema, “An Evaluation of the Close-to-Files Processor and Data Co-Allocation Policy in Multiclusters,” IEEE Cluster 2004.
16/36
Co-Allocation Policies (2)
• Placement policies for flexible jobs: • Queue time-aware: Flexible Cluster
(CM + reduce queue wait time) Minimization (FCM)
• Communication-aware: Communication (decisions based on inter-cluster Aware (CA)
communication speeds)
See: O.O.Sonmez, H.H. Mohamed and D.H.J. Epema, “Communication-aware Job Scheduling Policies for the Koala Grid Scheduler”, IEEE e-Science 2006.
17/36
Co-Allocation Policies (3)
8 8 8
Components
Clusters
C1 (16) C2 (16) C3 (16)
I
II
III
WF
24
Component C1 (16) C2 (16) C3 (16)
I
II
FCM
18/36
Experimental Results : Co-Allocation Vs. No co-allocation • OpenMPI + DRMAA • no co-allocation (left) vs. co-allocation (right) • workloads of real parallel applications range from computation- (Prime) to very communication-intensive (Wave)
average job response time (s)
Prime Poisson Wave
co-allocation is disadvantageous for communication-intensive applications
19/36
Experimental Results : The performance of the policies
• Flexible Cluster Min. vs. Comm. Aware • Workloads of communication-intensive applications
average job response time (s)
considering the network metrics improves the co-allocation performance
FCM CA FCM CA [w/o Delft] [with Delft]
20/36
Support for PSAs in Koala
• Background
• System Design
• Scheduling Policies
• Experimental Results
21/36
Parameter Sweep Application Model
• A single executable that runs for a large set of parameters • E.g.; monte-carlo simulations, bioinformatics applications...
• PSAs may run in multiple clusters simultaneously
• We support OGF’s JSDL 1.0 (XML)
22/36
Motivation
• How to run thousands of tasks in the DAS? • Issues:
• 15 min. rule! • Observational scheduling • Overload
• Run them as Cycle Scavenging Applications !! • Sets priority classes implicitly • No worries for observing empty clusters
23/36
Cycle Scavenging
• The technology behind volunteer computing projects
• Harnessing idle CPU cycles from desktops
• Download a software (screen saver) • Receive tasks from a central server • Execute a task when the computer is idle • Immediate preemption when the user is active again
24/36
System Requirements
1. Unobtrusiveness Minimal delay for (higher priority) local and grid jobs
2. Fairness Multiple cycle scavenging applications running concurrently should be assigned comparable CPU-Time
3. Dynamic Resource Allocation Cycle scavenging applications has to Grow/Shrink at runtime
4. Efficiency As much use of dynamic resources as possible
5. Robustness and Fault Tolerance Long-running, complex system: problems will occur, and must be dealt with
25/36
System Interaction
Scheduler
CS-Runner
Node
submits PSA(s)
JDL
grow/shrink messages registers
Clusters
Launcher
Head Node
KCM
submits launchers
deploys, monitors, and
preempts tasks
monitors/informs idle/demanded
resources CS Policies: • Equi-All: grid-wide basis • Equi-PerSite: per cluster
Application Level Scheduling: • Pull-based approach • Shrinkage policy
26/36
Cycle Scavenging Policies
1. Equipartition-All
Clusters
C1 (12) C2 (12) C3 (24) CS User-1
CS User-2
CS User-3
27/36
Cycle Scavenging Policies
2. Equipartition-PerSite
Clusters
C1 (12) C2 (12) C3 (24) CS User-1
CS User-2
CS User-3
28/36
Experimental Results • DAS3 • Using Launchers vs. not • 60s. dummy tasks • Tested on a 32-node cluster
See: O. Sonmez, B. Grundeken, H.H. Mohamed, Alex Iosup, D.H.J. Epema, Scheduling Strategies for Cycle Scavenging in Multicluster
Grid Systems, CCGrid 2009.
Equi-PerSite is fair and superior to Equi-All
Num
ber
of C
ompl
eted
Job
s
Equi-All Equi-All Equi-PerSite Equi-PerSite WBlock WBurst WBlock WBurst
Mak
espa
n [s
]
• Equi-All vs. Equi-PerSite • 3 CS Users submit the same application with the same parameter range • Non-CS Workloads: WBlock, WBurst
job startup overhead +
information delay
Number of Jobs
29/36
Support for Workflows in Koala
• Applications with dependencies • e.g., Montage workflow
• Astronomy application to generate mosaics of the sky
• 4500 tasks
• Dependencies are file transfers
• Experience the WRunner in the hands-on session
30/36
Workflow Scheduling Policies (1/3)
1. Round Robin: submits the eligible tasks to the clusters in round-robin order
2. Single Cluster: maps every complete workflow to the least-loaded cluster at its submission
3. All Clusters: submits each eligible task to the least loaded cluster
31/36
4. All Clusters File-Aware: submits each eligible task to the cluster that minimizes the transfer costs of the files on which it depends
5. Coarsening*: iteratively reduces the size of a graph by collapsing groups of nodes and their internal edges • We use Heavy Edge Matching* technique to group tasks that are
connected with heavy edges
*G. Karypis and V. Kumar. Multilevel graph partitioning schemes. In Int. Conf. Par. Proc., pages 113–122, 1995.
Workflow Scheduling Policies (2/3)
32/36
6. Cluster Minimization: submits as many eligible tasks as possible to a cluster before considering the next cluster
7. HEFT* (Heterogeneous Earliest-Finish-Time): selects the task with the highest upward rank value and assigns the selected task to the cluster that ensures its earliest completion time
5/10/11 32
*H. Topcuoglu, S. Hariri, and M. Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE TPDS, 13(3):260–274, 2002.
Workflow Scheduling Policies (3/3)
We will use Single Cluster and All Clusters in the hands-on session
33/36
Workflow Engine Architecture
DRMAA
DFS
Workflow Description
SSH +
Custom Protocol
Execution
34/36
Cloud Integration
• Resource management issues • When and how many resources to allocate • What type of resource to allocate • Which availability zone
• Performance issues • No queueing but resource acquisition/release overheads • Virtualization overhead • Wide-area file transfers
• Cost issues (for public clouds) • Functionality Issues (MapReduce)
35/36
Conclusion
• Koala supports multiple application types: • Sequential applications • Parallel applications that may need co-allocation • Parallel applications that can grow/shrink at runtime • Parameter sweep applications • Workflows
• Different scheduling policies for each application type • No one size fits all policy
36/36
“[email protected]” http://www.st.ewi.tudelft.nl/~nezih/
More Information:
• Koala Project: http://st.ewi.tudelft.nl/koala
• PDS publication database: http://www.pds.twi.tudelft.nl
• DAS-4: http://www.cs.vu.nl/das4/
Top Related