Scheduling in distributed systems - Andrii Vozniuk

27
Scheduling In Distributed Systems Andrii Vozniuk EPFL July 4, 2012 Candidacy exam

description

My EPFL candidacy exam presentation: http://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/vozniuk_andrii_candidacy_writeup.pdf Here I present how schedulers work in three distributed data processing systems and their possible optimizations. I consider Gamma - a parallel database, MapReduce - a data-intensive system and Condor - a compute-intensive system. This talk is based on the following papers: 1) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt 2) Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica 3) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt

Transcript of Scheduling in distributed systems - Andrii Vozniuk

Page 1: Scheduling in distributed systems - Andrii Vozniuk

Scheduling In Distributed Systems

Andrii Vozniuk EPFL July 4, 2012

Candidacy exam

Page 2: Scheduling in distributed systems - Andrii Vozniuk

Data explosion Processing gets more complicated

Big Data

2

Resources of many computers should be used

Generates: 40 TB/dayStores: 20 PB/year

Generates: 25 TB/dayStores: 10 PB/year

Page 3: Scheduling in distributed systems - Andrii Vozniuk

Typical Data Processing Pipeline

3

Log data

Clean data

Query data

Analyze data

Sensor data

No one-size-fits-all system currently exists

ETL-like batchprocessing

Efficient queryexecution

Using resources ofmany organizations

Particle found!

User model

Page 4: Scheduling in distributed systems - Andrii Vozniuk

Outline

4

Condor - compute-intensive system

Gamma - parallel database

MapReduce - data-intensive system

Ɣ

Conclusions

Future Research

Page 5: Scheduling in distributed systems - Andrii Vozniuk

Scheduling Policy: setting an ordering of tasks Assigning resources to tasks

Scheduling In Distributed Systems

5

Scheduling is challenging in distributed systems

How to match resources and tasks?

tasktask

task task

Page 6: Scheduling in distributed systems - Andrii Vozniuk

Matching Tasks With Resources

6

How scheduling is influenced by data and execution models?

Perspectives Data model Execution model

System/Perspective

Data model Execution model

Gamma Relational Multioperator

MapReduce Unconstrained MapReduce

Condor Unconstrained Unconstrained

Page 7: Scheduling in distributed systems - Andrii Vozniuk

Gamma Pioneering parallel database Data model: constrained

Relational data model Relations are horizontally partitioned

Execution model: constrained Multioperator queries Operators employ hash-based algorithms

7

Ɣ

Page 8: Scheduling in distributed systems - Andrii Vozniuk

Gamma: Scheduler

8

Scheduling is done at the operator level

Host Machine

Query Manager Catalog

Gamma Database

a-m n-z

OperatorProcess

Operator Process

SchedulerProcess

Optimizes queryCompiles plan

SELECT r FROM RWHERE r < ‘k’

Schedulesoperators

Execution onrelevant nodes

Node 1 Node 2

Ɣ

query

Page 9: Scheduling in distributed systems - Andrii Vozniuk

Gamma: Batch Scheduling

9

Exploit sharing by scheduling in a batch Example of selection sharing

Batch scheduling trades latency for throughput

σ1

A

Shared scanσ2

A

σ1

A

σ2

Ɣ

Reads of A can be shared applying predicates in turn Shared relation A is scanned only once

Page 10: Scheduling in distributed systems - Andrii Vozniuk

Gamma: Batch Scheduling Joins

10

Sharing among joins reduces total execution time

Several hash-joins in a batch of queries Hash table for the same relation can be shared Example assumes 100% selectivity of σ

σ σ

A Β

σ σ

A C

σ σ

B A

σ

C

Shared hash-table for A

Ɣ

Sharing reduces I/O and memory usage

Page 11: Scheduling in distributed systems - Andrii Vozniuk

Limitations Of Gamma Gamma offers

Efficient query execution Sharing in a batch of queries

Gamma operates on structured data Gamma is not suitable for

Unstructured data processing ETL type of workload Running on large scale

11

A different system for ETL processing is needed

Ɣ

Page 12: Scheduling in distributed systems - Andrii Vozniuk

MapReduce System for data-intensive applications Execution model: constrained

Job is a set of map and reduce tasks Tasks are independent

Data model: unconstrained Arbitrary data format Files are partitioned into chunks Each chunk is replicated several times

12

Page 13: Scheduling in distributed systems - Andrii Vozniuk

MapReduce: Scheduling

13

4 Map tasks

2 Reduce task

MapReduce job

Map1

Map2

Map3

Map4

Reduce

Reduce

Example:

Temp1Result

1

Chunk1

Temp2

Chunk2

Temp4Temp3

Chunk3 Result

2

Chunk4

Fine grain scheduling improves fault tolerance and elasticity

Tasks are scheduled close to data Execution is scalable and fault-tolerant Execution is elastic

Page 14: Scheduling in distributed systems - Andrii Vozniuk

MapReduce: Speculative Execution Nodes may become slow Speculative execution minimizes job’s response time Launch if progress is 20% less than average

14

Speculative execution works well in homogeneous environment

Normal node

Temporary slow node

backup

straggler

Page 15: Scheduling in distributed systems - Andrii Vozniuk

Emerging Heterogeneous Infrastructures Replacement of failed components Extending existing cluster with new machines Virtualized data centers of cloud providers

CPU and RAM are isolated Contention for disk and network

15

In many real-life cases the infrastructure is heterogeneous

1 2 3 4 5 6 70

10203040506070

VMs on Physical Host

IO P

erfo

rman

ce p

er

VM (M

B/s)

Page 16: Scheduling in distributed systems - Andrii Vozniuk

MapReduce: Heterogeneous Cluster

Performance degrades on heterogeneous cluster Slow nodes are wasted Backup tasks on slow nodes All straggling tasks are treated equally Thrashing due to excessive speculative execution

16

Speculative execution should be improved for heterogeneous cluster

Fast node

Slow node

Page 17: Scheduling in distributed systems - Andrii Vozniuk

MapReduce: LATE Scheduler Idea: back up the task with the largest estimated finish

time (Longest Approximate Time to End)

Thresholds Limit the number of backup tasks Launch backup tasks on fast nodes Backup only sufficiently slow tasks

17

progress score

execution timeprogress rate =

1 – progress score

progress rateestimated time left =

LATE looks forward to prioritize tasks to speculate

Page 18: Scheduling in distributed systems - Andrii Vozniuk

MapReduce: LATE Example Back up the task with Longest Approximate Time to End

18

2 min

Time (min)

Progress = 5.3%

Estimated time left:(1-0.05) / (1/1.9) = 1.8

Progress = 66%

LATE correctly identifies task which hurts the response time the most

3x slower

1.9x slower

1

2

3

1 task/min

Estimated time left:(1-0.66) / (1/3) = 1

improvement

Page 19: Scheduling in distributed systems - Andrii Vozniuk

Limitations Of MapReduce

19

MapReduce offers High scalability Good fault tolerance Handling of unstructured data

MapReduce is not suitable for Running on multi organization infrastructure Harvesting idle resources in organization

A different system for multi organization infrastructure is needed

Page 20: Scheduling in distributed systems - Andrii Vozniuk

Compute-intensive system harvesting idle resources Data model: arbitrary Execution model: arbitrary

jobjob

Condor

20

job

job

Increase resources utilization by scheduling jobs on idle machines

How to increase utilization and respect

the owners?

Page 21: Scheduling in distributed systems - Andrii Vozniuk

jobjob

Condor Scheduler: Centralized?

21

Scheduler

Efficient but not reliable, possible bottleneck

job

job

Page 22: Scheduling in distributed systems - Andrii Vozniuk

jobjob

Condor Scheduler: Distributed?

22

Scheduler

Reliable but inefficient

Scheduler

Scheduler

Scheduler

job

job

Page 23: Scheduling in distributed systems - Andrii Vozniuk

jobjob

Condor Scheduler: Hybrid!

23

Hybrid approach has the best of both worlds

Scheduler

Scheduler

Matchmaker

Scheduler

Information about nodesInformation about tasks

job

job

1

1

12

3

3

4

Page 24: Scheduling in distributed systems - Andrii Vozniuk

Job Description

[MyType=“Job”TargetType = “Machine“Department=“CompSci“Requirements = (other.OpSys==LINUX && other.Disk > 10000000)Rank=Memory]

Machine Description

[MyType=“Machine“TargetType=“Job“Machine=“nostos.cs.wisc.edu“OpSys=“LINUX“Disk=3076077Requirement = (LoadAvg <= 0.3) && (KeyboardIdle > (15*60))Rank = other.Department==self.Department]

ClassAds: Describing Jobs and Resources

24

Matchmaker is suitable for heterogeneous shared clusters

Requirements should be satisfied Candidate with the highest rank is returned

Page 25: Scheduling in distributed systems - Andrii Vozniuk

Scheduling done at different levels Gamma: operator level scheduling enables sharing MR and Condor: arbitrary code => sharing is hard Condor: matchmaking gives control on job placement

Hybrid approaches are promising for big data processing

Scheduling in heterogeneous deployments is challenging

Conclusions

25

Page 26: Scheduling in distributed systems - Andrii Vozniuk

26

Thank you for your attention!

Feedback & Question?

[email protected]

Page 27: Scheduling in distributed systems - Andrii Vozniuk

References Matchmaking: Distributed Resource Management for Hig

h Throughput Computing by Rajesh Raman, Miron Livny and Marvin Solomon.

Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt.

Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica

Slides 14 and 18 exploit presentation ideas from the LATE slides for OSDI 2008 by Matei Zaharia

27