Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids

Decentralized Dynamic Scheduling across Heterogeneous Multi-core

Desktop Grids

Jaehwan Lee, Pete Keleher, Alan Sussman

Department of Computer Science University of Maryland

Multi-core is not enough• Multi-core CPU is the current trend of desktop computing• Not easy to exploit Multi-core in a single machine for high

throughput computing “Multicore is Bad news for Supercomputers”, S. Moore, IEEE

Spectrum, 2008• We have proposed decentralized solution for initial job

placement for Multi-core Grids, but..

How to increase the performance for Multi-core via dynamic Scheduling?

Motivation and Challenges• Why does dynamic scheduling need?

Stale load information from neighbors Unpredictable job completion times can change the

current load situation Probabilistic Initial job assignment can be improved

• Challenges for decentralized dynamic scheduling for multi-core grids Multiple Resource Requirement Decentralized Algorithm No Job starvation

Our Contribution

• New Decentralized Dynamic Scheduling Schemes for Multi-core Grids Local scheduling Internode scheduling Queue Balancing

• Experimental Result via extensive simulation Performance competitive with an online

centralized scheduler

Outline

• Background• Related work

• Our approach

• Experimental Results

• Conclusion & Future Work

Overall System Architecture

• P2P grids

Clients

OwnerNode

RunNode

InjectionNode

Insert Job J

Job J

Route Job J

Assign GUID to Job J

J

FIFO Job Queue

Peer-to-PeerNetwork

(DHT - CAN)

Initiate

Find

Matchmaking

Send Job J

Heartbeat

Heartbeat

Matchmaking Mechanism in CAN

CPU

Memory

C F I

A

B E

D G

H

Job J

Owner

Job JCPU >= CJ

&&Memory >= MJ

CJ

MJ

Client

Insert J

PushingJob J

Run Node

JFIFO Queue

Heartbeat

Jik-Soo Kim

This figure is an illustration purpose~!!Actual CAN may be irregularly partitioned.Make sure that describing all of following items:1) We are using each resource type as a separate dimension with additonal virtual dimension2) As nodes are mapped to the CAN space, they divide up the whole space and maintain their own zones and neighbors. Thus, each node's zone contains the point for that node3) Each node periodically sends update messages to its direct neighbors and maintains indirect neighbors for failure recovery4) Pushing mechanism is used for better load balancing and we propagate global information and aggregate it along each dimension

Approach for Multi-core nodes (Lee:IPDPS’10)

• Two logical nodes can represent a multi-core node Max-node : Maximum & Static status Residue-node : Current & Dynamic status

• Resource management & matchmaking Dual-CAN : Two distinct nodes consist of each separate CAN Balloon Model : Residue-node can be represent as a simple structure on top of

existing CAN

Qlen=1

CPU: 2GHzMem:2GB

Job 1

CPU

Memory

2GHz

3GB

2GB

1.5GHz

A B

Max-node

Primary CAN

CPU

Memory

2GHz

1GB

B’

Residue- node

Free

Secondary CAN B’

FreeResidue-

nodeCPU: 2GHzMem:2GB

Job 1

Qlen=1

Memory

CPU1.5GHz 2GHz

2GB

3GB

AB

Max-node

1GB

Balloon Model

Outline

• Background

• Related work• Our approach

• Experimental Results


Backfilling• Concept

• Assumption Job running time should be given Conservative Backfilling : NOT to delay all other waiting jobs EASY Backfilling : NOT to delay the first job in the queue Inaccurate job running time estimation is closely related to the

overall performance

CPUs

Time

Job1Job2

Job3

Job4

Previous approach under multiple resource requirements

• Backfilling under multiple resource requirement (Leinberger:SC’99) Backfilling in single machine Heuristic approaches : Backfill Lowest(BL), Backfill Balanced(BB) Assumption cannot be applicable to our system

• Job running time is not given

• Job migration to balance K-resources between nodes (Leinberger:HCW’00) Reduce local load imbalance by exchanging jobs, but not consider overall

system loads No backfilling scheme Assumption : near-homogeneous environment

Outline

• Background

• Related work

• Our approach• Experimental Results


Dynamic Scheduling• After Initial Job assignment..(but before the job starts

running) Periodical dynamic scheduling is invoked

• Costs for dynamic scheduling Communication Cost

• Nothing : Local scheduling• Minimum : Internode scheduling & Queue balancing

Job profile is small

CPU cost : Nothing• No preemptive scheduling : Once a job starts running, it won’t be

stopped due to dynamic scheduling.

Local Scheduling• Extension of Backfilling under

multiple resource requirement

• Backfilling Counter (BC) Every Job profile keeps this value,

with initial value 0 The number of other jobs which have

bypassed the job in the waiting queue

Only a job with a BC is equal to or greater than BC of the job at the head of the queue can be backfilled

No job starvation or long waiting time

BackfillingRunning Job

Head of Queue 0

0

0

0

BC

Queue

J1

J2

J3

J4

JR

J3

Which job should be backfilled?• If multiple jobs in the queue can be backfilled, choose the best job for

better utilization Backfill Balanced (BB) (Leinberger:SC’99) algorithm Choose the job with minimum objective function (= BM x FM )

• Balance Measure (BM) : normalized utilization for resource : Job ’s normalized requirement for resource

Minimize unevenness across utilization of multiple resources• Fullness Measure (FM)

Maximize average utilization

Internode Scheduling• Extension of Local scheduling across nodes• Node Backfilling Counter (NBC)

BC of the job at the head of the node’s waiting queue Only jobs whose BC is equal to or greater than NBC of the target node can be

migrated No job starvation or long waiting time

RunningJob

000

BC

Queue

JB

J1

J2

J3

Running Job

220

BC

Queue

JA

J1

J2

J3

Running Job

110

BC

Queue

JR J3

J1

J2

J4

NBC : 2 NBC : 0

J4

Internode Scheduling – PUSH vs. PULL

• PUSH A job sender initiates the process Sender tries to match every job in the queue with residual free resources in its neighbors If a job can be sent to multiple nodes, Pick the job with the objective function to prefer a

node with the fastest CPU

• PULL A job receiver initiates the process Receiver sends a PULL-Request message to the potential sender (with maximum queue

length) Potential sender checks whether it has a job to be backfilled and the job satisfies BC

condition. If multiple jobs can be sent, choose the job with minimum objective function (= BM x

FM ) If no job can be found, send a PULL-Reject message to receiver. The receiver continues to look for another potential sender among neighbors

Queue Balancing• Local scheduling & Internode scheduling

find the job which can start running immediately to use current residual resources

• Proactive job migration for queue (load) balancing Migrated job does not have to start immediately

• (Normalized) Load of a node (Leinberger:HCW’00) “Load” for multiple resources can be defined as follows

For each resource, sum all job’s requirements in the queue Load of a node can be defined as a maximum of the above sums

• PUSH & PULL schemes are available Minimize total local loads (= sum of loads of neighbors, TLL) Minimize the maximum local load among neighbors (MLL)

Outline

• Background

• Related work

• Our approach

• Experimental Results• Conclusion & Future Work

Experimental Setup• Event-driven Simulations

A set of nodes and events • 1000 initial nodes and 5000 job submissions• Jobs are submitted with a Poisson distribution (various

interval T)• A node has 1,2,4 or 8 cores. • Job run time follows uniform distribution

(30mins~90mins) Node Capability (Job Requirements)

• CPU, Memory, Disk and the number of cores• Job requirement for a resource can be omitted (Don’t care)

Job Constraint Ratio : The probability (Ratio) that each resource type for a job is specified

Steady state experiments

Experimental Setup

• Performance metrics

Job Turn-around Time

Injectedinto

the system

Arrives atOwner

Arrives atRun Node

Starts Execution

Finishes Execution

MatchmakingCost

WaitTime

RunningTime

Comparison Models• Centralized Scheduler (CENT)

Online and global scheduling mechanism with a single waiting queue Not feasible in a complete implementation of P2P system

• Combination of our approach Vanilla : No dynamic scheduling (Initial job assignment only) L : Local scheduling only LI : Local scheduling + Internode scheduling LIQ : All three methods (Local scheduling + Internode scheduling + Queue balancing) LI(Q)-PUSH/PULL : LI & LIQ has PUSH (PULL) options

Job JCPU 2.0GHz

Mem 500MBDisk 1GB

CentralizedScheduler

Performance varying system loads

• LIQ-PULL > LI-PULL > LIQ-PUSH > LI-PUSH > L >= Vanilla

• Internode scheduling has a major role

• PULL is better than PUSH In overloaded system, PULL is

better to spread information due to aggressive trial for a job (Demers:PODC’87)

• Local scheduling cannot guarantee for better performance than Vanilla Our Backfilling Counter cannot

ensure not to delay other waiting jobs (different from conservative backfilling)

Overheads

• PULL has higher cost than PUSH Active search (lots of trials and errors )

• Other schemes are similar to Vanilla Our schemes do not incur significant additional overhead

Performance varying Job Constraint Ratio

• LIQ-PULL : best• LIQ == LI• LIQ-PULL is competitive to

CENT• For 80% Job Constraint

Ratio, LIQ-PULL performance gets

worse compared to other ratio cases

Reason: difficult to find a good (capable) neighbor for job migration because jobs get more highly constrained

Evaluation Summary

• Performance LIQ-PULL is competitive to CENT Internode Scheduling has major impacts on performance PULL is better than PUSH (more aggressive search) Performance shows good consistently regardless of variant

system load and job constraint ratio environment

• Overheads PULL > PUSH (more aggressive search) Competitive to Vanilla

Conclusion and Future Work

• New decentralized Dynamic Scheduling for Multi-core P2P Grids Local Scheduling : Backfilling in a single node Internode scheduling : Backfilling across neighbors Queue Balancing : Proactive Queue(Load) adjustment among neighbors

• Evaluation via simulation LIQ-PULL is the best Competitive performance to CENT PULL is better than PUSH, but PULL has higher cost Low overhead (similar to vanilla )

• Future work Real experiment (co-operation with Astronomy Dept.) Decentralized Resource Management for Heterogeneous Asymmetric Multi-

processors

Decentralized Dynamic Scheduling across Heterogeneous Multi-core

Desktop Grids

Jaehwan Lee, Pete Keleher, Alan Sussman

Department of Computer Science University of Maryland

Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids

Documents

Transcript of Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids