Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids
description
Transcript of Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids
Decentralized Dynamic Scheduling across Heterogeneous Multi-core
Desktop Grids
Jaehwan Lee, Pete Keleher, Alan Sussman
Department of Computer Science University of Maryland
Multi-core is not enough• Multi-core CPU is the current trend of desktop computing• Not easy to exploit Multi-core in a single machine for high
throughput computing “Multicore is Bad news for Supercomputers”, S. Moore, IEEE
Spectrum, 2008• We have proposed decentralized solution for initial job
placement for Multi-core Grids, but..
How to increase the performance for Multi-core via dynamic Scheduling?
Motivation and Challenges• Why does dynamic scheduling need?
Stale load information from neighbors Unpredictable job completion times can change the
current load situation Probabilistic Initial job assignment can be improved
• Challenges for decentralized dynamic scheduling for multi-core grids Multiple Resource Requirement Decentralized Algorithm No Job starvation
Our Contribution
• New Decentralized Dynamic Scheduling Schemes for Multi-core Grids Local scheduling Internode scheduling Queue Balancing
• Experimental Result via extensive simulation Performance competitive with an online
centralized scheduler
Outline
• Background• Related work
• Our approach
• Experimental Results
• Conclusion & Future Work
Overall System Architecture
• P2P grids
Clients
OwnerNode
RunNode
InjectionNode
Insert Job J
Job J
Route Job J
Assign GUID to Job J
J
FIFO Job Queue
Peer-to-PeerNetwork
(DHT - CAN)
Initiate
Find
Matchmaking
Send Job J
Heartbeat
Heartbeat
Matchmaking Mechanism in CAN
CPU
Memory
C F I
A
B E
D G
H
Job J
Owner
Job JCPU >= CJ
&&Memory >= MJ
CJ
MJ
Client
Insert J
PushingJob J
Run Node
JFIFO Queue
Heartbeat
Approach for Multi-core nodes (Lee:IPDPS’10)
• Two logical nodes can represent a multi-core node Max-node : Maximum & Static status Residue-node : Current & Dynamic status
• Resource management & matchmaking Dual-CAN : Two distinct nodes consist of each separate CAN Balloon Model : Residue-node can be represent as a simple structure on top of
existing CAN
Qlen=1
CPU: 2GHzMem:2GB
Job 1
CPU
Memory
2GHz
3GB
2GB
1.5GHz
A B
Max-node
Primary CAN
CPU
Memory
2GHz
1GB
B’
Residue- node
Free
Secondary CAN B’
FreeResidue-
nodeCPU: 2GHzMem:2GB
Job 1
Qlen=1
Memory
CPU1.5GHz 2GHz
2GB
3GB
AB
Max-node
1GB
Balloon Model
Outline
• Background
• Related work• Our approach
• Experimental Results
• Conclusion & Future Work
Backfilling• Concept
• Assumption Job running time should be given Conservative Backfilling : NOT to delay all other waiting jobs EASY Backfilling : NOT to delay the first job in the queue Inaccurate job running time estimation is closely related to the
overall performance
CPUs
Time
Job1Job2
Job3
Job4
Previous approach under multiple resource requirements
• Backfilling under multiple resource requirement (Leinberger:SC’99) Backfilling in single machine Heuristic approaches : Backfill Lowest(BL), Backfill Balanced(BB) Assumption cannot be applicable to our system
• Job running time is not given
• Job migration to balance K-resources between nodes (Leinberger:HCW’00) Reduce local load imbalance by exchanging jobs, but not consider overall
system loads No backfilling scheme Assumption : near-homogeneous environment
Outline
• Background
• Related work
• Our approach• Experimental Results
• Conclusion & Future Work
Dynamic Scheduling• After Initial Job assignment..(but before the job starts
running) Periodical dynamic scheduling is invoked
• Costs for dynamic scheduling Communication Cost
• Nothing : Local scheduling• Minimum : Internode scheduling & Queue balancing
Job profile is small
CPU cost : Nothing• No preemptive scheduling : Once a job starts running, it won’t be
stopped due to dynamic scheduling.
Local Scheduling• Extension of Backfilling under
multiple resource requirement
• Backfilling Counter (BC) Every Job profile keeps this value,
with initial value 0 The number of other jobs which have
bypassed the job in the waiting queue
Only a job with a BC is equal to or greater than BC of the job at the head of the queue can be backfilled
No job starvation or long waiting time
BackfillingRunning Job
Head of Queue 0
0
0
0
BC
Queue
J1
J2
J3
J4
JR
J3
Which job should be backfilled?• If multiple jobs in the queue can be backfilled, choose the best job for
better utilization Backfill Balanced (BB) (Leinberger:SC’99) algorithm Choose the job with minimum objective function (= BM x FM )
• Balance Measure (BM) : normalized utilization for resource : Job ’s normalized requirement for resource
Minimize unevenness across utilization of multiple resources• Fullness Measure (FM)
Maximize average utilization
Internode Scheduling• Extension of Local scheduling across nodes• Node Backfilling Counter (NBC)
BC of the job at the head of the node’s waiting queue Only jobs whose BC is equal to or greater than NBC of the target node can be
migrated No job starvation or long waiting time
RunningJob
000
BC
Queue
JB
J1
J2
J3
Running Job
220
BC
Queue
JA
J1
J2
J3
Running Job
110
BC
Queue
JR J3
J1
J2
J4
NBC : 2 NBC : 0
J4
Internode Scheduling – PUSH vs. PULL
• PUSH A job sender initiates the process Sender tries to match every job in the queue with residual free resources in its neighbors If a job can be sent to multiple nodes, Pick the job with the objective function to prefer a
node with the fastest CPU
• PULL A job receiver initiates the process Receiver sends a PULL-Request message to the potential sender (with maximum queue
length) Potential sender checks whether it has a job to be backfilled and the job satisfies BC
condition. If multiple jobs can be sent, choose the job with minimum objective function (= BM x
FM ) If no job can be found, send a PULL-Reject message to receiver. The receiver continues to look for another potential sender among neighbors
Queue Balancing• Local scheduling & Internode scheduling
find the job which can start running immediately to use current residual resources
• Proactive job migration for queue (load) balancing Migrated job does not have to start immediately
• (Normalized) Load of a node (Leinberger:HCW’00) “Load” for multiple resources can be defined as follows
For each resource, sum all job’s requirements in the queue Load of a node can be defined as a maximum of the above sums
• PUSH & PULL schemes are available Minimize total local loads (= sum of loads of neighbors, TLL) Minimize the maximum local load among neighbors (MLL)
Outline
• Background
• Related work
• Our approach
• Experimental Results• Conclusion & Future Work
Experimental Setup• Event-driven Simulations
A set of nodes and events • 1000 initial nodes and 5000 job submissions• Jobs are submitted with a Poisson distribution (various
interval T)• A node has 1,2,4 or 8 cores. • Job run time follows uniform distribution
(30mins~90mins) Node Capability (Job Requirements)
• CPU, Memory, Disk and the number of cores• Job requirement for a resource can be omitted (Don’t care)
Job Constraint Ratio : The probability (Ratio) that each resource type for a job is specified
Steady state experiments
Experimental Setup
• Performance metrics
Job Turn-around Time
Injectedinto
the system
Arrives atOwner
Arrives atRun Node
Starts Execution
Finishes Execution
MatchmakingCost
WaitTime
RunningTime
Comparison Models• Centralized Scheduler (CENT)
Online and global scheduling mechanism with a single waiting queue Not feasible in a complete implementation of P2P system
• Combination of our approach Vanilla : No dynamic scheduling (Initial job assignment only) L : Local scheduling only LI : Local scheduling + Internode scheduling LIQ : All three methods (Local scheduling + Internode scheduling + Queue balancing) LI(Q)-PUSH/PULL : LI & LIQ has PUSH (PULL) options
Job JCPU 2.0GHz
Mem 500MBDisk 1GB
CentralizedScheduler
Performance varying system loads
• LIQ-PULL > LI-PULL > LIQ-PUSH > LI-PUSH > L >= Vanilla
• Internode scheduling has a major role
• PULL is better than PUSH In overloaded system, PULL is
better to spread information due to aggressive trial for a job (Demers:PODC’87)
• Local scheduling cannot guarantee for better performance than Vanilla Our Backfilling Counter cannot
ensure not to delay other waiting jobs (different from conservative backfilling)
Overheads
• PULL has higher cost than PUSH Active search (lots of trials and errors )
• Other schemes are similar to Vanilla Our schemes do not incur significant additional overhead
Performance varying Job Constraint Ratio
• LIQ-PULL : best• LIQ == LI• LIQ-PULL is competitive to
CENT• For 80% Job Constraint
Ratio, LIQ-PULL performance gets
worse compared to other ratio cases
Reason: difficult to find a good (capable) neighbor for job migration because jobs get more highly constrained
Evaluation Summary
• Performance LIQ-PULL is competitive to CENT Internode Scheduling has major impacts on performance PULL is better than PUSH (more aggressive search) Performance shows good consistently regardless of variant
system load and job constraint ratio environment
• Overheads PULL > PUSH (more aggressive search) Competitive to Vanilla
Conclusion and Future Work
• New decentralized Dynamic Scheduling for Multi-core P2P Grids Local Scheduling : Backfilling in a single node Internode scheduling : Backfilling across neighbors Queue Balancing : Proactive Queue(Load) adjustment among neighbors
• Evaluation via simulation LIQ-PULL is the best Competitive performance to CENT PULL is better than PUSH, but PULL has higher cost Low overhead (similar to vanilla )
• Future work Real experiment (co-operation with Astronomy Dept.) Decentralized Resource Management for Heterogeneous Asymmetric Multi-
processors
Decentralized Dynamic Scheduling across Heterogeneous Multi-core
Desktop Grids
Jaehwan Lee, Pete Keleher, Alan Sussman
Department of Computer Science University of Maryland