Task scheduling algorithm for multicore processor system for
minimizing recovery time in case of single node fault
1
Shohei Gotoda†, Naoki Shibata‡, Minoru Ito†
†Nara Institute of Science and Technology‡Shiga University
Background
• Multicore processors Almost all processors designed recently
are multicore processors
• Computing cluster consisting of 1800 nodes experiences about 1000 failures[1]in the first year after deployment[1] Google spotlights data center inner workings cnet.com article on May 30, 2008
Objective of Research
• Fault tolerance We assume a single fail-stop failure of a
multicore processor
• Network contention To generate schedules reproducible on
real systems
3
Devise new scheduling method thatminimizes recovery time
taking account of the above points
Task Graph• A group of tasks that can
be executed in parallel• Vertex (task node)
Task to be executed on a single CPU core
• Edge (task link)Data dependence between
tasks
4
Task node Task link
Task graph
Processor Graph• Topology of the computer
network• Vertex (Processor node)
CPU core (circle)• has only one link
Switch (rectangle)• has more than 2 links
• Edge (Processor link)Communication path between
processors
5
Processor node Processor linkSwitch
Processor graph
321
Task Scheduling• Task scheduling problem
assigns a processor node to each task node
minimizes total execution time
An NP-hard problem
6
1
One processor node is assigned to each task
node321
Processor graph
Task graph
Inputs and Outputs for Task Scheduling
• InputsTask graph and processor
graph
• OutputA schedule
• which is an assignment of a processor node to each task node
• Objective functionMinimize task execution time
7
3
31
31
321
Processor graph
Task graph
Network Contention Model
• Communication delayIf processor link is occupied by
another communication
• We use existing network contention model[2]
8
3
31
32
Contention 321
Processor graph
Task graph
[2] O. Sinnen and L.A. Sousa, “Communication Contention in Task Scheduling,“ IEEE Trans. Parallel and Distributed Systems, vol. 16, no. 6, pp. 503-515, 2005.
Multicore Processor Model
• Each core executes a task independently from other cores
• Communication between cores finishes instantaneously
• One network interface is shared among all cores on a die
• If there is a failure, all cores on a die stop execution simultaneously
9
Core1
Core2
CPU
21
Processor graph
Influence of Multicore Processors
10
• Need for considering multicore processors in schedulingHigh speed communication
link among processors on a single die
• Existing schedulers try to utilize this high speed link
• As a result, many dependent tasks are assigned to cores on a single die
3
31
32
321Assigned to cores on a same die
Processor graph
Task graph
• Need for considering multicore processors in schedulingHigh speed communication
link among processors on a single die
• Existing schedulers try to utilize this high speed link
• As a result, many dependent tasks are assigned to cores on a single die
In case of fault• Dependent tasks tends to be
destroyed at a time
11
3
31
32
321
Processor graph
Task graph
Influence of Multicore Processors
Assigned to cores on a same die
Related Work (1/2)• Checkpointing [3]
Node state is saved in each nodeBackup node is allocatedRecover processing results from saved state
Multicore is not consideredNetwork contention is not considered
12
[3] Y. Gu, Z. Zhang, F. Ye, H. Yang, M. Kim, H. Lei, and Z. Liu. An empirical study of high availability in stream processing systems. In Middleware ’09: the 10th ACM/IFIP/USENIX International Conference on Middleware (Industrial Track), 2009.
1
2
3
4
Input Queue
Output Queue
Secondary
Primary Backup
Related Work (2/2)• Task scheduling method[5] in which
Multiple task graph templates are prepared beforehand
Processors are assigned according to the templates
• This method is suitable for highly loaded systems
[5] Wolf, J., et al.: SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems. In: ACM Middleware (2008)
Our Contribution• There is no existing method for
scheduling that takes account of both• multicore processor failure• network contention
• We propose a scheduling method taking account of network contention and multicore processor failure
14
Assumptions• Only a single fail-stop failure of a
multicore processor can occurFailed computing node automatically restart
after 30 sec.
• Failure can be detected in one secondby interruption of heartbeat signals
• Use checkpointing technique to recover from saved state
• Network contentionContention model is same as the Sinnen’s
model15
Checkpointing and Recovery
• Each processor node saves state to the main memory when each task is finished Saved state is the data transferred to the
succeeding processor nodes Only output data from each task node is saved as a
state• This is much smaller than the complete memory image
We assume saving state finishes instantaneously• Since this is just copying small data within memory
• Recovery Saved state which is not affected by the failure is
found in the ancestor task nodes. Some tasks are executed again using the saved state
16
What Proposed Method Tries to Do
• Reduce recovery time in case of failure Minimizes the worst case total execution
time• Worst case in the all possible patterns of
failure• Each of dies can fail
Execution time before failure + recovery
Worst Case Scenario• Critical path
Path in task graph from first to last task with longest execution time
• The worst case scenarioAll tasks in critical path are assigned to processors on a dieFailure happens when the last task is being executedWe need two times of total execution time
18
Example task graph
First
Last
Idea of Proposed Method• We distribute tasks on critical path over
diesBut, there is communication overheadIf we distribute too many tasks, there is too
much overhead
• Usually, the last tasks in critical path have larger influenceWe check tasks from the last task in the
critical pathWe find the last k tasks in the critical path to
other diesWe find the best k
Problem with Existing Method
20
1 2
3
A B C
21
3
BA
Resulting execution
ExistingSchedule
D
DC
• Task 1 is assigned to core A• Task 2 is assigned to core B• Task 3 is assigned to same
die• because of high
communication speed
Time
• Suppose that failure happens when Task 3 is being executed
• All results are lost
21
1 2
3
A B C
21
3
BA
D
DC
Resulting execution
ExistingSchedule
Time
Problem with Existing Method
Problem with Existing Method
22
1 2
3
A B C
21
3
BA
D
DC
1’ 2’
3’
21
3
‘ ‘
‘
Resulting execution
ExistingSchedule
Time
• Suppose that failure happens when Task 3 is being executed
• All results are lost• We need to execute all tasks
again from the beginningon another die
Improvement in Proposed Method
• Distribute influential tasks to other diesIn this case, task 3 is the most
influential
23
21
3
Proposed schedule
1 2
3
A B C
BA
Resulting executionD
DC
Comm.overhead
Time
Recovery in Proposed Method
• Suppose that failure happens when Task 3 is being executed
• Results of Task 1 and 2 are saved
24
21
3
1 2
3
A B C
BA
D
DC
Resulting execution
Time
Proposed schedule
Recovery in Proposed Method
• Suppose that failure happens when Task 3 is being executed
• Results of Task 1 and 2 are saved
• Execution can be continued from the saved state
25
3’
21
3
1 2
3
A B C
BA
D
DC
3
‘Resulting execution
Time
Proposed schedule
Communication Overhead
• Communication overhead is imposed to the proposed method
26
Existing schedule Proposed schedule
overhead
1 2
3
A B C D
1 2
3
A B C D
Tim
e
Speed-up in Recovery
27
Recovery withexisting schedule
Recovery withproposed schedule
Proposed method has larger effect
if computation time is longer than communication time
1 2
3
A B C D
1 2
3
A B C D
1’ 2’
3’3’
speed-up
時間
Comparison of Schedules
28
Existing schedule
Proposed scheduleTim
e
Time
Task graph
10 32
Processor graph
1
2
6 7
3 4
8 9
5
10
11
12
13
29
Notavailable
Comparison of Recovery
Existing schedule
Proposed scheduleTim
e
Time
Task graph
10 32
Processor graph
1
2
6 7
3 4
8 9
5
10
11
12
13
Evaluation• Items to compare
Recovery time in case of a failureOverhead in case of no failure
• Compared methodsPROPOSEDCONTENTION
• Sinnen’s method considering network contention
INTERLEAVED• Scheduling algorithm that tries to spread
tasks to all dies as much as possible30
Test Environment• Devices
4 PCs with• Intel Core i7 920 (2.67GHz) (Quad core)• Intel Network Interface Card
Intel Gigabit CT Desktop Adaptor (PCI Express x1)
• 6.0GB Memory
• Program to measure execution time
• Windows 7(64bit) • Java(TM) SE Runtime Environment
(64bit)• Standard TCP socket
31
Task Graph with Low Parallelism
Configuration• Number of task nodes :
90• Number of cores on a
die : 2• Number of dies : 2 ~ 4• Robot control [4]
32
Task graphProcessor graph
10
Die
1 CoreSwitch
4 5
Die
# of dies
32
Die
6 7
Die
[4] Standard Task Graph Set http://www.kasahara.elec.waseda.ac.jp/schedule/index.html
Results with Robot Control Task
• We varied number of dies• In case of failure, proposed method
reduced total execution time by 40%• In case of no failure, up to 6% of
overhead33
In case of a failure No failure
40%
6%
Number of dies Number of dies
CONTENTIONINTERLEAVED
PROPOSED
INTERLEAVED
CONTENTION
PROPOSED
Exe
cutio
n tim
e (s
ec)
Exe
cutio
n tim
e (s
ec)
Configuration
• Number of task nodes :98
• Number of cores on a die : 4
• Number of dies : 2 ~ 4• Sparse matrix solver [4]
34
10
Die
1 Core
Switch
2 3 54
Die
6 7
# of dies
Task Graph with High Parallelism
Processor graph
Task graph
[4] Standard Task Graph Set http://www.kasahara.elec.waseda.ac.jp/schedule/index.html
Results with Sparse Matrix Solver
• We varied number of dies• In case of failure, execution time
including recovery reduced by up to 25%
• In case of no failure, up to 7% of overhead
35
25%7%
In case of a failure No failure
INTERLEAVEDINTERLEAVED
CONTENTION
CONTENTION
PROPOSED
PROPOSED
Number of diesNumber of dies
Exe
cutio
n tim
e (s
ec)
Exe
cutio
n tim
e (s
ec)
Simulation with Varied CCR
• CCRRatio between comm. time and comp. timeHigh CCR means long communication time
• Number of tasks : 50• Number of cores on a die : 4• Number of dies : 4• Task graph
18 random graphs
10
Die
1 Core
Switch
2 3 54
Die
6 7
# of dies
Processor graph
• We varied CCR• INTERLEAVED has large overhead
when CCR=10 (communication heavy)
• PROPOSED has 30% overhead, but reduced execution time in case of no failure
37
5%
30%
Results with Varied CCRIn case of a failure No failure
Exe
cutio
n tim
e (s
ec)
Exe
cutio
n tim
e (s
ec)
INTERLEAVED
CONTENTION
PROPOSED CONTENTION
PROPOSED
INTERLEAVED
Single thread Multi thread
Tim
e to
gen
erat
e sc
hed
ule
Effect of Parallelization of Proposed Scheduler
• Proposed algorithm is parallelized• Compared times to generate schedules
20 task graphsMulti thread vs Single ThreadSpeed-up : up to x4
38
Environment• Intel Core i7 920
(2.67GHz) • Windows 7(64bit) • Java(TM) SE 6 (64bit)
Conclusion
• Proposed task scheduling method consideringNetwork contentionSingle fail-stop failureMulticore processor
• Future workEvaluation on larger computer system
39
Top Related