Download - (Slides) Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault

Task scheduling algorithm for multicore processor system for

minimizing recovery time in case of single node fault

1

Shohei Gotoda†, Naoki Shibata‡, Minoru Ito†

†Nara Institute of Science and Technology‡Shiga University

Background

• Multicore processors Almost all processors designed recently

are multicore processors

• Computing cluster consisting of 1800 nodes experiences about 1000 failures[1]in the first year after deployment[1] Google spotlights data center inner workings cnet.com article on May 30, 2008

Objective of Research

• Fault tolerance We assume a single fail-stop failure of a

multicore processor

• Network contention To generate schedules reproducible on

real systems

3

Devise new scheduling method thatminimizes recovery time

taking account of the above points

Task Graph• A group of tasks that can

be executed in parallel• Vertex (task node)

Task to be executed on a single CPU core

• Edge (task link)Data dependence between

tasks

4

Task node Task link

Task graph

Processor Graph• Topology of the computer

network• Vertex (Processor node)

CPU core (circle)• has only one link

Switch (rectangle)• has more than 2 links

• Edge (Processor link)Communication path between

processors

5

Processor node Processor linkSwitch

Processor graph

321

Task Scheduling• Task scheduling problem

assigns a processor node to each task node

minimizes total execution time

An NP-hard problem

6

1

One processor node is assigned to each task

node321

Processor graph

Task graph

Inputs and Outputs for Task Scheduling

• InputsTask graph and processor

graph

• OutputA schedule

• which is an assignment of a processor node to each task node

• Objective functionMinimize task execution time

7

3

31

31

321

Processor graph

Task graph

Network Contention Model

• Communication delayIf processor link is occupied by

another communication

• We use existing network contention model[2]

8

3

31

32

Contention 321

Processor graph

Task graph

[2] O. Sinnen and L.A. Sousa, “Communication Contention in Task Scheduling,“ IEEE Trans. Parallel and Distributed Systems, vol. 16, no. 6, pp. 503-515, 2005.

Multicore Processor Model

• Each core executes a task independently from other cores

• Communication between cores finishes instantaneously

• One network interface is shared among all cores on a die

• If there is a failure, all cores on a die stop execution simultaneously

9

Core1

Core2

CPU

21

Processor graph

Influence of Multicore Processors

10

• Need for considering multicore processors in schedulingHigh speed communication

link among processors on a single die

• Existing schedulers try to utilize this high speed link

• As a result, many dependent tasks are assigned to cores on a single die

3

31

32

321Assigned to cores on a same die

Processor graph

Task graph

• Need for considering multicore processors in schedulingHigh speed communication

link among processors on a single die

• Existing schedulers try to utilize this high speed link

• As a result, many dependent tasks are assigned to cores on a single die

In case of fault• Dependent tasks tends to be

destroyed at a time

11

3

31

32

321

Processor graph

Task graph

Influence of Multicore Processors

Assigned to cores on a same die

Related Work (1/2)• Checkpointing [3]

Node state is saved in each nodeBackup node is allocatedRecover processing results from saved state

Multicore is not consideredNetwork contention is not considered

12

[3] Y. Gu, Z. Zhang, F. Ye, H. Yang, M. Kim, H. Lei, and Z. Liu. An empirical study of high availability in stream processing systems. In Middleware ’09: the 10th ACM/IFIP/USENIX International Conference on Middleware (Industrial Track), 2009.

1

2

3

4

Input Queue

Output Queue

Secondary

Primary Backup

Related Work (2/2)• Task scheduling method[5] in which

Multiple task graph templates are prepared beforehand

Processors are assigned according to the templates

• This method is suitable for highly loaded systems

[5] Wolf, J., et al.: SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems. In: ACM Middleware (2008)

Our Contribution• There is no existing method for

scheduling that takes account of both• multicore processor failure• network contention

• We propose a scheduling method taking account of network contention and multicore processor failure

14

Assumptions• Only a single fail-stop failure of a

multicore processor can occurFailed computing node automatically restart

after 30 sec.

• Failure can be detected in one secondby interruption of heartbeat signals

• Use checkpointing technique to recover from saved state

• Network contentionContention model is same as the Sinnen’s

model15

Checkpointing and Recovery

• Each processor node saves state to the main memory when each task is finished Saved state is the data transferred to the

succeeding processor nodes Only output data from each task node is saved as a

state• This is much smaller than the complete memory image

We assume saving state finishes instantaneously• Since this is just copying small data within memory

• Recovery Saved state which is not affected by the failure is

found in the ancestor task nodes. Some tasks are executed again using the saved state

16

What Proposed Method Tries to Do

• Reduce recovery time in case of failure Minimizes the worst case total execution

time• Worst case in the all possible patterns of

failure• Each of dies can fail

Execution time before failure + recovery

Worst Case Scenario• Critical path

Path in task graph from first to last task with longest execution time

• The worst case scenarioAll tasks in critical path are assigned to processors on a dieFailure happens when the last task is being executedWe need two times of total execution time

18

Example task graph

First

Last

Idea of Proposed Method• We distribute tasks on critical path over

diesBut, there is communication overheadIf we distribute too many tasks, there is too

much overhead

• Usually, the last tasks in critical path have larger influenceWe check tasks from the last task in the

critical pathWe find the last k tasks in the critical path to

other diesWe find the best k

Problem with Existing Method

20

1 2

3

A B C

21

3

BA

Resulting execution

ExistingSchedule

D

DC

• Task 1 is assigned to core A• Task 2 is assigned to core B• Task 3 is assigned to same

die• because of high

communication speed

Time

• Suppose that failure happens when Task 3 is being executed

• All results are lost

21

1 2

3

A B C

21

3

BA

D

DC

Resulting execution

ExistingSchedule

Time



22

1 2

3

A B C

21

3

BA

D

DC

1’ 2’

3’

21

3

‘ ‘

‘

Resulting execution

ExistingSchedule

Time


• All results are lost• We need to execute all tasks

again from the beginningon another die

Improvement in Proposed Method

• Distribute influential tasks to other diesIn this case, task 3 is the most

influential

23

21

3

Proposed schedule

1 2

3

A B C

BA

Resulting executionD

DC

Comm.overhead

Time

Recovery in Proposed Method


• Results of Task 1 and 2 are saved

24

21

3

1 2

3

A B C

BA

D

DC

Resulting execution

Time

Proposed schedule

Recovery in Proposed Method


• Results of Task 1 and 2 are saved

• Execution can be continued from the saved state

25

3’

21

3

1 2

3

A B C

BA

D

DC

3

‘Resulting execution

Time

Proposed schedule

Communication Overhead

• Communication overhead is imposed to the proposed method

26

Existing schedule Proposed schedule

overhead

1 2

3

A B C D

1 2

3

A B C D

Tim

e

Speed-up in Recovery

27

Recovery withexisting schedule

Recovery withproposed schedule

Proposed method has larger effect

if computation time is longer than communication time

1 2

3

A B C D

1 2

3

A B C D

1’ 2’

3’3’

speed-up

時間

Comparison of Schedules

28

Existing schedule

Proposed scheduleTim

e

Time

Task graph

10 32

Processor graph

1

2

6 7

3 4

8 9

5

10

11

12

13

29

Notavailable

Comparison of Recovery

Existing schedule

Proposed scheduleTim

e

Time

Task graph

10 32

Processor graph

1

2

6 7

3 4

8 9

5

10

11

12

13

Evaluation• Items to compare

Recovery time in case of a failureOverhead in case of no failure

• Compared methodsPROPOSEDCONTENTION

• Sinnen’s method considering network contention

INTERLEAVED• Scheduling algorithm that tries to spread

tasks to all dies as much as possible30

Test Environment• Devices

4 PCs with• Intel Core i7 920 (2.67GHz) (Quad core)• Intel Network Interface Card

Intel Gigabit CT Desktop Adaptor (PCI Express x1)

• 6.0GB Memory

• Program to measure execution time

• Windows 7(64bit) • Java(TM) SE Runtime Environment

(64bit)• Standard TCP socket

31

Task Graph with Low Parallelism

Configuration• Number of task nodes ：

90• Number of cores on a

die ： 2• Number of dies ： 2 ～ 4• Robot control [4]

32

Task graphProcessor graph

10

Die

1 CoreSwitch

4 5

Die

# of dies

32

Die

6 7

Die

[4] Standard Task Graph Set http://www.kasahara.elec.waseda.ac.jp/schedule/index.html

Results with Robot Control Task

• We varied number of dies• In case of failure, proposed method

reduced total execution time by 40%• In case of no failure, up to 6% of

overhead33

In case of a failure No failure

40%

6%

Number of dies Number of dies

CONTENTIONINTERLEAVED

PROPOSED

INTERLEAVED

CONTENTION

PROPOSED

Exe

cutio

n tim

e (s

ec)

Exe

cutio

n tim

e (s

ec)

Configuration

• Number of task nodes ：98

• Number of cores on a die ： 4

• Number of dies ： 2 ～ 4• Sparse matrix solver [4]

34

10

Die

1 Core

Switch

2 3 54

Die

6 7

# of dies

Task Graph with High Parallelism

Processor graph

Task graph

[4] Standard Task Graph Set http://www.kasahara.elec.waseda.ac.jp/schedule/index.html

Results with Sparse Matrix Solver

• We varied number of dies• In case of failure, execution time

including recovery reduced by up to 25%

• In case of no failure, up to 7% of overhead

35

25%7%

In case of a failure No failure

INTERLEAVEDINTERLEAVED

CONTENTION

CONTENTION

PROPOSED

PROPOSED

Number of diesNumber of dies

Exe

cutio

n tim

e (s

ec)

Exe

cutio

n tim

e (s

ec)

Simulation with Varied CCR

• CCRRatio between comm. time and comp. timeHigh CCR means long communication time

• Number of tasks ： 50• Number of cores on a die ： 4• Number of dies ： 4• Task graph

18 random graphs

10

Die

1 Core

Switch

2 3 54

Die

6 7

# of dies

Processor graph

• We varied CCR• INTERLEAVED has large overhead

when CCR=10 (communication heavy)

• PROPOSED has 30% overhead, but reduced execution time in case of no failure

37

5%

30%

Results with Varied CCRIn case of a failure No failure

Exe

cutio

n tim

e (s

ec)

Exe

cutio

n tim

e (s

ec)

INTERLEAVED

CONTENTION

PROPOSED CONTENTION

PROPOSED

INTERLEAVED

Single thread Multi thread

Tim

e to

gen

erat

e sc

hed

ule

Effect of Parallelization of Proposed Scheduler

• Proposed algorithm is parallelized• Compared times to generate schedules

20 task graphsMulti thread vs Single ThreadSpeed-up : up to x4

38

Environment• Intel Core i7 920

(2.67GHz) • Windows 7(64bit) • Java(TM) SE 6 (64bit)

Conclusion

• Proposed task scheduling method consideringNetwork contentionSingle fail-stop failureMulticore processor

• Future workEvaluation on larger computer system

39