Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome...

20
Computer Science 12 Embedded Systems Group Automatic Extraction of Multi-Objective Aware Parallelism for Heterogeneous MPSoCs Daniel Cordes, Michael Engel, Olaf Neugebauer, Peter Marwedel TU Dortmund University Department of Computer Science 12 Dortmund Germany

Transcript of Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome...

Page 1: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

Computer Science 12 Embedded Systems Group

Automatic Extraction of Multi-Objective Aware Parallelism for Heterogeneous MPSoCs

Daniel Cordes, Michael Engel, Olaf Neugebauer, Peter Marwedel

TU Dortmund University Department of Computer Science 12

Dortmund Germany

Page 2: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 1 / 19

Introduction

Complexity of embedded software increasing

Multiple cores available in MPSoCs

Manual parallelization error-prone & time-consuming

Heterogeneous MPSoCs can be more efficient than homogeneous

Cores behave differently on same parts of application

Efficient balancing of tasks very difficult

Restrictions of embedded architectures

Low computational power, small memories, battery-driven

Good trade-off between different objectives must be found

Multi-objective aware approach based on Genetic Algorithms to extract parallelism for heterogeneous MPSoCs presented here

Page 3: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 2 / 19

Outline

1. Motivating Example

2. Parallelization Methodology

3. GA-based Parallelization Approach

4. Results

5. Conclusion & Future Work

Page 4: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 3 / 19

big.LITTLE architecture (ARM)

Performance vs. power of ARM’s big.LITTLE architecture

Many trade-offs possible

Performance

Power

Highest Cortex-A7 Operating Point

Highest Cortex-A15 Operating Point

Lowest Cortex-A15 Operating Point

Lowest Cortex-A7 Operating Point

Cortex-A7Cortex-A15

[arm.com]

Page 5: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 4 / 19

Example – Homogeneous architecture

Time

Slow Proc.

Slow Proc.

Slow Proc.

Slow Proc.

Phase 1: for (…)

Phase 5+6: for (...)

Phase 5+6: for (...)

Phase 5+6: for (...)

Phase 5+6: for (...)

Phase 3: convolve2d_2tasks

Phase 3: convolve2d_2tasks

Phase 4: convolve2d_2tasks

Phase 4: convolve2d_2tasks

Phase 2: convolve2d_4tasks

Phase 2: convolve2d_4tasks

Phase 2: convolve2d_4tasks

Phase 2: convolve2d_4tasks

Time to finish application

Timing example of parallelized edge-detect application (UTDSP benchmark suite)

Execution on homogeneous architecture

Well-balanced solution

Page 6: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 5 / 19

Time

Slow Proc.

Medium Proc.

Fast Proc.

Fast Proc.

Phase 1: for (…)

5+6

5+6

Phase 5+6

Phase 5+6: for (...)

Phase 3

Phase 3

Phase 4

Phase 4: convolve2d_2tasks

Phase 2

Phase 2

Phase 2

Phase 2: convolve2d_4tasks

Time to finish application

Unused performance and very unbalanced execution behavior

Unused performance and very unbalanced execution behavior

Unused performance

and very unbalanced exec.

behav.

Example – Heterogeneous architecture

Same solution on heterogeneous architecture

Faster cores not well utilized

Unbalanced solution with a lot of wasted performance

Page 7: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 6 / 19

Ph. 1

Time

Phase 5+6

Phase 5+6

Phase 5+6

Phase 5+6

Phase 3 Phase 3

Phase 4 Phase 4

Phase 2

Phase 2

Phase 2

Phase 2Slow Proc.

Medium Proc.

Fast Proc.

Fast Proc.

Time to finish application

Reduced execution time due to optimized heterogeneous parallelization approach

Example – Heterogeneous architecture

Well-balanced solution

Increased performance of cores utilized

But: Other solutions might also be better for other objectives!

Page 8: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 7 / 19

Question…

How can we extract multi-objective aware parallelism for heterogeneous MPSoCs automatically?

Input: Sequential C-Code

Output: Parallelized application

Used techniques:

Annotated Hierarchical Task Graph (AHTG)

Genetic Algorithms

High-Level Evaluation Models

Goal:

Extract Task and Pipeline Parallelism

[FreeDigitalPhotos.net]

Page 9: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 8 / 19

In

Out

In

Out

...

...

In

Out

...

...

...

...

...

...

In

Out

...

...

...

In

Out

...

int main() { for (i = 0; i < NUMAV; ++i) { int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = input_signal[index + j] * hamming[j]; sample_imag[j] = zero; } } }

1.

3.

5.

7.

9.

Extract ICD-C IR and Augmented Hierarchical Task Graph (AHTG)

1.

Process nodesbottom-up

2.3.

Create a front of Pareto-optimal solution candidates with different approaches

4.

Attach solution candidates and continue with other nodes

5.

ANSI C

-

Sourc

e code

Node 5

Node 2

In

Out

T2

Node 1

Node 3

Node 5

Node 3

In

Out

T2

Node 3

Node 1

In

Out

Node 6

In

Out

Node 4

Node 2

In

Out

T1

Node 5

Node 3

In

Out

T2

Node 1

Node 3

In

Out

Node 2

Node 4

Node 6

Node 5

Implement parallelism

8.Annotate Source Code

7.Select (best) solution candidate as final result

6.

Apply parallelization approaches

Parallelization Methodology

Extraction techniques for homogeneous MPSoCs

already exist:

• DATE’12: Multi-Objective Aware Extraction of

Task-Level Parallelism Using Genetic

Algorithms.

• CODES’12: Automatic extraction of multi-

objective aware pipeline parallelism using

genetic algorithms.

Page 10: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 9 / 19

GA-Approach: Task-Level Chromosome

T1 T1 T2 T3 ... T4 S1,4 S2,3 S3,2 S4,8 ... Sn,4

Node1..n Node1..n

Node-to-Task MappingHierarchical

Parallel SolutionP1 P2 ... P1

Task1..i

Task-to-Processor-Class Mapping

Page 11: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 10 / 19

GA-Approach: Task-Level Chromosome

T1 T1 T2 T3 ... T4 S1,4 S2,3 S3,2 S4,8 ... Sn,4

Node1..n Node1..n

Node-to-Task MappingHierarchical

Parallel SolutionP1 P2 ... P1

Task1..i

Task-to-Processor-Class Mapping

In

Out

In

Out

...

...

In

Out

...

...

...

...

...

...

In

Out

...

...

...

In

Out

...

Page 12: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 11 / 19

GA-Approach: Task-Level Chromosome

T1 T1 T2 T3 ... T4 S1,4 S2,3 S3,2 S4,8 ... Sn,4

Node1..n Node1..n

Node-to-Task MappingHierarchical

Parallel SolutionP1 P2 ... P1

Task1..i

Task-to-Processor-Class Mapping

ChromosomeRepresentation

Task to processor class Mapping

Task

-to

-P

roce

sso

r-C

lass

Map

pin

g P1

P1

P2

P3

T2

T3

T4

T1 ProcClass C1

T1

N2

N1 T2 N3

ProcClass C2

T3N4

N5

ProcClass C2

T4 N6

N7

2 x fast 1 x middle 1 x slow

Page 13: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 12 / 19

GA-Approach: Pipeline Chromosome

T1 T1 T2 T3 ... T4 U1,1 U1,2 ... Us,t

Node1..n Stage1..s x Subtasks1..t

Node-to-Pipeline Mapping Sub-Tasks Used

C1,1 C1,2 ... Cs,t

Stage1..s x Subtasks1..t

Chunk Sizes of Sub-Tasks

P1 P2 ... P2

Stage1..s x Subtasks1..t

Sub-Task to Processor Class

PT

Sched-uling

1

T1

T2 N3

N2

N1

Chromosome Representation

Task Graph

T1

T1

T2

N2

N3

N1

No

de-

to-

Pip

elin

e 2 Nodes in Stage 1,1 Node in Stage 2

Page 14: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 13 / 19

GA-Approach: Pipeline Chromosome

T1 T1 T2 T3 ... T4 U1,1 U1,2 ... Us,t

Node1..n Stage1..s x Subtasks1..t

Node-to-Pipeline Mapping Sub-Tasks Used

C1,1 C1,2 ... Cs,t

Stage1..s x Subtasks1..t

Chunk Sizes of Sub-Tasks

P1 P2 ... P2

Stage1..s x Subtasks1..t

Sub-Task to Processor Class

PT

Sched-uling

1

T1,1

N1

Chromosome Representation

Task Graph

Sub

-Tas

ks U

sed

1

1

1

1

0

0

U1,2

U1,3

U2,1

U2,2

U2,3

U1,1

T2,1

T1,3

N1

T1,2

N1

N3

N2N2N2

T2,2 T2,3

Unused

3 Subtasks of Pipeline 1,

1 Subtask of Pipeline 2

Page 15: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 14 / 19

GA-Approach: Pipeline Chromosome

T1 T1 T2 T3 ... T4 U1,1 U1,2 ... Us,t

Node1..n Stage1..s x Subtasks1..t

Node-to-Pipeline Mapping Sub-Tasks Used

C1,1 C1,2 ... Cs,t

Stage1..s x Subtasks1..t

Chunk Sizes of Sub-Tasks

P1 P2 ... P2

Stage1..s x Subtasks1..t

Sub-Task to Processor Class

PT

Sched-uling

1

Chromosome Representation

Chunk Sizes

Ch

un

k Si

zes

of

Sub

-Tas

ks 6

4

2

12

2

6

C1,2

C1,3

C2,1

C2,2

C2,3

C1,1

...

T1,1

T1,2

T1,3

T2,1

T2,2

T2,3

Unused

Unused

Page 16: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 15 / 19

GA-Approach: Pipeline Chromosome

T1 T1 T2 T3 ... T4 U1,1 U1,2 ... Us,t

Node1..n Stage1..s x Subtasks1..t

Node-to-Pipeline Mapping Sub-Tasks Used

C1,1 C1,2 ... Cs,t

Stage1..s x Subtasks1..t

Chunk Sizes of Sub-Tasks

P1 P2 ... P2

Stage1..s x Subtasks1..t

Sub-Task to Processor Class

PT

Sched-uling

1

Chromosome Representation

Sub

-Tas

k to

P

roce

sso

r C

lass

Map

pin

g ProcClass C1 ProcClass C2 ProcClass C2

2 x fast 1 x middle 1 x slow

1

2

3

1

1

3

P1,2

P1,3

P2,1

P2,2

P2,3

P1,1

2 PT

T2,1

N3

T1,1

N1

N2

T1,2

N1

N2

T1,3

N1

N2

Iteration Scheduling: 0 = Chunk-Based, 1 = Interleaved, 2 = Fully Interleaved

Processor Class Mapping

Page 17: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 16 / 19

Results – JPEG encoder

Page 18: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 17 / 19

Benchmark Time* #Nodes #Populations #Individuals #Mutations #Crossover #Sol.(T,P,M)

adpcm enc. 01:31 36 1,520 153,453 30,729 104,962 63

bound. value 01:17 12 644 83,973 17,900 54,964 34

compress 10:59 289 8,936 706,266 141,170 464,112 30

edge detect 02:43 105 2,872 200,371 40,002 133,096 118

filterbank 03:24 7 412 50,779 12,229 44,164 47

fir 256 00:30 13 388 29,889 6,629 21,515 44

iir 4 03:02 13 852 105,224 22,631 79,206 63

jpeg 2000 05:13 62 2,868 312,242 68,142 246,427 333

latnrm 32 01:11 17 636 53,642 11,462 39,831 54

mult 10 01:01 36 1,060 70,855 14,635 57,226 90

spectral 02:25 51 2,260 213,230 44,696 158,624 114

average 03:01 58 2,041 179,993 37,293 127,648 90

Benchmark Time* #Nodes #Populations #Individuals #Mutations #Crossover #Solutions

adpcm enc. 01:31 36 1,520 153,453 30,729 104,962 63

bound. value 01:17 12 644 83,973 17,900 54,964 34

compress 10:59 289 8,936 706,266 141,170 464,112 30

edge detect 02:43 105 2,872 200,371 40,002 133,096 118

filterbank 03:24 7 412 50,779 12,229 44,164 47

fir 256 00:30 13 388 29,889 6,629 21,515 44

iir 4 03:02 13 852 105,224 22,631 79,206 63

jpeg 2000 05:13 62 2,868 312,242 68,142 246,427 333

latnrm 32 01:11 17 636 53,642 11,462 39,831 54

mult 10 01:01 36 1,060 70,855 14,635 57,226 90

spectral 02:25 51 2,260 213,230 44,696 158,624 114

average 03:01 58 2,041 179,993 37,293 127,648 90

GA-Approach: Results - GA-Statistics

213k created and evaluated individuals

Overall 2,5 minutes: 0.6 milli sec. / individual

Number of solutions range from 43 - 333

* AMD Opteron @ 2.4 GHz

Most benchmarks processed in 1-2 minutes

Page 19: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 18 / 19

Conclusions & Future Work

Conclusion

New multi-objective aware parallelization approaches for heterogeneous embedded systems presented

Huge optimization potential

High speedups with pipeline technique

Less energy consumption with task-level technique

Good trace-offs with combination

Future Work

Integrate DFS/DVS in models

Integrate other objectives, like, e.g., memory consumption

Page 20: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss

MuCoCoS 2013 - 09/07/13

© Daniel Cordes 2013 19 / 19

Thank you for your attention!

Questions?