Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome...
Transcript of Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome...
![Page 1: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/1.jpg)
Computer Science 12 Embedded Systems Group
Automatic Extraction of Multi-Objective Aware Parallelism for Heterogeneous MPSoCs
Daniel Cordes, Michael Engel, Olaf Neugebauer, Peter Marwedel
TU Dortmund University Department of Computer Science 12
Dortmund Germany
![Page 2: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/2.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 1 / 19
Introduction
Complexity of embedded software increasing
Multiple cores available in MPSoCs
Manual parallelization error-prone & time-consuming
Heterogeneous MPSoCs can be more efficient than homogeneous
Cores behave differently on same parts of application
Efficient balancing of tasks very difficult
Restrictions of embedded architectures
Low computational power, small memories, battery-driven
Good trade-off between different objectives must be found
Multi-objective aware approach based on Genetic Algorithms to extract parallelism for heterogeneous MPSoCs presented here
![Page 3: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/3.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 2 / 19
Outline
1. Motivating Example
2. Parallelization Methodology
3. GA-based Parallelization Approach
4. Results
5. Conclusion & Future Work
![Page 4: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/4.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 3 / 19
big.LITTLE architecture (ARM)
Performance vs. power of ARM’s big.LITTLE architecture
Many trade-offs possible
Performance
Power
Highest Cortex-A7 Operating Point
Highest Cortex-A15 Operating Point
Lowest Cortex-A15 Operating Point
Lowest Cortex-A7 Operating Point
Cortex-A7Cortex-A15
[arm.com]
![Page 5: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/5.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 4 / 19
Example – Homogeneous architecture
Time
Slow Proc.
Slow Proc.
Slow Proc.
Slow Proc.
Phase 1: for (…)
Phase 5+6: for (...)
Phase 5+6: for (...)
Phase 5+6: for (...)
Phase 5+6: for (...)
Phase 3: convolve2d_2tasks
Phase 3: convolve2d_2tasks
Phase 4: convolve2d_2tasks
Phase 4: convolve2d_2tasks
Phase 2: convolve2d_4tasks
Phase 2: convolve2d_4tasks
Phase 2: convolve2d_4tasks
Phase 2: convolve2d_4tasks
Time to finish application
Timing example of parallelized edge-detect application (UTDSP benchmark suite)
Execution on homogeneous architecture
Well-balanced solution
![Page 6: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/6.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 5 / 19
Time
Slow Proc.
Medium Proc.
Fast Proc.
Fast Proc.
Phase 1: for (…)
5+6
5+6
Phase 5+6
Phase 5+6: for (...)
Phase 3
Phase 3
Phase 4
Phase 4: convolve2d_2tasks
Phase 2
Phase 2
Phase 2
Phase 2: convolve2d_4tasks
Time to finish application
Unused performance and very unbalanced execution behavior
Unused performance and very unbalanced execution behavior
Unused performance
and very unbalanced exec.
behav.
Example – Heterogeneous architecture
Same solution on heterogeneous architecture
Faster cores not well utilized
Unbalanced solution with a lot of wasted performance
![Page 7: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/7.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 6 / 19
Ph. 1
Time
Phase 5+6
Phase 5+6
Phase 5+6
Phase 5+6
Phase 3 Phase 3
Phase 4 Phase 4
Phase 2
Phase 2
Phase 2
Phase 2Slow Proc.
Medium Proc.
Fast Proc.
Fast Proc.
Time to finish application
Reduced execution time due to optimized heterogeneous parallelization approach
Example – Heterogeneous architecture
Well-balanced solution
Increased performance of cores utilized
But: Other solutions might also be better for other objectives!
![Page 8: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/8.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 7 / 19
Question…
How can we extract multi-objective aware parallelism for heterogeneous MPSoCs automatically?
Input: Sequential C-Code
Output: Parallelized application
Used techniques:
Annotated Hierarchical Task Graph (AHTG)
Genetic Algorithms
High-Level Evaluation Models
Goal:
Extract Task and Pipeline Parallelism
[FreeDigitalPhotos.net]
![Page 9: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/9.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 8 / 19
In
Out
In
Out
...
...
In
Out
...
...
...
...
...
...
In
Out
...
...
...
In
Out
...
int main() { for (i = 0; i < NUMAV; ++i) { int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = input_signal[index + j] * hamming[j]; sample_imag[j] = zero; } } }
1.
3.
5.
7.
9.
Extract ICD-C IR and Augmented Hierarchical Task Graph (AHTG)
1.
Process nodesbottom-up
2.3.
Create a front of Pareto-optimal solution candidates with different approaches
4.
Attach solution candidates and continue with other nodes
5.
ANSI C
-
Sourc
e code
Node 5
Node 2
In
Out
T2
Node 1
Node 3
Node 5
Node 3
In
Out
T2
Node 3
Node 1
In
Out
Node 6
In
Out
Node 4
Node 2
In
Out
T1
Node 5
Node 3
In
Out
T2
Node 1
Node 3
In
Out
Node 2
Node 4
Node 6
Node 5
Implement parallelism
8.Annotate Source Code
7.Select (best) solution candidate as final result
6.
Apply parallelization approaches
Parallelization Methodology
Extraction techniques for homogeneous MPSoCs
already exist:
• DATE’12: Multi-Objective Aware Extraction of
Task-Level Parallelism Using Genetic
Algorithms.
• CODES’12: Automatic extraction of multi-
objective aware pipeline parallelism using
genetic algorithms.
![Page 10: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/10.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 9 / 19
GA-Approach: Task-Level Chromosome
T1 T1 T2 T3 ... T4 S1,4 S2,3 S3,2 S4,8 ... Sn,4
Node1..n Node1..n
Node-to-Task MappingHierarchical
Parallel SolutionP1 P2 ... P1
Task1..i
Task-to-Processor-Class Mapping
![Page 11: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/11.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 10 / 19
GA-Approach: Task-Level Chromosome
T1 T1 T2 T3 ... T4 S1,4 S2,3 S3,2 S4,8 ... Sn,4
Node1..n Node1..n
Node-to-Task MappingHierarchical
Parallel SolutionP1 P2 ... P1
Task1..i
Task-to-Processor-Class Mapping
In
Out
In
Out
...
...
In
Out
...
...
...
...
...
...
In
Out
...
...
...
In
Out
...
![Page 12: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/12.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 11 / 19
GA-Approach: Task-Level Chromosome
T1 T1 T2 T3 ... T4 S1,4 S2,3 S3,2 S4,8 ... Sn,4
Node1..n Node1..n
Node-to-Task MappingHierarchical
Parallel SolutionP1 P2 ... P1
Task1..i
Task-to-Processor-Class Mapping
ChromosomeRepresentation
Task to processor class Mapping
Task
-to
-P
roce
sso
r-C
lass
Map
pin
g P1
P1
P2
P3
T2
T3
T4
T1 ProcClass C1
T1
N2
N1 T2 N3
ProcClass C2
T3N4
N5
ProcClass C2
T4 N6
N7
2 x fast 1 x middle 1 x slow
![Page 13: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/13.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 12 / 19
GA-Approach: Pipeline Chromosome
T1 T1 T2 T3 ... T4 U1,1 U1,2 ... Us,t
Node1..n Stage1..s x Subtasks1..t
Node-to-Pipeline Mapping Sub-Tasks Used
C1,1 C1,2 ... Cs,t
Stage1..s x Subtasks1..t
Chunk Sizes of Sub-Tasks
P1 P2 ... P2
Stage1..s x Subtasks1..t
Sub-Task to Processor Class
PT
Sched-uling
1
T1
T2 N3
N2
N1
Chromosome Representation
Task Graph
T1
T1
T2
N2
N3
N1
No
de-
to-
Pip
elin
e 2 Nodes in Stage 1,1 Node in Stage 2
![Page 14: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/14.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 13 / 19
GA-Approach: Pipeline Chromosome
T1 T1 T2 T3 ... T4 U1,1 U1,2 ... Us,t
Node1..n Stage1..s x Subtasks1..t
Node-to-Pipeline Mapping Sub-Tasks Used
C1,1 C1,2 ... Cs,t
Stage1..s x Subtasks1..t
Chunk Sizes of Sub-Tasks
P1 P2 ... P2
Stage1..s x Subtasks1..t
Sub-Task to Processor Class
PT
Sched-uling
1
T1,1
N1
Chromosome Representation
Task Graph
Sub
-Tas
ks U
sed
1
1
1
1
0
0
U1,2
U1,3
U2,1
U2,2
U2,3
U1,1
T2,1
T1,3
N1
T1,2
N1
N3
N2N2N2
T2,2 T2,3
Unused
3 Subtasks of Pipeline 1,
1 Subtask of Pipeline 2
![Page 15: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/15.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 14 / 19
GA-Approach: Pipeline Chromosome
T1 T1 T2 T3 ... T4 U1,1 U1,2 ... Us,t
Node1..n Stage1..s x Subtasks1..t
Node-to-Pipeline Mapping Sub-Tasks Used
C1,1 C1,2 ... Cs,t
Stage1..s x Subtasks1..t
Chunk Sizes of Sub-Tasks
P1 P2 ... P2
Stage1..s x Subtasks1..t
Sub-Task to Processor Class
PT
Sched-uling
1
Chromosome Representation
Chunk Sizes
Ch
un
k Si
zes
of
Sub
-Tas
ks 6
4
2
12
2
6
C1,2
C1,3
C2,1
C2,2
C2,3
C1,1
...
T1,1
T1,2
T1,3
T2,1
T2,2
T2,3
Unused
Unused
![Page 16: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/16.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 15 / 19
GA-Approach: Pipeline Chromosome
T1 T1 T2 T3 ... T4 U1,1 U1,2 ... Us,t
Node1..n Stage1..s x Subtasks1..t
Node-to-Pipeline Mapping Sub-Tasks Used
C1,1 C1,2 ... Cs,t
Stage1..s x Subtasks1..t
Chunk Sizes of Sub-Tasks
P1 P2 ... P2
Stage1..s x Subtasks1..t
Sub-Task to Processor Class
PT
Sched-uling
1
Chromosome Representation
Sub
-Tas
k to
P
roce
sso
r C
lass
Map
pin
g ProcClass C1 ProcClass C2 ProcClass C2
2 x fast 1 x middle 1 x slow
1
2
3
1
1
3
P1,2
P1,3
P2,1
P2,2
P2,3
P1,1
2 PT
T2,1
N3
T1,1
N1
N2
T1,2
N1
N2
T1,3
N1
N2
Iteration Scheduling: 0 = Chunk-Based, 1 = Interleaved, 2 = Fully Interleaved
Processor Class Mapping
![Page 17: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/17.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 16 / 19
Results – JPEG encoder
![Page 18: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/18.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 17 / 19
Benchmark Time* #Nodes #Populations #Individuals #Mutations #Crossover #Sol.(T,P,M)
adpcm enc. 01:31 36 1,520 153,453 30,729 104,962 63
bound. value 01:17 12 644 83,973 17,900 54,964 34
compress 10:59 289 8,936 706,266 141,170 464,112 30
edge detect 02:43 105 2,872 200,371 40,002 133,096 118
filterbank 03:24 7 412 50,779 12,229 44,164 47
fir 256 00:30 13 388 29,889 6,629 21,515 44
iir 4 03:02 13 852 105,224 22,631 79,206 63
jpeg 2000 05:13 62 2,868 312,242 68,142 246,427 333
latnrm 32 01:11 17 636 53,642 11,462 39,831 54
mult 10 01:01 36 1,060 70,855 14,635 57,226 90
spectral 02:25 51 2,260 213,230 44,696 158,624 114
average 03:01 58 2,041 179,993 37,293 127,648 90
Benchmark Time* #Nodes #Populations #Individuals #Mutations #Crossover #Solutions
adpcm enc. 01:31 36 1,520 153,453 30,729 104,962 63
bound. value 01:17 12 644 83,973 17,900 54,964 34
compress 10:59 289 8,936 706,266 141,170 464,112 30
edge detect 02:43 105 2,872 200,371 40,002 133,096 118
filterbank 03:24 7 412 50,779 12,229 44,164 47
fir 256 00:30 13 388 29,889 6,629 21,515 44
iir 4 03:02 13 852 105,224 22,631 79,206 63
jpeg 2000 05:13 62 2,868 312,242 68,142 246,427 333
latnrm 32 01:11 17 636 53,642 11,462 39,831 54
mult 10 01:01 36 1,060 70,855 14,635 57,226 90
spectral 02:25 51 2,260 213,230 44,696 158,624 114
average 03:01 58 2,041 179,993 37,293 127,648 90
GA-Approach: Results - GA-Statistics
213k created and evaluated individuals
Overall 2,5 minutes: 0.6 milli sec. / individual
Number of solutions range from 43 - 333
* AMD Opteron @ 2.4 GHz
Most benchmarks processed in 1-2 minutes
![Page 19: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/19.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 18 / 19
Conclusions & Future Work
Conclusion
New multi-objective aware parallelization approaches for heterogeneous embedded systems presented
Huge optimization potential
High speedups with pipeline technique
Less energy consumption with task-level technique
Good trace-offs with combination
Future Work
Integrate DFS/DVS in models
Integrate other objectives, like, e.g., memory consumption
![Page 20: Automatic Extraction of Multi-Objective Aware …...Processor Class PT Sched-uling 1 Chromosome Representation S u b-T a s k t o P r o c e s s o r C l a s s M a p p i n g ProcCla ss](https://reader033.fdocuments.in/reader033/viewer/2022041512/5e28954e87c86a2ba37f6ec9/html5/thumbnails/20.jpg)
MuCoCoS 2013 - 09/07/13
© Daniel Cordes 2013 19 / 19
Thank you for your attention!
Questions?