An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face...
Transcript of An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face...
![Page 1: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/1.jpg)
Copyright 2013, Toshiba Corporation.
An Evaluation of
an Energy Efficient Many-Core SoC
with Parallelized Face Detection
Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and
Takashi Miyamori
Toshiba Corporation, Kawasaki, Japan
![Page 2: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/2.jpg)
2
Executive Summary
• Future architecture will have many cores
• A key challenge : How to efficiently use them?
• We evaluated techniques to accelerate one
type of important application (face detection)
• Performance scales up to 64 cores
• Energy efficiency is 20x better than desktop
CPU
![Page 3: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/3.jpg)
3
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
![Page 4: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/4.jpg)
4
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
![Page 5: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/5.jpg)
5
Two Key Trends in Embedded Systems
• Trend 1 : New applications (e.g. image recognition)
need more computing power while keeping low power
• Trend 2 : New architecture can enable much more
parallelism than before
![Page 6: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/6.jpg)
6
Two Key Trends in Embedded Systems
• Trend 1 : New applications (e.g. image recognition)
need more computing power while keeping low power
• Trend 2 : New architecture can enable much more
parallelism than before Now : 500GOPS
Heterogeneous Multi-Core
Accelerator
ViscontiTM2
[ISSCC’12]
Accelerator
CPU CPU
CPU CPU
![Page 7: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/7.jpg)
7
Two Key Trends in Embedded Systems
• Trend 1 : New applications (e.g. image recognition)
need more computing power while keeping low power
• Trend 2 : New architecture can enable much more
parallelism than before Now : 500GOPS
Heterogeneous Multi-Core
Accelerator
ViscontiTM2
[ISSCC’12]
Accelerator
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
Accelerator
Accelerator
Accelerator
Accelerator
Heterogeneous Many-Core Future : More than 1TOPS
Toshiba Many-Core
[VLSI Sympo. ’12]
![Page 8: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/8.jpg)
8
Two Key Trends in Embedded Systems
• Trend 1 : New applications (e.g. image recognition)
need more computing power while keeping low power
• Trend 2 : New architecture can enable much more
parallelism than before Now : 500GOPS
Heterogeneous Multi-Core
Accelerator
ViscontiTM2
[ISSCC’12]
Accelerator
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
Accelerator
Accelerator
Accelerator
Accelerator
Heterogeneous Many-Core Future : More than 1TOPS
Toshiba Many-Core
[VLSI Sympo. ’12]
Result : A need for efficient and scalable application
performance on many-core
![Page 9: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/9.jpg)
9
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
![Page 10: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/10.jpg)
10
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
GHz cores
High Performance Multi & Many-Cores for HPC
Tilera ® Tile64
Intel® 80-Tile
Cell Broadband EngineTM
Intel ® SCC
![Page 11: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/11.jpg)
11
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
GHz cores
High Performance Multi & Many-Cores for HPC
Tilera ® Tile64
Intel® 80-Tile
Cell Broadband EngineTM
Intel ® SCC
Less than 3W is needed
for embedded applications
3
![Page 12: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/12.jpg)
12
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
GHz cores
High Performance Multi & Many-Cores for HPC
Tilera ® Tile64
Intel® 80-Tile
Cell Broadband EngineTM
Intel ® SCC
Less than 3W is needed
for embedded applications
3
Energy Efficient Embedded Multi-Cores
Toshiba Multi-Core (ViscontiTM2)
ARM ® CortexTM-A5
Renesas RP-X
![Page 13: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/13.jpg)
13
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
GHz cores
High Performance Multi & Many-Cores for HPC
Tilera ® Tile64
Intel® 80-Tile
Cell Broadband EngineTM
Intel ® SCC
Less than 3W is needed
for embedded applications
3
Energy Efficient Embedded Multi-Cores
Toshiba Multi-Core (ViscontiTM2)
ARM ® CortexTM-A5
Renesas RP-X
Our Target Area
![Page 14: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/14.jpg)
14
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
GHz cores
High Performance Multi & Many-Cores for HPC
Tilera ® Tile64
Intel® 80-Tile
Cell Broadband EngineTM
Intel ® SCC
Less than 3W is needed
for embedded applications
3
Energy Efficient Embedded Multi-Cores
Toshiba Multi-Core (ViscontiTM2)
ARM ® CortexTM-A5
Renesas RP-X
Our Target Area
Toshiba Energy Efficient
Many-Core (64Core)
![Page 15: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/15.jpg)
15
Many-Core Scalability
The Number of Cores
Perf
orm
an
ce
![Page 16: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/16.jpg)
16
Many-Core Scalability
The Number of Cores
Perf
orm
an
ce
Ideal
![Page 17: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/17.jpg)
17
Many-Core Scalability
The Number of Cores
Perf
orm
an
ce
Ideal
Actual?
![Page 18: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/18.jpg)
18
Many-Core Scalability
The Number of Cores
Perf
orm
an
ce
Ideal
Actual?
Can we achieve good performance scaling-up
on face detection?
![Page 19: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/19.jpg)
19
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
![Page 20: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/20.jpg)
20
Face Detection
![Page 21: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/21.jpg)
21
Face Detection
1st
ROI
Check if a face
exists or not
25 pixels
25 pixels
ROI : Region of Interest
![Page 22: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/22.jpg)
22
Face Detection
2nd
ROI
Check if a face
exists or not
25 pixels
25 pixels
2 pixels
ROI : Region of Interest
![Page 23: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/23.jpg)
23
Face Detection
3rd
ROI
Check if a face
exists or not
25 pixels
25 pixels
4 pixels
ROI : Region of Interest
![Page 24: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/24.jpg)
24
Face Detection
Nth
ROI
Check if a face
exists or not
ROI : Region of Interest
![Page 25: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/25.jpg)
25
Face Detection
Check if a face
exists or not
25 pixels
25 pixels
(N+1)th
ROI
2 pixels ROI : Region of Interest
![Page 26: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/26.jpg)
26
Face Detection
Target
ROI
ROI : Region of Interest
![Page 27: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/27.jpg)
27
ROI
(Region Of Interest)
Joint Haar-Like Features [ICCV ‘05]
Compared to each threshold
If greater than : 1, otherwise : 0
1 1 0 Joint Haar-Like
Feature
• Extension to widely-used Viola and John’s Method
[CVPR ‘01] (using Haar-like features)
Haar-like feature : Difference of image intensities between
blue and red rectangles.
![Page 28: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/28.jpg)
28
ROI
(Region Of Interest)
Joint Haar-Like Features [ICCV ‘05]
Compared to each threshold
If greater than : 1, otherwise : 0
1 1 0 Joint Haar-Like
Feature
• Extension to widely-used Viola and John’s Method
[CVPR ‘01] (using Haar-like features)
Haar-like feature : Difference of image intensities between
blue and red rectangles.
Eye is darker
than cheek
![Page 29: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/29.jpg)
29
Classifier using Joint Haar-Like Features
Table
Table
Face?
Face?
Accumulate
Face or Not Face
Positions of features and tables are learned in advance
and stored in the dictionary
Possibility of face or not face
Weight of the feature Joint Haar-Like Features
![Page 30: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/30.jpg)
30
Characteristics of Face Detection
• Face detection for each
ROI can be executed in
parallel
• There are a lot of ROIs
in an image
– 3M ROIs when image size
is 4000x3200
• A lot of coarse grain thread parallelism based
on ROIs
– Overhead of thread scheduling can be minimized
Many-core is good for face detection !
![Page 31: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/31.jpg)
31
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
![Page 32: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/32.jpg)
32
Chip Micrograph and Features
Technology 40nm LP Process
Interconnect 8 metal (Cu)
Chip Size 15.0mm x 14.0mm
Cluster Size 7.4mm x 5.7mm
Transistors 87.5Million
Cluster
Frequency
333MHz, 1.1V
Package 1369-pin FCBGA
Cluster 0
DD
R3 I/F
D
DR
3 I/F
Cluster 1
15.0mm
14.0
mm
L2Cache Bank0
5.7
mm
7.4mm
L2Cache Bank1
L2 Cache Bank2
L2 Cache Bank3
Core
Reconfigurable
Engines
2MB L2 Cache
2MB L2 Cache
![Page 33: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/33.jpg)
33
Chip Micrograph and Features
Technology 40nm LP Process
Interconnect 8 metal (Cu)
Chip Size 15.0mm x 14.0mm
Cluster Size 7.4mm x 5.7mm
Transistors 87.5Million
Cluster
Frequency
333MHz, 1.1V
Package 1369-pin FCBGA
Cluster 0
DD
R3 I/F
D
DR
3 I/F
Cluster 1
15.0mm
14.0
mm
L2Cache Bank0
5.7
mm
7.4mm
L2Cache Bank1
L2 Cache Bank2
L2 Cache Bank3
Core
Reconfigurable
Engines
2MB L2 Cache
2MB L2 Cache
![Page 34: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/34.jpg)
34
Structure of Many-Core Cluster
• Tree-based NoC – Leaf nodes: Core
– Root nodes: L2 cache
banks
L2$(Root)
Core(Leaf)
Core(Leaf)
Router L2 Cache
512 KB
Bank0
L2 Cache
512 KB
Bank1
L2 Cache
512 KB
Bank2
L2 Cache
512 KB
Bank3
Core Core
Core Core
1
Core Core
Core Core
1
2
2
Core Core
Core Core
1
Core Core
Core Core
1
2
2
- Interrupt Controller
- Hardware Semaphores
Cluster Control Module
3 3 3 3
Core Core
Core Core
1
Core Core
Core Core
1
2
2
Core Core
Core Core
1
Core Core
Core Core
1
2
2
Four L2 Cache
Banks
![Page 35: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/35.jpg)
35
Many-Core SoC Architecture
Video
In x 4 Peripherals
L2 Cache
2MB
Core x 32
Many-Core Cluster 0
Video
Out
SRAM 512KB
Reconfigurable
Engine x 2
External
Interface PCIe
X 1
Image
Recognition &
Processing
Accelerators
SRAM 512KB x 4
ARM
Cortex-A9
X 2
PCIe
X 1
10.7GB/s
Reconfigurable
Engine x 2
ARM
Cortex-A9
X 2
DDR3
Controller
X 2
L2 Cache
2MB
Core x 32
Many-Core Cluster 1
![Page 36: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/36.jpg)
36
Core : Media Processing Block (MPB)
• 3-Way VLIW Processor
• L1 Instruction Cache: 32KB
• L1 Data Cache: 16KB
• 333 MHz
3-Way VLIW Processor (MPB)
Dec./RF
ALU
D$
Dec.
Cop. RF
ALU ALU
MUL
Acc. Acc.
I$
RF
32b RISC Core
2-way SIMD Coprocessor
RISC Core
SIMD Co-Processor
L1 Inst
Cache
32KB
L1 Data
Cache
16KB
Debug
Module
NoC I/F
Address Protect
Unit
Address Translate
Unit
Exploits multi-grain parallelism • Thread level by many cores
• Instruction level by VLIW architecture
• Data level by SIMD instructions
64b 64b
![Page 37: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/37.jpg)
37
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
![Page 38: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/38.jpg)
38
Issues in implementing parallelized face detection
• High coarse-grain parallelism: Good for Parallelization
– There are enough ROIs to exploit by many cores
• Imbalanced workload: Bad for Processor Utilization
– The workload of an ROI where a face exists is higher than that of an
ROI without a face
Implementation of parallelized face-detection
– Minimize the number of threads in order to reduce synchronization cost
• Allocate one thread to one core
– Find a good thread partitioning with balancing workload of threads
– Reduce data bandwidth (L1$-L2$ and L2$-DDR3)
![Page 39: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/39.jpg)
39
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
![Page 40: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/40.jpg)
40
Implementation on the Single Cluster
• We implemented the face detection with two
methods to allocate image to cores
– Allocating Cyclically
– Splitting Equally
![Page 41: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/41.jpg)
41
(1) Allocating Cyclically
ROI Core0
Core1
Core31
Core2
Image
This way allocates lines to each core cyclically
Effective in balancing workload
![Page 42: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/42.jpg)
42
(2) Splitting Equally
Core0
Core1
Core31
Core2
Image
Height
Height/32
This way divides the image evenly
Effective to reduce data size read by each core
![Page 43: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/43.jpg)
43
Images for Evaluation • High Resolution Images (5.76-12.7Mp) including many faces
No. Resolution Number of Faces
0 4000x1440 30
1 3000x4082 37
2 4083x3062 78
3 4094x3107 148
4 3568x2568 9
5 3568x2568 10
Face Detection Result of Image 4
![Page 44: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/44.jpg)
44
Evaluation Board
Many-Core SoC
(Fan-less Cooling)
I/O and switches for
evaluation
![Page 45: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/45.jpg)
45
0
5
10
15
20
25
30
35
1 2 4 8 16 32 1 2 4 8 16 32
Rela
tive
Pe
rfo
rma
nce
Number of Cores
img.0
img.1
img.2
img.3
img.4
img.5
ideal
Relative Performance on Single Cluster
15.5x
30x
11x
21x
Allocating
Cyclically
Splitting
Equally
![Page 46: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/46.jpg)
46
0
5
10
15
20
25
30
35
0 10 20 30 40
Ave
rag
e R
ela
tive
P
erf
orm
an
ce
The Number of Cores
AllocatingCyclically
SplittingEqually
ideal
Average Relative Performance on Single Cluster
11x
21x 15.5x
30x
With Allocating Cyclically,
performance scales up to 32 cores
Ideal
![Page 47: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/47.jpg)
47
0
5
10
15
20
25
30
35
40
45
0 1 2 3 4 5 0 1 2 3 4 5
Tim
e (
se
c)
Image Number
FastestCore
SlowestCore
Execution Time of the Fastest and Slowest Cores
11x
1.1x
Allocating
Cyclically
Splitting
Equally
![Page 48: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/48.jpg)
48
Processor Utilization
• Allocating Cyclically : 90 ~ 95%
• Splitting Equally : 55 ~ 75%
Low processor utilization deteriorates
the performance of Splitting Equally
0
20
40
60
80
100
0 1 2 3 4 5
Pro
ce
ss
or
Uti
liza
tio
n(%
)
Image Number
AllocatingCyclically
SplittingEqually
![Page 49: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/49.jpg)
49
Bandwidth of L2 Cache and DDR3
• L1-L2 bandwidth is nearly the same
– L1 cache is not enough to store ROI line
• About L2-DDR3, Allocating Cyclically is better
– All cores access the small area at the same
0
0.2
0.4
0.6
0.8
1
1.2
L1-L2 L2-DDR3
Rela
tive A
mo
un
t o
f Tra
ns
ferr
ed
Data
AllocatingCyclically
SplittingEqually
![Page 50: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/50.jpg)
50
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
![Page 51: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/51.jpg)
51
Implementation on Dual Cluster
• Each cluster has its own L2 cache and shares DDR3
– Because bandwidth is narrower than L1 and L2 cache, reducing
bandwidth between L2 cache and DDR3 is important
• We implemented the two ways
– Allocating Cyclically
– Bisection
MPB
L2 cache
L1Cache MPBx32 …
DDR3
MPB L1Cache
MPB
L2 cache
L1Cache MPBx32 …
MPB L1Cache
![Page 52: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/52.jpg)
52
(1) Allocating Cyclically
Image
Core0
Core1
Core31
Core2
Cluster0
Core0
Core1
Core31
Core2
Cluster1
ROI
This way is the same as that of a single cluster
Effective in balancing workload
![Page 53: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/53.jpg)
53
(2) Bisection
Image
Cluster0 Cluster1
Height/2
This way divides the image into two blocks
Each cluster processes each block
![Page 54: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/54.jpg)
54
(2) Bisection
Image
Core0
Core1
Core31
Core2
Cluster0
Core0
Core1
Core31
Core2
Cluster1
ROI
ROI
This way divides the image into two blocks
Effective to reduce data size read by each cluster
In each block, Allocating Cyclically is used
![Page 55: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/55.jpg)
55
0
10
20
30
40
50
60
70
0 1 2 3 4 5
Rela
tive
Pe
rfo
rma
nce
Image Number
AllocatingCyclically
Bisection
Performance of Dual Cluster (64 Cores)
Ideal (64x)
61x
42x
![Page 56: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/56.jpg)
56
0
10
20
30
40
50
60
70
0 1 2 3 4 5
Rela
tive
Pe
rfo
rma
nce
Image Number
AllocatingCyclically
Bisection
Performance of Dual Cluster (64 Cores)
Ideal (64x)
61x
42x
By Allocating Cyclically,
performance scales up to 64 cores
![Page 57: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/57.jpg)
57
0
2
4
6
8
10
12
14
16
18
0 1 2 3 4 5 0 1 2 3 4 5
Tim
e(s
ec)
Image Number
FastestCore
SlowestCore
Allocating
Cyclically Bisection
1.3x
2.8x
Execution Time of the Fastest and Slowest Cores
![Page 58: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/58.jpg)
58
Processor Utilization
• Allocating Cyclically : 87 ~ 95%
• Bisection : 66 ~ 91%
Low processor utilization deteriorates
the performance of Bisection
0
20
40
60
80
100
0 1 2 3 4 5
Pro
ce
ss
or
Uti
liza
tio
n
(%)
Image Number
AllocatingCyclically
Bisection
![Page 59: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/59.jpg)
59
DDR3 Bandwidth
Utilized bandwidth is 750MB/s ( only 7% of maximum (10.7GB/s ) )
Memory bandwidth is not bottleneck even when two clusters operate.
0
100
200
300
400
500
600
700
800
1 2 3 4 5 6
DD
R3 B
an
dw
idth
(M
B/s
)
Image Number
Bandwidth in Allocating Cyclically
![Page 60: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/60.jpg)
60
0
500
1000
1500
2000
2500
1 2 4 8 16 32 64
Po
we
r (m
W)
The Number of Cores
Clusters
Bus
DDR3
IO
Other
Power Consumption
Typical Process, Room Temperature, using Allocating Cyclically
2Clusers
1.18W
SoC : 2.21W
Our many-core SoC achieves less than 3W
![Page 61: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/61.jpg)
61
16.1
443.5
0
50
100
150
200
250
300
350
400
450
500
Many-core Core™-i7-3820
En
erg
y (
J)
Comparison with Desk-Top CPU
• Compared with Desk-Top CPU
(Core™-i7-3820: 3.6GHz, 4 Cores, 8 Threads)
TDP of Core™-i7-3820 (130W) is used for calculating energy
20x
better
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Many-core Core™-i7-3820
Rela
tive P
erf
orm
an
ce
Performance Energy
![Page 62: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/62.jpg)
62
Conclusion
• Future architecture will have many cores
– A key challenge : How to efficiently use them?
• We evaluated the many-core SoC with parallelized face
detection
– Many-core is suited for the face detection because it
exploits ROI based coarse-grained parallelism efficiently
• Scale up by 30x (32 cores) to 60x (64 cores)
• Balancing workload is important
• Power consumption is only 2.21W under actual
workload : enables fan-less cooling
– Our many-core SoC is remarkably energy efficient in
image recognition applications
• 20x better than the desk-top CPU
![Page 63: An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core](https://reader033.fdocuments.in/reader033/viewer/2022060406/5f0f72557e708231d444344e/html5/thumbnails/63.jpg)
63
Thank you!