Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many...
-
Upload
lincoln-doten -
Category
Documents
-
view
222 -
download
6
Transcript of Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many...
Nasibeh TeimouriHamed Tabkhi Gunar Schirner
Summer 2014
Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential
Solutions
Outline• Heterogeneous MPSoCs
– Specialization is a growing trend
• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators
• Previous works
• Quantitative exploration of current accelerator-rich MPSoC - Huge memory demand
- Immerse communication traffic
- Overwhelming Synchronization
• The proposed accelerator centric architecture template - Implementation
- Evaluation
2
Outline• Heterogeneous MPSoCs
– Specialization is a growing trend
• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators
• Previous work
• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand
- Immerse communication traffic
- Overwhelming Synchronization
• The proposed accelerator centric architecture template - Implementation
- Evaluation
3
Heterogeneous MPSoCs
4
GPU
Alg
5 [0
]
Alg
5 [1
]
Alg
5 [2
]
Alg
5 [
63]
...
M
DMADMA
DMADMA
M S
Core 4
INT31INT0
0.8
GH
z
...
Alg2 Alg3
OS/Drv
I - L1 D – L1
L2
M
Core 3
INT31INT0
0.8
GH
z
...
Alg2 Alg3
OS/Drv
I - L1 D – L1
L2
M
4
Core 2
INT31INT0
0.8
GH
z
...
Alg2 Alg3
OS/Drv
I - L1 D – L1
L2
M
Core 1
INT31INT0
0.8
GH
z
...
Alg2 Alg3
OS/Drv
I - L1 D – L1
L2
M
Low Performance
PeripheralS
Low Performance
PeripheralS
Low Performance
Peripheral
S
BridgeM
S
Function-Level Processor
Alg1[1]
Alg4[1]
M
Function-Level Processor
Alg1
Alg4[2]
M
SRAM
S
SDRAMContrl.
S
SDRAMContrl.
SDRAMContrl.
SDRAMContrl.SDRAM
TransducerM
S
IP Comp.
S
– Heterogeneous MPSoCs– Integrated solutions for a group of evolving
markets• ILP (e.g. CPU, DSP, or even GPU)
Flexibility
- Power dissipation • Custom-HW Accelerators (ACCs) for compute-
intensive kernels Power efficiency- Cost- Inflexibility
What is the trend?
Specialization as a MPSoC trend
5
• Increasing demands for high performance low power computing– Market examples:
• Embedded vision• Software Define Radio (ADR)• Cyber Physical Systems (CPS)
– Tens billion of operations per second– Less than few watts power
- Trend: Domain specific specialization– Proliferating number of ACCs in systems ACC-Rich MPSoC
MemoryInterfaceACC Shared/LLC
Memory
ACC 0
DMA
ILP 0Cache
ACC 1
DMA
ACC N
DMA
Outline• Heterogeneous MPSoCs
– Specialization is a growing trend
• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators
• Previous work
• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand
- Immerse communication traffic
- Overwhelming Synchronization
• The proposed accelerator centric architecture template - Implementation
- Evaluation
6
Principals of current accelerator-rich MPSoC
7
• ILP+HWACC composition– HW-ACC
• Executes Compute-intense kernels/apps
– ILP• Executes remaining
applications• Orchestrates HWACCs /
coordinate data movement
– On-chip scratchpad memory (SPM)
• Keeps data between ILP and ACCs on-chip
– Avoid costly off-chip memory access
ILP Input DMA ACC1 Output
1. Input Done
2.DMA Start
3.DMA Done
4.DMA Start
5.DMA Done
6.ACC1 Start
7.ACC1 Done
8.DMA Start
9.DMA Done
10.DMA Start
11.DMA Done
12.Output start
13-Output Done
System Fabric
ACC1
SPM
SPM
Input Interface
SPM
Output Interface
SPM
On-chip memoryILP
Stream outStream In
CacheInterrupt
Control
DMADMADMA
MPSoC with many accelerators
8
• Scratch Pad Memory (SPM)- 2 per accelerator , 1 per I/O
- To hold input job
System Fabric
ACC1 ACC2 ACCn
SPM
SPM
SPM
SPM
SPM
SPM
Input Interface
SPM
Output Interface
SPM
On-chip memoryILP
Stream outStream In
CacheInterrupt
Control
DMADMADMADMADMA
DMA
SPM: Scratch Pad Memory
• Control and interrupt lines- ACC configuration
• Centralized vs. dedicated DMA- Stream data transfer
Challenges with increasing number of interrupts
1- Memory requirement- Two SPM per each ACC- One SPM per each Interfaces- Shared memory to hold data handed
over the accelerators
2- High volume of traffic over system fabric
- No point to point connections between ACC
- Required DMA data transfers
3-ILP synchronization- Among accelerators, IO Interfaces and
DMA transfers
9
System Fabric
ACC1 ACC2 ACCn
SPM
SPM
SPM
SPM
SPM
SPM
Input Interface
SPM
Output Interface
SPM
On-chip memoryILP
Stream outStream In
CacheInterrupt
Control
NEED to quantitatively consider this
architecture!
Outline• Heterogeneous MPSoCs
– Specialization is a growing trend
• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators
• Previous work
• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand
- Immerse communication traffic
- Overwhelming Synchronization
• The proposed accelerator centric architecture template - Implementation
- Evaluation
10
Previous works on composing ACC
• Composing bigger applications out of many accelerates like Accelerator-
Rich CMPs[1], CHARM[2]
– Imposing a considerable traffic and considerable on-chip buffers for
accelerator data exchange
– ILP load to orchestrate the system composed of accelerators
1. J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture support for accelerator-rich cmps. In Proceedings of the 49th Annual Design Automation Conference, DAC ’12, pages 843–849, 2012.
2. M. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The accelerator store framework for high-performance, low-power accelerator-based systems.Computer Architecture Letters, 9(2):53 –56, feb. 2010.
11
Outline• Heterogeneous MPSoCs
– Specialization is a growing trend
• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators
• Previous work
• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand
- Immerse communication traffic
- Overwhelming Synchronization
• The proposed accelerator centric architecture template - Implementation
- Evaluation
12
Quantitative exploration of accelerator-rich MPSoC; WHY and HOW
• Applicability of quantitative exploration
– Quantifying the potential challenges
– Exposing the ACC-rich bottlenecks as # of ACCs increases
– Helping system architects for proper sizing of systems knobs (SPM
sizes, # of ACCs, Communication BW)
– Motivating our proposed arch-template solution
• Approaches of quantitative exploration
1- First order mathematic based analysis
2- Simulation based analysis of ACC-rich MPSoC
13
Exploration overview
• Assumptions– One HD resolution frame as input
• Divided into smaller jobs – Memory on chip
• Avoid off-chip memory for now
• Exploration steps– Memory requirement as #ACC increases– Sizing SPM to satisfy memory budget limitation– Interrupt rate load on ILP
14
Memory size analysis (calculation based)
15
0 5 10 15 20 25 30 35 400
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
2
4
8
16
32
64
128
# Accelerators
Mem
ory
size
(KB
)
Job size (KB)
• Memory size = SPMs + shared memory• SPM holds one job
• Job size determines minimum size of SPM and shared memory
• Shared memory holds all jobs exchanged among ACCs
Sizing job size with respect to memory budget
• More ACCs requires larger memory
• Bigger job needs larger memory
Limiting memory budget
Output Frame
MPSoC
Input Frame
System Fabric
Shared memory
ACC1
SPM
SP
M
ACC2
SPM
SP
M
Input
SP
M
Output
SP
M
Job sizing (calculation based)
16
• The lower the size of memory, the smaller the size of job• The more #accelerators, the smaller job size
System Fabric
ACC1 ACC2 ACCn
SPM
SPM
SPM
SPM
SPM
SPM
Input Interface
SPM
Output Interface
SPM
ACC shared/LLC memoryILP
Stream outStream In
CacheInterrupt
Control
Smaller job size issues more interrupts to ILP- Responsibility of ILP to synchronize ACCs transactions
0 10 20 30 40 50 600
500
1000
1500
2000
2500
3000
3500
4000
4500
0.5
1
4
16
# Accelerators
Job
Size
(KB
)
Memory size (MB)
- Count the number of interrupts
- Measure ILP responsibility to response Interrupts
Simulation platform
17
Output Frame
MPSoC
Input Frame
System Fabric
Shared memory
ACC1
SPM
SP
M
ACC2
SPM
SP
M
Input
SP
M
Output
SP
M
• Using SpecC SLDL to develop a simulation model
– Scalable # of ACCs» Different/same data rate
– ILPs– DMAs– Mummeries (SPM, shared memory)
» On-chip and off-chip memory
• Generating ACC-Rich simulation model – BFM AMBA-AHB Communication fabric – ARM 9 (ISA v6) for ILP execution
• Priority based– Dedicated interrupt line– Centralized DMAs
IO Interface (Input)
IO Interface (Output)
ACC0 ACC1
Memory
DMA
ILP
System Fabric
Stimulus FirstQueuLastQueu Monitor
SCE refinement
# of interrupt by scaling #ACC (simulation based)
• # of interrupt vs. the number of accelerators– For different size of on-chip memory
18
0 10 20 30 40 50 600
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
0.5
1
4
8
# Accelerators
# In
terr
upts
Mem Size(MB)
More interrupts to the ILP with smaller job size - Significant utilization or even over saturation of ILP only because of driving accelerators
Memory Size (MB)
#ACC 0.5 1 4 8
1 64K 128K 512K 1M
4 32K 64K 256K 512K
9 16K 32K 128K 256K
18 8K 16K 64K 128K
34 4K 8K 32K 64K
60 2K 4K 16K 32K
Smaller memory/more ACCs -> smaller
Job
Communication overhead analysis (calculation based)
19
• Communication overhead = data exchanged through the system fabric
• More ACCs, heavier traffic on system fabric
Output Frame
MPSoC
Input Frame
System Fabric
Shared memory
ACC1
SPM
SP
M
ACC2
SPM
SP
M
Input
SP
M
Output
SP
M
0 10 20 30 40 50 60 700
50
100
150
200
250
300
# Accelerators
Com
mun
icat
ion
traf
fic
(MB
)
Exploration Summery
• Problems affiliated with current accelerator-rich architecture
– On-chip memory requirements
– ILP synchronization load
– Heavy communication traffic on system fabric
• Demands toward improved ACC-centric design
– Tackling the challenges of current ACC-rich architecture
20
Outline• Heterogeneous MPSoCs
– Specialization is a growing trend
• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators
• Previous work
• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand
- Immerse communication traffic
- Overwhelming Synchronization
• The proposed accelerator centric architecture template - Implementation
- Evaluation
21
The goals of the proposed ACC-centric arcitecture
• The proposed solution
– An autonomous accelerator chain
• Relieving ILP’s synchronization load
– Point to point connections between accelerators
• No need for larger SPM per each accelerator
• No frequent DMA data transfers
• No heavy traffic on system fabric
22
Simulation platform
23
• Modifying the developed SpecC
model to support autonomous chain
of accelerator
– Gateways to manage the chain
• Creating another ACC-Rich
simulation model
– BFM AMBA-AHB Communication
fabric
– ARM 9 (ISA v6) for ILP execution
– Dedicated interrupt line from gateways
to ILP
– Centralized DMA
SCE refinement
Stimulus Input Data
Output Data
Monitor
InGateway
Memory
OutGatewayChain
ACC0 ACCn
ILP
System Fabric
Input Interface
Output Interface
System Fabric
Input Gateway ACC1
Output Gateway
Mem
ILP
Input Interface
SP
M
Output Interface
SP
M
ACC2
Interrupt
Control
SPM SPM
Stream In Stream Out
DMA DMA
DMA DMA
Interrupt
Control
The proposed accelerator-centric architecture template
• Gateways controlled by ILP to manage the whole chain of accelerators
– SPM to receive/send data from/to memory
– Control lines from ILP to gateways for configuration
– Interrupt lines from gateways to ILP
– Point to point connections in chain with small buffer in between– Chain works independence of ILP
24
Point to point accelerator connectionsNo much memory requirementNot many DMA data transfer
Autonomous ACC chain:Light ILP synchronization load no matter how many accelerators
System Fabric
Input Gateway ACC1
Output Gateway
Mem
ILP
Input Interface
SP
M
Output Interface
SP
M
ACC2
Interrupt
Control
SPM SPM
Stream In Stream Out
DMA DMA
DMA DMA
Interrupt
Control
System Fabric
Input Gateway ACC1
Output Gateway
Mem
ILP
Input Interface
SP
M
Output Interface
SP
M
ACC2
Interrupt
SPM SPM
Stream In Stream Out
DMA DMA
DMA DMA
Interrupt
1. DMA brings data to the input gateway’s SPM2. Input gateway receives data and starts to pass data through the chain3. Chain works on data4. Output gateway gathers data in SPM5. DMA brings data to memory
1
2
3
4
5
Evaluation
25
MORE ACC:Current arch: exponential
growth in interrupts Proposed architecture: The
same number of interrupts
0 10 20 30 40 50 60 700
50
100
150
200
250
300
Current ArchProposed Arch
# Accelerators
Dat
a co
mm
unic
atio
n tr
affi
c (M
B)
MORE ACC:Current arch: Heavier trafficProposed arch: almost the
same data traffic
0 10 20 30 40 50 60 700
100
200
300
400
500
600
700
Proposed Architecture
Current Architecture
# Accelerators
Job
Size
(KB
)
4 MB memory
MORE ACC:Current arch: Smaller jobProposed arch: almost the
same job
0 10 20 30 40 50 60 700
5
10
15
20
25
30
35
40
45
50
Proposed Architecture
Current Architecture
# Acceerators
Mem
ory
(MB)
256 KB job
MORE ACC:Current arch: Linear growth in
memory requirement Proposed arch: almost constant
memory requirement
0 10 20 30 40 50 60 700
2000
4000
6000
8000
10000
12000
Proposed Arch
Current Arch
# Accelerators
# In
terr
upts
Summary
• Specialization as a growing trend in CMPs– Accelerator rich architectures
• Exploration of the challenges in current accelerator rich architecture– Memory requirement– Communication overhead– Synchronization load
• The proposed accelerator-centric architecture template– Autonomous accelerator chain
• No large memory requirement• No heavy communication traffic• No critical amount of required synchronization
26
Question?
Again, Thanks to Professor Schirner for all
his support…
Thanks to Hamed for what I’ve been learning from him,
Thank you all ESL members for your attendance!
27