Post on 07-Feb-2016
description
Module
Module
Module
Module
Module
Module
Module
Module
Module Module Module
Module
Module
Module
R
R
R R R
R
RR R R R
R R
R
Module
R
R
R
Technion – Israel Institute of Technology
The Era of Many-Module SoC:Revisiting the NoC Mapping Problem
Isask’har (Zigi) Walter, Israel Cidon, Avinoam Kolodny, Daniel Sigalov
December, 2009
2
SoC Revolution
PE1
R
PE2
PE3
R R
PE4
R
PE5
PE6
R R
PE1
PE2
PE3
PE4
PE5
PE6
Bus-based system NoC-based system
3
SoC Evolution
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
RR
R RR
R RR
R
R R R R R
R R R R R
R R R R R
R R R R R
R R R R R
4
Processor Evolution
CPU
Cache
Single Core
CPU1
Cache
Dual Core
CPU2
Cache CPU1Cache
Quad Core
CPU3Cache
CPU2Cache
CPU4Cache
5
How would such chips be like?
Most likely Power still important Highly parallel IP reuse
Ease of design and verification
The Era of Many-Module SoC
High Certainty
Totally unknown
Large number of modules
NoC Interconne
ct
Applications
R
Special purpose cores replace general purpose processors Power considerations
Future SoCs - Observation#1
GeneralPurpose
CPU
Pre.Proc.
DSP CPU
Task1
Task2
MEM
Task3
Task4
MEM
Memory
Task1
Task2
MEM
Task3
Task4
MEM
Memory
Task5
Task5
GPU
6
Processing pipes are getting longer
7
Future SoCs - Observation#2
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
?R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
Large diversity All modules are
unique
Highly regular Classes of Replicated
cores standard modules (DSP,
HW accelerators, Cache banks, etc.)
8
Increased use of specialized cores Pipes are getting longer
Replication of processing elements
How is the design flow affected? This work – mapping of the NoC
The Era of Many-Module SoCObservation#
1
Observation#2
9
The Era of Many Module SoC Revisiting the Mapping Problem Cross-Entropy Optimization Evaluation
Outline
10
Given Traffic pattern(s)
a set (or sets) of pair-wise bandwidth requirements and timing constraints
Routing Topology
Goal Find efficient mapping of cores to tiles
NoC Mapping
PE4
PE1 PE2
PE5
PE3
PE6
PE7 PE8 PE9 PE4
PE1
PE2PE5
PE3
PE6PE7
PE8
PE9
PE4 PE1
PE2
PE5
PE3 PE6
PE7PE8
PE9
PE4
PE1
PE2
PE5
PE3
PE6
PE7
PE8
PE9
PE4
PE1PE2
PE5
PE3
PE6
PE7
PE8 PE9
11
An important design step Mapping affects power and performance!
A difficult problem! Often heuristic algorithms are used
Common optimization goals Minimize (dynamic) power Minimize power + maximize performance Minimize power subject to performance
constraints
Mapping Optimization
12
Typical modeling Power and latency proportional to
distance Cost function:
Modeling
1 ,
( ) ( , )l i jl L i j N
Cost P BW b Dist i j
13
Calculating Mapping Cost
1 ,
( ) ( , )l i jl L i j N
Cost P BW b Dist i j
2( ) 30 2 100 2 260Cost
PE1 PE2
PE4 PE5
PE3
PE6
1 2 6 4 3( ) 30 ( ) 100 ( )Cost Dist PE PE Dist PE PE
100
30
1 2 6 2 6 4 3 4 3( ) ( ) ( ) ( ) ( )Cost bw PE PE Dist PE PE bw PE PE bw PE PE
1( ) 30 2 100 3 360Cost
Mapping π1
Mapping π2
14
Motivation - Example #1
Optimal mapping (π1):
PE1
PE2
PE3
PE4
MEM1
MEM2
PE1 MEM2
PE2 PE3
PE4
MEM1
1 1 1
1 1 1 1
11 ,
( ) ( , ) 9i ji j N
Cost b Dist i j
15
Optimal mapping (π2):
Let the mapping algorithm assign the flows!
Motivation - Example #1 (cont.)
PE1
PE2
PE3
PE4
2*MEM
PE1
MEM1
PE2
PE3 PE4
MEM2 PE1 MEM1
PE2 PE3
PE4
MEM2
Cost(π1)=9Cost(π2)=7
16
Motivation - Example #1 (cont.)
PE1
PE2
PE3
PE4
MEM1
MEM2
1 1 1
1
1 11
PE1
MEM1
PE2
PE3 PE4
MEM2
Cost(π2)=7
The mapping algorithm should be aware of replicated modules!
17
Pair-wise point-to-point requirements For example, in a 4-module system:
Classic Performance Constraints
PE1
2PE2
1 PE3
1 1PE4
PE4 PE3 PE2 PE1
18
PE1 PE2 PE3
PE4
Motivation - Example #2
Timing Requirement
PEs Stream ID
4 PE1PE2PE3PE4 Stream 1
1 PE2PE4 Stream 2
Stream 1
Stream 2
19
Example #2 – Pair-wise req.
No feasible mapping!
PE1 PE2 PE3
PE4
PE2.
PE4.
PE1
PE3 PE2. PE4.
PE1PE3
PE2.PE4
PE1PE3
2 PE2
1 PE3.
1 1 PE4
PE3 PE2 PE1
Req=2 Req=1
Req=1Req=1
20
PE1 PE2 PE3
PE4
Application-Level Requirements
Requirement PEs Stream ID
4 PE1PE2PE3PE4 Stream 1
1 PE2PE4 Stream 2
PE2
PE4
PE1
PE3
Stream 1
Stream 2
A feasible mapping does exist!
Req=1Req=2
Req=1Req=1
It’s better to work with the application level requirements
21
Find efficient mappings by extending the formulation of the mapping problem Adding degrees of freedom
Degree of freedom #1 Leverage existence of replicated modules
Degree of freedom #2 Replace p2p constraints with end-to-end,
application-level requirements
This Work
22
Modifying the Formulation (1)
Time Req. BW Flow
3 100 PE1DSP3
12 200 PE2DSP4
15 100 PE2SRAM1
5 100 PE3SRAM2
… … …
Time Req. BW Flow
3 100 PE1<ANY DSP>
12 200 PE2<ANY DSP>
15 100 PE2<ANY SRAM>
5 100 PE3<ANY SRAM>
… … …
Leverage existence of replicated modules Allow the mapping algorithm to allocate
flows to the best replicated module
23
∞
∞ 2
4 4 3
∞ 3 1 3
∞ 4 7 7 3
∞ 3 ∞ 2 4 ∞
4 ∞ 7 2 3 ∞ 5
∞ ∞ 2 ∞ 1 5 6 ∞
7 7 3 ∞ ∞ 5 3 ∞ 1
∞ 3 ∞ 3 2 1 2 ∞ 3 1
Modifying the Formulation (2)
E2E Req.
Stream’s PEs Stream ID
23 PE1PE3PE9PE4PE10
1
12 PE5PE2PE3PE8PE7PE6PE10
2
15 PE5PE3PE9 3
20 PE7PE8PE2PE3 4
2 PE1PE2 5
… … …
In this paper, for synthetic task graphs Did so for a real application too
P2P timing req. E2E timing req.
Replace p2p constraints with end-to-end, application-level requirements
24
The Era of Many Module SoC Revisiting the Mapping Problem Cross-Entropy Optimization Evaluation
Outline
25
Modern optimization heuristic Good at combinatorial optimization problems
Akin to evolutionary algorithms Generation of new solutions is based on
sampling and estimation Inherently a global search method
Reduced risk of getting trapped in a local minimum
Cross Entropy Optimization
26
Given an initial parameter vector v=v0, sample a random population of K solutions x1,x2,…,xk from the distribution given by f(x;v).
Evaluate the costs S(xi),i=1,…,K. Using the ρK (0<ρ<1) elite (lowest cost) samples, obtain a new density function
f(x;v) by calculating a new vector v via Maximum Likelihood (ML) estimation. Repeat steps 1-3 with the new vector v unless maximum number of iterations is
reached or no improvement is obtained for a predefined number of iterations.
Cross Entropy Optimization
1. Generate 10 random mappings: π1, π2, …, π10
2. Find 3 lowest cost mappings: π2, π5, π7
3. Examine those 3 best mappings:A. For each tile, calculate the probability
core PEi is mapped to that tile
B. Update probabilities accordingly
For example:
27
Prob(TileAPE1)= Prob(TileAPE2)= Prob(TileAPE3)=Prob(TileAPE4)=0.25
Prob(TileBPE1)= Prob(TileBPE2)= Prob(TileBPE3)=Prob(TileBPE4)=0.25
Prob(TileCPE1)= Prob(TileCPE2)= Prob(TileCPE3)=Prob(TileCPE4)=0.25
Prob(TileDPE1)= Prob(TileDPE2)= Prob(TileDPE3)=Prob(TileDPE4)=0.25
CE Example
PE3
PE1 PE2
PE4
TileC
TileA
TileD
TileB
π1
PE3
PE1
PE2
PE4
π2
PE3
PE1 PE2
PE4
π3
PE3
PE1PE2
PE4
π4
PE3
PE1 PE2
PE4
π5
PE3
PE1
PE2
PE4
π6
PE3 PE1
PE2PE4
π7
PE3
PE1
PE2
PE4
π8
PE3
PE1
PE2
PE4
π9
PE3
PE1 PE2
PE4
π10
28
Prob(TileAPE1)=1
Updating Probabilities
PE1 PE2
PE4PE3PE3
PE1
PE2
PE4 PE1 PE2
PE4 PE3
Prob(TileBPE2)=2/3 Prob(TileBPE4)=1/3
Prob(TileDPE2)=1/3 Prob(TileDPE3)=1/3 Prob(TileDPE4)=1/3
Prob(TileCPE3)=2/3 Prob(TileCPE4)=1/3
π2 π5 π3
Following iteration uses these updates probabilities Gradually, probabilities converge to 0/1
TileC
TileA
TileD
TileB
29
The Era of Many Module SoC Revisiting the Mapping Problem Cross-Entropy Optimization Evaluation
Outline
30
Scenario 6x6 mesh NoC Synthetic, randomized SoC
Task graphs (and task-to-core mapping) Varying number of replicated modules Varying timing constraints (Real application in DATE10 paper)
Compare with best cost of classic mapping Averaging multiple runs
Evaluation
31
“Class”: a group of identical PEs Total number of replicated cores=
{Number of classes}*{class size}
Accounting for Replication
Sa
vio
ngs
[%]
Replicated Modules [%]
One ClassTwo ClassesFour Classes
Cost
Reduct
ion
]%[
10 20
Total number of Replicated Modules]%[ 30 40 50
32
SoCs with a pipeline data path and background P2P traffic Varying pipeline slack Different amounts of background
constraints
Application-Level Requirements
Tight Medium Loose Very Loose
Sav
ings
[%]
Allowed Pipeline Slack
extra constraints extra constraints extra constraints extra constraints
Cost
Reduct
ion
]%[
Pipeline Slack
33
We are going into the era of “Many module SoC”
Extend the mapping to account for Classes of replicated modules Application-level requirements
Meaningful power savings
But mapping is an example Routing? Task assignment? Link design?
Topology selection?
Conclusions and Future Work
34
Thank you!
Questions?
zigi@tx.technion.ac.il
The Era of Many-Module SoC
M odule
M odule M odule
M odule M odule
M odule M odule
M odule
M odule
M odule
M odule
M odule
QNoCResearch
GroupGroup
ResearchQNoC