The Era of Many-Module SoC : Revisiting the NoC Mapping Problem

Post on 07-Feb-2016

26 views 0 download

Tags:

description

The Era of Many-Module SoC : Revisiting the NoC Mapping Problem. Technion – Israel Institute of Technology. Isask’har ( Zigi ) Walter, Israel Cidon , Avinoam Kolodny , Daniel Sigalov. December, 2009. SoC Revolution. PE1. PE2. PE3. R. R. R. PE1. PE2. PE3. R. R. R. PE4. - PowerPoint PPT Presentation

Transcript of The Era of Many-Module SoC : Revisiting the NoC Mapping Problem

Module

Module

Module

Module

Module

Module

Module

Module

Module Module Module

Module

Module

Module

R

R

R R R

R

RR R R R

R R

R

Module

R

R

R

Technion – Israel Institute of Technology

The Era of Many-Module SoC:Revisiting the NoC Mapping Problem

Isask’har (Zigi) Walter, Israel Cidon, Avinoam Kolodny, Daniel Sigalov

December, 2009

2

SoC Revolution

PE1

R

PE2

PE3

R R

PE4

R

PE5

PE6

R R

PE1

PE2

PE3

PE4

PE5

PE6

Bus-based system NoC-based system

3

SoC Evolution

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

RR

R RR

R RR

R

R R R R R

R R R R R

R R R R R

R R R R R

R R R R R

4

Processor Evolution

CPU

Cache

Single Core

CPU1

Cache

Dual Core

CPU2

Cache CPU1Cache

Quad Core

CPU3Cache

CPU2Cache

CPU4Cache

5

How would such chips be like?

Most likely Power still important Highly parallel IP reuse

Ease of design and verification

The Era of Many-Module SoC

High Certainty

Totally unknown

Large number of modules

NoC Interconne

ct

Applications

R

Special purpose cores replace general purpose processors Power considerations

Future SoCs - Observation#1

GeneralPurpose

CPU

Pre.Proc.

DSP CPU

Task1

Task2

MEM

Task3

Task4

MEM

Memory

Task1

Task2

MEM

Task3

Task4

MEM

Memory

Task5

Task5

GPU

6

Processing pipes are getting longer

7

Future SoCs - Observation#2

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

?R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

Large diversity All modules are

unique

Highly regular Classes of Replicated

cores standard modules (DSP,

HW accelerators, Cache banks, etc.)

8

Increased use of specialized cores Pipes are getting longer

Replication of processing elements

How is the design flow affected? This work – mapping of the NoC

The Era of Many-Module SoCObservation#

1

Observation#2

9

The Era of Many Module SoC Revisiting the Mapping Problem Cross-Entropy Optimization Evaluation

Outline

10

Given Traffic pattern(s)

a set (or sets) of pair-wise bandwidth requirements and timing constraints

Routing Topology

Goal Find efficient mapping of cores to tiles

NoC Mapping

PE4

PE1 PE2

PE5

PE3

PE6

PE7 PE8 PE9 PE4

PE1

PE2PE5

PE3

PE6PE7

PE8

PE9

PE4 PE1

PE2

PE5

PE3 PE6

PE7PE8

PE9

PE4

PE1

PE2

PE5

PE3

PE6

PE7

PE8

PE9

PE4

PE1PE2

PE5

PE3

PE6

PE7

PE8 PE9

11

An important design step Mapping affects power and performance!

A difficult problem! Often heuristic algorithms are used

Common optimization goals Minimize (dynamic) power Minimize power + maximize performance Minimize power subject to performance

constraints

Mapping Optimization

12

Typical modeling Power and latency proportional to

distance Cost function:

Modeling

1 ,

( ) ( , )l i jl L i j N

Cost P BW b Dist i j

13

Calculating Mapping Cost

1 ,

( ) ( , )l i jl L i j N

Cost P BW b Dist i j

2( ) 30 2 100 2 260Cost

PE1 PE2

PE4 PE5

PE3

PE6

1 2 6 4 3( ) 30 ( ) 100 ( )Cost Dist PE PE Dist PE PE

100

30

1 2 6 2 6 4 3 4 3( ) ( ) ( ) ( ) ( )Cost bw PE PE Dist PE PE bw PE PE bw PE PE

1( ) 30 2 100 3 360Cost

Mapping π1

Mapping π2

14

Motivation - Example #1

Optimal mapping (π1):

PE1

PE2

PE3

PE4

MEM1

MEM2

PE1 MEM2

PE2 PE3

PE4

MEM1

1 1 1

1 1 1 1

11 ,

( ) ( , ) 9i ji j N

Cost b Dist i j

15

Optimal mapping (π2):

Let the mapping algorithm assign the flows!

Motivation - Example #1 (cont.)

PE1

PE2

PE3

PE4

2*MEM

PE1

MEM1

PE2

PE3 PE4

MEM2 PE1 MEM1

PE2 PE3

PE4

MEM2

Cost(π1)=9Cost(π2)=7

16

Motivation - Example #1 (cont.)

PE1

PE2

PE3

PE4

MEM1

MEM2

1 1 1

1

1 11

PE1

MEM1

PE2

PE3 PE4

MEM2

Cost(π2)=7

The mapping algorithm should be aware of replicated modules!

17

Pair-wise point-to-point requirements For example, in a 4-module system:

Classic Performance Constraints

PE1

2PE2

1 PE3

1 1PE4

PE4 PE3 PE2 PE1

18

PE1 PE2 PE3

PE4

Motivation - Example #2

Timing Requirement

PEs Stream ID

4 PE1PE2PE3PE4 Stream 1

1 PE2PE4 Stream 2

Stream 1

Stream 2

19

Example #2 – Pair-wise req.

No feasible mapping!

PE1 PE2 PE3

PE4

PE2.

PE4.

PE1

PE3 PE2. PE4.

PE1PE3

PE2.PE4

PE1PE3

2 PE2

1 PE3.

1 1 PE4

PE3 PE2 PE1

Req=2 Req=1

Req=1Req=1

20

PE1 PE2 PE3

PE4

Application-Level Requirements

Requirement PEs Stream ID

4 PE1PE2PE3PE4 Stream 1

1 PE2PE4 Stream 2

PE2

PE4

PE1

PE3

Stream 1

Stream 2

A feasible mapping does exist!

Req=1Req=2

Req=1Req=1

It’s better to work with the application level requirements

21

Find efficient mappings by extending the formulation of the mapping problem Adding degrees of freedom

Degree of freedom #1 Leverage existence of replicated modules

Degree of freedom #2 Replace p2p constraints with end-to-end,

application-level requirements

This Work

22

Modifying the Formulation (1)

Time Req. BW Flow

3 100 PE1DSP3

12 200 PE2DSP4

15 100 PE2SRAM1

5 100 PE3SRAM2

… … …

Time Req. BW Flow

3 100 PE1<ANY DSP>

12 200 PE2<ANY DSP>

15 100 PE2<ANY SRAM>

5 100 PE3<ANY SRAM>

… … …

Leverage existence of replicated modules Allow the mapping algorithm to allocate

flows to the best replicated module

23

∞ 2

4 4 3

∞ 3 1 3

∞ 4 7 7 3

∞ 3 ∞ 2 4 ∞

4 ∞ 7 2 3 ∞ 5

∞ ∞ 2 ∞ 1 5 6 ∞

7 7 3 ∞ ∞ 5 3 ∞ 1

∞ 3 ∞ 3 2 1 2 ∞ 3 1

Modifying the Formulation (2)

E2E Req.

Stream’s PEs Stream ID

23 PE1PE3PE9PE4PE10

1

12 PE5PE2PE3PE8PE7PE6PE10

2

15 PE5PE3PE9 3

20 PE7PE8PE2PE3 4

2 PE1PE2 5

… … …

In this paper, for synthetic task graphs Did so for a real application too

P2P timing req. E2E timing req.

Replace p2p constraints with end-to-end, application-level requirements

24

The Era of Many Module SoC Revisiting the Mapping Problem Cross-Entropy Optimization Evaluation

Outline

25

Modern optimization heuristic Good at combinatorial optimization problems

Akin to evolutionary algorithms Generation of new solutions is based on

sampling and estimation Inherently a global search method

Reduced risk of getting trapped in a local minimum

Cross Entropy Optimization

26

Given an initial parameter vector v=v0, sample a random population of K solutions x1,x2,…,xk from the distribution given by f(x;v).

Evaluate the costs S(xi),i=1,…,K. Using the ρK (0<ρ<1) elite (lowest cost) samples, obtain a new density function

f(x;v) by calculating a new vector v via Maximum Likelihood (ML) estimation. Repeat steps 1-3 with the new vector v unless maximum number of iterations is

reached or no improvement is obtained for a predefined number of iterations.

Cross Entropy Optimization

1. Generate 10 random mappings: π1, π2, …, π10

2. Find 3 lowest cost mappings: π2, π5, π7

3. Examine those 3 best mappings:A. For each tile, calculate the probability

core PEi is mapped to that tile

B. Update probabilities accordingly

For example:

27

Prob(TileAPE1)= Prob(TileAPE2)= Prob(TileAPE3)=Prob(TileAPE4)=0.25

Prob(TileBPE1)= Prob(TileBPE2)= Prob(TileBPE3)=Prob(TileBPE4)=0.25

Prob(TileCPE1)= Prob(TileCPE2)= Prob(TileCPE3)=Prob(TileCPE4)=0.25

Prob(TileDPE1)= Prob(TileDPE2)= Prob(TileDPE3)=Prob(TileDPE4)=0.25

CE Example

PE3

PE1 PE2

PE4

TileC

TileA

TileD

TileB

π1

PE3

PE1

PE2

PE4

π2

PE3

PE1 PE2

PE4

π3

PE3

PE1PE2

PE4

π4

PE3

PE1 PE2

PE4

π5

PE3

PE1

PE2

PE4

π6

PE3 PE1

PE2PE4

π7

PE3

PE1

PE2

PE4

π8

PE3

PE1

PE2

PE4

π9

PE3

PE1 PE2

PE4

π10

28

Prob(TileAPE1)=1

Updating Probabilities

PE1 PE2

PE4PE3PE3

PE1

PE2

PE4 PE1 PE2

PE4 PE3

Prob(TileBPE2)=2/3 Prob(TileBPE4)=1/3

Prob(TileDPE2)=1/3 Prob(TileDPE3)=1/3 Prob(TileDPE4)=1/3

Prob(TileCPE3)=2/3 Prob(TileCPE4)=1/3

π2 π5 π3

Following iteration uses these updates probabilities Gradually, probabilities converge to 0/1

TileC

TileA

TileD

TileB

29

The Era of Many Module SoC Revisiting the Mapping Problem Cross-Entropy Optimization Evaluation

Outline

30

Scenario 6x6 mesh NoC Synthetic, randomized SoC

Task graphs (and task-to-core mapping) Varying number of replicated modules Varying timing constraints (Real application in DATE10 paper)

Compare with best cost of classic mapping Averaging multiple runs

Evaluation

31

“Class”: a group of identical PEs Total number of replicated cores=

{Number of classes}*{class size}

Accounting for Replication

Sa

vio

ngs

[%]

Replicated Modules [%]

One ClassTwo ClassesFour Classes

Cost

Reduct

ion

]%[

10 20

Total number of Replicated Modules]%[ 30 40 50

32

SoCs with a pipeline data path and background P2P traffic Varying pipeline slack Different amounts of background

constraints

Application-Level Requirements

Tight Medium Loose Very Loose

Sav

ings

[%]

Allowed Pipeline Slack

extra constraints extra constraints extra constraints extra constraints

Cost

Reduct

ion

]%[

Pipeline Slack

33

We are going into the era of “Many module SoC”

Extend the mapping to account for Classes of replicated modules Application-level requirements

Meaningful power savings

But mapping is an example Routing? Task assignment? Link design?

Topology selection?

Conclusions and Future Work

34

Thank you!

Questions?

zigi@tx.technion.ac.il

The Era of Many-Module SoC

M odule

M odule M odule

M odule M odule

M odule M odule

M odule

M odule

M odule

M odule

M odule

QNoCResearch

GroupGroup

ResearchQNoC