Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling...

28
Analytical Performance Modeling of Hierarchical Interconnect Fabrics Nikita Nikitin, Javier de San Pedro, Josep Carmona and Jordi Cortadella Universitat Politècnica de Catalunya Supported by Intel Corporation International Symposium on Networks-on-Chip (NOCS) 2012, Copenhagen, Denmark

Transcript of Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling...

Page 1: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Analytical Performance Modeling of Hierarchical Interconnect Fabrics

Nikita Nikitin, Javier de San Pedro, Josep Carmona and Jordi Cortadella

Universitat Politècnica de Catalunya

Supported by Intel Corporation

International Symposium on Networks-on-Chip (NOCS) 2012, Copenhagen, Denmark

Page 2: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Outline • Introduction

– Hierarchical Chip Multiprocessors (CMPs)

– Performance modeling for CMPs

– The cyclic dependency between latency and traffic

• Analytical performance modeling

– Modeling traffic

– Modeling latency

– Methods to resolve the dependency

• Results and conclusions

NOCS'12 Universitat Politècnica de Catalunya 2

Page 3: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

The trends in CMP design • Hundreds of computing units per chip

– Smaller, simpler, more power-efficient cores

• Advanced memory management – Larger on-chip cache

– Increasing interconnect (IC) bandwidth

• Tiled architecture

NOCS'12 Universitat Politècnica de Catalunya 3

R R R R

R R R R

R R R R

R R R R Mem

ory

Co

ntr

olle

r

Mem

ory

Co

ntr

olle

r

C

L2 R

L1

Page 4: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Hierarchical interconnects

4

C+L1

L2

C+L1

L2

L3

IC ( Bus / Ring )

NI

R

Dir

NOCS'12 Universitat Politècnica de Catalunya

Tiled CMP with hierarchical interconnect

R

R R

Mem

ory

Co

ntr

olle

r

Mem

ory

Co

ntr

olle

r

IC

R

IC

IC IC

R R

R

• Exploit locality of memory references*

* “Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation CMPs”, R.Das et al., HPCA, 2009

Page 5: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Design of CMP architecture • Goal: efficient use of chip resources

– Maximize performance

– Fit area/power/thermal budget

• Multidimensional exploration space

(#cores / cache size /

memory hierarchy / IC topologies /…)

• Means: automated design space exploration

– Analytical performance models are essential

NOCS'12 Universitat Politècnica de Catalunya 5

C C

L3

R

D

R

MC

MC

IC

R

IC

R

R

IC

R

IC

R

Page 6: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Contention modeling

• Contention impacts CMP performance

• Crucial evaluating hierarchical interconnects

– Is the required bandwidth sustainable?

NOCS'12 Universitat Politècnica de Catalunya 6

R

R R

Mem

ory

Co

ntr

olle

r

Mem

ory

Co

ntr

olle

r

IC

R

IC

IC IC

R R

R

# of wires? Router architecture?

Local IC topology?

Page 7: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Motivational example

NOCS'12 Universitat Politècnica de Catalunya 7

(a) 8x8 mesh (b) 4x4 mesh with bus clusters

(c) 2x2 mesh with bus clusters

Estimation w/o contention is very

inaccurate!

48 cores, 16 cache modules core cache IC Legend:

0

2

4

6

8

10

(a) (b) (c)

Thro

ugh

pu

t (I

PC

)

No contention

With contention

Page 8: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Analytical modeling of CMP performance

• Analytical models for ICs: – Latency L as a function of traffic λ

– λ defined by the workload

Emphasis: λ depends on L!

• This work: resolve the cyclic dependency of traffic and latency – Formulate λ as a function of L

– Add existing model for L(λ)

– Resolve the system efficiently

NOCS'12 Universitat Politècnica de Catalunya 8

L λ IPC

Core1

Corei

CoreN

Li λi

Memory subsystem

L L •••

(Throughput)

Page 9: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Outline • Introduction

– Hierarchical Chip Multiprocessors (CMPs)

– Performance modeling for CMPs

– The cyclic dependency between latency and traffic

• Analytical performance modeling

– Modeling traffic

– Modeling latency

– Methods to resolve the dependency

• Results and conclusions

NOCS'12 Universitat Politècnica de Catalunya 9

Page 10: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Modeling memory traffic

Traffic to memory (probability of a memory reference per cycle):

NOCS'12 Universitat Politècnica de Catalunya 10

Average latency of memory access Memory access penalty

Core L λ

Memory

subsystem

Parameters of core executing some workload: 1. - ideal Cycles Per Instruction

2. - # Memory references Per Instruction

Real performance of in-order core:

Page 11: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Modeling average memory latency • Average latency of memory requests for a core:

NOCS'12 Universitat Politècnica de Catalunya 11

0

0,05

0,1

0,15

0,2

0,25

0 5 10 M

iss

Rat

io

Cache size (Mb)

Latencies are calculated using - Cache latencies - Interconnect topology - Routing algorithm (XY)

Probabilities are calculated using - Miss ratio dependency on cache size

Application

15% miss in 64K L1

5% miss in 1M L2

0

0,1

0,2

0,3

0,4

0 5 10 M

iss

Rat

io

Cache size (Mb)

Application

Page 12: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Modeling contention latency

NOCS'12 Universitat Politècnica de Catalunya 12

CL

MC

MC

R

CL

R

CL

R

CL

R

R

C C

L3

NI

D

Mesh NoC Bus-based cluster

Delays in queues are defined by extending M/G/1 queuing model:

“An Analytical Approach for Network-on-Chip Performance Analysis”, Ogras et al., TCAD, 2010 (Best Paper Award)

Page 13: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

System of non-linear equations

• Solve using numerical methods

• General methods are very slow – 10x10 mesh (10K vars./eqns.) – MATLAB timeout after few hours

• Proposed methods: – Fixed-point iteration

– Bisection search for λ

The cyclic dependency of L and λ

NOCS'12 Universitat Politècnica de Catalunya 13

Any “black-box” model for L(λ)!

Analytical model for latency

Page 14: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Fixed-point iteration

+ Fast (10x10 mesh in several ms)

+ Converges to the exact solution

NOCS'12 Universitat Politècnica de Catalunya 14

0

10

20

30

40

50

0 0,05 0,1 0,15 0,2

L, a

vera

ge la

ten

cy (

cycl

es)

λ, average traffic rate (flits/cycle)

L(λ) λ (L)

Characteristic of the IC Characteristic of

the cores/workload

– May not converge for high λ

Hop-count latency

Page 15: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Bisection search for λ

– Fast, as fixed-point

– Always converges to an approximate solution

(good for homogeneous clusters)

NOCS'12 Universitat Politècnica de Catalunya 15

0

10

20

30

40

50

0 0,05 0,1 0,15 0,2

L, a

vera

ge la

ten

cy (

cycl

es)

λ, average traffic rate (flits/cycle)

L(λ) λ (L)

Characteristic of the IC Characteristic of

the cores/workload

λ=0 λ(Lhop-count)

Page 16: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Outline • Introduction

– Hierarchical Chip Multiprocessors (CMPs)

– Performance modeling for CMPs

– The cyclic dependency between latency and traffic

• Analytical performance modeling

– Modeling traffic

– Modeling latency

– Methods to resolve the dependency

• Results and conclusions

NOCS'12 Universitat Politècnica de Catalunya 16

Page 17: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Performance of analytical methods

Test Mesh Cont. lat. Num. of var./eqn.

Runtime (sec)

MATLAB Fixed-Point Bisection

T1 2 x 2 5% 236 0.023 0.001 0.001

T2 4 x 4 13% 1224 1.412 0.001 0.002

T3 6 x 6 8% 3108 30.831 0.002 0.003

T4 8 x 8 12% 6128 408.539 0.006 0.010

T5 10 x 10 23% 10260 Timeout (1hr) 0.010 0.012

T6 10 x 10 46% 10260 Timeout (1hr) 0.022 0.015

T7 10 x 10 55% 10260 Timeout (1hr) NA 0.016

NOCS'12 Universitat Politècnica de Catalunya 17

Page 18: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Case study: performance exploration

NOCS'12 Universitat Politècnica de Catalunya 18

Parameter Value

Chip area Core area Core IPC0

MPI L1 size L2 size Memory density Mesh dimensions MC latency

350 mm2

1.25 mm2

2.0 0.5 64, 128 Kb 64 Kb to 3 Mb 1 mm2 / Mb 2x2 to 16x16 100 cycles

0

0,05

0,1

0,15

0,2

0,25

0 2 4 6 8 10

Mis

s R

atio

Cache size (Mb)

1062 configurations explored

Cache Size 64K 128K 256K 512K 1M 2M 4M 8M

Area* (mm2) 0.063 0.125 0.25 0.5 1.0 2.0 4.0 8.0

Latency (cycles) 2 3 4 5 6 7 8 9

Page 19: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Simulation environment

NOCS'12 Universitat Politècnica de Catalunya 19

Network simulation

Global (mesh)

memory L3 cache

node

Bus Local (bus, ring, …)

Core

Memory

controller

• Verify model by simulation

• Cycle-accurate NoC simulator – On top of BookSim 2.0

• Extensions – Hierarchical networks

– Bus topologies

– Probabilistic state-machines

for cores and memories

Page 20: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Faithfulness of the model

20

0

5

10

15

20

25

30

35

1

52

10

3

15

4

20

5

25

6

30

7

35

8

40

9

46

0

51

1

56

2

61

3

66

4

71

5

76

6

81

7

86

8

91

9

97

0

10

21

Thro

ugh

pu

t (I

PC

)

Configurations sorted in descending order of throughput

Modeling

Simulation

• Average difference in throughput is about 10%

• Corresponds to the error of the latency model

NOCS’12 Universitat Politècnica de Catalunya

Page 21: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Best-throughput ordering

21

Simulation time: 5.5 hours Modeling time: 16.8 sec (>1000x faster)

0

200

400

600

800

1000

0 200 400 600 800 1000 B

est

co

nfi

gura

tio

ns

by

anal

ysis

th

at in

clu

de

N

Number of best config. by simulation (N)

Static latency

Full latency

Ideal (Simulation)

(1; 33)

(4; 44)

(1; 2) (4; 6)

(50; 64)

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 Be

st c

on

figu

rati

on

s b

y an

alys

is t

hat

incl

ud

e N

Number of best configurations by simulation (N)

Static latency

Full latency

Ideal (Simulation)

NOCS’12 Universitat Politècnica de Catalunya

No contention

With contention

No contention

With contention

Ideal (Simulation)

Page 22: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Conclusions

• Analytical modeling of contention in CMPs is essential

• There exists cyclic dependency between latency and traffic of memory requests

• This dependency can be efficiently resolved using numerical methods (fixed-point, bisection)

• Precision of the model is significantly improved

• Current work: out-of-order cores, heterogeneity

NOCS'12 Universitat Politècnica de Catalunya 22

Page 23: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Backup

NOCS'12 Universitat Politècnica de Catalunya 23

Page 24: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Sufficient for convergence of :

0

10

20

30

40

50

0 0,05 0,1 0,15 0,2

L, a

vera

ge la

ten

cy (

cycl

es)

λ, average traffic rate (flits/cycle)

L(λ) λ (L)

Fixed-point convergence issues

NOCS'12 Universitat Politècnica de Catalunya 24

Hop-count latency

Page 25: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Bisection search

NOCS'12 Universitat Politècnica de Catalunya 25

Latency model Traffic model

Page 26: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Average latency calculation

• Average Memory Access Time (AMAT):

NOCS'12 Universitat Politècnica de Catalunya 26

Page 27: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Best configuration

27

- 6x6 mesh, 36 clusters, 5 cores/cluster

- total 180 cores with 64K L1, 256K L2

- 68Mb total shared L3

Throughput = 30.81 IPC

R R R R R R

R R R R R R

R R R R R R

R R R R R R

R R R R R R

R R R R R R

Mem

ory

Co

ntr

olle

r

Mem

ory

Co

ntr

olle

r

Memory Controller

Memory Controller

C+L1

L2

C+L1

L2

C+L1

L2

L3

Bus

C+L1

L2

C+L1

L2

NI

R

Dir

NOCS'12 Universitat Politècnica de Catalunya

Page 28: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er

Runtime: Modeling vs Simulation

0,001

0,01

0,1

1

10

100

1000

0 100 200 300 400 500 600 700 800 900 1000 1100

Ru

nti

me

(se

con

ds)

Number of components (cores + memories) in CMP

Analytical

Simulation

Modeling a CMP with ~700 components in 1 second

28 NOCS’12 Universitat Politècnica de Catalunya