Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University...

Design and Management of 3D CMP’s using

Network-in-Memory

Feihui Li et.al.Penn State University

(ISCA – 2006)

News..

Moral of the story…

• 3D technology helps in reducing wire delays – Exploit it in as many ways as you can!– They chose L2 caches

• Also, 3D leads to on-chip hotspots.– Arrange units intelligently, reduce

localized hotspots.

Major Results/Contributions

• First 3D CMP design space exploration• Proposal of 3D NUCA L2 caches for CMP’s.

– Comparison with the existing 2D counterparts.– 3D works better even without data migration

• Proposal of NoC’s as a method of communication between L2 banks.– “Efficiently exploit fast vertical interconnects”

Basics…

Typical Network-on-Chip architecture

Major types of integration

Proposed : 3D Network-in-MemL2 Cache bank / or CPU

Pillar nodeProcessing

Element(Cache Bank

or CPU)NIC

R

b bits

Single-Stage Router

Processing Element

(Cache Bank or CPU)

NIC

R

b bits

Inpu t Buffer

Output Bu

ffer

dTDMA Bus

NoC

NoC/Bus Interface

b-bit dTDMA Bus (Communication Pillar)

orthogonal to slide

Single-Stage Router

Inpu t Buffer

Output Bu

ffer

dTDMA Bus

NoC/Bus Interface

b-bit dTDMA Bus (Communication Pillar)

orthogonal to slide

Router

Communication Pillar

dTDMA Bus (Dynamic Time-Division Multiple Access)

The dTDMA Bus as the Communication Pillar

1500 um

10~100 um

Use dTDMA bus (VLSID 2006) V efficient/fast bus V small area/power overhead

l ay e

rs

Router

dTDMA Bus Arbiter

Do not use multi-hop for vertical communication x vertical distance is so small

Proposals (1)• Inter-die “communication pillars”

• Integration of dTDMA buses and NoC routers for a fast communication interface – typical NoC fails due to

• increased complexity

• contention issues

• increased power/area overhead

• multi-hop vertical comm.

3D Benefit: Increased Locality CPU Nodes within 1 hop

Nodes within 2 hops Nodes within 3 hops

dTDMA pillar

2D vicinity

3D vicinity

Proposals (2)

• Cannot increase # of pillars arbitrarily– Depends on via density– Router complexity

• So, CPU’s share pillars– Stacking of CPU’s also has to be considered

• CPU placement algorithm– Stack CPU’s across dies so as to

• Maintain decent access hop-count• Manage thermal profile

CPU placement example

This way, not stacking CPU’s on top of one another, helps to solve localized hotspot problem

3D L2 Caches

• Clusters – Cache banks + tag array– Some clusters have CPU’s, others don’t.

Cache Management

• Search• Placement & Replacement• Cache Line Migration

L2 Cache Management

Simulation Environment

• Simics + in-house NoC simulator• All CPU’s issue in-order

– 8 CPU’s, SPARC ISA– Directory based protocol for coherence

between L1’s and the L2

• HS3d for temperature modeling• 64MB and 32 MB L2 caches

Performance

0

0.5

1

1.5

2

2.5

3

3.5

ammp apsi art equake f ma3d galgel mgrid swim wupwise

IPC

CMP-DNUCA CMP-DNUCA-3D CMP-SNUCA-3D

Important Results

Important Results (2)

Impact of # of “pillars” on access latency

Important Results (3)

Final Word

• 3D is feasible & scalable… and has arrived.

• Localized hotspots can be solved by placing hotter units apart.

• Power savings + performance gain even without data migration– No numbers to support the claim(!)– Would that help the temperature issue as well?

Potential HPCA Submission

• An evaluation of temperature and IPC for a single core 3D processor• Leverage clustered architectures for

“temperature aware” processor designs.– Basic premise : Stacking cooler units (caches)

on top of hotter units• Better thermal profile of processor

Proposals

Arch 1Arch 2

Arch 3

Cachebank

Cachebank

Cluster

Proposals (2)

• Cache banks (both data and instruction) are– 2 way word-interleaved, or,– Replicated

• Present study done for 8-cluster architecture

Results (Performance)

2-way word interleaved caches

Results (Performance)

Replicated caches

Traffic Analysis

0

5000000

10000000

15000000

20000000

25000000

amm

p

appl

u

apsi art

bzip

2

craf

ty

eon

equa

ke

fma3

d

galg

el

gap

gcc

gzip

luca

s

mcf

mes

a

mgr

id

pars

er

swim

twol

f

vorte

x

vpr

wup

wis

e

Benchmarks - Arch1

Nu

mb

er o

f Acc

esse

s

RINGHOPCOUNT TOTALD2DHOPCOUNT INTERCLUSTER RINGHOP FOR CACHE

Traffic Analysis (2)

0

5000000

10000000

15000000

20000000

25000000

amm

p

appl

u

apsi ar

t

bzip

2

craf

ty

eon

equa

ke

fma3

d

galg

el

gap

gcc

gzip

luca

s

mcf

mes

a

mgr

id

pars

er

swim

twol

f

vorte

x

vpr

wup

wis

e

Benchmarks -Arch2

Num

ber o

f Acc

esse

s

RINGHOPCOUNT TOTALD2DHOPCOUNT INTERCLUSTER RINGHOP FOR CACHE

Results (Thermal)

0

50

100

150

200

250

300

350

400

Peak

Tem

p of

Hot

test

Uni

t (C)

BASE ARCH 1 ARCH 2

Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University...

Documents

Transcript of Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University...