Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University...
-
Upload
philomena-harris -
Category
Documents
-
view
213 -
download
0
Transcript of Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University...
Design and Management of 3D CMP’s using
Network-in-Memory
Feihui Li et.al.Penn State University
(ISCA – 2006)
News..
Moral of the story…
• 3D technology helps in reducing wire delays – Exploit it in as many ways as you can!– They chose L2 caches
• Also, 3D leads to on-chip hotspots.– Arrange units intelligently, reduce
localized hotspots.
Major Results/Contributions
• First 3D CMP design space exploration• Proposal of 3D NUCA L2 caches for CMP’s.
– Comparison with the existing 2D counterparts.– 3D works better even without data migration
• Proposal of NoC’s as a method of communication between L2 banks.– “Efficiently exploit fast vertical interconnects”
Basics…
Typical Network-on-Chip architecture
Major types of integration
Proposed : 3D Network-in-MemL2 Cache bank / or CPU
Pillar nodeProcessing
Element(Cache Bank
or CPU)NIC
R
b bits
Single-Stage Router
Processing Element
(Cache Bank or CPU)
NIC
R
b bits
Inpu t Buffer
Output Bu
ffer
dTDMA Bus
NoC
NoC/Bus Interface
b-bit dTDMA Bus (Communication Pillar)
orthogonal to slide
Single-Stage Router
Inpu t Buffer
Output Bu
ffer
dTDMA Bus
NoC/Bus Interface
b-bit dTDMA Bus (Communication Pillar)
orthogonal to slide
Router
Communication Pillar
dTDMA Bus (Dynamic Time-Division Multiple Access)
The dTDMA Bus as the Communication Pillar
1500 um
10~100 um
Use dTDMA bus (VLSID 2006) V efficient/fast bus V small area/power overhead
l ay e
rs
Router
dTDMA Bus Arbiter
Do not use multi-hop for vertical communication x vertical distance is so small
Proposals (1)• Inter-die “communication pillars”
• Integration of dTDMA buses and NoC routers for a fast communication interface – typical NoC fails due to
• increased complexity
• contention issues
• increased power/area overhead
• multi-hop vertical comm.
3D Benefit: Increased Locality CPU Nodes within 1 hop
Nodes within 2 hops Nodes within 3 hops
dTDMA pillar
2D vicinity
3D vicinity
Proposals (2)
• Cannot increase # of pillars arbitrarily– Depends on via density– Router complexity
• So, CPU’s share pillars– Stacking of CPU’s also has to be considered
• CPU placement algorithm– Stack CPU’s across dies so as to
• Maintain decent access hop-count• Manage thermal profile
CPU placement example
This way, not stacking CPU’s on top of one another, helps to solve localized hotspot problem
3D L2 Caches
• Clusters – Cache banks + tag array– Some clusters have CPU’s, others don’t.
Cache Management
• Search• Placement & Replacement• Cache Line Migration
L2 Cache Management
Simulation Environment
• Simics + in-house NoC simulator• All CPU’s issue in-order
– 8 CPU’s, SPARC ISA– Directory based protocol for coherence
between L1’s and the L2
• HS3d for temperature modeling• 64MB and 32 MB L2 caches
Performance
0
0.5
1
1.5
2
2.5
3
3.5
ammp apsi art equake f ma3d galgel mgrid swim wupwise
IPC
CMP-DNUCA CMP-DNUCA-3D CMP-SNUCA-3D
Important Results
Important Results (2)
Impact of # of “pillars” on access latency
Important Results (3)
Final Word
• 3D is feasible & scalable… and has arrived.
• Localized hotspots can be solved by placing hotter units apart.
• Power savings + performance gain even without data migration– No numbers to support the claim(!)– Would that help the temperature issue as well?
Potential HPCA Submission
• An evaluation of temperature and IPC for a single core 3D processor• Leverage clustered architectures for
“temperature aware” processor designs.– Basic premise : Stacking cooler units (caches)
on top of hotter units• Better thermal profile of processor
Proposals
Arch 1Arch 2
Arch 3
Cachebank
Cachebank
Cluster
Proposals (2)
• Cache banks (both data and instruction) are– 2 way word-interleaved, or,– Replicated
• Present study done for 8-cluster architecture
Results (Performance)
2-way word interleaved caches
Results (Performance)
Replicated caches
Traffic Analysis
0
5000000
10000000
15000000
20000000
25000000
amm
p
appl
u
apsi art
bzip
2
craf
ty
eon
equa
ke
fma3
d
galg
el
gap
gcc
gzip
luca
s
mcf
mes
a
mgr
id
pars
er
swim
twol
f
vorte
x
vpr
wup
wis
e
Benchmarks - Arch1
Nu
mb
er o
f Acc
esse
s
RINGHOPCOUNT TOTALD2DHOPCOUNT INTERCLUSTER RINGHOP FOR CACHE
Traffic Analysis (2)
0
5000000
10000000
15000000
20000000
25000000
amm
p
appl
u
apsi ar
t
bzip
2
craf
ty
eon
equa
ke
fma3
d
galg
el
gap
gcc
gzip
luca
s
mcf
mes
a
mgr
id
pars
er
swim
twol
f
vorte
x
vpr
wup
wis
e
Benchmarks -Arch2
Num
ber o
f Acc
esse
s
RINGHOPCOUNT TOTALD2DHOPCOUNT INTERCLUSTER RINGHOP FOR CACHE
Results (Thermal)
0
50
100
150
200
250
300
350
400
Peak
Tem
p of
Hot
test
Uni
t (C)
BASE ARCH 1 ARCH 2