Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip
-
Upload
ramona-jacobs -
Category
Documents
-
view
19 -
download
0
description
Transcript of Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip
Intel’s Tara-scale computing projectIntel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip
Sun’s Niagara2 (T2)Sun’s Niagara2 (T2) 8 cores, 64 Threads
Key design issuesKey design issues Architecture Challenges and TradeoffsArchitecture Challenges and Tradeoffs Packaging and off-chip memory bandwidth Software and runtime environment
Tara-Scale CMP
CDA5155sp08 peir
Many-Core CMPs – High-level View
Cores
L2
L1I/D What are the key
architecture issues in
many-cores CMP
CDA5155sp08 Peir 2
On-die interconnectOn-die interconnect Cache organization Cache organization
& Cache coherence& Cache coherence I/O and Memory I/O and Memory
architecturearchitecture
The General Block Diagram
FFU: Fixed Function Unit,
Mem C: Memory Controller,
PCI-E C: PCI-based Controller,
R: Router,
ShdU: Shader Unit,
Sys I/F: System Interface,
TexU: Texture Unit
CDA5155sp08 Peir 3
On-Die Interconnect
2D Embedding of a 64-core 3D-mesh network
The longest hop of the topological distance is extended from 9 to 18!
On-Die Interconnect
Must satisfy bandwidth and latency within power/area Ring or 2D mesh/torusRing or 2D mesh/torus are good candidate topology
Wiring density, router complexity, design complexity Multiple source/dest. pairs can be switched together; avoid
packets stop and buffered, save power, help throughput Xbar, general router are power hungry Fault-tolerant interconnect
Provide spare modules, allow fault-tolerant routing Partition for performance isolation
Performance Isolation in 2D mesh
Performance isolation in 2D mesh with partition 3 rectangular partitions Intra-communication confined within partition Traffic generated in a partition will not affect others
Virtualization of network interfaces Interconnect as an abstraction of applications Allow programmers fine-tune application’s inter-processor
communication
Many-Core CMPs
Cores
L2
L1I/D How about on-die cache organization with so many
cores?
Shared vs. PrivateShared vs. Private Cache capacity vs. Cache capacity vs.
accessibility accessibility Data replication vs. Data replication vs.
block migrationblock migration Cache partition Cache partition
CMP Cache Organization
Capacity vs. Accessibility, A Tradeoff Capacity – favor Shared cacheCapacity – favor Shared cache
No data replication, no cache coherenceNo data replication, no cache coherence Longer access time, contention issueLonger access time, contention issue Flexible cache capacity sharingFlexible cache capacity sharing Fair sharing among cores – Cache partitionFair sharing among cores – Cache partition
Accessibility – favor Private cacheAccessibility – favor Private cache Fast local access with data replication, capacity Fast local access with data replication, capacity
may suffermay suffer Need maintain coherence among private cachesNeed maintain coherence among private caches Equal partition, inflexibleEqual partition, inflexible
Many works to take advantage of bothMany works to take advantage of both Capacity sharing on private– cooperative cachingCapacity sharing on private– cooperative caching Utility-based cache partition on sharedUtility-based cache partition on shared
Analytical Data Replication Model
Reuse distance histogram f(x):# of accesses with distance x Cache size S:
Total # hits => Area beneath the curve =>
S
dxxf0
)(Cache misses increase
S
RSdxxf )(
Capacity decreases Cache hits now
RS
dxxf0
)(
Local hits increaseR/S of hits to replica
RS
dxxfS
R0
)(
Local hits increaseR/S of hits to replica
L of replica hits: local
RS
dxxfS
R0
)(
RS
dxxfS
RL
0)(
P: Miss Penalty Cycles; G: Local Gain CyclesNet memory access cycle increase:
IncMissCachePIncHitLocalG
S
RS
RSdxxfPdxxfL
S
RG )()(
0
Get Histogram f(x) for OLTP
-1012345678
0 1 2 3 4 5 6 7 8
Reuse distance (MB)
Reus
e hi
stog
ram
f(x)
3
6
10*658.2
,10*084.6
),exp()(
B
Awhere
BxAxf
Step 2:Matlab Curve Fitting
Find math expr.
Step 1: Stack simulationCollect discrete reuse distance
X106
Data Replication Effects
-10
0
10
20
30
40
0 1/8 1/4 3/8 1/2 5/8 3/4 7/8
Fraction of replication
Ave
. acc
ess
tim
e in
crea
ses
(Cyc
les)
f(x)G =15P = 400L = 0.5
S = 2M
-10
0
10
20
30
40
0 1/8 1/4 3/8 1/2 5/8 3/4 7/8
Fraction of replication
Ave
. acc
ess
tim
e in
crea
ses
(Cyc
les)
-10
0
10
20
30
40
0 1/8 1/4 3/8 1/2 5/8 3/4 7/8
Fraction of replication
Ave
. acc
ess
tim
e in
crea
ses
(Cyc
les)
S = 4MS = 8M:
S
RS
RSdxxfPdxxfL
S
RGModel )()(:
0
(R/S)
S = 2M0% best S = 4M
40% best S = 8M65% best
Data Replication Impacts vary with
different cache sizes
Many-Core CMPs
Cores
L2
L1I/D How about Cache
Coherence with so many cores&caches
?
Snooping bus:Snooping bus: Broadcast requestsBroadcast requests
Directory-based:Directory-based: maintaining memory maintaining memory block information block information
Review Culler’s bookReview Culler’s book
Simplicity: Shared L2, Write-through L1
Existing designs IBM Power4 & 5 Sun Niagara & Niagara 2
Small number of cores,Multiple L2 banks, Xbar
Still need L1 coherence!! Inclusive L2, use L2 directory record L1 sharers in Power4&5 Non-inclusive L2, Shadow L1 directory in Niagara
L2 (shared) coherence among multiple CMPs Private L2 is assumed
Other Considerations
Broadcast Snooping Bus: loading, speed, space, power, scalability, etc. Ring: slow traversal, ordering, scalability
Memory-based directory Huge directory space Directory cache, extra penalty
Shadow L2 Directory: copy all local L2s Aggregated associativity = Cores * Ways/Core;
64*16 = 1024 way High power
Directory-Based Approach
Directory needs to maintain the statestate and locationlocation of all cached blocks
Directory is checked when the data cannot be accessed locally, e.g. cache miss, write-to-shared
Directory may route the request to remote cache to fetch the requested block
Sparse Directory Approach
Holds states for all cached blocks
Low-cost set-associative design
No backup Key issues:Key issues:
Centralized vs. Centralized vs. DistributedDistributed
Indirect accessesIndirect accesses Extra invalidation Extra invalidation
due to conflictsdue to conflicts Presence bit vs. Presence bit vs.
duplicated blocksduplicated blocks
Conflict Issues in Coherence Directory
Coherence directory must be a superset of all cached blocks
Uneven distribution of cached blocks in each directory set cause invalidations
Potential solutions: High set associativity – costly Directory + victim directory Randomization and Skew associativity Bigger directory - Costly Others?
50
60
70
80
90
100
OLTP Apache SPECjbb SPEC-2000 SPEC-2006
Val
id B
lock
(%
)
Set-8wSet-16wSet-32wSet-64wSet-2x-8w
Impact of Invalidation due to Directory Conflict
• 8-core CMP, 1MB 8-way private L2 (total 8MB)• Set-associative dir; # of dir entry = total # of cache blocks• Each cached block occupies a directory entry
75%
96%
72%
93%
Presence bits Issue in Directory Presence bits (or not?)Presence bits (or not?)
Extra space, useless for multi-programs Coherence directory must cover all cached
blocks (consider no sharing) Potential solutionsPotential solutions
Coarse-granularity present bits, imprecise not suitable for CMP
Sparse presence vectors – record core-ids Allow duplicated block addresses with few
core-ids for each shared block, enable multiple hits on directory search
Others?
50
60
70
80
90
100
OLTP Apache SPECjbb SPEC-2000 SPEC-2006
Val
id b
lock
s (%
)
Set-8w Set-8w-64v Skew-8wSet-10w-1/4 Set-8w-p Set-full
Valid Blocks
Presence Bit:Multiprogrammed -> No
Multithreaded -> Yes
Skew, and 10w-1/4 helps;
No difference 64v
Challenge in Memory Bandwidth
Increase in off-chip memory bandwidth to sustain chip-level IPC Need power-efficient high-speed off-die I/O Need power-efficient high-bandwidth DRAM access
Potential Solutions: Embedded DRAM Integrated DRAM, GDDR inside processor package 3D stacking of multiple DRAM/processor dies Many technology issues to overcome
Memory Bandwidth Fundamental BW = # of bits x bit rate
A typical DDR2 bus is 16 bytes (128 bits) wide and operating at 800Mb/s. The memory bandwidth of that bus is 16 bytes x 800Mb/s, which is 12.8GB/s
Latency and Capacity Fast, but small capacity on-chip SRAM (caches) Slow large capacity off-chip DRAM
Memory Bus vs. System Bus Bandwidth
Scaling of bus capability has usually involved a combination of increasing the bus width while simultaneously increasing the bus speed
Integrated CPU with Memory Controller
Eliminate off-chip controller delay Fast, but difficult to adapt new DRAM technology
The entire burden of pin count and interconnect speed to sustain increases in memory bandwidth requirements now falls on the CPU package alone
Challenge in Memory Bandwidth and Pin Count
Challenge in Memory Bandwidth
Historical trend for memory bandwidth demand Current generation: 10-20 GB/s Next generation: >100GB/s and could go 1TB/s
New Packaging
New Packaging
New Packaging