Efficient Data Access in Future Memory Hierarchies
description
Transcript of Efficient Data Access in Future Memory Hierarchies
1
Efficient Data Access in FutureMemory Hierarchies
Rajeev Balasubramonian
School of ComputingResearch Buffet, Fall 2010
2
Acks
• Terrific students in the Utah Arch group
• Collaborators at HP, Intel, IBM
• Prof. Al Davis, who re-introduced us to memory systems
3
Current Trends
• Continued device scaling
• Multi-core processors
• The power wall
• Pin limitations
• Problematic interconnects
• Need for high reliability
4
Anatomy of Future High-Perf Processors
TheCore
Hi-Perf
TheCore
Lo-Perf
• Designs well understood• Combo of hi- and lo-perf• Risky Ph.D.!!
5
Anatomy of Future High-Perf Processors
Core
• Large shared L3• Partitioned into many banks• Assume one bank per core
L2/L3CacheBank
6
Anatomy of Future High-Perf Processors
C
• Many cores!• Large distributed L3 cache
$ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
7
Anatomy of Future High-Perf Processors
C
• On-chip network• Includes routers and long wires• Used for cache coherence• Used for off-chip requests/responses
$ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
8
Anatomy of Future High-Perf Processors
C
• Memory controller handles off-chip requests to memory
$ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
C $ C $ C $ C $
Memory Controller
DIMM DIMM DIMM
9
CPU
Anatomy of Future High-Perf Processors
DIMM DIMM DIMM
CPU
DIMM DIMM DIMM
CPU
DIMM DIMM DIMM
• Multi-socket motherboard• Lots of cores, lots of memory, all connected
10
CPU
Anatomy of Future High-Perf Processors
DIMM DIMM DIMM
CPU
DIMM DIMM DIMM
CPU
DIMM DIMM DIMM
• DRAM backed up by slower, higher capacity emerging non-volatile memories• Eventually backed up by disk… maybe
NonVolatilePCM
NonVolatilePCM
DISK DISK
11
Life of a Cache Miss
CoreL1
Miss in L1on-chipnetwork
$
Look up L2/L3 bank
On- andoff-chipnetwork
MC
Wait in MC queue
DIMM
Access DRAM
NonVolatilePCM
Access PCM
DISK
Access disk
12
Research Topics
CoreL1
Miss in L1on-chipnetwork
$
Look up L2/L3 bank
On- andoff-chipnetwork
MC
Wait in MC queue
DIMM
Access DRAM
NonVolatilePCM
Access PCM
DISK
Access disk
Not very hot topics!
13
Research Topics
CoreL1
Miss in L1on-chipnetwork
$
Look up L2/L3 bank
On- andoff-chipnetwork
MC
Wait in MC queue
DIMM
Access DRAM
NonVolatilePCM
Access PCM
DISK
Access disk
2
45
6 1
3
14
Problems with DRAM
• DRAM main memory contributes 1/3rd of total energy in datacenters
• Long latencies; high bandwidth needs
• Error resilience is very expensive
• DRAM is a commodity and chips have to be compliant with standards
Initial designs instituted in the 1990s Innovations are evolutionary Traditional focus on density
15
Time for a Revolutionary Change?
• Energy is far more critical today
• Cost-per-bit perhaps not as relevant today
• Memory reliability is increasing in importance
• Multi core access streams have poor locality
• Queuing delays are starting to dominate
• Potential migration to new memory technologies and interconnects
• Incandescent light bulb• Low purchase cost• High operating cost• Commodity
• Energy-efficient light bulb• Higher purchase cost• Much lower operating cost• Value-addition
16
It’s worth a small increase in capital costs to gain large reductions in operating costs
$3.00 13W
$0.30 60W
And not 10X, just 15-20%!
Key Idea
17
DRAM Basics
…
Memory bus or channel
Rank
DRAMchip ordeviceBank
Array1/8th of therow buffer
One word ofdata output
DIMM
On-chip Memory
Controller
18
RAS
CAS
Cache Line
DRAM Chip DRAM Chip DRAM Chip DRAM Chip
Row Buffer
One bank shown in each chip
DRAM Operation
19
New Design Philosophy
• Eliminate overfetch; activate a single chip and a single small array much lower energy, slightly higher cost
• Provide higher parallelism
• Add features for error detection
[ Appears in ISCA’10 paper, Udipi et al. ]
20
Single Subarray Access (SSA)
MEMORY CONTROLLER
8 8
ADDR/CMD BUS
64 Bytes
Bank
Subarray
Bitlines
Row buffer
Global Interconnect to I/O
ONE DRAM CHIP
DIMM
8 8 8 8 8 88DATA BUS
21
Address
Cache Line
DRAM ChipSubarray
DRAM ChipSubarray
DRAM ChipSubarray
DRAM ChipSubarraySubarray Subarray Subarray Subarray
Sleep Mode(or other parallelaccesses)
Subarray Subarray Subarray SubarraySubarray Subarray Subarray Subarray
SSA Operation
22
Consequences
• Why is this good? Minimal activation energy for a line More circuits can be placed in low-power sleep Can perform multiple operations in parallel
• Why is this bad? Higher area and cost (roughly 5 – 10%) Longer data transfer time Not compatible with today’s standards No opportunity for row buffer hits
23
Narrow Vs. Wide Buses?
• What provides higher utilization? 1 wide bus or 8 narrow buses?
• Must worry about load imbalance and long data transfer time in latter
• Must worry about many bubbles in the former because of dependences
[ Ongoing work, Chatterjee et al. ]
Bank Access DT
Bank Access DTOne 64-bit wide bus
Back-to-backrequests to the
same bank
64 idle bits
Bank Access DT
Bank Access DT
8 idle bits
Eight 8-bit wide buses
24
Methodology
• Tested with simulators (Simics) and multi-threaded benchmark suites
• 8-core simulations
• DRAM energy and latency from Micron datasheets
25
blackscholescanneal
fluidanimate
streamcluste
rx264 cg
0.00
0.50
1.00
1.50
2.00
2.50
Baseline Open Row
Baseline Close Row
SBA
SSA
Rela
tive
DRAM
Ene
rgy
Cons
umpti
on
Moving to close page policy – 73% energy increase on average Compared to open page, 3X reduction with SBA, 6.4X with SSA
Results – Energy
26
BASELINE (OPEN PAGE,
FR-FCFS)
BASELINE (CLOSED
ROW, FCFS)
SBA SSA0%
20%
40%
60%
80%
100%Termination Resistors
Global In-terconnect
Bitlines
Decoder + Wordline + Senseamps
Results – Energy – Breakdown
27
blackscholes
bodytrackcanneal
ferret
fluidanimate
freqmine
streamcluste
rvips
x264stre
am cg is
Average0.00
100.00200.00300.00400.00500.00600.00700.00800.00
Baseline Open Page
Baseline Close Page
SBA
SSA
Cycl
es
• Serialization/Queuing delay balance in SSA - 30% decrease (6/12) or 40% increase (6/12)
Results – Performance
28
BASELINE (OPEN PAGE,
FR-FCFS)
BASELINE (CLOSED ROW,
FCFS)
SBA SSA0%
20%
40%
60%
80%
100%Data Transfer
DRAM Core Access
Rank Switching delay (ODT)
Command/Addr Transfer
Queuing Delay
Results – Performance – Breakdown
29
Error Resilience in DRAM
• Important to not only withstand a single error, but also entire chip failure – referred to as chipkill
• DRAM chips do not include error correction features -- error tolerance must be built on top
• Example: 8-bit ECC code for a 64-bit word; for chipkill correctness, each of the 72 bits must be read from a separate DRAM chip significant overfetch!
0 1 2 3 68 69 70 71….
72-bit word on every bus transfer
30
Two-Tiered Solution
• Add a small (8-bit) checksum for every cache line
• Maintain one extra DRAM chip for parity across 8 DRAM chips
• When the checksum flags an error, use the parity to re-construct the corrupted cache line
• Writes will require updates to parity as well
0 1 7 Parity….
31
Research Topics
CoreL1
Miss in L1on-chipnetwork
$
Look up L2/L3 bank
On- andoff-chipnetwork
MC
Wait in MC queue
DIMM
Access DRAM
NonVolatilePCM
Access PCM
DISK
Access disk
2
45
6 1
3
32
Topic 2 – On-chip Networks
• Conventional wisdom: buses are not scalable; need routed packet-switched networks
• But, routed networks require bulky energy-hungry routers
• Results: Buses can be made scalable by having a hierarchy
of buses and Bloom filters to stifle broadcasts Low-swing buses can yield low energy, simpler
coherence, and scalable operation
[ Appears in HPCA’10 paper, Udipi et al. ]
33
Topic 3: Data Placement in Caches, Memory
• In a large distributed NUCA cache, or in a large distributed NUMA memory, data placement must be controlled with heuristics that are aware of:
capacity pressure in the cache bank distance between CPU and cache bank and DIMM queuing delays at memory controller potential for row buffer conflicts at each DIMM
[ Appears in HPCA’09, Awasthi et al. and in PACT’10, Awasthi et al. (Best paper!) ]
34
Topic 4: Silicon Photonic Interconnects
• Silicon photonics can provide abundant bandwidth and makes sense for off-chip communication
• Can help multi-cores overcome the bandwidth problem
• Problems: DRAM design that best interfaces with silicon photonics, protocols that allow scalable operation
[ On-going work, Udipi et al. ]
35
Topic 5: Memory Controller Design
• Problem: abysmal row buffer hit rates, quality of service
• Solutions:
Co-locate hot cache lines in the same page
Predict row buffer usage and “prefetch” row closure
QoS policies that leverage multiple memory “knobs”
[ Appears in ASPLOS’10 paper, Sudan et al. and on-going work, Awasthi et al., Sudan et al. ]
36
Topic 6: Non Volatile Memories
• Emerging memories (PCM): can provide higher densities at smaller feature sizes are based on resistance, not charge (hence non-volatile) can serve as replacements to DRAM/Flash/disk
• Problem: when a cell is programmed to a given resistance, the resistance tends to drift over time may require efficient refresh or error correction
[ On-going work, Awasthi et al. ]
37
Title
• Bullet