1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al...

1

Towards Scalable and Energy-Efficient Memory System Architectures

Rajeev Balasubramonian, Al Davis,Ani Udipi, Kshitij Sudan,

Manu Awasthi, Nil Chatterjee, Seth Pugsley, Manju Shevgoor

School of ComputingUniversity of Utah

2

Towards Scalable and Energy-Efficient Memory System Architectures

3

Convergence of Technology Trends

Energy

Reliability

New MemoryTechnologies

BW, Capacity, and Locality for

Multi-Cores

Overhaul of main memory architecture!

4

High Level Approach

• Explore changes to memory chip microarchitecture Must cause minimal disruption to density

• Explore changes to interfaces and standards Major change appears inevitable!

• Explore system and memory controller innovations Most attractive, but order-of-magnitude improvement unlikely

Design solutions that are technology-agnostic

5

Projects

Memory Chip

• Reduce overfetch

• Support reliability

• Handle PCM drift

• Promote read/write parallelism

Memory Interface

• Interface with photonics

• Organize channel for high capacity

Memory Controller

• Maximize use of row buffer

• Schedule for low latency and energy

• Exploit mini-ranks

CPUMC

DIMM…

6

Talk Outline

Mature work:• SSA architecture – Single Subarray Access (ISCA’10)• Support for reliability (ISCA’10)• Interface with photonics (ISCA’11)• Micro-pages – data placement for row buffer efficiency (ASPLOS’10)• Handling multiple memory controllers (PACT’10)• Managing resistance drift in PCM cells (NVMW’11)

Preliminary work:• Handling read/write parallelism• Enabling high capacity• Handling DMA scheduling• Exploiting rank subsetting for performance and thermals

7

Minimizing Overfetch with SingleSubarray Access

Ani Udipi

CPUMC

DIMM…

Primary Impact

Problem 1 - DRAM Chip Energy

• On every DRAM access, multiple arrays in multiple chips are activated

• Was useful when there was good locality in access streams– Open page policy

• Helped keep density high and reduce cost-per-bit• With multi-thread, multi-core and multi-socket systems,

there is much more randomness– “Mixing” of access streams when finally seen by the memory

controller

8

Rethinking DRAM Organization

• Limited use for designs based on locality• As much as 8kbytes read in order to service a 64byte cache

line request• Termed “overfetch”

– Substantially increases energy consumption• Need a new architecture that

– Eliminates overfetch– Increases parallelism– Increases opportunity for power-down– Allows efficient reliability

9

Proposed Solution – SSA Architecture

10

MEMORY CONTROLLER

8 8

ADDR/CMD BUS

64 Bytes

Bank

Subarray

Bitlines

Row buffer

Global Interconnect to I/O

ONE DRAM CHIP

DIMM

8 8 8 8 8 88DATA BUS

SSA Basics

• Entire DRAM chip divided into small “subarrays”

• Width of each subarray is exactly one cache line

• Fetch entire cache line from a single subarray in a single DRAM chip – SSA

• Groups of subarrays combined into “banks” to keep peripheral circuit overheads low

• Close page policy and “posted-RAS”

• Data bus to processor essentially split into 8 narrow buses

11

SSA Architecture Impact

• Energy reduction– Dynamic – fewer bitlines activated– Static – smaller activation footprint – more and longer spells of

inactivity – better power down

• Latency impact– Limited pins per cache line – serialization latency– Higher bank-level parallelism – shorter queuing delays

• Area increase– More peripheral circuitry and I/O at finer granularities – area

overhead (< 5%)

12

Area Impact

• Smaller arrays – more peripheral overhead• More wiring overhead in the on-chip interconnect between

arrays and pin pads• We did a best-effort area impact calculation using a

modified version of CACTI 6.5– Analytical model, has its limitations

• More feedback in this specific regard would be awesome!• More info on exactly where in the hierarchy overfetch stops

would be great too

13

14

Support for Chipkill Reliability

Ani Udipi

CPUMC

DIMM…

Primary Impact

Problem 2 – DRAM Reliability

• Many server applications require chipkill-level reliability – failure of an entire DRAM chip

• One example of existing systems– Consider baseline 64-bit word plus 8-bit ECC – Each of these 72 bits must be read out of a different chip,

else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable!

– Reading 72 chips - significant overfetch!• Chipkill even more of a concern for SSA since entire cache

line comes from a single chip

15

Proposed Solution

Approach similar to RAID-5

16

DIMM

L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C

L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C..

C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C

.

...

.

...

.

...

.

...

P7

DRAM DEVICE

L – Cache Line C – Local Checksum P – Global Parity

Chipkill design

• Two-tier error protection

• Tier - 1 protection – self-contained error detection– 8-bit checksum/cache line – 1.625% storage overhead– Every cache line read is now slightly longer

• Tier -2 protection – global error correction– RAID-like striped parity across 8+1 chips– 12.5% storage overhead

• Error-free access (common case)– 1 chip reads– 2 chip writes – leads to some bank contention – 12% IPC degradation

• Erroneous access– 9 chip operation

17

Questions

• What are the common failure modes in DRAM? PCM?• Do entire chips fail?• Do parts of chips fail?

– Which parts? Bitlines? Wordlines? Capacitors?– Entire arrays?– Entire banks?– I/O?

• Should all these failures be handled the same way?

18

19

Designing Photonic Interfaces

Ani Udipi

CPUMC

DIMM…

Primary Impact

Problem 3 – Memory interconnect

• Electrical interconnects are not scaling well– Where can photonics make an impact, both on energy and

performance?

• Various levels in the DRAM interconnect– Memory cell to sense-amp - addressed by SSA– Row buffer to I/O – currently electrical (on-chip)– I/O pins to processor – currently electrical (off-chip)

• Photonic interconnects– Large static power component – laser/ring tuning– Much lower dynamic component – relatively unaffected by distance

• Electrical interconnects– Relatively small static component– Large dynamic component

• Cannot overprovision photonic bandwidth, use only where necessary

20

Consideration 1 – How much photonics on a die?

21

Elec

tric

al E

nerg

y

Phot

onic

Ene

rgy

Consideration 2 - Increasing Capacity

• 3D stacking is imminent• There will definitely be several dies on the channel

– Each die has photonic components that are constantly burning static power

– Need to minimize this!• TSVs available within a stack; best of both worlds

– Large bandwidth– Low static energy– Need to exploit this!

22

Proposed Design

23

ProcessorDIMM

Waveguide

DRAM chips

Photonic Interface die +

Stack controller

Memory controller

Proposed Design – Interface Die

• Exploit 3D die stacking to move all photonic components to a separate interface die, shared by several memory dies

– Use photonics where there is heavy utilization – shared bus between processor and interface die i.e. the off-chip interconnect

– Helps break pin barrier for efficient I/O, substantially improves socket-edge BW

– On-stack, where there is low utilization, use efficient low-swing interconnects and TSVs

24

Advantages of the proposed system

• Reduction in energy consumption– Fewer photonic resources, without loss in performance– Rings, couplers, trimming

• Industry considerations– Does not affect design of commodity memory dies– Same memory die can be used with both photonic and

electrical systems– Same interface die can be used with different kinds of

memory dies – DRAM, PCM, STT-RAM, Memristors

25

Problem 4 – Communication Protocol

• Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface

– Need to handle heterogeneous memory modules, each with its own maintenance requirements, further complicates scheduling

– Very little interoperability – affects both consumers (too many choices!) and vendors (stock-keeping and manufacturing)

– Heavy pressure on address/command bus – several commands to micro-manage every operation of the DRAM

– Several independent banks – need to maintain large amounts of state to schedule requests efficiently

– Simultaneous arbitration for multiple resources (address bus, data bank, data bus) to complete a single transaction

26

Proposed Solution – Packet-based interface

• Release most of the tight control memory controller holds today

• Move mundane tasks to the memory modules themselves (on the interface die) - make them more autonomous

– maintenance operation (refresh, scrub, etc.)– routine operations (DRAM precharge, NVM wear handling)– timing control (DRAM alone has almost 20 different timing

constraints to be respected)– coding and any other special requirements

• Only information the memory module needs is the address and read/write identification, time slots reserved apriori for data return

27

Advantages

• Better interoperability, plug and play– As long as the interface die has the necessary information,

everything in interchangeable• Better support for heterogeneous systems

– Allows easier data movement between DRAM and NVM for example, on the same channel

• Reduces memory controller complexity• Allows innovation and value addition in the memory,

without being constrained by processor-side support• Reduces bit transport energy on the address/command bus

28

29

Data Placement with Micro-PagesTo Boost Row Buffer Utility

Kshitij Sudan

CPUMC

DIMM…

Primary Impact

DRAM Access Inefficiencies• Over fetch due to large row-buffers

• 8 KB read into row buffer for a 64 byte cache line

• Row-buffer utilization for a single request < 1%

• Diminishing locality in multi-cores• Increasingly randomized memory access stream

• Row-buffer hit rates bound to go down

• Open page policy and FR-FCFS request scheduling

• Memory controller schedules requests to open row-buffers first

Goal

Improve row-buffer hit-rates for Chip Multi-Processors

30

Key ObservationGather all heavily accessed chunks of independent OS

pages and map them to the same DRAM rowCache Block Access Pattern Within OS Pages

For heavily accessed pages in a given time interval,accesses are usually to a few cache blocks

31

Basic Idea

Hottest micro-pages

1 KB micro-pages

Coldest micro-pages

4 KB OS Pages

DRAM Memory

Reserved DRAM Region

32

Hardware Implementation (HAM)

PhysicalAddress

X

New addr . Y

4 GB Main MemoryCPU Memory Request

4 MB ReservedDRAM region

Y

X Page A

Mapping Table

X Y

Old Address New Address

BaselineHardware Assisted Migration (HAM)

33

Results5M cycle EPOCH, ROPS, HAM and ORACLE

Apart from average 9% performance gains, our schemes also save DRAM energy at the same time!

Percent change in

performance

34

Conclusions

• On average, for applications with room for improvement and with our best performing scheme• Average performance ↑ 9% (max. 18%)

• Average memory energy consumption ↓ 18% (max. 62%).

• Average row-buffer utilization ↑ 38%

• Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses

35

36

Data Placement AcrossMultiple Memory Controllers

Kshitij Sudan

CPUMC

DIMM…

Primary Impact

DRAM NUMA Latency

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMMQPI

MC On-Chip Memory Controller

QPI Interconnect

Memory Channel

DIMM DRAM (DIMMs)

Socket Boundary

37

Problem Summary

• Pin limitations → increasing queuing delay• Almost 8x increase in queuing delays from single core/one

thread to 16 cores/16 threads

• Multi-cores → increasing row-buffer interference• Increasingly randomized memory access stream

• Longer on- and off-chip wire delays → increasing NUMA factor• NUMA factor already at 1.5x today

Goal

Improve application performance by reducing queuing delays and NUMA latency 38

Policies to Manage Data Placement Among MCs

• Adaptive First Touch• Assign new virtual pages to a DRAM (physical) page belonging to

MC(j) minimizing the a cost function

• Dynamic Page Migration• Programs change phases → Imbalance in MC load

• Migrate pages between MCs at runtime

• Integrating Heterogenous Memory Technologies

cost j = α x loadj + β x rowhitsj + λ x distancej

costk = Λ * distancek + Γ * rowhitsk

cost j = α x loadj + β x rowhitsj + λ x distance + Ƭ x LatencyDimmClusterj + µ x Usagej

39

Summary

• Multiple on-chip MCs will be common in future CMPs• Multiple cores sharing one MC, MCs controlling different

types of memories• Intelligent data mapping needed

• Adaptive First Touch policy (AFT)• Increases performance by 6.5% in homogeneous and by

1.6% in DRAM – PCM hierarchy.

• Dynamic page migration, improvement on AFT• Further improvement over AFT - 8.9% over baseline in

homogeneous, and by 4.4% in best performing DRAM-PCM hierarchy.

40

41

Managing Resistance Drift inPCM Cells

Manu Awasthi

CPUMC

DIMM…

Primary Impact

42

Quick Summary

• Multi level cells in PCM appear imminent• A number of proposals exist to handle hard errors

and lifetime issues of PCM devices• Resistance Drift is a less explored phenomenon

– Will become increasingly significant as number of levels/cell increases – primary cause of “soft errors”

– Naïve techniques based on DRAM-like refresh will be extremely costly for both latency and energy

– Need to explore holistic solutions to counter drift

43

What is Resistance Drift?

Resistance

Time11 10 01 00

A

B

ERROR!!

T0

Tn

Crystalline Amorphous

44

Resistance Drift DataCell Type Drift Time at Room

temperature (secs)

Median 11 cell 10499

Worst 11 Case cell 1015

Median 10 cell 1024

Worst Case 10 cell 5.94

Median 01 cell 108

Worst Case 01 cell 1.81

(11) (00)(10) (01)

45

Resistance Drift - Issues

• Programmed resistance drifts according to power law equation -

• R0, α usually follow a Gaussian distribution• Time to drift (error) depends on

– Programmed resistance (R0), and – Drift Coefficient (α)– Is highly unpredictable!!

Rdrift(t) = R0 х (t)α

46

Resistance Drift - How it happens

11 10 01 00

Median case cell• Typical R0

• Typical α

Worst case cell• High R0

• High α

Scrub rate will be dictated by the Worst Case R0 and Worst Case α

Naive refresh/scrub will be extremely costly!

Drift Drift

R0 R0Rt Rt

ERROR!!

Number of Cells

47

Architectural Solutions - Headroom• Assumes support for Light Array

Reads for Drift Detection (LARDDs) & ECC-N

• Headroom-h scheme – scrub is triggered if N-h errors are detected

† Decreases probability of errors slipping through

– Increases frequency of full scrub and hence decreases life time

– Gradual Headroom scheme : Start with large LARDD frequency, increase frequency as errors increase

Read Line

Check for Errors

Errors < N-h

Scrub Line

True

False

After N cycles

48

Reducing Overheads with Circuit Level Solution

• Invoking ECC on every LARDD increases energy consumption

• Parity – like error detection circuit is used to signal the need for a full fledged ECC error detect – Number of Drift Prone States in each line are counted

when the line is written into memory (single bit represents odd/even)

– At every LARDD, parity is verified

• Reduces need for ECC read-compare at every LARDD cycle

48(11) (00)(10) (01)

49

More Solutions

• Precise Writes– More write iterations to program state closer to mean,

reduce chance of drift– Increases energy consumption , write time and decreases

lifetime!

• Non Uniform Guardbanding– Resistance is equally distributed between all n states– Expand resistance range for drift prone states at expense

of non-drift prone ones

50

Results

LARDD Interval (seconds)

Erro

rs

51

Conclusions

• Resistance drift will exacerbate with MLC scaling• Naïve solutions based on ECC support are costly for

PCM– Increased write energy, decreased lifetimes

• Holistic solutions need to be explored to counter drift at device, architectural and system levels– 39% reduction in energy, 4x less errors, 102x increase in

lifetime

52

Handling Read/Write Parallelism

Nil Chatterjee

CPUMC

DIMM…

Primary Impact

53

The Problem• Writes are not on the critical path for program execution, but

they can slow down reads through resource contention

• In future chipkill correct systems, each data write will necessitate an update of the ECC codes and the impact of writes will be more evident.

• In PCM, the problem is exacerbated by the significantly longer write times.

• Abstracting the writes away improves read latency by 48% in non-ECC DRAM systems.

54

Impact of Writes on Reads• Write draining affects read latencies by

– Increasing the queuing delay – Reducing the read stream’s row-buffer locality

55

Bank Contention from Writes• Reads are not scheduled in the middle of the WQ

drain because it would require multiple bus turnarounds incurring tWRT and tOST delays.

• Underutilization of the data bus bandwidth during WQ draining leading to performance loss.

• However, opportunities to schedule read accesses to idle banks might exist in this interval.

56

Example

57

Solution : Increasing R/W overlap• During a WQ drain cycle, schedule partial reads to

idle banks. – Following a column read command, the data is

fetched from the sense amplifiers into a small buffer (64byte) near the I/O pads.

– Data will be streamed out only after the WQ reaches the low watermark - no turnaround delays.

• Immediately following the WQ drain, a flurry of prefetched reads can occupy the data bus.

58

Solution : Increasing R/W overlap.

59

Impact• A small pool of partial read registers can help

increase the data bus utilization post writes.

• In PCM system, where writes are very expensive, partial reads can have higher impact.

• The JEDEC standard must be augmented to support a partial read command.

60

Organizing Channels for High Capacity

Kshitij Sudan

CPUMC

DIMM…

Primary Impact

Increasing DRAM Capacity by Re-Architecting Memory Channel

• Increase DRAM capacity, while minimizing power

• Re-architect CPU-to-DRAM channel• Study effects of bus width and protocol (serial vs. parallel)

• CMPs might have changed the playfield!

61

62

Increasing DRAM Capacity by Re-Architecting Memory Channel

• Organize modules as binary tree, and move some MC functionality to “Buffer Chip”

• Reduces module depth from O(n) to O(log n)

• Reduces worst case latency, and improves signal integrity

• Buffer chip manages low-level DRAM operations and channel arbitration

• Not limited by worst-case access latency like FB-DIMM

• NUMA like DRAM access – leverage data mapping

62

63

Handling DMA Scheduling

Kshitij Sudan

CPUMC

DIMM…

Primary Impact


• Reduce conflicts between CPU generated RAM requests, and DMA generated DRAM requests

64


• Study interference from DMA requests on CPU generated DRAM requests• With on-chip MCs, unclear how DMA requests compete

with DRAM requests.

• Devise scheduling polices to minimize DMA and CPU access conflicts

• Infer how DMA and DRAM requests are arbitrated at the MC• No CPU manufacturer documentation available publicly!

65

66

Variable Rank Subsetting

Seth Pugsley

CPUMC

DIMM…

Primary Impact

Motivation for Rank Subsetting

• Rank Subsetting– Split up a rank+data channel into multiple, smaller

ranks+data channels• Prior motivations: reduce dynamic energy and

overfetch

67

Rank Size Options

Standard 8 chip-wide rank1x64-bit data bus2 banks1x8KB row buffer64 byte cache line in 8 clock edgesAll transfers sequential

4 chip-wide narrow rank2x32-bit data buses4 banks2x4KB row buffers64 byte cache line in 16 clock edgesCan transfer 2 cache lines in parallel



68

Impact on Queuing Delay

Core AccessDB

16 cyc

4 cyc

Core AccessDB

16 cyc

4 cyc

Core AccessDB

16 cyc

4 cyc

Behavior with a single bank: data bus utilization of 25%

69

Impact on Queuing Delay

Core AccessDB

16 cyc

4 cyc

Core AccessDB

16 cyc

4 cyc

Core AccessDB

16 cyc

4 cyc

Behavior with a single bank: data bus utilization of 25%

Core 0 AccessDB

16 cyc

Core 1 AccessDB

16 cyc

Core 0 AccessDB

16 cyc

Core 1 AccessDB

16 cyc

Behavior with two banks: data bus utilization of 50%

Core 0 AccessDB

16 cyc

Core 1 AccessDB

16 cyc

70

Advantages of Rank Subsetting

• More open rows– Each open row is narrower (still OK hit rates)

• Reduced Queuing Delay – More banks available and better data bus

utilization

71

Performance for Static Rank Subsetting

72

Variable Rank Subsetting

• Use a different size rank for each memory op– e.g., 1-wide transaction on data bus at same time as 2-wide and 4-

wide transactions– Scheduling can get pretty hairy– Many wasted data bus slots

D0D1D2D3D4D5D6D7

Time

= 1-wide = 2-wide = 4-wide = 8-wide = wasted

…

73

More Sensible Variable Rank Subsetting

• Still can use a different size rank for each memory op• Limit rank size to only 2 options

– Software chooses mode for newly allocated pages– Scheduling is much easier than the previous example

D0D1D2D3D4D5D6D7

= 4-wide = 8-wide = wasted

…

Time 74

75

Exploiting Rank Subsetting toAlleviate Thermal Constraints

Manju Shevgoor

CPUMC

DIMM…

Primary Impact

76

The problem- DRAM is getting hot• DRAM Temperatures can rise up to 95° C • Refresh rate needs to double once DRAM crosses 85° C • Thermal emergencies due to elevated temperatures

adversely affect performance• Cooling Systems are expensive

Full DIMM heat spreader, Zhu et al., ITHERM’08Typical Cooling System, Liu et al., HPCA’11

77

Current Thermal Throttling Techniques

CPU Throttling

Reduces overall activity

Thermal Shutdown

Stop all requests to over-heated

chips

Memory Bandwidth Throttling

Lower channel bandwidth to

reduce DRAM activity

• All DRAM chips are affected by these techniques irrespective of their temperature

• Even cool chips which could otherwise be operating at optimal throughput are also throttled

78

Refresh Overhead

Elastic Refresh, Stucheli et al., MICRO’10

• As memory chips get denser, this problem only worsens• Integer workloads can have up to 13% IPC degradation because

of Refresh• Chips working at Extended Temperature Range will cause larger

IPC degradation

79

Temperature Profile along a DIMM

• Proximity to the hot processor results in unequal temperature

• Position with respect to airflow also impacts the temperature

• Temperature difference between the hottest and coolest chips can be 10°C

Typical Temperature Profile Along the RDIMMSource: Zhu et al., ITHERM’08

Typical Cooling System, Liu et al., HPCA’11

80

Baseline

• All chips are grouped into 1 Rank• Not all chips are ‘HOT’• Not all chips need to be throttled!

Buffer

Rank 1

DIMM

Baseline Rank Organization

81

Proposed Solution

BufferRank 1(Coolest

Rank)

Rank 4Rank 3(Warmest

Rank)

Rank 2

DIMM

Proposed Rank Organization- Statically Split DIMM into multiple Ranks based on temperature

• Not all Ranks are equally hot, so Penalize only Hottest Ranks• Control Refresh Rate at Rank granularity

• Only the hottest chips are refreshed every 32ms the rest can be refreshed every 64ms

82

Fine-Grained DRAM Throttling

• Need a throttling mechanism which can be applied at a finer granularity

• Temperature Aware Cache Replacement– Modify LRU to preferentially evict lines belonging to Cool Ranks– Will reduce activity only in Hot Ranks

Decrease activity ONLY in Hot-Ranks

R3 R1 R1R2 R1 R3R3 R4

R1R3 R4R2R1 R4R1 R2

R3 R2 R3R2 R4 R2R2 R3

MRU LRU

83

Rank-wise Refresh

BufferRank 1(Coolest)

Rank 4Rank 3(Warmest)

Rank 2

DIMM

• Refresh only as fast as needed• Only Ranks operating at Extended Temperature Range

are refreshed every 32ms• Ranks operating at Normal Temperature Range are

refreshed every 64ms

Extended Temperature Range

Normal Temperature Range

84

Summary

Split DIMM into Mini Ranks

Model Temperature of Chips

Throttle Activity of Hot Ranks

Increase Refresh Rate of Hot Rank Only

Penalize Hot Ranks ONLY!!

Keeps the Chips from Reaching

High Temp.Maintains Data

Integrity of Chips once they get

Hot

85

Summary

• Converging technology trends require an overhaul of main memory architectures

• Multi-pronged approach required for significant improvements: memory chip, controller, interface, OS

• Future memory chips must also optimize for energy and reliability, and not just latency and density

• Publications: http://www.cs.utah.edu/arch-research/

86

Acknowledgments

• Collaborators at HP Labs, IBM, Intel

• Funding from NSF, Intel, HP, University of Utah

• Thanks for hosting!

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al...

Documents

Transcript of 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al...