1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al...

86
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee, Seth Pugsley, Manju Shevgoor School of Computing University of Utah

Transcript of 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al...

Page 1: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

1

Towards Scalable and Energy-Efficient Memory System Architectures

Rajeev Balasubramonian, Al Davis,Ani Udipi, Kshitij Sudan,

Manu Awasthi, Nil Chatterjee, Seth Pugsley, Manju Shevgoor

School of ComputingUniversity of Utah

Page 2: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

2

Towards Scalable and Energy-Efficient Memory System Architectures

Page 3: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

3

Convergence of Technology Trends

Energy

Reliability

New MemoryTechnologies

BW, Capacity, and Locality for

Multi-Cores

Overhaul of main memory architecture!

Page 4: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

4

High Level Approach

• Explore changes to memory chip microarchitecture Must cause minimal disruption to density

• Explore changes to interfaces and standards Major change appears inevitable!

• Explore system and memory controller innovations Most attractive, but order-of-magnitude improvement unlikely

Design solutions that are technology-agnostic

Page 5: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

5

Projects

Memory Chip

• Reduce overfetch

• Support reliability

• Handle PCM drift

• Promote read/write parallelism

Memory Interface

• Interface with photonics

• Organize channel for high capacity

Memory Controller

• Maximize use of row buffer

• Schedule for low latency and energy

• Exploit mini-ranks

CPUMC

DIMM…

Page 6: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

6

Talk Outline

Mature work:• SSA architecture – Single Subarray Access (ISCA’10)• Support for reliability (ISCA’10)• Interface with photonics (ISCA’11)• Micro-pages – data placement for row buffer efficiency (ASPLOS’10)• Handling multiple memory controllers (PACT’10)• Managing resistance drift in PCM cells (NVMW’11)

Preliminary work:• Handling read/write parallelism• Enabling high capacity• Handling DMA scheduling• Exploiting rank subsetting for performance and thermals

Page 7: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

7

Minimizing Overfetch with SingleSubarray Access

Ani Udipi

CPUMC

DIMM…

Primary Impact

Page 8: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Problem 1 - DRAM Chip Energy

• On every DRAM access, multiple arrays in multiple chips are activated

• Was useful when there was good locality in access streams– Open page policy

• Helped keep density high and reduce cost-per-bit• With multi-thread, multi-core and multi-socket systems,

there is much more randomness– “Mixing” of access streams when finally seen by the memory

controller

8

Page 9: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Rethinking DRAM Organization

• Limited use for designs based on locality• As much as 8kbytes read in order to service a 64byte cache

line request• Termed “overfetch”

– Substantially increases energy consumption• Need a new architecture that

– Eliminates overfetch– Increases parallelism– Increases opportunity for power-down– Allows efficient reliability

9

Page 10: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Proposed Solution – SSA Architecture

10

MEMORY CONTROLLER

8 8

ADDR/CMD BUS

64 Bytes

Bank

Subarray

Bitlines

Row buffer

Global Interconnect to I/O

ONE DRAM CHIP

DIMM

8 8 8 8 8 88DATA BUS

Page 11: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

SSA Basics

• Entire DRAM chip divided into small “subarrays”

• Width of each subarray is exactly one cache line

• Fetch entire cache line from a single subarray in a single DRAM chip – SSA

• Groups of subarrays combined into “banks” to keep peripheral circuit overheads low

• Close page policy and “posted-RAS”

• Data bus to processor essentially split into 8 narrow buses

11

Page 12: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

SSA Architecture Impact

• Energy reduction– Dynamic – fewer bitlines activated– Static – smaller activation footprint – more and longer spells of

inactivity – better power down

• Latency impact– Limited pins per cache line – serialization latency– Higher bank-level parallelism – shorter queuing delays

• Area increase– More peripheral circuitry and I/O at finer granularities – area

overhead (< 5%)

12

Page 13: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Area Impact

• Smaller arrays – more peripheral overhead• More wiring overhead in the on-chip interconnect between

arrays and pin pads• We did a best-effort area impact calculation using a

modified version of CACTI 6.5– Analytical model, has its limitations

• More feedback in this specific regard would be awesome!• More info on exactly where in the hierarchy overfetch stops

would be great too

13

Page 14: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

14

Support for Chipkill Reliability

Ani Udipi

CPUMC

DIMM…

Primary Impact

Page 15: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Problem 2 – DRAM Reliability

• Many server applications require chipkill-level reliability – failure of an entire DRAM chip

• One example of existing systems– Consider baseline 64-bit word plus 8-bit ECC – Each of these 72 bits must be read out of a different chip,

else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable!

– Reading 72 chips - significant overfetch!• Chipkill even more of a concern for SSA since entire cache

line comes from a single chip

15

Page 16: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Proposed Solution

Approach similar to RAID-5

16

DIMM

L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C

L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C..

C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C

.

...

.

...

.

...

.

...

P7

DRAM DEVICE

L – Cache Line C – Local Checksum P – Global Parity

Page 17: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Chipkill design

• Two-tier error protection

• Tier - 1 protection – self-contained error detection– 8-bit checksum/cache line – 1.625% storage overhead– Every cache line read is now slightly longer

• Tier -2 protection – global error correction– RAID-like striped parity across 8+1 chips– 12.5% storage overhead

• Error-free access (common case)– 1 chip reads– 2 chip writes – leads to some bank contention – 12% IPC degradation

• Erroneous access– 9 chip operation

17

Page 18: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Questions

• What are the common failure modes in DRAM? PCM?• Do entire chips fail?• Do parts of chips fail?

– Which parts? Bitlines? Wordlines? Capacitors?– Entire arrays?– Entire banks?– I/O?

• Should all these failures be handled the same way?

18

Page 19: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

19

Designing Photonic Interfaces

Ani Udipi

CPUMC

DIMM…

Primary Impact

Page 20: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Problem 3 – Memory interconnect

• Electrical interconnects are not scaling well– Where can photonics make an impact, both on energy and

performance?

• Various levels in the DRAM interconnect– Memory cell to sense-amp - addressed by SSA– Row buffer to I/O – currently electrical (on-chip)– I/O pins to processor – currently electrical (off-chip)

• Photonic interconnects– Large static power component – laser/ring tuning– Much lower dynamic component – relatively unaffected by distance

• Electrical interconnects– Relatively small static component– Large dynamic component

• Cannot overprovision photonic bandwidth, use only where necessary

20

Page 21: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Consideration 1 – How much photonics on a die?

21

Elec

tric

al E

nerg

y

Phot

onic

Ene

rgy

Page 22: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Consideration 2 - Increasing Capacity

• 3D stacking is imminent• There will definitely be several dies on the channel

– Each die has photonic components that are constantly burning static power

– Need to minimize this!• TSVs available within a stack; best of both worlds

– Large bandwidth– Low static energy– Need to exploit this!

22

Page 23: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Proposed Design

23

ProcessorDIMM

Waveguide

DRAM chips

Photonic Interface die +

Stack controller

Memory controller

Page 24: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Proposed Design – Interface Die

• Exploit 3D die stacking to move all photonic components to a separate interface die, shared by several memory dies

– Use photonics where there is heavy utilization – shared bus between processor and interface die i.e. the off-chip interconnect

– Helps break pin barrier for efficient I/O, substantially improves socket-edge BW

– On-stack, where there is low utilization, use efficient low-swing interconnects and TSVs

24

Page 25: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Advantages of the proposed system

• Reduction in energy consumption– Fewer photonic resources, without loss in performance– Rings, couplers, trimming

• Industry considerations– Does not affect design of commodity memory dies– Same memory die can be used with both photonic and

electrical systems– Same interface die can be used with different kinds of

memory dies – DRAM, PCM, STT-RAM, Memristors

25

Page 26: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Problem 4 – Communication Protocol

• Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface

– Need to handle heterogeneous memory modules, each with its own maintenance requirements, further complicates scheduling

– Very little interoperability – affects both consumers (too many choices!) and vendors (stock-keeping and manufacturing)

– Heavy pressure on address/command bus – several commands to micro-manage every operation of the DRAM

– Several independent banks – need to maintain large amounts of state to schedule requests efficiently

– Simultaneous arbitration for multiple resources (address bus, data bank, data bus) to complete a single transaction

26

Page 27: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Proposed Solution – Packet-based interface

• Release most of the tight control memory controller holds today

• Move mundane tasks to the memory modules themselves (on the interface die) - make them more autonomous

– maintenance operation (refresh, scrub, etc.)– routine operations (DRAM precharge, NVM wear handling)– timing control (DRAM alone has almost 20 different timing

constraints to be respected)– coding and any other special requirements

• Only information the memory module needs is the address and read/write identification, time slots reserved apriori for data return

27

Page 28: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Advantages

• Better interoperability, plug and play– As long as the interface die has the necessary information,

everything in interchangeable• Better support for heterogeneous systems

– Allows easier data movement between DRAM and NVM for example, on the same channel

• Reduces memory controller complexity• Allows innovation and value addition in the memory,

without being constrained by processor-side support• Reduces bit transport energy on the address/command bus

28

Page 29: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

29

Data Placement with Micro-PagesTo Boost Row Buffer Utility

Kshitij Sudan

CPUMC

DIMM…

Primary Impact

Page 30: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

DRAM Access Inefficiencies• Over fetch due to large row-buffers

• 8 KB read into row buffer for a 64 byte cache line

• Row-buffer utilization for a single request < 1%

• Diminishing locality in multi-cores• Increasingly randomized memory access stream

• Row-buffer hit rates bound to go down

• Open page policy and FR-FCFS request scheduling

• Memory controller schedules requests to open row-buffers first

Goal

Improve row-buffer hit-rates for Chip Multi-Processors

30

Page 31: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Key ObservationGather all heavily accessed chunks of independent OS

pages and map them to the same DRAM rowCache Block Access Pattern Within OS Pages

For heavily accessed pages in a given time interval,accesses are usually to a few cache blocks

31

Page 32: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Basic Idea

Hottest micro-pages

1 KB micro-pages

Coldest micro-pages

4 KB OS Pages

DRAM Memory

Reserved DRAM Region

32

Page 33: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Hardware Implementation (HAM)

PhysicalAddress

X

New addr . Y

4 GB Main MemoryCPU Memory Request

4 MB ReservedDRAM region

Y

X Page A

Mapping Table

X Y

Old Address New Address

BaselineHardware Assisted Migration (HAM)

33

Page 34: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Results5M cycle EPOCH, ROPS, HAM and ORACLE

Apart from average 9% performance gains, our schemes also save DRAM energy at the same time!

Percent change in

performance

34

Page 35: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Conclusions

• On average, for applications with room for improvement and with our best performing scheme• Average performance ↑ 9% (max. 18%)

• Average memory energy consumption ↓ 18% (max. 62%).

• Average row-buffer utilization ↑ 38%

• Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses

35

Page 36: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

36

Data Placement AcrossMultiple Memory Controllers

Kshitij Sudan

CPUMC

DIMM…

Primary Impact

Page 37: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

DRAM NUMA Latency

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMM

MC

Core 1 Core 2

Core 3 Core 4

DIMM DIMM DIMMQPI

MC On-Chip Memory Controller

QPI Interconnect

Memory Channel

DIMM DRAM (DIMMs)

Socket Boundary

37

Page 38: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Problem Summary

• Pin limitations → increasing queuing delay• Almost 8x increase in queuing delays from single core/one

thread to 16 cores/16 threads

• Multi-cores → increasing row-buffer interference• Increasingly randomized memory access stream

• Longer on- and off-chip wire delays → increasing NUMA factor• NUMA factor already at 1.5x today

Goal

Improve application performance by reducing queuing delays and NUMA latency 38

Page 39: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Policies to Manage Data Placement Among MCs

• Adaptive First Touch• Assign new virtual pages to a DRAM (physical) page belonging to

MC(j) minimizing the a cost function

• Dynamic Page Migration• Programs change phases → Imbalance in MC load

• Migrate pages between MCs at runtime

• Integrating Heterogenous Memory Technologies

cost j = α x loadj + β x rowhitsj + λ x distancej

costk = Λ * distancek + Γ * rowhitsk

cost j = α x loadj + β x rowhitsj + λ x distance + Ƭ x LatencyDimmClusterj + µ x Usagej

39

Page 40: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Summary

• Multiple on-chip MCs will be common in future CMPs• Multiple cores sharing one MC, MCs controlling different

types of memories• Intelligent data mapping needed

• Adaptive First Touch policy (AFT)• Increases performance by 6.5% in homogeneous and by

1.6% in DRAM – PCM hierarchy.

• Dynamic page migration, improvement on AFT• Further improvement over AFT - 8.9% over baseline in

homogeneous, and by 4.4% in best performing DRAM-PCM hierarchy.

40

Page 41: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

41

Managing Resistance Drift inPCM Cells

Manu Awasthi

CPUMC

DIMM…

Primary Impact

Page 42: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

42

Quick Summary

• Multi level cells in PCM appear imminent• A number of proposals exist to handle hard errors

and lifetime issues of PCM devices• Resistance Drift is a less explored phenomenon

– Will become increasingly significant as number of levels/cell increases – primary cause of “soft errors”

– Naïve techniques based on DRAM-like refresh will be extremely costly for both latency and energy

– Need to explore holistic solutions to counter drift

Page 43: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

43

What is Resistance Drift?

Resistance

Time11 10 01 00

A

B

ERROR!!

T0

Tn

Crystalline Amorphous

Page 44: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

44

Resistance Drift DataCell Type Drift Time at Room

temperature (secs)

Median 11 cell 10499

Worst 11 Case cell 1015

Median 10 cell 1024

Worst Case 10 cell 5.94

Median 01 cell 108

Worst Case 01 cell 1.81

(11) (00)(10) (01)

Page 45: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

45

Resistance Drift - Issues

• Programmed resistance drifts according to power law equation -

• R0, α usually follow a Gaussian distribution• Time to drift (error) depends on

– Programmed resistance (R0), and – Drift Coefficient (α)– Is highly unpredictable!!

Rdrift(t) = R0 х (t)α

Page 46: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

46

Resistance Drift - How it happens

11 10 01 00

Median case cell• Typical R0

• Typical α

Worst case cell• High R0

• High α

Scrub rate will be dictated by the Worst Case R0 and Worst Case α

Naive refresh/scrub will be extremely costly!

Drift Drift

R0 R0Rt Rt

ERROR!!

Number of Cells

Page 47: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

47

Architectural Solutions - Headroom• Assumes support for Light Array

Reads for Drift Detection (LARDDs) & ECC-N

• Headroom-h scheme – scrub is triggered if N-h errors are detected

† Decreases probability of errors slipping through

– Increases frequency of full scrub and hence decreases life time

– Gradual Headroom scheme : Start with large LARDD frequency, increase frequency as errors increase

Read Line

Check for Errors

Errors < N-h

Scrub Line

True

False

After N cycles

Page 48: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

48

Reducing Overheads with Circuit Level Solution

• Invoking ECC on every LARDD increases energy consumption

• Parity – like error detection circuit is used to signal the need for a full fledged ECC error detect – Number of Drift Prone States in each line are counted

when the line is written into memory (single bit represents odd/even)

– At every LARDD, parity is verified

• Reduces need for ECC read-compare at every LARDD cycle

48(11) (00)(10) (01)

Page 49: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

49

More Solutions

• Precise Writes– More write iterations to program state closer to mean,

reduce chance of drift– Increases energy consumption , write time and decreases

lifetime!

• Non Uniform Guardbanding– Resistance is equally distributed between all n states– Expand resistance range for drift prone states at expense

of non-drift prone ones

Page 50: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

50

Results

LARDD Interval (seconds)

Erro

rs

Page 51: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

51

Conclusions

• Resistance drift will exacerbate with MLC scaling• Naïve solutions based on ECC support are costly for

PCM– Increased write energy, decreased lifetimes

• Holistic solutions need to be explored to counter drift at device, architectural and system levels– 39% reduction in energy, 4x less errors, 102x increase in

lifetime

Page 52: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

52

Handling Read/Write Parallelism

Nil Chatterjee

CPUMC

DIMM…

Primary Impact

Page 53: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

53

The Problem• Writes are not on the critical path for program execution, but

they can slow down reads through resource contention

• In future chipkill correct systems, each data write will necessitate an update of the ECC codes and the impact of writes will be more evident.

• In PCM, the problem is exacerbated by the significantly longer write times.

• Abstracting the writes away improves read latency by 48% in non-ECC DRAM systems.

Page 54: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

54

Impact of Writes on Reads• Write draining affects read latencies by

– Increasing the queuing delay – Reducing the read stream’s row-buffer locality

Page 55: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

55

Bank Contention from Writes• Reads are not scheduled in the middle of the WQ

drain because it would require multiple bus turnarounds incurring tWRT and tOST delays.

• Underutilization of the data bus bandwidth during WQ draining leading to performance loss.

• However, opportunities to schedule read accesses to idle banks might exist in this interval.

Page 56: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

56

Example

Page 57: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

57

Solution : Increasing R/W overlap• During a WQ drain cycle, schedule partial reads to

idle banks. – Following a column read command, the data is

fetched from the sense amplifiers into a small buffer (64byte) near the I/O pads.

– Data will be streamed out only after the WQ reaches the low watermark - no turnaround delays.

• Immediately following the WQ drain, a flurry of prefetched reads can occupy the data bus.

Page 58: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

58

Solution : Increasing R/W overlap.

Page 59: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

59

Impact• A small pool of partial read registers can help

increase the data bus utilization post writes.

• In PCM system, where writes are very expensive, partial reads can have higher impact.

• The JEDEC standard must be augmented to support a partial read command.

Page 60: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

60

Organizing Channels for High Capacity

Kshitij Sudan

CPUMC

DIMM…

Primary Impact

Page 61: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Increasing DRAM Capacity by Re-Architecting Memory Channel

• Increase DRAM capacity, while minimizing power

• Re-architect CPU-to-DRAM channel• Study effects of bus width and protocol (serial vs. parallel)

• CMPs might have changed the playfield!

61

Page 62: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

62

Increasing DRAM Capacity by Re-Architecting Memory Channel

• Organize modules as binary tree, and move some MC functionality to “Buffer Chip”

• Reduces module depth from O(n) to O(log n)

• Reduces worst case latency, and improves signal integrity

• Buffer chip manages low-level DRAM operations and channel arbitration

• Not limited by worst-case access latency like FB-DIMM

• NUMA like DRAM access – leverage data mapping

62

Page 63: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

63

Handling DMA Scheduling

Kshitij Sudan

CPUMC

DIMM…

Primary Impact

Page 64: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Handling DMA Scheduling

• Reduce conflicts between CPU generated RAM requests, and DMA generated DRAM requests

64

Page 65: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Handling DMA Scheduling

• Study interference from DMA requests on CPU generated DRAM requests• With on-chip MCs, unclear how DMA requests compete

with DRAM requests.

• Devise scheduling polices to minimize DMA and CPU access conflicts

• Infer how DMA and DRAM requests are arbitrated at the MC• No CPU manufacturer documentation available publicly!

65

Page 66: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

66

Variable Rank Subsetting

Seth Pugsley

CPUMC

DIMM…

Primary Impact

Page 67: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Motivation for Rank Subsetting

• Rank Subsetting– Split up a rank+data channel into multiple, smaller

ranks+data channels• Prior motivations: reduce dynamic energy and

overfetch

67

Page 68: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Rank Size Options

Standard 8 chip-wide rank1x64-bit data bus2 banks1x8KB row buffer64 byte cache line in 8 clock edgesAll transfers sequential

4 chip-wide narrow rank2x32-bit data buses4 banks2x4KB row buffers64 byte cache line in 16 clock edgesCan transfer 2 cache lines in parallel

1 chip-wide narrow rank8x8-bit data buses16 banks8x1KB row buffers64 byte cache line in 64 clock edgesCan transfer 8 cache lines in parallel

2 chip-wide narrow rank4x16-bit data buses8 banks4x2KB row buffers64 byte cache line in 32 clock edgesCan transfer 4 cache lines in parallel

68

Page 69: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Impact on Queuing Delay

Core AccessDB

16 cyc

4 cyc

Core AccessDB

16 cyc

4 cyc

Core AccessDB

16 cyc

4 cyc

Behavior with a single bank: data bus utilization of 25%

69

Page 70: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Impact on Queuing Delay

Core AccessDB

16 cyc

4 cyc

Core AccessDB

16 cyc

4 cyc

Core AccessDB

16 cyc

4 cyc

Behavior with a single bank: data bus utilization of 25%

Core 0 AccessDB

16 cyc

Core 1 AccessDB

16 cyc

Core 0 AccessDB

16 cyc

Core 1 AccessDB

16 cyc

Behavior with two banks: data bus utilization of 50%

Core 0 AccessDB

16 cyc

Core 1 AccessDB

16 cyc

70

Page 71: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Advantages of Rank Subsetting

• More open rows– Each open row is narrower (still OK hit rates)

• Reduced Queuing Delay – More banks available and better data bus

utilization

71

Page 72: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Performance for Static Rank Subsetting

72

Page 73: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Variable Rank Subsetting

• Use a different size rank for each memory op– e.g., 1-wide transaction on data bus at same time as 2-wide and 4-

wide transactions– Scheduling can get pretty hairy– Many wasted data bus slots

D0D1D2D3D4D5D6D7

Time

= 1-wide = 2-wide = 4-wide = 8-wide = wasted

73

Page 74: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

More Sensible Variable Rank Subsetting

• Still can use a different size rank for each memory op• Limit rank size to only 2 options

– Software chooses mode for newly allocated pages– Scheduling is much easier than the previous example

D0D1D2D3D4D5D6D7

= 4-wide = 8-wide = wasted

Time 74

Page 75: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

75

Exploiting Rank Subsetting toAlleviate Thermal Constraints

Manju Shevgoor

CPUMC

DIMM…

Primary Impact

Page 76: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

76

The problem- DRAM is getting hot• DRAM Temperatures can rise up to 95° C • Refresh rate needs to double once DRAM crosses 85° C • Thermal emergencies due to elevated temperatures

adversely affect performance• Cooling Systems are expensive

Full DIMM heat spreader, Zhu et al., ITHERM’08Typical Cooling System, Liu et al., HPCA’11

Page 77: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

77

Current Thermal Throttling Techniques

CPU Throttling

Reduces overall activity

Thermal Shutdown

Stop all requests to over-heated

chips

Memory Bandwidth Throttling

Lower channel bandwidth to

reduce DRAM activity

• All DRAM chips are affected by these techniques irrespective of their temperature

• Even cool chips which could otherwise be operating at optimal throughput are also throttled

Page 78: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

78

Refresh Overhead

Elastic Refresh, Stucheli et al., MICRO’10

• As memory chips get denser, this problem only worsens• Integer workloads can have up to 13% IPC degradation because

of Refresh• Chips working at Extended Temperature Range will cause larger

IPC degradation

Page 79: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

79

Temperature Profile along a DIMM

• Proximity to the hot processor results in unequal temperature

• Position with respect to airflow also impacts the temperature

• Temperature difference between the hottest and coolest chips can be 10°C

Typical Temperature Profile Along the RDIMMSource: Zhu et al., ITHERM’08

Typical Cooling System, Liu et al., HPCA’11

Page 80: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

80

Baseline

• All chips are grouped into 1 Rank• Not all chips are ‘HOT’• Not all chips need to be throttled!

Buffer

Rank 1

DIMM

Baseline Rank Organization

Page 81: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

81

Proposed Solution

BufferRank 1(Coolest

Rank)

Rank 4Rank 3(Warmest

Rank)

Rank 2

DIMM

Proposed Rank Organization- Statically Split DIMM into multiple Ranks based on temperature

• Not all Ranks are equally hot, so Penalize only Hottest Ranks• Control Refresh Rate at Rank granularity

• Only the hottest chips are refreshed every 32ms the rest can be refreshed every 64ms

Page 82: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

82

Fine-Grained DRAM Throttling

• Need a throttling mechanism which can be applied at a finer granularity

• Temperature Aware Cache Replacement– Modify LRU to preferentially evict lines belonging to Cool Ranks– Will reduce activity only in Hot Ranks

Decrease activity ONLY in Hot-Ranks

R3 R1 R1R2 R1 R3R3 R4

R1R3 R4R2R1 R4R1 R2

R3 R2 R3R2 R4 R2R2 R3

MRU LRU

Page 83: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

83

Rank-wise Refresh

BufferRank 1(Coolest)

Rank 4Rank 3(Warmest)

Rank 2

DIMM

• Refresh only as fast as needed• Only Ranks operating at Extended Temperature Range

are refreshed every 32ms• Ranks operating at Normal Temperature Range are

refreshed every 64ms

Extended Temperature Range

Normal Temperature Range

Page 84: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

84

Summary

Split DIMM into Mini Ranks

Model Temperature of Chips

Throttle Activity of Hot Ranks

Increase Refresh Rate of Hot Rank Only

Penalize Hot Ranks ONLY!!

Keeps the Chips from Reaching

High Temp.Maintains Data

Integrity of Chips once they get

Hot

Page 85: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

85

Summary

• Converging technology trends require an overhaul of main memory architectures

• Multi-pronged approach required for significant improvements: memory chip, controller, interface, OS

• Future memory chips must also optimize for energy and reliability, and not just latency and density

• Publications: http://www.cs.utah.edu/arch-research/

Page 86: 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

86

Acknowledgments

• Collaborators at HP Labs, IBM, Intel

• Funding from NSF, Intel, HP, University of Utah

• Thanks for hosting!