CSC 2224: Parallel Computer Architecture and Programming ...

78
CSC 2224: Parallel Computer Architecture and Programming Main Memory Fundamentals Prof. Gennady Pekhimenko University of Toronto Fall 2020 The content of this lecture is adapted from the slides of Vivek Seshadri, Donghyuk Lee, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU

Transcript of CSC 2224: Parallel Computer Architecture and Programming ...

Page 1: CSC 2224: Parallel Computer Architecture and Programming ...

CSC 2224: Parallel Computer Architecture and ProgrammingMain Memory Fundamentals

Prof. Gennady Pekhimenko

University of Toronto

Fall 2020

The content of this lecture is adapted from the slides of Vivek Seshadri, Donghyuk Lee, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU

Page 2: CSC 2224: Parallel Computer Architecture and Programming ...

Review #4

• RowClone: Fast and Energy-Efficient

In-DRAM Bulk Data Copy and Initialization

Vivek Seshadri et al., MICRO 2013

2

Page 3: CSC 2224: Parallel Computer Architecture and Programming ...

Why Is Memory So Important?

(Especially Today)

Page 4: CSC 2224: Parallel Computer Architecture and Programming ...

The Performance Perspective• “It’s the Memory, Stupid!” (Richard Sites, MPR, 1996)

Mutlu+, “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors,” HPCA 2003.

Page 5: CSC 2224: Parallel Computer Architecture and Programming ...

The Energy Perspective

5

Dally, HiPEAC 2015

Page 6: CSC 2224: Parallel Computer Architecture and Programming ...

The Energy Perspective

6

Dally, HiPEAC 2015

A memory access consumes ~1000X the energy of a complex addition

Page 7: CSC 2224: Parallel Computer Architecture and Programming ...

The Reliability Perspective• Data from all of Facebook’s servers worldwide• Meza+, “Revisiting Memory Errors in Large-Scale Production Data Centers,” DSN’15.

7

Page 8: CSC 2224: Parallel Computer Architecture and Programming ...

The Security Perspective

8

Page 9: CSC 2224: Parallel Computer Architecture and Programming ...

Why Is DRAM So Slow?

Page 10: CSC 2224: Parallel Computer Architecture and Programming ...

Motivation (by Vivek Seshadri)

Conversation with a friend from Stanford

Why is DRAM so slow?!

Really?

50 nanoseconds to serve one request? Is that a fundamental limit to DRAM’s access latency?

Him Vivek

10

Page 11: CSC 2224: Parallel Computer Architecture and Programming ...

Understanding DRAM

What is in here?

11

DRAM

Problems related to today’s DRAM design

Solutions proposed by our research

Page 12: CSC 2224: Parallel Computer Architecture and Programming ...

Outline

1. What is DRAM?

2. DRAM Internal Organization

3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013,

Adaptive-Latency DRAM, HPCA 2015)

– Parallelism (Subarray-level Parallelism, ISCA 2012)

12

Page 13: CSC 2224: Parallel Computer Architecture and Programming ...

What is DRAM?

DRAM – Dynamic Random Access Memory

3455434

Array of Values

43543

98734

0

847

42

873909

1729

WRITE address, value

READ address

Accessing any location takes the same amount of time

Data needs to be constantly refreshed

13

Page 14: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM in Today’s Systems

14

Why DRAM? Why not some other memory?

Page 15: CSC 2224: Parallel Computer Architecture and Programming ...

Von Neumann Model

Processor Memory

Program Data

T1 = Read Data[1]

T2 = Read Data[2]

T3 = T1 + T2

Write Data[3], T3

3

4

8

2

T1 = Read Data[1]

44 instruction accesses +3 data accesses

Memory performance is important15

Page 16: CSC 2224: Parallel Computer Architecture and Programming ...

Factors that Affect Choice of Memory

1. Speed

– Should be reasonably fast compared to processor

2. Capacity

– Should be large enough to fit programs and data

3. Cost

– Should be cheap

16

Page 17: CSC 2224: Parallel Computer Architecture and Programming ...

Why DRAM?

Access Latency

Co

stFlip-flops

SRAM

DRAM

FlashDisk

Higher Cost

Higher access latency

Favorable point in the trade-off spectrum

17

Page 18: CSC 2224: Parallel Computer Architecture and Programming ...

Is DRAM Fast Enough?

Processor Commodity DRAM

Request

3 GHz, 2 Instructions / cycle

300 Instructions

300 Instructions

300 Instructions

300 Instructions

Independent programs

Request

Request

Request

Served in parallel?

Core Core

Core Core

18

Latency

Parallelism

50ns

Page 19: CSC 2224: Parallel Computer Architecture and Programming ...

Outline

1. What is DRAM?

2. DRAM Internal Organization

3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013,

Adaptive-Latency DRAM, HPCA 2015)

– Parallelism (Subarray-level Parallelism, ISCA 2012)

19

Page 20: CSC 2224: Parallel Computer Architecture and Programming ...

processor

main memory

peripheral logic

bank

mat

high latency

DRAM Organization

Page 21: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Cell Array: Mat

peripheral logic

cell

matmat

sense amplifierw

ord

line

driver

cell

wordline

bitline

Page 22: CSC 2224: Parallel Computer Architecture and Programming ...

Memory Element (Cell)

0 1

Component that can be in at least two states

Can be electrical, magnetic, mechanical, etc.

DRAM → Capacitor22

Page 23: CSC 2224: Parallel Computer Architecture and Programming ...

Capacitor – Bucket of Electric Charge

Charge

“Charged” “Discharged”

0 0

Vmax

0Voltage level indicator. Not part of the actual circuit

23

Page 24: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Chip

Contains billions of capacitors (also called cells)

0

?

24

DRAM Chip

Page 25: CSC 2224: Parallel Computer Architecture and Programming ...

Divide and Conquer

DRAM Bank

?

25

1. Weak

2. Reading destroys state

Challenges

Page 26: CSC 2224: Parallel Computer Architecture and Programming ...

Sense-Amplifier

Vmax

0

Vx

Vy

Vx > Vy

Vx

Vy

Vx+δ

Vx+δ

Vy-δ

Amplify the difference Stable state

26

Page 27: CSC 2224: Parallel Computer Architecture and Programming ...

Outline

1. What is DRAM?

2. DRAM Internal Organization

– DRAM Cell

– DRAM Array

– DRAM Bank

3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013)

– Parallelism (Subarray-level Parallelism, ISCA 2012)27

Page 28: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Cell Read Operation

Vmax/2

Vmax/2

0

Vmax/2 + δ

0

VmaxVmaxVmax/2 + δ

DRAMCell

Sense-Amplifier

Cell loses charge

Amplify the differenceRestore

Cell Data

28

Access data

12

Page 29: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Cell Read Operation

0

DRAMCell

Sense-Amplifier

Control Signal

DRAMCell

Address

29

Page 30: CSC 2224: Parallel Computer Architecture and Programming ...

Outline

1. What is DRAM?

2. DRAM Internal Organization– DRAM Cell

– DRAM Array

– DRAM Bank

3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013;

Adaptive-Latency DRAM, HPCA 2015)

– Parallelism (Subarray-level Parallelism, ISCA 2012)30

Page 31: CSC 2224: Parallel Computer Architecture and Programming ...

Problem

Sense-Amplifier

Control Signal

DRAMCell

Address

Much larger than a cell

31

Page 32: CSC 2224: Parallel Computer Architecture and Programming ...

Cost AmortizationBitline

Wordline

32

Page 33: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Array Operation

1 0 0 1 1 0 1 0 0 1 0 1

½ ½ ½ ½ ½ ½ ½ ½ ½ ½ ½ ½

Ro

w D

eco

der

1

Ro

w A

dd

ress

? ? ? ? ? ? ? ? ? ? ??

½↑

½↑

½↑

½↑

½↑

½½↓

½↓

½↓

½↓

½↓

½↓

m

1 0 0 1 1 0 1 0 0 1 0 1

Sensing & Amplification

Charge RestorationWordlineDisableCharge SharingWordline Enable

Access Data (column n)

Restore Sense-Amplifier

33

m

n

Sense-Amplifiers

Page 34: CSC 2224: Parallel Computer Architecture and Programming ...

Outline

1. What is DRAM?

2. DRAM Internal Organization– DRAM Cell

– DRAM Array

– DRAM Bank

3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013

Adaptive-Latency DRAM, HPCA 2015)

– Parallelism (Subarray-level Parallelism, ISCA 2012)34

Page 35: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Bank

35

How to build a DRAM bank from a DRAM array?

Page 36: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Bank: Single DRAM Array?

36

10

00

0+

row

s

Ro

w D

eco

der

1 0 0 1 1 0 0 1

Long bitlineDifficult to sense data

in far away cells

Ro

w A

dd

ress

Page 37: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Bank: Collection of Arrays

37

Ro

w D

eco

der

1 0 0 1 1 0 0 1

Ro

w D

eco

der

Row Address Column Read/Write

Data

Subarray

Page 38: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Operation: Summary

38

Ro

w D

eco

der

1 0 0 1 1 0 0 1

Ro

w D

eco

der

Row Address Column Read/Write

Row m, Col n

m

1. Enable row m

2. Access col n

3. Close row

n

m

Page 39: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Chip Hierarchy

39

Ro

w D

eco

der

1 0 0 1 1 0 0 1

Ro

w D

eco

der

Row Address Column Read/Write

Collection of Banks

Collection of Subarrays

Page 40: CSC 2224: Parallel Computer Architecture and Programming ...

Outline

1. What is DRAM?

2. DRAM Internal Organization

3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013;

Adaptive-Latency DRAM, HPCA 2015)

– Parallelism (Subarray-level Parallelism, ISCA 2012)

40

Page 41: CSC 2224: Parallel Computer Architecture and Programming ...

Factors That Affect Performance

1. Latency

– How fast can DRAM serve a request?

2. Parallelism

– How many requests can DRAM serve in parallel?

41

Page 42: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Chip Hierarchy

42

Ro

w D

eco

der

1 0 0 1 1 0 0 1

Ro

w D

eco

der

Row Address Column Read/Write

Collection of Banks

Collection of Subarrays

Latency

Parallelism

Page 43: CSC 2224: Parallel Computer Architecture and Programming ...

Outline

1. What is DRAM?

2. DRAM Internal Organization

3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013;

Adaptive-Latency DRAM, HPCA 2015)

– Parallelism (Subarray-level Parallelism, ISCA 2012)

43

Page 44: CSC 2224: Parallel Computer Architecture and Programming ...

Subarray Size: Rows/Subarray

44

Number of rows in subarray

Latency Chip Area

Page 45: CSC 2224: Parallel Computer Architecture and Programming ...

Subarray Size vs. Access Latency

45

Smaller subarrays => lower access latency

Shorter Bitlines => Faster access

Page 46: CSC 2224: Parallel Computer Architecture and Programming ...

Subarray Size vs. Chip Area

Large Subarray Smaller Subarrays

46

AreaCost

Smaller subarrays => larger chip area

Page 47: CSC 2224: Parallel Computer Architecture and Programming ...

Chip Area vs. Access Latency

0

1

2

3

4

0 10 20 30 40 50 60

No

rmal

ize

d C

hip

Are

a

Access Latency (ns)

Commodity DRAM(512 rows/subarray)

32 rows/subarray

47

Why is DRAM so slow?

Page 48: CSC 2224: Parallel Computer Architecture and Programming ...

Chip Area vs. Access Latency

0

1

2

3

4

0 10 20 30 40 50 60

No

rmal

ize

d C

hip

Are

a

Access Latency (ns)

Commodity DRAM(512 rows/subarray)

32 rows/subarray

?

How to enable low latency without high area overhead?

48

Page 49: CSC 2224: Parallel Computer Architecture and Programming ...

New Proposal

Large Subarray Small SubarrayOur Proposal49

Low area cost

Page 50: CSC 2224: Parallel Computer Architecture and Programming ...

Tiered-Latency DRAM

Near Segment

Far Segment

+ Lower access latency+ Lower energy/access

- Higher access latency- Higher energy/access

Map frequently accessed data to near segment

50

Page 51: CSC 2224: Parallel Computer Architecture and Programming ...

Results Summary

51

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Performance

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Energy Consumption

12%23%

Commodity DRAM Tiered-Latency DRAM

Page 52: CSC 2224: Parallel Computer Architecture and Programming ...

Tiered-Latency DRAM:A Low Latency and Low Cost DRAM

Architecture

Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu

Published in the proceedings of 19th IEEE International Symposium on

High Performance Computer Architecture 2013

52

Page 53: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Stores Data as Charge

1. Sensing2. Restore3. Precharge

DRAM cell

Sense amplifier

Three steps of charge movement

Page 54: CSC 2224: Parallel Computer Architecture and Programming ...

Sensing RestoreTiming Parameters

Data 0

Data 1

cell

time

char

ge

Sense amplifier

DRAM Charge over Time

Why does DRAM need the extra timing margin?

In theory margin

cell

Sense amplifier

In practice

Page 55: CSC 2224: Parallel Computer Architecture and Programming ...

1. Process Variation – DRAM cells are not equal

– Leads to extra timing margin for cell that can store small amount of charge

2. Temperature Dependence – DRAM leaks more charge at higher temperature

– Leads to extra timing margin when operating at low temperature

Two Reasons for Timing Margin

1. Process Variation – DRAM cells are not equal

– Leads to extra timing margin for cell that can store small amount of charge;

1. Process Variation – DRAM cells are not equal

– Leads to extra timing margin for cells that can store large amount of charge

Page 56: CSC 2224: Parallel Computer Architecture and Programming ...

Same size →Same charge →

Different size →Different charge →

Same latency Different latency

Large variation in cell size →Large variation in charge →Large variation in access latency

DRAM Cells are Not EqualRealIdeal

Largest cell

Smallest cell

Page 57: CSC 2224: Parallel Computer Architecture and Programming ...

1. Process Variation – DRAM cells are not equal

– Leads to extra timing margin for cells that can store large amount of charge

2. Temperature Dependence – DRAM leaks more charge at higher temperature

– Leads to extra timing margin when operating at low temperature

Two Reasons for Timing Margin

2. Temperature Dependence – DRAM leaks more charge at higher temperature

– Leads to extra timing margin when operating at low temperature

2. Temperature Dependence – DRAM leaks more charge at higher temperature

– Leads to extra timing margin when operating at low temperature

Page 58: CSC 2224: Parallel Computer Architecture and Programming ...

Cells store small charge at high temperature and large charge at low temperature → Large variation in access latency

Charge Leakage ∝ Temperature

Room Temp. Hot Temp. (85°C)

Small leakage Large leakage

Page 59: CSC 2224: Parallel Computer Architecture and Programming ...

DRAM Timing Parameters

• DRAM timing parameters are dictated by the worst case

– The smallest cell with the smallest charge in all DRAM products

– Operating at the highest temperature

• Large timing margin for the common case

→ Can lower latency for the common case

Page 60: CSC 2224: Parallel Computer Architecture and Programming ...

TemperatureController

PC

HeaterFPGAs FPGAs

DRAM Testing Infrastructure

Page 61: CSC 2224: Parallel Computer Architecture and Programming ...

Typical DIMM at Low Temperature

Obs 1. Faster Sensing

More charge

Strong chargeflow

Faster sensing

Typical DIMM at Low Temperature→More charge → Faster sensing

Timing(tRCD)

17% ↓No Errors

115 DIMM characterization

Page 62: CSC 2224: Parallel Computer Architecture and Programming ...

Obs 2. Reducing Restore Time

Larger cell & Less leakage ➔Extra charge

No need to fullyrestore charge

Typical DIMM at lower temperature→More charge → Restore time reduction

Read (tRAS)

37% ↓Write (tWR)

54% ↓No Errors

115 DIMM characterization

Typical DIMM at Low Temperature

Page 63: CSC 2224: Parallel Computer Architecture and Programming ...

Empty (0V)

Full (Vdd)

Half

Obs 3. Reducing Precharge Time

Bit

line

Sense amplifier

Sensing Precharge

Precharge ? – Setting bitline to half-full charge

Typical DIMM at Low Temperature

Page 64: CSC 2224: Parallel Computer Architecture and Programming ...

Empty (0V) Full (Vdd)

Half

bitline

Not fully precharged

More charge→ strong sensing

Access empty cell Access full cell

Timing(tRP)

35% ↓No Errors

115 DIMM characterization

Typical DIMM at Lower Temperature→More charge → Precharge time reduction

Obs 3. Reducing Precharge Time

Page 65: CSC 2224: Parallel Computer Architecture and Programming ...

Adaptive-Latency DRAM

• Key idea– Optimize DRAM timing parameters online

• Two components– DRAM manufacturer profiles multiple sets of reliable DRAM timing parameters at different temperatures for each DIMM– System monitors DRAM temperature & uses appropriate

DRAM timing parameters

reliable DRAM timing parameters

DRAM temperature

Page 66: CSC 2224: Parallel Computer Architecture and Programming ...

0%5%

10%15%20%25%

sop

lex

mcf

milc

libq

lbm

gem

s

cop

y

s.cl

ust

er

gup

s

no

n-i

nte

nsi

ve

inte

nsi

ve

all-

wo

rklo

ads

Single Core Multi Core

0%5%

10%15%20%25%

sop

lex

mcf

milc

libq

lbm

gem

s

cop

y

s.cl

ust

er

gup

s

no

n-i

nte

nsi

ve

inte

nsi

ve

all-

wo

rklo

ads

Single Core Multi Core

0%5%

10%15%20%25%

sop

lex

mcf

milc

libq

lbm

gem

s

cop

y

s.cl

ust

er

gup

s

no

n-i

nte

nsi

ve

inte

nsi

ve

all-

wo

rklo

ads

Single Core Multi Core

14.0%

2.9%0%5%

10%15%20%25%

sop

lex

mcf

milc

libq

lbm

gem

s

cop

y

s.cl

ust

er

gup

s

no

n-i

nte

nsi

ve

inte

nsi

ve

all-

wo

rklo

ads

Single Core Multi-Core

10.4%

Real System Evaluation

AL-DRAM provides high performance improvement, greater for multi-core workloads

Perf

orm

ance

Imp

rove

me

nt Average

Improvement

all-

35

-wo

rklo

ad

Page 67: CSC 2224: Parallel Computer Architecture and Programming ...

Summary: AL-DRAM

• Observation– DRAM timing parameters are dictated by the worst-case cell

(smallest cell at highest temperature)

• Our Approach: Adaptive-Latency DRAM (AL-DRAM) – Optimizes DRAM timing parameters for the common case

(typical DIMM operating at low temperatures)

• Analysis: Characterization of 115 DIMMs– Great potential to lower DRAM timing parameters (17 –

54%) without any errors

• Real System Performance Evaluation – Significant performance improvement (14% for memory-

intensive workloads) without errors (33 days)

Page 68: CSC 2224: Parallel Computer Architecture and Programming ...

Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case

Donghyuk Lee, Yoongu Kim,

Gennady Pekhimenko, Samira Khan, VivekSeshadri, Kevin Chang, and Onur Mutlu

Published in the proceedings of 21st

International Symposium on High Performance Computer Architecture 2015

68

Page 69: CSC 2224: Parallel Computer Architecture and Programming ...

Outline

1. What is DRAM?

2. DRAM Internal Organization

3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013;

Adaptive-Latency DRAM, HPCA 2015)

– Parallelism (Subarray-level Parallelism, ISCA 2012)

69

Page 70: CSC 2224: Parallel Computer Architecture and Programming ...

Parallelism: Demand vs. Supply

Demand Supply

Out-of-order Execution

Multi-cores

Prefetchers

70

MultipleBanks

Page 71: CSC 2224: Parallel Computer Architecture and Programming ...

Increasing Number of Banks?

How to improve available parallelism within DRAM?

Adding more banks → Replication of shared structures

Replication → Cost

71

Page 72: CSC 2224: Parallel Computer Architecture and Programming ...

Our Observation

72

Time

1. Wordline enable

2. Charge sharing

3. Sense amplify

4. Charge restoration

5. Wordline disable

6. Restore sense-amps

Data Access

Local to a subarray

Page 73: CSC 2224: Parallel Computer Architecture and Programming ...

Subarray-Level Parallelism

73

Ro

w D

eco

der

1 0 0 1 1 0 0 1

Ro

w D

eco

der

Row Address Column Read/Write

Ro

w A

dd

ress

Ro

w A

dd

ress

Replicate

Time share

Page 74: CSC 2224: Parallel Computer Architecture and Programming ...

Subarray-Level Parallelism: Benefits

74

Time

Data Access

Data Access

Data Access

Data Access Saved Time

Commodity DRAM

Subarray-Level Parallelism

Two requests to different subarraysin same bank

Page 75: CSC 2224: Parallel Computer Architecture and Programming ...

Results Summary

75

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Performance

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Energy Consumption

Commodity DRAM Subarray-Level Parallelism

17%

19%

Page 76: CSC 2224: Parallel Computer Architecture and Programming ...

A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM

Yoongu Kim, Vivek Seshadri, Donghyuk Lee,Jamie Liu, Onur Mutlu

Published in the proceedings of 39th

International Symposium on Computer Architecture 2012

76

Page 77: CSC 2224: Parallel Computer Architecture and Programming ...

Review #4

• RowClone: Fast and Energy-Efficient

In-DRAM Bulk Data Copy and Initialization

Vivek Seshadri et al., MICRO 2013

77

Page 78: CSC 2224: Parallel Computer Architecture and Programming ...

CSC 2224: Parallel Computer Architecture and ProgrammingMain Memory Fundamentals

Prof. Gennady Pekhimenko

University of Toronto

Fall 2020

The content of this lecture is adapted from the slides of Vivek Seshadri, Donghyuk Lee, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU