RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely...

26
PRACTICAL NONVOLATILE MULTILEVEL-CELL PHASE CHANGE MEMORY Jichuan Chang, Robert S. Schreiber, Norman P. Jouppi Hewlett-Packard Labs Doe Hyun Yoon IBM T. J. Watson Research Center

Transcript of RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely...

Page 1: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

PRACTICAL NONVOLATILE MULTILEVEL-CELL

PHASE CHANGE MEMORY

Jichuan Chang,

Robert S. Schreiber,

Norman P. Jouppi

Hewlett-Packard Labs

Doe Hyun Yoon

IBM T. J. Watson Research Center

Page 2: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

MEMORY CAPACITY CHALLENGE IN HPC

• DRAM as main memory

– Scaling is slowing down

• Hard to meet ever-increasing capacity demand

• Byte-addressable nonvolatile memory

– Phase change memory (PCM), memristor, …

– Scales better than DRAM

– Multilevel-cell (MLC) capability

– Nonvolatility

• Checkpoint, in-situ post processing

• High-performance file system

• NV MLC PCM for continued capacity scaling 2

Page 3: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

MAJOR CHALLENGE: RESISTANCE DRIFT • Conventional 4-Level-Cell (4LC) Designs

– Naïve 4LC is useless

– Optimized 4LC is only barely usable

– Still need refresh -- it’s volatile memory

• Observation: Most errors in 4LC occur in one cell state

• Proposal: 3-Level-Cell (3LC) PCM – Simple, genuinely nonvolatile (>10 years retention)

– 3-ON-2 and mark-and-spare • Low-cost wearout tolerance for 3LC

– 1.41 bits/cell (vs. 1.52 in 4LC) • Only 7% lower capacity than (volatile) 4LC

3

Page 4: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

PCM AND RESISTANCE DRIFT

4

Page 5: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

PHASE CHANGE MEMORY • Best of DRAM and Flash

– Higher capacity, better scaling (vs. DRAM)

– Faster, byte-addressable NVM (vs. Flash)

• MLC (Multilevel-Cell) capability

– Store more than 1 bits per cell

• Ex) 2 bits per cell

• Caveats:

– Slow, low-bandwidth write

– Finite write endurance

– Resistance drift 5

Common problems

in both SLC and MLC

Page 6: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

RESISTANCE DRIFT • PCM Cell resistance increases over time

– R(t), cell resistance at time t (t >0)

• A cell is programmed at t =0

• Sensed as R0 at time t0 (>0)

• : drift rate (0<<1)

• Drift errors

– Negligible in SLC PCM

– Major reliability problem in MLC PCM

6

0

0)(t

tRtR

Page 7: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

DRIFT ERRORS IN 4LC PCM • 4 cell states: S1, S2, S3, S4

– PDF is truncated Gaussian

• ±2.75 around mean values

• Mean resistance values: 1, 2, 3, 4

– Threshold between states: 1, 2, 3

• Drift rate () increases with cell resistance

7 log10R

3.5 3 2.5 6.5 4 4.5 5 5.5 6

S1 S2 S3 S4

1 2 3

1 2 3 4

Page 8: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

2s 32s 17min 9hour 12day 1year 34year 1089year 34865year

DRIFT ERROR RATES • Monte-Carlo simulation

• Errors only in S2 and S3

8

S2

S3

Time

Fra

ctio

n o

f c

ells

with

an

err

or

Page 9: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

REFRESH • Refresh before cells loose their data

– Consume already limited PCM write BW

– Too frequent refresh will make PCM unavailable

to users

• What PCM refresh interval is acceptable?

– At least 50% of write BW should be

available to users

– Refresh interval >17 minutes

• Caveat: PCM w/ refresh is no longer nonvolatile

9

Page 10: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

CELL ERROR RATE • What cell error rate is tolerable?

– Goal: 10-year device MTBF

• Fewer than 1 erroneous 64B block

in a 16GB device for 10 years

– CER >1e-2

• Impossible to achieve the goal

even with unrealistically strong ECC

– CER ~1e-3 @ 17min refresh

• Barely meets the goal with BCH-10

• More analysis in the paper

10

Page 11: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

BASELINE 4LC PCM

11

Page 12: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

NAÏVE DESIGN: 4LCN

• Equal probability for all 4 states

• 17min refresh caps CER at ~1e-2

12

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

2s 32s 17min 9hour 12day 1year 34year 1089year 34865year

Fra

ctio

n o

f c

ells

with

an

err

or

Refresh Interval

CER~1e-2

Unacceptable

Refresh interval > 17 min

4LCn

Page 13: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

13

OPTIMAL STATE MAPPING • Drift only increases cell resistance

• Optimize 2, 3, 1, 2, 3 to minimize CER

– minimize CER(2, 3, 1, 2, 3)

– subject to i+2.75+<i< i+1-2.75-

– for i=1,2,3

0

0.5

1

1.5

2

2.5

2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5

pd

f o

f ce

ll re

sis

tan

ce

S1 S4 S2 S3

Simple

mapping Optimal

mapping

1 2 3

minimum

spacing

1 2 3 4

Page 14: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

OPTIMAL STATE MAPPING: 4LCO

• CER ~1e-3 @ 17-min refresh

• With BCH-10, it meets the goal

14

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

2s 32s 17min 9hour 12day 1year 34year 1089year 34865year

Fra

ctio

n o

f c

ells

with

an

err

or

Refresh Interval

4LCo

CER~1e-3, barely usable

with 10-bit correcting ECC

4LCn

Refresh interval > 17 min

Page 15: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

PROPOSAL:

3LC PCM

15

Page 16: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

0

0.5

1

1.5

2

2.5

2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5

pd

f o

f ce

ll re

sis

tan

ce

S3

3 0

0.5

1

1.5

2

2.5

2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5

pd

f o

f ce

ll re

sis

tan

ce

PROPOSAL: 3LC PCM

16

Wide margin

S1 S2 S4

1 2

Simple

mapping Optimal

mapping

• Observation:

– Most errors occur in one state (S3)

• DO NOT USE IT

– Wide Margin for S2

• Simple and optimal mapping (3LCn & 3LCo)

1 2 4

Page 17: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

3LC DESIGNS (3LCN AND 3LCO) • Reliable for >10 years w/o ECC & refresh

• Genuinely nonvolatile

17

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

2s 32s 17min 9hour 12day 1year 34year 1089year 34865year

Fra

ctio

n o

f c

ells

with

an

err

or

Refresh Interval

4LCo

3LCn

3LCo

CER~1e-9 at 16 years

No ECC, No refresh

4LCn

Page 18: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

3LC PCM DESIGN ISSUES • How to store information?

– Binary information in ternary cells

• What about wearout failures?

• How to compensate for

the reduced cell density?

– 3LC’s ideal capacity is 1.58 bits/cell (log23)

– vs. 2 bits/cell in 4LC

18

Page 19: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

HOW TO STORE BINARY INFO IN TERNARY CELLS?

• 3-ON-2

– Store three bits in two ternary cells

– 64B (512-bit) data block in 342 cells

• 9 states in 2 ternary cells

• 8 states for 3-bit data

• INVALID state

– (S4, S4)

– Use this for tolerating

wearout failures

19

First

cell

Second

cell

3-bit

data

S1 S1 000

S1 S2 001

S1 S4 010

S2 S1 011

S2 S2 100

S2 S4 101

S4 S1 110

S4 S2 111

S4 S4 INVALID

Page 20: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

TOLERATING WEAROUT FAILURES IN 3LC

• PCM has only finite write endurance

– ~108 writes per cell

• Mark-and-spare

– A low-cost wearout failure tolerance for 3LC

– Use 3LC’s INVALID state for marking a cell pair with

a failure

– No need to store failed-cell location

– 2 spare cells per failure

• c.f. ECP [Schechter+ ISCA’10 ]

– Need a pointer and a spare cell for a failure

– 5 cells per failure with 512-bit data block and 4LC 20

Page 21: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

• Use INVALID (S4, S4) to mark

a cell pair w/ failure

– A stuck-at cell stuck can be revived by

applying reverse current [Goux+ IEEE TED’09]

• Need a spare pair for tolerating a failure

A pair w/

failure

MARK-AND-SPARE EXAMPLE

21

Wearout

failure

A ternary cell A cell pair

for 3 bits

D0 D1 D2 D3 D4 D5 D6 D7 S0 S1

Page 22: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

HOW TO CORRECT WEAROUT FAILURES?

22

Page 23: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

342

256 31

10

50

0 50 100 150 200 250 300 350 400

3LC

4LC

342

256

12

31

10

50

0 50 100 150 200 250 300 350 400

3LC

4LC

CAPACITY: 3LC VS. 4LC

23

• 64B (512-bit) block

• 3LC needs fewer bits than 4LC for error correction

– 6 wearout failures:

Mark-and-spare (2cells/failure) vs. ECP (5cells/failure)

– Drift errors: BCH-1 vs. BCH-10

• 3LC: 1.41 bits/cell, 4LC: 1.52 bits/cell

• Besides, 3LC is nonvolatile

7%

Data Wearout failure correction Drift error correction

# cells

Page 24: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

CAPACITY VS. # WEAROUT FAILURES • MLC has worse endurance than that of SLC

• May need to tolerate more than

6 wearout failures

24

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

bits/c

ell

# Wearout failures tolerated

4LC

3LC

Page 25: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

COMPARISON TO TRI-LEVEL-CELL PCM

25

• Recent work on MLC drift errors [ISCA’13] – Same observation

• Most errors occur in the S3 state

– Same solution • Use 3 levels instead of 4 levels

• TLC paper does not address – Wearout failures

– Optimal resistance/threshold mapping • Baseline 4LC is overly pessimistic – not usable at all

• Unique feature in TLC paper – Bandwidth-Enhanced writes

Page 26: RACTICAL NONVOLATILE MULTILEVEL ELL PHASE CHANGE … · •Proposal: 3LC PCM –Simple, genuinely nonvolatile –3-ON-2 & Mark-and-spare •Low-cost wearout tolerance mechanism for

MLC PCM FOR CONTINUED CAPACITY SCALING

• Major challenge: resistance drift

• Conventional 4LC PCM is not practical – Strong ECC and frequent refresh:

• Performance/power penalty

• Loose nonvolatility

• Proposal: 3LC PCM – Simple, genuinely nonvolatile

– 3-ON-2 & Mark-and-spare • Low-cost wearout tolerance mechanism for 3LC

– Only 7% lower capacity than (volatile) 4LC

• Generalized non-power-of-two level cells – 5LC, 6LC, …

26