Improving Read Performance of PCM via Write Cancellation and Write Pausing

© 2007 IBM Corporation

HPCA – 2010

Improving Read Performance of PCM via Write Cancellation and Write Pausing

Moinuddin QureshiMichele Franceschini and Luis Lastras

IBM T. J. Watson Research Center, Yorktown Heights, NY

2 © 2007 IBM Corporation

Introduction

More cores in system More concurrency Larger working set

DRAM-based memory system hitting: power, cost, scaling wall

Phase Change Memory (PCM): Emerging technology, projected to be more scalable, higher density, power-efficient


PCM OperationTmelt

Tcryst

Time

RESET

SET

Tem

pera

ture

Switching by heating using electrical pulses

RESET state: amorphous (high resistance)SET state: crystalline (low resistance)

LargeCurrent

SETLow resistance

Photo Courtesy: Bipin Rajendran, IBM

Read latency 2x-4x of DRAM. Write latency much higher

SmallCurrent

RESETHigh resistance

AccessDevice

MemoryElement


Problem of Contention from Slow Writes

PCM writes 4x-8x slower than reads Writes not latency critical.Typical response: Use large buffers and intelligent scheduling.

But once write is scheduled to a bank, later arriving read waits

Write request causes contention for reads increased read latency


Outline

Introduction Quantifying the Problem Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary


Configuration: Hybrid Memory

Processor Chip DRAM Cache

PCM-Based Main Memory

Baseline uses read priority scheduling if WRQ < 80% full. If WRQ>80% full, oldest-first policy “forced write” (rare <0.1%)

Each bank has a separate RDQ and WRQ (32-entry)

(256MB)


Problem

Writes significantly increase read latency (Problem only for asymmetric memories)

Read Latency=1k cycles Write Latency=8k cycles (sensitivity in paper)12 workloads: each with 8 benchmarks from SPEC06

0200400600800

10001200140016001800200022002400260028003000

1 2 3 4

BaselineNo Read Priority

Write Latency=1K

Write Latency=0

Effe

ctiv

e R

ead

Late

ncy

(Cyc

les)

00.10.20.30.40.50.60.70.80.9

11.11.2

1 2 3 4

Nor

m.

Exe

cutio

n Ti

me


Outline

Introduction Problem: Writes Delaying Reads Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary


Write Cancellation

Write Cancellation: “abort” on-going write to Improve read latency

Line in non-deterministic state: read matching read request from WRQ

Perform write cancellation as soon as a read request arrives at a bank (as long as the write is not done in forced-mode)


Write Cancellation with Static Threshold

WCST: Cancel write request only if less than K% service done

Canceling a write request close to completion is wasteful and causes episodes of forced-writes (low performance)

1000

1100

1200

1300

1400

1500

1600

K=0% K=50% K=65% K=75% K=90% K=100%

Effe

ctiv

e R

ead

Late

ncy

(Cyc

les)

2365

(NeverCancel) (AlwaysCancel)


Adaptive Write Cancellation

Best threshold depends on num pending entries in WRQ. Fewer entries Higher threshold (best read latency)More entries Lower threshold (reduces forced writes)

Write Cancellation with Adaptive Threshold (WCAT)Threshold = 100 – (4*NumEntriesInWRQ)

100%

0%10 20 30

50%

Num Entries in WRQ

Thre

shol

d

High

LowForcedWrites


Adaptivity of WCAT

Num Entries in WRQ Low (0-1)

Med(2-13)

High(14-25)

Forced(26+)

WCST(K=75%) 61.4% 29.8% 7.4% 1.43%

WCAT 58.2% 35.4% 5.6% 0.72%

WCAT uses higher threshold initially with empty WRQ butLower threshold later reduces the episodes of forced-writes

We sampled all WRQ every 2M cycles to measure occupancy


Results for WCAT

1000

1050

1100

1150

1200

1250

1300

1350

1400

1450

1500

1550

Write Cancellation WCST (K=75%) WCAT

Ave

rage

Rea

d La

tenc

y

Baseline: 2365 cycles Ideal:1K cycles

0

5

10

15

20

25

30

35

40

45

Write Cancellation WCST (K=75%) WCAT

Extr

a W

rite

Cyc

les

(%)

Adaptive threshold reduces latency and incurs half the overhead


Outline



Iterative Write in PCM devices

In Multi-Level Cells (MLC), the programming precision requirementincreases linearly with the number of levels

PCM cells respond differently to same programming pulse

Acknowledged solution to address uncertainty: Iterative writes

Each iteration consists of steps of: write-read-verify

Write VerifyRead

Not done

Done


Model for Iterative Writes

We develop an analytical model to capture number of iterations:In terms of bits/cell, num levels written in one shot, and learning

Time required to write a line is worst-case of all cells in line

Avg number of iterations: 8.3 (consistent with MLC literature)

MLC:3 bits/cell


Concept of Write Pausing

Iterative writes can be paused to service pending read requests

Reads can be performed at the end of each iteration (potential pause point)

Iter 1 Iter 2 Iter 3 Iter 4

Potential Pause Points

Iter 1 Iter 2 Rd X Iter 3

Rd X

Iter 4

Better read latency with negligible write overhead

We extend the iterative write algorithm of Nirschl et al. [IEDM’07] to support Write Pausing


Results for Write Pausing

1000

1100

1200

1300

1400

1500

16001700

1800

1900

2000

2100

2200

2300

2400

Baseline Write Pause Anytime Pause

Effe

ctiv

e R

ead

Late

ncy

Write Pausing at end of iteration gets 85% of benefit of “Anytime” Pause


Outline



Write Pausing + WCAT

Iter 1 Iter 2 Iter 3

Rd X

Iter 4

Iter 1 Iter 2 Rd X Iter 3

Rd X

Iter 4

Iter 1 Iter 2Rd X Iter 3

Rd X

Iter 4

Iter2 Cancelled

Only one iteration is cancelled “micro-cancellation” has low overhead


Results

1000

1050

1100

1150

1200

1250

1300

1350

1400

1450

1500

Write Pause Write Pause+MicroCancellation

Anytime Pause

Effe

ctiv

e R

ead

Late

ncy

Write Pause + Micro Cancellation very close to Anytime Pause(re-execution overhead of micro cancellation <4% extra iterations)

1

1.1

1.2

1.3

1.4

1.5

Write Pause Write Pause+MicroCancellation

Anytime Pause

Spee

dup

(wrt

Bas

elin

e)

Baseline: 2365 cycles Ideal:1K cycles


Impact of Write Queue Size

We will need large buffers to best exploit the benefit of Pausing

00.10.20.30.40.50.60.70.80.9

11.11.21.31.41.51.6

8 16 32 64 128 256 512

Number of Entries in Each WRQ

BaselinePause + Micro Cancellation

Spee

dup

wrt

Bas

elin

e (3

2-en

try)


Outline



Summary

Slow writes increase the effective read latency (2.3x)

Write Cancellation: Cancel ongoing write to service read Threshold based write cancellation Adaptive Threshold: better performance, half the overhead

Write Pausing exploits iterative write to service pending reads Write Pausing + Micro Cancellation close to optimal pause Effective read latency: from 2365 to 1330 cycles (1.45x speedup)

We will need large write buffers to exploit the benefit of Pausing


Questions


Write Pausing in Iterative Algorithms

(Nirschl+ IEDM’07)


Workloads and Figure of Merit

12 memory-intensive workloads from SPEC 2006: •6 rate-mode (eight copies of same benchmark) •6 mix-mode (two copies of four benchmarks)

Key metric: Effective Read Latency

Tin = Time at which read request enters RDQ Tout = Time at which read request finishes service at memory

Effective Read Latency = Tout – Tin (average reported)


Sensitivity to Write Latency

At WriteLatency=4K, the speedup is 1.35x instead of 1.45x (at 8K latency)

Improving Read Performance of PCM via Write Cancellation and Write Pausing

Documents

Transcript of Improving Read Performance of PCM via Write Cancellation and Write Pausing