Improving Read Performance of PCM via Write Cancellation and Write Pausing
description
Transcript of Improving Read Performance of PCM via Write Cancellation and Write Pausing
© 2007 IBM Corporation
HPCA – 2010
Improving Read Performance of PCM via Write Cancellation and Write Pausing
Moinuddin QureshiMichele Franceschini and Luis Lastras
IBM T. J. Watson Research Center, Yorktown Heights, NY
2 © 2007 IBM Corporation
Introduction
More cores in system More concurrency Larger working set
DRAM-based memory system hitting: power, cost, scaling wall
Phase Change Memory (PCM): Emerging technology, projected to be more scalable, higher density, power-efficient
3 © 2007 IBM Corporation
PCM OperationTmelt
Tcryst
Time
RESET
SET
Tem
pera
ture
Switching by heating using electrical pulses
RESET state: amorphous (high resistance)SET state: crystalline (low resistance)
LargeCurrent
SETLow resistance
Photo Courtesy: Bipin Rajendran, IBM
Read latency 2x-4x of DRAM. Write latency much higher
SmallCurrent
RESETHigh resistance
AccessDevice
MemoryElement
4 © 2007 IBM Corporation
Problem of Contention from Slow Writes
PCM writes 4x-8x slower than reads Writes not latency critical.Typical response: Use large buffers and intelligent scheduling.
But once write is scheduled to a bank, later arriving read waits
Write request causes contention for reads increased read latency
5 © 2007 IBM Corporation
Outline
Introduction Quantifying the Problem Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary
6 © 2007 IBM Corporation
Configuration: Hybrid Memory
Processor Chip DRAM Cache
PCM-Based Main Memory
Baseline uses read priority scheduling if WRQ < 80% full. If WRQ>80% full, oldest-first policy “forced write” (rare <0.1%)
Each bank has a separate RDQ and WRQ (32-entry)
(256MB)
7 © 2007 IBM Corporation
Problem
Writes significantly increase read latency (Problem only for asymmetric memories)
Read Latency=1k cycles Write Latency=8k cycles (sensitivity in paper)12 workloads: each with 8 benchmarks from SPEC06
0200400600800
10001200140016001800200022002400260028003000
1 2 3 4
BaselineNo Read Priority
Write Latency=1K
Write Latency=0
Effe
ctiv
e R
ead
Late
ncy
(Cyc
les)
00.10.20.30.40.50.60.70.80.9
11.11.2
1 2 3 4
Nor
m.
Exe
cutio
n Ti
me
8 © 2007 IBM Corporation
Outline
Introduction Problem: Writes Delaying Reads Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary
9 © 2007 IBM Corporation
Write Cancellation
Write Cancellation: “abort” on-going write to Improve read latency
Line in non-deterministic state: read matching read request from WRQ
Perform write cancellation as soon as a read request arrives at a bank (as long as the write is not done in forced-mode)
10 © 2007 IBM Corporation
Write Cancellation with Static Threshold
WCST: Cancel write request only if less than K% service done
Canceling a write request close to completion is wasteful and causes episodes of forced-writes (low performance)
1000
1100
1200
1300
1400
1500
1600
K=0% K=50% K=65% K=75% K=90% K=100%
Effe
ctiv
e R
ead
Late
ncy
(Cyc
les)
2365
(NeverCancel) (AlwaysCancel)
11 © 2007 IBM Corporation
Adaptive Write Cancellation
Best threshold depends on num pending entries in WRQ. Fewer entries Higher threshold (best read latency)More entries Lower threshold (reduces forced writes)
Write Cancellation with Adaptive Threshold (WCAT)Threshold = 100 – (4*NumEntriesInWRQ)
100%
0%10 20 30
50%
Num Entries in WRQ
Thre
shol
d
High
LowForcedWrites
12 © 2007 IBM Corporation
Adaptivity of WCAT
Num Entries in WRQ Low (0-1)
Med(2-13)
High(14-25)
Forced(26+)
WCST(K=75%) 61.4% 29.8% 7.4% 1.43%
WCAT 58.2% 35.4% 5.6% 0.72%
WCAT uses higher threshold initially with empty WRQ butLower threshold later reduces the episodes of forced-writes
We sampled all WRQ every 2M cycles to measure occupancy
13 © 2007 IBM Corporation
Results for WCAT
1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
1550
Write Cancellation WCST (K=75%) WCAT
Ave
rage
Rea
d La
tenc
y
Baseline: 2365 cycles Ideal:1K cycles
0
5
10
15
20
25
30
35
40
45
Write Cancellation WCST (K=75%) WCAT
Extr
a W
rite
Cyc
les
(%)
Adaptive threshold reduces latency and incurs half the overhead
14 © 2007 IBM Corporation
Outline
Introduction Problem: Writes Delaying Reads Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary
15 © 2007 IBM Corporation
Iterative Write in PCM devices
In Multi-Level Cells (MLC), the programming precision requirementincreases linearly with the number of levels
PCM cells respond differently to same programming pulse
Acknowledged solution to address uncertainty: Iterative writes
Each iteration consists of steps of: write-read-verify
Write VerifyRead
Not done
Done
16 © 2007 IBM Corporation
Model for Iterative Writes
We develop an analytical model to capture number of iterations:In terms of bits/cell, num levels written in one shot, and learning
Time required to write a line is worst-case of all cells in line
Avg number of iterations: 8.3 (consistent with MLC literature)
MLC:3 bits/cell
17 © 2007 IBM Corporation
Concept of Write Pausing
Iterative writes can be paused to service pending read requests
Reads can be performed at the end of each iteration (potential pause point)
Iter 1 Iter 2 Iter 3 Iter 4
Potential Pause Points
Iter 1 Iter 2 Rd X Iter 3
Rd X
Iter 4
Better read latency with negligible write overhead
We extend the iterative write algorithm of Nirschl et al. [IEDM’07] to support Write Pausing
18 © 2007 IBM Corporation
Results for Write Pausing
1000
1100
1200
1300
1400
1500
16001700
1800
1900
2000
2100
2200
2300
2400
Baseline Write Pause Anytime Pause
Effe
ctiv
e R
ead
Late
ncy
Write Pausing at end of iteration gets 85% of benefit of “Anytime” Pause
19 © 2007 IBM Corporation
Outline
Introduction Problem: Writes Delaying Reads Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary
20 © 2007 IBM Corporation
Write Pausing + WCAT
Iter 1 Iter 2 Iter 3
Rd X
Iter 4
Iter 1 Iter 2 Rd X Iter 3
Rd X
Iter 4
Iter 1 Iter 2Rd X Iter 3
Rd X
Iter 4
Iter2 Cancelled
Only one iteration is cancelled “micro-cancellation” has low overhead
21 © 2007 IBM Corporation
Results
1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
Write Pause Write Pause+MicroCancellation
Anytime Pause
Effe
ctiv
e R
ead
Late
ncy
Write Pause + Micro Cancellation very close to Anytime Pause(re-execution overhead of micro cancellation <4% extra iterations)
1
1.1
1.2
1.3
1.4
1.5
Write Pause Write Pause+MicroCancellation
Anytime Pause
Spee
dup
(wrt
Bas
elin
e)
Baseline: 2365 cycles Ideal:1K cycles
22 © 2007 IBM Corporation
Impact of Write Queue Size
We will need large buffers to best exploit the benefit of Pausing
00.10.20.30.40.50.60.70.80.9
11.11.21.31.41.51.6
8 16 32 64 128 256 512
Number of Entries in Each WRQ
BaselinePause + Micro Cancellation
Spee
dup
wrt
Bas
elin
e (3
2-en
try)
23 © 2007 IBM Corporation
Outline
Introduction Problem: Writes Delaying Reads Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary
24 © 2007 IBM Corporation
Summary
Slow writes increase the effective read latency (2.3x)
Write Cancellation: Cancel ongoing write to service read Threshold based write cancellation Adaptive Threshold: better performance, half the overhead
Write Pausing exploits iterative write to service pending reads Write Pausing + Micro Cancellation close to optimal pause Effective read latency: from 2365 to 1330 cycles (1.45x speedup)
We will need large write buffers to exploit the benefit of Pausing
25 © 2007 IBM Corporation
Questions
26 © 2007 IBM Corporation
Write Pausing in Iterative Algorithms
(Nirschl+ IEDM’07)
27 © 2007 IBM Corporation
Workloads and Figure of Merit
12 memory-intensive workloads from SPEC 2006: •6 rate-mode (eight copies of same benchmark) •6 mix-mode (two copies of four benchmarks)
Key metric: Effective Read Latency
Tin = Time at which read request enters RDQ Tout = Time at which read request finishes service at memory
Effective Read Latency = Tout – Tin (average reported)
28 © 2007 IBM Corporation
Sensitivity to Write Latency
At WriteLatency=4K, the speedup is 1.35x instead of 1.45x (at 8K latency)