DWC: Dynamic Write Consolidation for Phase Change Memory...

DWC: Dynamic Write Consolidation for Phase ChangeMemory Systems

Fei Xia*† Dejun Jiang* Jin Xiong* Mingyu Chen* Lixin Zhang* Ninghui Sun*

*SKL Computer Architecture, ICT, CAS, Beijing, China†University of Chinese Academy of Sciences, Beijing, China

{xiafei2011, jiangdejun, xiongjin, cmy, zhanglixin, snh}@ict.ac.cn

ABSTRACTPhase change memory (PCM) is promising to become an al-ternative main memory thanks to its better scalability andlower leakage than DRAM. However, the long write latencyof PCM puts it at a severe disadvantage against DRAM.In this paper, we propose a Dynamic Write Consolidation(DWC) scheme to improve PCM memory system perfor-mance while reducing energy consumption. This paper ismotivated by the observation that a large fraction of a cacheline being written back to memory is not actually modified.DWC exploits the unnecessary burst writes of unmodifieddata to consolidate multiple writes targeting the same rowinto one write. By doing so, DWC enables multiple writes tobe send within one. DWC incurs low implementation over-head and shows significant efficiency. The evaluation resultsshow that DWC achieves up to 35.7% performance improve-ment, and 17.9% on average. The effective write latency arereduced by up to 27.7%, and 16.0% on average. Moreover,DWC reduces the energy consumption by up to 35.3%, and13.9% on average.

Categories and Subject DescriptorsC.0 [Computer Systems Organization]: General—Sys-tem architectures

KeywordsPhase Change Memory; Write consolidation; Performanceoptimization

1. INTRODUCTIONDRAM has been the choice for the main memory for

decades. However, it is becoming difficult for DRAM tokeep scaling down to smaller cells [4] due to limitations suchas capacitor placement, device leakages and charge sensing.On the other hand, larger DRAMs result in increasing en-ergy consumption which accounts for 20% to 40% of the

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, June 10–13 2014, Munich, Germany.Copyright 2014 ACM 978-1-4503-2642-1/14/06 ...$15.00..

total server energy [17, 22, 33]. Recently, some emergingnon-volatile memories (NVM) have shown great potentialto become the promising choices for the main memory [14,37, 31] due to their better scalability and lower leakage.Phase Change Memory (PCM), which is already commer-cially available [28], is one of such NVMs.

It is projected by ITRS [4] that the PCM will have similarwrite energy and write parallelism with DRAM in near fu-ture. However, PCM still has a major weakness: long writelatency, which raises a challenging issue to its adoption inthe main memory [4]. Several approaches have been pro-posed to reduce the impact of PCM’s long write latency onmemory system performance. For instance, PreSET[24] andtwo-stage-write [35] exploits PCM asymmetries of writingbit one and writing bit zero to reduce the time to write thePCM array. These studies have shown that reducing theunexpected impacts of long PCM write latency is critical toimprove the system performance.

PCM employs burst accesses to memory chips. The PCMmemory controller uses one write command and multiplebursts (called Burst Length, BL) to transfer a cache linedata to PCM chips. It has been observed by many that alarge fraction of data written to main memory is not actuallymodified. We also observe from our study that 54% of eight-byte words written back to the memory from the caches areunmodified. In the conventional system, these unmodifieddata is written to the memory as part of its respective cachelines. We argue that the bursts used to send unmodified datacan be saved to send modified data of other write commands.

The silent store scheme [19] reduces unnecessary writesto the memory by not sending unmodified whole cache line,but it cannot handle partially modified cache lines. Lee etal use partial writes to PCM array by tracking dirty datafrom L1 cache to memory banks at different granularities[14]. Their approach does not reduce the time to transfercache line data to the memory. To the best of our knowledge,there is no prior work that exploits the unnecessary burstsof unmodified data blocks to process other writes.

In this paper, we propose a new write scheme DynamicWrite Consolidation (DWC) to improve PCM main mem-ory performance while reducing energy consumption. Themain idea of DWC is to consolidate multiple write com-mands in BL bursts by utilizing unnecessary burst writes ofunmodified data blocks. Figure 1 shows the timing diagramof the conventional write scheme, silent store scheme andDWC. There are three consecutive write commands target-

211

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9

w1 d1

w1d0

w1 d2

w3 d0

w1 d3

w3 d2

w3 d1

w3 d3

t10 t11

w1 d1

w1d0

w1 d2

w3 d0

w1 d3

w3 d2

w3 d1

w3 d3

w1 d1

w1d0

w1 d2

w3 d0

w1 d3

w3 d2

w3 d1

w3 d3

Conventional PCM write

Silent-Store-Scheme

DWC

t12

DQ

DQ

DQ

w1 d4

w1 d5

w1 d6

w1 d7

w2 d4

w2 d5

w2 d6

w2 d7

w2 d0

w2 d1

w2 d2

w2 d3

w3 d4

w3 d6

w3 d5

w3 d7

w1 d4

w1 d5

w1 d6

w1 d7

w3 d4

w3 d6

w3 d5

w3 d7

Figure 1: Timing diagram for different PCM writing schemes. The number of modified data blocks of W1,W2, W3 are 4, 0 and 4. The gray hexagon represents transferring an unmodified data block in the burstwrite mode.

ing the same row. A cache line has eight eight-byte words 1

and BL is eight. The number of modified data blocks of W1and W3 is 4, and all data blocks of W2 are unmodified. Theconventional write scheme completes three writes at time t12regardless whether whole cache line data or only a few wordsare modified. Silent store scheme does not write unmodifiedcache line data, thus W2 is eliminated. It completes writesat time t8. In DWC, unmodified words are not written backto the memory, and thus W2 is also unnecessary. Moreover,W1 and W3 are consolidated to write modified data blocksin BL bursts. Thus DWC completes at time t4. To applythe DWC in real PCM memory systems, we need to addressthe following three challenges:

1. How to let the memory controller get the modificationinformation of evicted cache line data.

2. How to find write commands that can be consolidatedwith a write command selected to be issued. Writecommands that can be consolidated should satisfy threeconditions: i), they access the same row of the samebank; ii), there is no read command between themthat accesses the same column; iii), the sum of mod-ified data blocks of these write commands is equal toor less than BL.

3. What modifications of PCM chips should be made tosupport DWC. Specifically, a PCM chip must be ableto handle two or more write commands within one BLburst.

We identify whether a data block is modified or not in thecache hierarchy and propagate the information to the mem-ory controller. Within the memory controller, we proposea two-stage low-overhead searching mechanism to find writecommands that can be consolidated. To support DWC, wepropose to add a few extra column address buffers and col-umn decoders to a PCM chip. This paper makes the follow-ing contributions:

• We propose a new write scheme DWC that allowsPCM main memory to reduce the impact of its longwrite latency on the overall system performance. DWCfully utilizes the burst writes of unmodified data andconsolidates multiple writes into one write command,which achieves improved performance.

1An eight-byte word is also called a data block in this paper.

• We quantitatively evaluate the latency, area and poweroverheads in implementing DWC with CACTI 6.0 andsynthesizing RTL codes. The results show that DWCincurs low implementation overhead.

CK

CK#t0 t1 t2 t3 t4 t5 t6 t7 t8

CMD

DQ D1D0 D2 D4D3 D6D5 D7

WL=4

ADDR

Write

Col addr n

BL/2 = 4

(b)

(a) Tag D0 D1 D2 D3 D4 D5 D6 D7

0 8 16 24 32 40 48 56 63

vConventional

CacheLine

Burst Write

Figure 2: Conventional 64B cache line and burstwrite process

• We conduct an extensive evaluation to show the ef-fectiveness and efficiency of DWC. The experimentalresults show that DWC can achieve up to 35.7% perfor-mance improvement, and 17.9% on average. DWC canreduce energy consumption by up to 35.3%, and 13.9%on average. We also show the sensitivity of DWC un-der varying queue depths, page sizes, LLC sizes andreplacement policies.

2. BACKGROUND AND MOTIVATION

2.1 Burst reads/writesPCM is designed to adopt interface specifications similar

to DRAM, in which the memory controller sends read andwrite requests in burst mode. Burst reads prefetch morethan one column data and transfer the data through theexternal data bus to cache line in pipeline. Similarly, onewrite command triggers burst writes by only sending thecommand and a starting column address. The memory con-troller then transfers data of consecutive columns after thestarting address. Figure 2 shows an example of burst writesfor a 64B cache line. Assuming that the data bus width is64bit (8B), a cache line can be divided into eight eight-bytewords as shown in Figure 2(a). Figure 2(b) shows the burstwrite process of DDR3. At time t0, the memory controller

212

0%

20%

40%

60%

80%

100% Unm

odified

Propo

r-on

Cacheline 8B-‐word (among modified cacheline)

Figure 3: Unmodified data proportions at differentgranularities

issues a write command and sends column address to ad-dress bus. After WL2 cycles, the memory controller startsto transfer cache line data through data bus in continuousburst mode at time t4. A cache line data needs eight burstwrites. The data blocks are transferred at both rising andfalling edges of the external bus clock. Therefore, the mem-ory controller completes data transfer at time t8.

2.2 MotivationThe key observation that motivates our work is that a

large fraction of a cache line data written back to memoryis not actually modified. Figure 3 shows the proportions ofunmodified data at 64B (cache line) and 8B (word) granu-larity. On average 44.8% of data is not modified at cacheline granularity, which means that these cache lines do notneed to be written back to memory though they are dirty.

Among modified cache line data, there are still 54.1% ofwords that are not modified on average. The result indicatesthat more than half of burst writes are wasted to transfer un-modified data within one write command. This motivates usto propose dynamic write consolidation, namely DWC. Byavoiding unnecessary burst writes and consolidating multi-ple writes in BL bursts, DWC can effectively improve PCMmemory performance.

3. DYNAMIC WRITE CONSOLIDATIONIn this section, we present the design details of DWC.

Figure 4 shows the architecture overview of DWC. DWC re-quires collaboration of cache hierarchy, memory controller,and memory chips. DWC first needs to identify the un-modified data blocks within a cache line. Thus, the cachehierarchy records and propagates the data block modifica-tion information of last level cache to memory controller.When memory controller issues a write command, it needsto find other write commands that can be consolidated inthe command queue based on the data block modificationinformation. To support DWC, the peripheral circuitry ofPCM chips also needs to be modified. Especially, we discussthe implementation overhead of each part.

Although DWC targets to improve performance of PCMmemory system, one can also apply it to DRAM memorysystem. However, it is more critical to reduce the unex-pected impact of long write latency of PCM memory forthe exploration of next generation memory system using ad-

2WL represents the time interval between column accesscommand and the start of data occurred in data bus.

DataData block

modification information

Writeback Deliver when replaced

Write

Command Queue

...

Last LevelCache Line

Memory Controller

Memory Chips

Write Consolidation

Peripheral Circuitry

Tag

PCM Array

Figure 4: Overview of DWC architecture

vanced features of PCM chips. Thus, we concentrate our at-tention and discussion on PCM main memory in this paper.

3.1 Cache hierarchyWe first present the modification of cache hierarchy to

support DWC. A cache line data can be divided into multipledata blocks. In DWC, only the modified data blocks in cacheline are written back to memory. Therefore, DWC needs todistinguish the unmodified data blocks from the modifiedones in cache hierarchy. DWC relies on the design choiceof employing a Modified Block Vector (MBV) to identify themodification information. Similar technique is also proposedin [16]. When data in L2 cache is evicted and written backto LLC, the old data is read from LLC. Then, an addedcomparator compares the old data with the new one at databus granularity. The comparison results are stored in MBV,which are then propagated to memory controller when lastlevel cache line data is evicted to main memory.

Overheads. To identify the modification information ofdata blocks in LLC, one needs to read existing old datafrom LLC. The read can be implemented using an atomiccache operation called read-modify-write(RMW) [18], whichis used in ECC-protected caches. The read operation doesnot incur extra latency overhead because it can be over-lapped with tag matching of the write operation [16].

The cache line size is 64B and the data bus width is 64bitin our system. Thus, an 8-bit MBV is needed to store mod-ification information for a 64B cache line. Another 64-bitcomparator is required to compare the old data with thenew data at each bus transmission. We use CACTI 6.0 [23]to evaluate the total area and power overheads of the modi-fied LLC hierarchy at 32nm technology process. We evaluatethe overheads of comparator and MBV by synthesizing RTLcodes using 90nm process library and the results are scaledto 32nm. Evaluation results show that our modification inLLC incurs 2.2% area overhead and 2.8% power overhead,compared to the whole LLC architecture 3.

3.2 Memory controllerIn the conventional DDRx write scheme, memory con-

troller adopts burst mode to send data of write commands.

3Detailed cache configuration parameters are shown in Sec-tion 4.

213

CK

CK#t0 t1 t2 t3 t4 t5 t6 t7 t8

CMD W1 W1 W1 W1 W3W3 W3 W3

DQW1,D2

W1,D1

W1,D3

W3,D0

W1,D4

W3,D6

W3,D5

W3,D7WL=4

ADDRCol

m+1

Col m+2

Col m+3

Col m+4

Col n+5

Col n

Col n+6

Col n+7

0 1 1 1 1 0 0 0

MBV

0 1 1 1 1 1 1 0

1 0 0 0 0 1 1 1

W1

W2

W3

(a) (b)

Figure 5: (a). The MBV of W1, W2 and W3 commands. “1” represents that the data block of the cache lineis modified and “0” represents not. (b).The timing of DWC that consolidates W3 with W1.

After sending a write command and its starting column ad-dress, memory controller uses BL bursts to send the consec-utive data blocks starting from the target column address.Each burst sends a data block at the data bus granularity.Unlike the conventional write scheme, DWC identifies theunmodified data blocks and does not write them back tomemory. Thus, the bursts used to send unmodified datacan be saved to send modified data of other write com-mands, which is the dynamic write consolidation in our pro-posed scheme. However, dynamic write consolidation is non-trivial. One cannot consolidate any two write commandsarbitrarily. An effective write consolidation requires the fol-lowing three conditions.

1. The consolidated write commands access the same rowof the same bank. Otherwise, the data in row buffershould be first written back to memory array afterone write command finishes. In such case, successivewrite command can not be issued before an ACTI-VATE command activates the corresponding row.

2. There is no read command accessing the same columnof the same row in front of the write commands tobe consolidated in the command queue. It is essen-tial to guarantee the read correctness. Otherwise, theread command gets the data of the consolidated writecommand, which introduces application errors.

3. The sum of modified data blocks of the consolidatedwrite commands is less than or equal to BL. This guar-antees that the modified data blocks can be transferredto memory chip within BL bursts. Thus, the timing ofDDRx is preserved as before.

To find write commands that can be consolidated with theissuing write command, the intuitive approach is to searchthe command queue sequentially and determine whether awrite command can be consolidated. This approach searchesthe queue to the end until BL modified data blocks of writesthat can be consolidated are found. This searching approachsuffers from large latency overhead, which degrades the sys-tem performance unexpectedly.

Instead, DWC employs a two-stage searching with lowlatency overhead. In the first stage, DWC identifies writecommands accessing the same row by comparing their ad-dresses. In the second stage, DWC searches write commandsthat can be consolidated. For each identified write commandin the first stage, DWC adds its modified data blocks withthose of the issuing write. In case that the added result isless than or equal to BL, DWC considers the correspondingwrite command is suitable for consolidation. Then, DWC

chooses the oldest write command among all writes that canbe consolidated, and finally consolidates it with the currentissuing write. Although it is ideal to consolidate as manywrites as possible with current issuing write, it introduceslarge latency overhead and circuitry complexity which is notdesirable in terms of a cost-effective design. We also showthe performance speedup achieved by DWC is close to thatof the ideal scheme in section 5.

Note that DWC can use the First-Ready-First-Come-First-Served (FR-FCFS) [29] memory scheduling policy to searchwrite commands that access the same row. However, DWCimposes more constraints on the scheduling process and aimsto execute possible write consolidations after scheduling.

Double Data Rate Command/Address Bus. In theconventional DDRx memory system, the frequency of com-mand/address bus is lower than that of data bus. Memorycontroller only needs to send the starting column address inone cycle, while data bursts need four cycles for one writecommand. However, DWC requires double data rate com-mand/address bus [20, 34] as the modified data blocks maynot be in continuous columns. Thus, each column addressof modified data blocks must be sent to memory chips at thefrequency of data bus.

Figure 5 shows the timing of DWC for consolidating twowrite commands. Assume write command W3 can be con-solidated with the write command W1. Figure 5(a) givesthe modification information of W1, W2 and W3. We usethe DDR3 protocol, and thus BL is 8. Figure 5(b) shows thetiming of the consolidation of W1 and W3. In the first 2 cy-cles, memory controller sends W1 and its column addressesof modified data blocks. The left 2 cycles are used to pro-cess W3 similarly. The data bus transfers the modified datablocks of W1 and W3 using burst writes like before.

Overheads. We set the command queue depth to be 64for overhead evaluation. We evaluate the area and poweroverheads of the two-stage searching by implementing it inVerilog HDL and synthesizing the design using SynopsysDesign Compiler with the technology library of TSMC at90nm. The synthesization results show that the searchingcan be implemented in 2 cycles under 400 MHz, which ismuch less than the write latency of PCM. We include thelatency overhead in our evaluation. The power and areaoverheads are 0.94 mW and 0.0105 mm2, respectively.

3.3 Memory chipsAssuming the PCM chip is organized similar to conven-

tional DRAM chip, the column address is in the columnaddress buffer and then sent to the column decoder. Thecolumn decoder selects successive multiple columns to storedata buffered in the data-in-buffer.

214

Memory Controller

Memory Chip

Column Address buffer

Row address buffer

row decoder

Memory Arrays

Ro

w s

elec

t

column decoder



...

column decoder

column decoder

...

sense amp array

data in buffer

Addr Bus Data BusWRITE

Figure 6: Modification to PCM chip

In order to support DWC, we need to modify the periph-eral circuitry of PCM chip. The dashed frame in Figure 6shows our modification to the simplified memory chip struc-ture. Since DWC requires memory controller to send col-umn address at double data rate, multiple column addressbuffers are needed to buffer column addresses. Multiple col-umn decoders are also needed to select different columnsof row buffer in parallel. The total number of column ad-dress buffers is equal to BL, so does column decoders. Thisis because memory controller sends BL column addressesto memory chip with one write command as Figure 5(b)shows. DWC does not require any modification to row ad-dress buffer and row decoder. This is because write consol-idation targets the write commands accessing the same rowof the same bank. Thus, DWC only needs to add (BL - 1)column address buffers and column decoders per PCM chip.

Overheads. The modification to PCM chip does not in-cur extra latency overhead, because the added decoders workwith the original decoder in parallel. We estimate the areaand power overheads of our added column address buffersand column decoders by CACTI 6.0 and synthesizing RTLcodes. We use 2Gb DDR3 memory chip in our evaluation.Therefore, 7 column address buffers and 7 column decodersare needed per PCM chip. By using a 32nm technology pro-cess, our modification incurs 0.04% area overhead and 0.47%power overhead, compared to an original memory chip.

4. EVALUATION METHODOLOGYWe evaluate DWC using the multiprocessor full-system

simulator Gem5 [8] and the cycle accurate memory systemsimulator DRAMSim2 [30]. We model both the latency andenergy consumption of PCM by enhancing the memory mod-ule of DRAMSim2.

4.1 System configurationsWe simulate an 8-core out-of-order multiprocessors at 2GHz.

Table 1 shows the detailed baseline system configurations.The baseline has a 32MB last level cache to filter frequentaccesses to PCM. The memory controller schedules read re-quests of read queue (RDQ) with high priority until writequeue (WRQ) is great than 80% full, when write requests

Table 1: Baseline system configurationsProcessor 8-cores, Out-of-Order, 2GHz

L1 caches32KB I-caches, 64KB D-caches,2-way associative, 64B cache line,1-cycle latency

L2 caches512KB, 8-way associative, 64B cache line,6-cycle latency, write-back

LLC32MB, 16-way associative, 64B cache line,30-cycle latency, write-back,LRU replacement policy

Memorycontroller

64-entry RDQ and WRQ,64-entry command queues per bank,channel/rank/bank/row/col address mapping,FCFS, read priority scheduling(unless WRQ is great than 80% full)

Mainmemory

64bit width DDR3-800, 8 x8 2Gb PCM chips,2 ranks per channel, 8 banks per rank,32768 rows/bank, 1024 columns/row,tRCD: 55ns, tRP: 150ns, tCL: 12.5ns, tWR: 15ns,Array read: 2.47pJ/bit, Array write: 16.82pJ/bit,Buffer read: 0.93pJ/bit, Buffer write: 1.02pJ/bit

Table 2: Workload characteristics

Workloads Description MPKIR/Wratio

gcc m 8 copies of gcc 1.92 1.44lbm m 8 copies of lbm 23.34 1.41

libquantum m8 copies

of libquantum 2.17 2.89

mcf m 8 copies of mcf 17.76 2.92omnetpp m 8 copies of omnetpp 1.05 1.12

linkedlist m8 copies

of linkedlist 15.67 1.08

mix 12bzip2-2cactusADM-2lbm-2libquantum 12.13 1.62

mix 22bzip-2hmmer-2lbm-2sjeng 11.07 4.48

mix 32gcc-2libquantum-2mcf-2omnetpp 5.69 1.78

mix 42gcc-2leslied3d-2wrf-2omnetpp 13.08 2.15

mix 52astar-2libquantum

-2linkedlist-2omnetpp 10.70 5.32

mix 62astar-2lbm

-2linkedlist-2mcf 13.52 2.72

are issued. The timing and energy parameters of PCM arederived from the widely-used configurations in [14].

4.2 WorkloadsWe choose a set of workloads from SPEC CPU2006 bench-

mark [2] and the micro benchmark Linked List [1]. Sincemultiple workloads usually run on the same physical serversimultaneously, we also provide 6 mixed workloads to evalu-ate the efficiency of DWC to serve such memory access pat-terns. We execute the applications in the multiprogrammedmode on Gem5. For each workload, we execute 10 mil-lions memory access instructions after warming up last levelcache.

Table 2 summarizes the characteristics of these workloads,including the LLC Misses Per-Kilo Instructions (MPKI) andread-to-write ratio (R/W ratio). Note that the number ofwrites does not include the writes of unmodified cache linedata. These workloads have varying MPKIs, ranging from1.05 to 23.34, which includes both less intensive ones andintensive ones.

215

0.8 0.9 1.0 1.1 1.2 1.3 1.4

Speedu

p

Baseline DWC

Figure 7: Execution time speedup

0%

10%

20%

30%

40%

50%

Consolida)

on Ra)

o

Figure 8: Consolidation ratio

5. EVALUATION RESULTSIn this section, we first evaluate the performance, effec-

tive latency and energy consumption of DWC. We use thesilent store scheme as the baseline system. Second, we com-pare DWC with the ideal scheme to evaluate the tradeoffof improvement and overhead. We also compare DWC withFR-FCFS memory scheduling policy. Then, we measure thecommand/address bus utilization to evaluate the impact ofDWC on the parallelism of bank access. Memory systembehavior varies under different system configurations, suchas queue depth, page size, LLC size and LLC replacementpolicy. As such, we apply a sensitivity analysis to DWC toshow its efficiency against varying configurations at last.

5.1 Performance improvementWe first evaluate the performance improvement of DWC.

Figure 7 shows the execution time speedup4 of DWC againstthe baseline system. DWC achieves the performance im-provement by 17.9% on average. Especially, DWC speeds upthe execution time of workload linkedlist m by up to 35.7%.The performance improvement of DWC mainly results fromtwo aspects. First, DWC reduces writes to PCM array anddata transfer time by write consolidation. Second, DWCreduces additional reads to row buffer before write. We fur-ther record the actually consolidated writes and the totalwrites for each workload. We calculate the consolidation ra-tios for all workloads and plot the results in Figure 8. Theconsolidation ratio of linkedlist m is 26.1%, which results inthe high performance improvement.

One factor that affects the performance improvement ofDWC is the memory accessing intensity. For instance, lbm mand libquantum m have similar consolidation ratios. How-ever, the memory access of lbm m is more intensive than thatof libquantum m, which allows the overall execution time oflbm m to benefit more from write consolidation. On theother hand, the R/W ratio also has significant impact on theperformance improvement of DWC. For instance, the con-solidation ratio of mix5 is larger than that of linkedlist m.Nevertheless, the R/W ratio of linkedlist m is 1.08, whichis less than 5.32 of mix5. The low R/W ratio indicates theexistence of more write commands that are possibly con-solidated. As a result, DWC achieves larger performancespeedup for linkedlist m than that of mix5. The consolida-tion ratios of gcc m and mix1 are less than 10%, which canexplain the little achieved performance speedup.

The major factor that affects the consolidation ratio is theunmodified data proportion. The low unmodified data pro-portions of gcc m and mix1 (as shown in Figure 3) result

4Speedup = ExecT imeBaseline / ExecT imeDWC .

60%

70%

80%

90%

100%

Normalized

Effe

c/ve Laten

cy Baseline-‐Read DWC-‐Read Baseline-‐Write DWC-‐Write

Figure 9: Effective read and write latency

in their low consolidation ratios respectively. We observethat the average unmodified data proportion of omnetpp mis lower than that of gcc m, while the consolidation ratio ofomnetpp m is higher than that of gcc m. It is probably dueto the nonuniform distribution of unmodified data propor-tion of writes. We leave the exploration for future work.

5.2 Effective latency reductionIn addition to reducing the application execution time,

DWC can also reduce the effective read and write latency.Figure 9 shows the effective latency reduction caused byDWC for all workloads. The results are normalized to thebaseline system. DWC can reduce the effective read andwrite latency by 12.8% and 16.0% on average, separately.DWC reduces the effective read and write latency by upto 23.2% and 27.7% for linkedlist m, respectively. DWCreduces the command queuing time of consolidated writes,which in turn reduces the queuing time of read and unconsol-idated write commands. The reduced queuing time in thecommand queue further reduces the number of stalled re-quests, and thus the queuing time in the transaction queue.As a result, the effective latency observed by processor is re-duced. On the other hand, DWC only requires writing thePCM array once for those consolidated multiple write com-mands instead of writing once for each write command. Thisalso contributes to the reduction of effective write latency.

5.3 Energy savingDWC can also reduce memory system energy consump-

tion by consolidating multiple writes together. Since PCMis projected to have similar write energy to DRAM, we donot use differential write scheme. Figure 10 shows the mem-ory energy consumption of DWC normalized to the baseline.DWC saves the energy consumption by up to 35.3% for theworkload libquantum m, and 13.9% on average for all work-loads. Since DWC avoids unnecessary writes to PCM bynot sending unmodified data, DWC can effectively reduce

216

60% 65% 70% 75% 80% 85% 90% 95%

100% Normalized

Ene

rgy

Baseline DWC

Figure 10: Energy saving

0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

Speedu

p

Baseline DWC Ideal

Figure 11: Comparing speedup between DWC andIdeal

50%

60%

70%

80%

90%

100%

Normalized

ED2

FR-‐FCFS DWC

Figure 12: Comparison of ED2 between DWC andFR-FCFS

0% 10% 20% 30% 40% 50% 60% 70%

Comman

d/Ad

dress

Bus U0liza0

on

Baseline DWC

Figure 13: Comparison of the command/address uti-lization between the conventional memory system andDWC

the write energy consumption of PCM. On the other hand,DWC consolidates multiple writes and writes data into PCMarray once together. To serve consolidated writes, PCM onlyneeds to read the addressed data into the row buffer once.This avoids multiple reads in the conventional memory writescheme to serve multiple writes.

5.4 Comparison to the IdealSo far, we show that DWC can effectively reduce the ap-

plication execution time, effective latency, and energy con-sumption. To balance the tradeoff between the performanceimprovement and the implementation overhead, DWC stopssearching the command queue once finding the oldest writethat can be consolidated. However, the ideal scheme is toconsolidate as many writes as possible in BL bursts. We alsoimplement the ideal scheme in our simulator and compareit with DWC. Figure 11 shows the performance speedupof DWC and Ideal normalized to the baseline. Ideal onlyachieves less than 5% improvement compared to DWC onaverage. To consolidate as many writes as possible, the idealscheme needs to search the command queue to the end atmost case and determines whether any write can be consol-idated. This process incurs large searching overhead whichlimits the improvement and requires complex circuitry im-plementation. Thus, we argue that DWC is much morepractical than the ideal scheme with a reasonable tradeoffbetween the performance improvement and the implemen-tation overhead.

5.5 Comparison to FR-FCFSDWC can be implemented based on FR-FCFS memory

scheduling policy. Thus, we also compare DWC with FR-

FCFS. Figure 12 shows the energy delay squared product(ED2) of DWC normalized to FR-FCFS for all workloads.Compared to FR-FCFS, DWC can reduce ED2 by 15.1% onaverage. For workload mix5, DWC can even achieve an ED2reduction as much as 36.8%. Unlike FR-FCFS, DWC con-solidates multiple writes targeting the same row in additionto reordering them together. DWC fully exploits the burstwrites of unmodified data for sending consolidated writes in-stead. Thus, DWC reduces the average transfer time of writecommand compared to FR-FCFS. To avoid the bank accessstarvation, FR-FCFS allows scheduling at most 4 commandsbefore switching to schedule other bank accesses. However,DWC consolidates multiple writes into one write commandat every FR-FCFS scheduling, which in turn allows morecommands to be scheduled before starting to schedule otherbank accesses. This also enables DWC to perform betterthan FR-FCFS in terms of energy and delay.

5.6 Command/address bus utilizationIn conventional DDRx memory systems, the command/address

bus can be immediately used to transfer ACTIVATE com-mand of other banks after a write command. However,the command/address bus in DWC is used to send addi-tional column addresses after a write command in succes-sive BL/2 cycles. Other commands can not be transferredduring this period. Thus, we finally evaluate the impact ofDWC on the parallelism of bank access. We measure thecommand/address bus utilizations of DWC and the conven-tional memory system.

As shown in Figure 13, the conventional average com-mand/address bus utilization is only 21.3%, and thus isable to send more commands/addresses without affecting

217

0%

5%

10%

15%

20%

25%

1.04

1.09

1.14

1.19

1.24

1.29

8 16 32 64 128 256

Consolida)

on Ra)

o

Speedu

p

Speedup Consolida9o Ra9o

Figure 14: Impact of queue depth

18.0%

18.5%

19.0%

19.5%

20.0%

1.170

1.175

1.180

1.185

1.190

2KB 4KB 8KB 16KB 32KB

Consolida)

on Ra)

o

Speedu

p

Speedup Consolida<on Ra<o

Figure 15: Impact of PCM page size

the bank parallelism. Even one applies DWC to the mem-ory system, the average command/address bus utilizationincreases to 44.0%, which means that the bus still existsidle. Therefore, DWC can improve application performancewithout degrading the bank parallelism.

5.7 Sensitivity analysisThe behavior of a memory system usually varies under

different configurations. In order to evaluate the impactsof different system configurations on the efficiency of DWC,we conduct a sensitivity analysis to DWC in this section.We include the following two performance-critical factors:queue depth and page size. On the other hand, LLC sizeand LLC replacement policy affect the memory accessing ofPCM main memory. Therefore, we evaluate the impact ofLLC size and LLC replacement policy on the efficiency ofDWC at last.

5.7.1 Impact of queue depth

Figure 14 shows the average execution time speedup of allworkloads when command queue depth varies from 8 to 256.The solid line shows the speedup of execution time, and thedashed line shows the speedup of consolidation ratio. As thecommand queue depth increases, so does the speedup. Thisis because a deeper command queue provides more spacesfor more writes. Correspondingly, the possibility of findingwrites that can be consolidated increases.

However, the speedup does not increase linearly with theincrease of command queue depth. The performance im-provement is significant when the queue depth increasesfrom 8 to 64. The speedup slows down when the queuedepth increases beyond 64. It is because the latency over-head to find consolidated write commands increases signif-icantly when the queue depth is larger than 64. This inturn eliminates the speedup benefit caused by DWC. Moreimportantly, large queue suffers from large area and energyoverheads. Therefore, DWC use a 64-entry command queue.

5.7.2 Impact of page size

Scalability is one important advantage of PCM comparedto DRAM. Therefore, we can expect that the capacity ofPCM chip will increase with the development of materialand manufacturing technology. The increased capacity inturn affects the row and column organization of PCM chips.We evaluate the efficiency of DWC as the page size of PCMchip increases. Figure 15 shows the average speedup of allworkloads when the PCM page size increases from 2KB to32KB. We measure both the speedup of execution time andthe speedup of consolidation ratio. The speedups increase

1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40

Speedu

p

8MB 16MB 32MB 64MB

Figure 16: Impact of LLC size

almost linearly with the increasing page size. This is becausea larger page results in higher possibility that multiple writesaccess the same row. Thus, we consider that DWC is adesirable write scheme for future PCM main memory thatcomprises large PCM pages.

5.7.3 Impact of LLC size

In this section, we evaluate the impact of LLC size onthe efficiency of DWC. Figure 16 shows the execution timespeedup when the LLC size varies from 8MB to 64MB. Theassociativity keeps constant at 16-way. The speedup de-creases as the LLC size increases for some workloads, suchas gcc m, libquantum m, and mix3. This is because largerLLC size results in less evictions to PCM and less unmodi-fied blocks of evicted cache line. We can also see that varyingLLC size almost does not affect the speedup for some work-loads, such as linkedlist m and mix2. An exception is thatthe speedup increases as the LLC size increases for work-load omnetpp m. This is because many evicted cache linesare clean when the LLC is small. Therefore, the R/W ratiois high which results in low speedup. The evaluation showsthat DWC achieves 11.8% speedup than the baseline whenthe LLC size is 64MB. Thus, DWC is still efficient with largeLLC size.

5.7.4 Impact of LLC replacement policy

So far, we show the efficiency of DWC using the commonLRU replacement policy of LLC. Recently, a few works pro-pose optimized LLC replacement policies to reduce writes toPCM memory, such as CLP-N [11]. Therefore, we evaluatethe impact of LLC replacement policy on the efficiency ofDWC in this section. CLP-N selects the oldest clean cacheline among the N least recently used cache lines. If suchclean cache line is not found, the LRU cache line is evicted.Among different configurations of N, we set N to be 7 that

218

1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40

Speedu

p LRU CLP-‐7 CLP-‐11

Figure 17: Impact of LLC replacement policy

performs best reported in [11] and 11 that further expendsthe selection range.

Figure 17 shows the execution time speedups of differentLLC replacement policy with 32MB LLC. The speedups ofall workloads are almost the same among the LRU, CLP-7and CLP-11 replacement policy. This is because all cachelines of a cache set in LLC are dirty at most time. Theopportunity is low to find other clean cache lines when theLRU cache line is dirty. Thus, the CLP-N replacement pol-icy almost does not affect the efficiency of DWC.

6. RELATED WORK

6.1 PCM main memoryThe long write latency of PCM is a major concern when

one adopts PCM in main memory. A few research worksexploit PCM’s asymmetric latency of writing bit one (SET)and writing bit zero (RESET) to improve PCM performance.By proactively setting all the bits in a given memory lineimmediately after a cache line gets dirty, PreSET [24] al-lows actual writes taking less time to reduce write latency.Similarly, two-stage-write [35] distinguishes writing bit zeroand writing bit one by dividing a write into two stages.In the writing zero stage, all zeros are written at an ac-celerated speed. In writing one stage, all ones are writ-ten with increased parallelism without violating power con-straint. Thus, two-stage-write can effectively reduce writelatency to PCM array. Since PCM requires multiple writesto flush a row buffer back to the array, partial writes [14]only writes the modified data in row buffer to PCM arrayto reduce write latency. Flip-N-Write [10] guarantees thateach flush includes at most half row buffer to be written byflipping data to reduce write latency. Similarly, DWC canreduce the average write latency to memory array. More-over, instead of optimizing write latency from row bufferto PCM array, DWC mainly reduces the queuing time andtransfer time. To reduce the unexpected impact of longwrite latency on read requests, write cancellation and writepausing [26] are proposed to serve read requests preemp-tively. The on-going write operation is cancelled or pausedwhen a newly-arriving read request accesses the same bank.Similarly, DWC can also reduce read latency by reducingqueueing time of read commands from write consolidation.A few research efforts focus on write latency optimizationfor MLC PCM. MLC PCM employs iterative write scheme[7], and thus incurs longer write latency. Morphable mem-ory system [25], Mercury [13], and write truncation [12] areproposed to reduce the write latency of MLC PCM. These

techniques can be complementary to DWC when DWC isapplied to MLC PCM-based main memory.

Another related topic is to hide the long write latencyof PCM. Hybrid memory system consisting of DRAM andPCM is proposed to reduce writes to PCM by placing thefrequent writes in DRAM[14, 15, 27]. OptiPCM [21] takesadvantage of photonic links to connect large number of PCMchips, and thus improves bandwidth and performance ofPCM memory. However, this paper mainly focuses on con-solidation techniques for PCM main memory.

6.2 Memory access schedulingMemory scheduling policies, such as FR-FCFS [29] and

burst scheduling [32], are related to our work closely. FR-FCFS prioritizes memory commands that access the samerow. With burst scheduling, memory accesses to the samerow of the same bank are clustered into bursts to maximizebus utilization. Read requests are allowed to preempt ongo-ing writes for reduced read latency. There are two differencesbetween DWC and these techniques. First, FR-FCFS andburst scheduling do not identify unmodified cache line data,while DWC identifies these data and does not send the un-modified data. Moreover, DWC exploits the saved time tosend consolidated writes. Second, DWC consolidates multi-ple writes into one write after reordering these write com-mands. Unlike DWC, FR-FCFS and burst scheduling donot perform any consolidation.

6.3 Memory access granularityThe access granularity is the product of burst length and

the data width of a channel, which is usually equal to thesize of a cache line. There are a few works that improvememory system performance by changing the memory ac-cess granularity. To be compatible with DDR2, DDR3 sup-ports BC4 (Burst Chop 4) mode [3], in which only half of acache line data are transferred in four bursts and the otherhalf data are masked. Fine-grained access (FG) of read andwrite requests is proposed by utilizing the idea of sub-rankedmemory, such as Convey’s S/G DIMM [9], HP’s MC-DIMM[5, 6], and Mini-Rank [36]. FG is based on the observationthat only a fraction of data transferred between cache lineand memory is actually used. FG utilizes the rest bandwidthto serve other requests in parallel. BC4, FG, and DWC allchange the number of data blocks transferred in one com-mand. However, DWC differs from them in two aspects.First, both BC4 and FG do not identify whether the cacheline data is modified or not when they are written back tomemory. In contrast, DWC only writes the modified data.Second, both BC4 and FG change the memory access gran-ularity of one command, while DWC preserves the memoryaccess granularity by actually consolidating multiple writesin BL bursts.

7. CONCLUSIONWith the recent development of emerging non-volatile mem-

ories, PCM is promising to become an alternative mainmemory. However, its long write latency raises a challeng-ing issue for the adoption of PCM in main memory. Inthis paper, we propose dynamic write consolidation, namelyDWC, to improve PCM memory system performance whilereducing its energy consumption. DWC identifies unmodi-fied data blocks in cache line and avoids sending these data.By exploiting the burst writes of the unmodified data, DWC

219

consolidates multiple writes into one write command. By do-ing so, DWC can process multiple writes within BL bursts,which effectively reduces queuing time for both read andwrite requests. Moreover, DWC reduces the average datatransfer time as well as the average write latency from therow buffer to memory array. We conduct an extensive evalu-ation on DWC to show its efficiency. Furthermore, we showthe efficiency of DWC under varying memory system con-figurations, such as queue depth, page size, LLC size, andLLC replacement policy. On the other hand, although DWCrequires changes to cache hierarchy, memory controller, andmemory chips, we show that DWC incurs low implementa-tion overhead in terms of area and power.

8. ACKNOWLEDGMENTSWe are grateful to the anonymous reviewers for their valu-

able comments. This work is supported in part by NationalBasic Research Program of China under grant No.2011CB302502,National Science Foundation of China under grants No.61379042,61221062, and 61202063, Huawei Research Program YB2013090048,and the Strategic Priority Research Program of the ChineseAcademy of Sciences under grant no. XDA06010401.

9. REFERENCES[1] Linked list traversal micro-benchmark. http://www.cs.

illinois.edu/homes/zilles/llubenchmark.html.[2] Standard performance evaluation corporation. SPEC CPU

2006. http://www.spec.org/cpu2006/.[3] DDR3 SDRAM Standard JESD79-3F, 2010. http:

//www.jedec.org/standards-documents/docs/jesd-79-3d.[4] PIDS, ITRS, 2012.

http://www.itrs.net/Links/2012ITRS/Home2012.htm.[5] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and

R. S. Schreiber. Future scaling of processor-memoryinterfaces. In SC, 2009.

[6] J.-H. Ahn, J. Leverich, R. Schreiber, and N. Jouppi.Multicore DIMM: an energy efficient memory module withindependently controlled DRAMs. Computer ArchitectureLetters, 2009.

[7] F. Bedeschi, R. Fackenthal, C. Resta, E. M. Donze,M. Jagasivamani, E. C. Buda, F. Pellizzer, D. W. Chow,A. Cabrini, G. Calvi, et al. A bipolar-selected phase changememory featuring multi-level cell storage. IEEE Journal ofSolid-State Circuits, 2009.

[8] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt,A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna,S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D.Hill, and D. A. Wood. The gem5 simulator. SIGARCHComput. Archit. News, 2011.

[9] T. Brewer. Instruction set innovations for the Convey HC-1computer. IEEE Micro, 2010.

[10] S. Cho and H. Lee. Flip-N-Write: a simple deterministictechnique to improve PRAM write performance, energyand endurance. In MICRO, 2009.

[11] A. Ferreira, M. Zhou, S. Bock, B. Childers, R. Melhem, andD. Mosse. Increasing pcm main memory lifetime. In DATE,2010.

[12] L. Jiang, B. Zhao, Y. Zhang, J. Yang, and B. Childers.Improving write operations in MLC phase change memory.In HPCA, 2012.

[13] M. Joshi, W. Zhang, and T. Li. Mercury: A fast andenergy-efficient multi-level cell based phase change memorysystem. In HPCA, 2011.

[14] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architectingphase change memory as a scalable DRAM alternative. InISCA, 2009.

[15] H. G. Lee, S. Baek, C. Nicopoulos, and J. Kim. An energy-and performance-aware DRAM cache architecture for

hybrid DRAM/PCM main memory systems. In ICCD,2011.

[16] Y. Lee, S. Kim, S. Hong, and J. Lee. Skinflint DRAMsystem: Minimizing DRAM chip writes for low power. InHPCA, 2013.

[17] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler,and T. W. Keller. Energy management for commercialservers. Computer, 2003.

[18] K. Lepak and M. Lipasti. Silent stores for free. In MICRO,2000.

[19] K. M. Lepak and M. H. Lipasti. On the value locality ofstore instructions. In ISCA, 2000.

[20] S. Li, D. H. Yoon, K. Chen, J. Zhao, J. H. Ahn, J. B.Brockman, Y. Xie, and N. P. Jouppi. MAGE: adaptivegranularity and ECC for resilient and power efficientmemory systems. In SC, 2012.

[21] Z. Li, R. Zhou, and T. Li. Exploring high-performance andenergy proportional interface for phase change memorysystems. In HPCA, 2013.

[22] K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge,and S. Reinhardt. Understanding and designing new serverarchitectures for emerging warehouse-computingenvironments. In ISCA, 2008.

[23] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi.CACTI 6.0: A tool to model large caches. Technical ReportHPL-2009-85, HP Laboratories, 2009.

[24] M. K. Qureshi, M. M. Franceschini, A. Jagmohan, andL. A. Lastras. PreSET: Improving performance of phasechange memories by exploiting asymmetry in write times.In ISCA, 2012.

[25] M. K. Qureshi, M. M. Franceschini, L. A. Lastras-Montano,and J. P. Karidis. Morphable memory system: a robustarchitecture for exploiting multi-level phase changememories. In ISCA, 2010.

[26] M. K. Qureshi, M. M. Franceschini, and L. A.Lastras-Montano. Improving read performance of phasechange memories via write cancellation and write pausing.In HPCA, 2010.

[27] L. E. Ramos, E. Gorbatov, and R. Bianchini. Pageplacement in hybrid memory systems. In ICS, 2011.

[28] J. Rice. Micron announces availability of phase changememory for mobile devices, 2012. http://investors.micron.com/releasedetail.cfm?ReleaseID=692563.

[29] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.Owens. Memory access scheduling. In ISCA, 2000.

[30] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2:A cycle accurate memory system simulator. ComputerArchitecture Letters, 2011.

[31] S. Sardashti and D. A. Wood. Unifi: Leveragingnon-volatile memories for a unified fault tolerance and idlepower management technique. In ICS, 2012.

[32] J. Shao and B. Davis. A burst scheduling access reorderingmechanism. In HPCA, 2007.

[33] A. N. Udipi, N. Muralimanohar, N. Chatterjee,R. Balasubramonian, A. Davis, and N. P. Jouppi.Rethinking DRAM design and organization forenergy-constrained multi-cores. In ISCA, 2010.

[34] D. H. Yoon, M. K. Jeong, and M. Erez. Adaptivegranularity memory systems: a tradeoff between storageefficiency and throughput. In ISCA, 2011.

[35] J. Yue and Y. Zhu. Accelerating write by exploiting PCMasymmetries. In HPCA, 2013.

[36] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, andZ. Zhu. Mini-rank: Adaptive DRAM architecture forimproving memory power efficiency. In MICRO, 2008.

[37] P. Zhou, B. Zhao, J. Yang, and Y. Zhang. A durable andenergy efficient main memory using phase change memorytechnology. In ISCA, 2009.

220

DWC: Dynamic Write Consolidation for Phase Change Memory...

Documents

Transcript of DWC: Dynamic Write Consolidation for Phase Change Memory...