Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob...

18
BIG MEMORIES Bruce Jacob University of Maryland SLIDE Big Memories Prof.Dr. Bruce Jacob University of Maryland OUTLINE • The Capacity Problem • Solution 1: BOB Memory Systems • Solution II: Hybrid Memory Cube • Solution III: Non-volatile Main Memories 1

Transcript of Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob...

Page 1: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE Big MemoriesProf.Dr. Bruce JacobUniversity of Maryland

OUTLINE• The Capacity Problem• Solution 1: BOB Memory Systems• Solution II: Hybrid Memory Cube• Solution III: Non-volatile Main Memories

1

Page 2: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

The Capacity Problem

2

Two DDR2-400 DIMMs Four DDR2-400 DIMMsSource: Steve Woo. DRAM and Memory System Trends. October 2004.

=>

Page 3: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

The Capacity Problem

… but wait, there’s more:3

2000# 2002# 2004# 2006# 2008# 2010# 2012#

DIM

M$Cap

acity$(G

B)$

Release$Year$

Release$of$Increasing$DIMM$Capacities$

8"GB"

16"GB" 16#GB#

4#GB#

1#GB#256#MB#

Page 4: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

Problem: Capacity

MC MC

JEDEC DDRx~10W/DIMM, ~20W total

FB-DIMM~10W/DIMM, ~300W total

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Attempts at a Solution

• Highly Engineered DIMMs(can cost $1000+ per DIMM)

• Fully-Buffered DIMM(pushes the power envelope)

4

Page 5: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Observations

• Cannot increase power significantly (e.g. to CPU scale)

• Cannot sacrifice aggregate bandwidth

• Need to approach commodity pricing

• Future-proof design would be highly desirable

5

Page 6: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution I: BOB

Buffer On (mother-)Board6

AMD G3MXIntel SMI/SMBIBM Power 795

Page 7: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution I: BOB

Buffer On (mother-)Board7

Page 8: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution I: BOB

Buffer On (mother-)Board8

Page 9: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution II: Micron HMC

9

Page 10: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution II: Micron HMC

A single-chip BOB system10

Page 11: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution II: Micron HMC

11

Page 12: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution II: Micron HMC

12

Page 13: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution II: Micron HMC

13

Page 14: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution III: Non-Volatiles

14

Obvious Conclusions II

• Flash/NV is inexpensive, is fast (rel. to disk), and has better capacity roadmap than DRAM

• Make it a first-class citizen in the memory hierarchy

• Access it via load/store interface, use DRAM to buffer writes, software management

• Probably reduces capacity pressure on DRAM system

$CPUSpeed, density, cost

Can have TB-scale DIMMs today

Page 15: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution III: Non-Volatiles

156 MB/s. However, this was not considered a major drawbackas transfer times to such cards were less important than theircapacity. Until recently, NAND flash chips utilized a 40 MHzasynchronous 8 bit interface that was capable of 40 MB/s.This was also acceptable for some time as the access latencyof flash was still faster than other external storage media of thetime and this was not the bottleneck in the applications thatutilized it. However, as flash has taken on a new role with theintroduction of SSDs, its transfer times have begun to matter.

One major problem with flash devices was that each manu-facturer had their own interface standard. This problem madedesigning SSD hardware difficult and expensive as it had tobe tailored to a specific manufacturer’s standard. To fostereasier integration of flash devices and drive SSD adoption, theNAND flash industry developed the ONFi 1.0 standard [3].

Another problem with flash devices is that the array of flashcells within the chip are actually capable of producing data ata rate of 330 MB/s without any modifications [12]. Realizingthat the asynchronous interface was the primary bottleneck inflash performance, manufacturers have developed synchronousstandards such as ONFi 2.1 or Toggle Mode DDR. These newstandards enable much faster transfers of data by running atfaster frequencies than was possible with an asynchronousapproach. As a result, newer flash chips are capable of band-widths of up to 200 MB/s. Furthermore, a new standard, ONFi3.0, has even more recently been defined which will allow forbandwidths of up to 400 MB/s. Therefore, the full bandwidthpotential of the flash array will soon be utilized to providefaster data transfers and improve overall performance whenaccessing flash. As a result of this additional bandwidth, thehost interface and software likely need to evolve in order tofully expose the improved performance of the flash devices.

3. Hybrid Main Memory Overview

3.1. Current State of the Art - SSD Design

A block diagram of a typical flash-based solid state drive isshown in Figure 1. The system consists of three main compo-nents: host interface, an SSD controller, and a set of NANDflash devices. The host interface is typically SATA, althoughrecently PCIe interfaces have become available for enterpriseapplications. The SSD controller is the core of the system andcreates the abstractions necessary for utilizing NAND flashdevices in such a way that creates a useful storage system. Itperforms tasks such as memory mapping, garbage collection,wear leveling, error correction, and access scheduling. TheSSD controller also typically has a small amount of memoryeither in the form of SRAM or DRAM to cache metadata andbuffer writes [6].

The NAND flash devices are where the data is stored on thedrive. SSDs leverage multiple devices to achieve high through-put. These are typically organized into parallel channels withone or more devices per channel. Internally, the NAND de-vices are organized into planes, blocks, and pages. Planes are

PCIeController

DDR3Channel

PCIe Lanes

PCIe Solid State Drive

DRAM DIMM

DDR3Channel

Buffer Channel

NV DIMM

NVController

ONFi

ONFi

DRAM DIMM

MemoryController

DRAM

SSDController

ONFi

ONFi

DRAM

X86Core

X86Core

X86Core

X86Core

Shared Last Level Cache

Core i7 CPU

X86Core

X86Core

X86Core

X86Core

Shared Last Level Cache

Core i7 CPU

HybridMemory

Controller

NAND Devices

NAND Devices

Figure 1: System design for SSD (top) and hybrid memory(bottom).

functionally independent units that allow for concurrent opera-tions on the device. Each plane has a set of registers that allowfor interleaved accesses. Blocks form the physical granularityat which erase operations occur. Finally, each block consistsof multiple pages, which are the physical granularity at whichread and write operations occur.

In terms of the computer system performance, the delay foran operation to a solid state drive starts when the user appli-cation issues a request for some data that triggers a page faultand ends when the operating system returns control to the userapplication after the request has completed. At the hardwarelevel, the SSD controller receives an access for a particularaddress and then later the controller raises an interrupt request(IRQ) on the CPU to tell the operating system the data is ready.A typical access to an SSD is shown in Figure 2. The timefrom point B to point C is the amount of time needed for thedisk to process the request. The time from point A to point Dis the total amount of time spent waiting for the request fromthe perspective of the application that made the request.

There are many intermediate software and hardware layersinvolved in an SSD access. The software side on a Linux-based system includes the virtual memory system, the virtualfile system, the specific file system for the partition that holdsthe data (e.g. NTFS or ext3), the block device driver forthe disk, and the device driver for the host interface such asthe Advanced Host Interface Controller (AHCI) for SerialATA (SATA) drives [11]. At the hardware level, the interfacesinvolved include the host interface to the drive, the direct mem-ory access (DMA) engine, and the SSD internals. The hostinterface is typically a SATA interface, which resides on thesouthbridge for modern Intel processors. This means that therequest much first cross the Intel Direct Media Interface (DMI)or equivalent before crossing the SATA interface. However,our model for this paper assumes the pure PCIe 3.0 NVM Ex-press interface and we utilize 16 lanes, which makes the model

3

Page 16: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Solution III: Non-Volatiles

Performance normalized to that of TB-sized DRAM system

16

0

0.2

0.4

0.6

0.8

1

1.2

Nor

mal

ized

IPC

Un-Optimized Hybrid SLC

SSD MLC

Hybrid MLC

SSD SLC

Hybrid SLC

Figure 9: System performance when combining all techniques. The IPC is normalized to the ideal case with enough DRAM tostore the entire working set.

many realistic workloads, we show that the hybrid memorydesign can provide significant performance improvementscompared to an enterprise-class solid state drive. We believethis design space is worth investigating further, as our paper isonly an initial glimpse into using hybrid memories as a fasterstorage system. In particular, there is much work to be doneoptimizing both the flash system and the operating system todeal with this new design. We intend to investigate both ofthese areas in future work.

References[1] “Hybrid Memory Cube Consortium.” [Online]. Available: http:

//hybridmemorycube.org[2] “Linux Kernel Documentation for tmpfs file system.” [Online].

Available: http://www.kernel.org/doc/Documentation/filesystems/tmpfs.txt

[3] “Open NAND Flash Interface Specification Revision 1.0,” December2006. [Online]. Available: http://onfi.org/wp-content/uploads/2009/02/onfi_1_0_gold.pdf

[4] “Fusion IO,” 2012. [Online]. Available: http://www.fusionio.com[5] “Intel Solid-State Drive 910 Series: Product Specification,”

2012. [Online]. Available: http://www.intel.com/content/www/us/en/solid-state-drives/ssd-910-series-specification.html

[6] “Marvell Unveils Third-Generation SSD 6Gb/s SATA Controller,”March 2012. [Online]. Available: http://www.marvell.com/company/news/pressDetail.do?releaseID=2176

[7] “NVM Express Revision 1.0c Specification,” February 2012. [Online].Available: http://www.nvmexpress.org

[8] “PCI Express OCZ Technology,” 2012. [Online]. Avail-able: http://www.ocztechnology.com/products/solid_state_drives/pci-e_solid_state_drives

[9] N. Agrawal et al., “Design Tradeoffs for SSD Performance,” in Pro-ceedings of the 2008 USENIX Technical Conference (USENIX’08), ser.USENIX ’08, 2008.

[10] A. Badam and V. S. Pai, “SSDAlloc: Hybrid SSD/RAM MemoryManagement Made Easy,” in In Proc. 8th USENIX Symposium onNetworked Systems Design and Implementation (NSDI ’11), 2011.

[11] D. P. Bovet and M. Cesati, Understanding the Linux Kernel, 3rd ed.O’Reilly Media, 2005.

[12] J. Cooke, “Choosing the Right NAND for Your Application,” Micron,2009.

[13] E. Cooper-Balis, P. Rosenfeld, and B. Jacob, “Buffer On Board memorysystems,” in Proceedings of the 39th Annual International Symposiumon Computer Architecture, ser. ISCA ’12, 2012.

[14] C. Dirik and B. Jacob, “The performance of PC Solid-State Disks(SSDs) as a function of bandwidth, concurrency, device architecture,and system organization,” in Proceedings of the 36th Annual Interna-tional Symposium on Computer Architecture, ser. ISCA ’09, 2009, pp.279–289.

[15] E. Harari, “The Non-Volatile Memory Industry - A Personal Journey,”in 3rd IEEE International Memory Workshop (IMW), May 2011, pp.1–4.

[16] E. Harari, “Flash Memory: The Great Disruptor!” in InternationalSolid-State Circuits Conference (ISSCC), Feb. 2012, pp. 10–15.

[17] J. Jex, “Flash memory BIOS for PC and notebook computers,” in IEEEPacific Rim Conference on Communications, Computers and SignalProcessing, vol. 2, 1991, pp. 692–695.

[18] S. Jiang and X. Zhang, “Token-ordered LRU: an effective page replace-ment policy and its implementation in Linux systems,” Perform. Eval.,vol. 60, no. 1-4, pp. 5–29, May 2005.

[19] T. Kgil and T. Mudge, “FlashCache: a NAND Flash Memory FileCache for Low Power Web Servers,” in Proceedings of the 2006 In-ternational Conference on Compilers, Architecture and Synthesis forEmbedded systems, ser. CASES ’06. New York, NY, USA: ACM,2006, pp. 103–112.

[20] E. Koldinger, J. Chase, and S. Eggers, “Architectural Support forSingle Address Space Operating Systems,” in 5th Int. Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS), vol. 27, no. 9, 1992, pp. 175–186.

[21] B. C. Lee et al., “Architecting Phase Change Memory as a ScalableDram Alternative,” in Proceedings of the 36th Annual InternationalSymposium on Computer Architecture, ser. ISCA ’09. New York, NY,USA: ACM, 2009, pp. 2–13.

[22] A. Patel et al., “MARSSx86: A Full System Simulator for x86 CPUs,”in Design Automation Conference 2011 (DAC’11), 2011.

[23] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Perfor-mance Main Memory System Using Phase-Change Memory Technol-ogy,” in Proceedings of the 36th Annual International Symposium onComputer Architecture, ser. ISCA ’09. New York, NY, USA: ACM,2009, pp. 24–33.

[24] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A CycleAccurate Memory System Simulator,” Computer Architecture Letters,vol. 10, no. 1, pp. 16 –19, Jan.-June 2011.

12

Page 17: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE

Bottom Line

• All three solutions are composable (this is GOOD)

• Power problem: solvable• Bandwidth problem: solvable• Cost problem: solvable• HMC-style generic interface

is future-proof by definition

17

Page 18: Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...

BIG MEMORIES

Bruce Jacob

University of Maryland

SLIDE Thank You!Prof.Dr. Bruce JacobUniversity of Maryland

[email protected]/~blj

18