Disaggregated Memory for Expansion and Sharing in Blade Servers

Kevin Lim*, Jichuan Chang+, Trevor Mudge*, Parthasarathy Ranganathan+, Steven K. Reinhardt*†,Thomas F. Wenisch*June 23, 2009

Disaggregated Memory for Expansion and Sharing in Blade Servers

* University of Michigan + HP Labs † AMD

Motivation: The memory capacity wall

Memory capacity per core drop ~30% every 2 years

1

10

100

1000

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

# CoreGB DRAM

Capacity Wall

2

Opportunity: Optimizing for the ensemble

Dynamic provisioning across ensemble enables cost & power savings

Intra-server variation (TPC-H, log scale) Inter-server variation (rendering farm)

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q120.1MB

1MB10MB

100MB1GB

10GB100GB

Time

3

ContributionsGoal: Expand capacity & provision for typical usage

• New architectural building block: memory blade− Breaks traditional compute-memory co-location

• Two architectures for transparent mem. expansion

• Capacity expansion:− 8x performance over provisioning for median usage− Higher consolidation

• Capacity sharing:− Lower power and costs− Better performance / dollar

4

Outline• Introduction

• Disaggregated memory architecture−Concept−Challenges−Architecture

• Methodology and results

• Conclusion

5

Disaggregated memory concept

Break CPU-memory co-location

Leverage fast, shared communication fabrics

Memory blade

Blade systems with disaggregated memory

CPUsDIMMDIMM

CPUsDIMMDIMM

CPUsDIMMDIMM

CPUsDIMMDIMM

DIMM

DIMM

DIMMBackplane

6

DIMM

DIMM

DIMM

DIMM

DIMM

Conventional blade systems

What are the challenges?

• Transparent expansion to app., OS−Solution 1: Leverage coherency−Solution 2: Leverage hypervisor

• Commodity-based hardware• Match right-sized, conventional systems

−Performance−Cost

Backplane

Memoryblade

Compute Blade

OSApp

Software Stack

HypervisorCPUs

DIMMDIMM

7

General memory blade design

Memory blade (enlarged)

Backplane

Protocol engine

Memory controller

Address mapping

Cost: Leverage sweet-spot of RAM pricing

Other optimizations

Transparency: Enforces allocation, isolation, and mapping

Cost: Handles dynamic memory partitioning

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

Design driven by key challenges

CPUsDIMMDIMM

CPUsDIMMDIMM

CPUsDIMMDIMM

CPUsDIMMDIMM

Perf.: Accessed as memory, not swap spaceCommodity: Connected via PCIe or HT

8

Fine-grained remote access (FGRA)

Memoryblade

Compute Blade

OS

App

Software Stack

DIMMDIMM

Backplane

Connected via coherent fabric to memory blade (e.g., HyperTransport™)Add minor hardware: Coherence Filter

Filters unnecessary traffic

Memory blade doesn’t need all coherence traffic!

On access: Data transferred at cache-block granularity

CPUs

Extends coherency domain

HyperTransport

CF

9

Page-swapping remote memory (PS)

Hypervisor

Leverage existing remapping between OS and hypervisor

Performance dominated by transfer latency; insensitive to small changes

Use indirection from hypervisor

Memoryblade

OS

AppDIMMDIMM

Backplane

CPUs

PCI Express

On access: Local data page swapped with remote data page

Connected via commodity fabric to memory blade (PCI Express)

On access: Data transferred at page (4KB) granularity

Bridge

10

Compute BladeSoftware Stack

Summary: Addressing the challenges

FGRA PSTransparent expansion

Extends coherency Hypervisor indirection

Commodity HW HyperTransport PCI Express

High performance Direct access Leverage locality

Cost comparable Shared memory blade infrastructureRight-provisioned memory

11

Outline• Introduction

• Disaggregated memory architecture

• Methodology and results−Performance−Performance-per-cost

• Conclusion

12

Methodology• Trace-based

−Memory traces from detailed simulation• Web 2.0, compute-intensive, server

−Utilization traces from live data centers• Animation, VM Consolidation, Web 2.0

• Two baseline memory sizes−M-max

• Sized to largest workload

−M-median• Sized to median of workloads

Simulator parameters

Remote DRAM 120 ns, 6.4 GB/s

PCIe 120 ns, 1 GB/s

HyperTransport 60 ns, 4 GB/s

13

zeus

mp

perl

gcc

bwav

es

nutc

h4p

tpch

mix

mcf

pgbe

nch

inde

xer

spec

jbb

spec

4p

Hmea

n

Hmea

n2

1

10

100

1000 M-max (ideal)PSFGRA

Nor

mal

ized

Per

form

ance

Baseline: M-median local + disk

Performance

8X

2X

Footprint > M-median

Performance 8X higher, close to ideal

FGRA slower on these memory intensive workloads Locality is most important to performance

14

Performance / Cost

zeus

mp

perl

gcc

bwav

es

nutc

h4p

tpch

mix

mcf

pgbe

nch

inde

xer

spec

jbb

spec

4p

Hmea

n

Hmea

n2

0

0.5

1

1.5

2

2.5

3

M-median PS

FGRA

Nor

mal

ized

Per

form

ance

/ $

1.3X 1.4X

Footprint > M-median

Baseline: M-max local + disk PS able to provide consistently high performance / $ M-median has significant drop-off on large workloads

15

Conclusions• Motivation: Impending memory capacity wall• Opportunity: Optimizing for the ensemble

• Solution: Memory disaggregation−Transparent, commodity HW, high perf., low cost−Dedicated memory blade for expansion, sharing−PS and FGRA provide transparent support

• Please see paper for more details!

16

Thank you!

Any questions?

[email protected]

17

Disaggregated Memory for Expansion and Sharing in Blade Servers

Documents

Transcript of Disaggregated Memory for Expansion and Sharing in Blade Servers