Disaggregated Memory for Expansion and Sharing in Blade Servers

17
Kevin Lim*, Jichuan Chang + , Trevor Mudge*, Parthasarathy Ranganathan + , Steven K. Reinhardt* , Thomas F. Wenisch* June 23, 2009 Disaggregated Memory for Expansion and Sharing in Blade Servers * University of Michigan + HP Labs AMD

description

Disaggregated Memory for Expansion and Sharing in Blade Servers. Kevin Lim*, Jichuan Chang + , Trevor Mudge *, Parthasarathy Ranganathan + , Steven K. Reinhardt* † , Thomas F. Wenisch * June 23, 2009. * University of Michigan + HP Labs † AMD. - PowerPoint PPT Presentation

Transcript of Disaggregated Memory for Expansion and Sharing in Blade Servers

Page 1: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Kevin Lim*, Jichuan Chang+, Trevor Mudge*, Parthasarathy Ranganathan+, Steven K. Reinhardt*†,Thomas F. Wenisch*June 23, 2009

Disaggregated Memory for Expansion and Sharing in Blade Servers

* University of Michigan + HP Labs † AMD

Page 2: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Motivation: The memory capacity wall

Memory capacity per core drop ~30% every 2 years

1

10

100

1000

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

# CoreGB DRAM

Capacity Wall

2

Page 3: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Opportunity: Optimizing for the ensemble

Dynamic provisioning across ensemble enables cost & power savings

Intra-server variation (TPC-H, log scale) Inter-server variation (rendering farm)

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q120.1MB

1MB10MB

100MB1GB

10GB100GB

Time

3

Page 4: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

ContributionsGoal: Expand capacity & provision for typical usage

• New architectural building block: memory blade− Breaks traditional compute-memory co-location

• Two architectures for transparent mem. expansion

• Capacity expansion:− 8x performance over provisioning for median usage− Higher consolidation

• Capacity sharing:− Lower power and costs− Better performance / dollar

4

Page 5: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Outline• Introduction

• Disaggregated memory architecture−Concept−Challenges−Architecture

• Methodology and results

• Conclusion

5

Page 6: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Disaggregated memory concept

Break CPU-memory co-location

Leverage fast, shared communication fabrics

Memory blade

Blade systems with disaggregated memory

CPUsDIMMDIMM

CPUsDIMMDIMM

CPUsDIMMDIMM

CPUsDIMMDIMM

DIMM

DIMM

DIMMBackplane

6

DIMM

DIMM

DIMM

DIMM

DIMM

Conventional blade systems

Page 7: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

What are the challenges?

• Transparent expansion to app., OS−Solution 1: Leverage coherency−Solution 2: Leverage hypervisor

• Commodity-based hardware• Match right-sized, conventional systems

−Performance−Cost

Backplane

Memoryblade

Compute Blade

OSApp

Software Stack

HypervisorCPUs

DIMMDIMM

7

Page 8: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

General memory blade design

Memory blade (enlarged)

Backplane

Protocol engine

Memory controller

Address mapping

Cost: Leverage sweet-spot of RAM pricing

Other optimizations

Transparency: Enforces allocation, isolation, and mapping

Cost: Handles dynamic memory partitioning

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

DIM

MDI

MM

Design driven by key challenges

CPUsDIMMDIMM

CPUsDIMMDIMM

CPUsDIMMDIMM

CPUsDIMMDIMM

Perf.: Accessed as memory, not swap spaceCommodity: Connected via PCIe or HT

8

Page 9: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Fine-grained remote access (FGRA)

Memoryblade

Compute Blade

OS

App

Software Stack

DIMMDIMM

Backplane

Connected via coherent fabric to memory blade (e.g., HyperTransport™)Add minor hardware: Coherence Filter

Filters unnecessary traffic

Memory blade doesn’t need all coherence traffic!

On access: Data transferred at cache-block granularity

CPUs

Extends coherency domain

HyperTransport

CF

9

Page 10: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Page-swapping remote memory (PS)

Hypervisor

Leverage existing remapping between OS and hypervisor

Performance dominated by transfer latency; insensitive to small changes

Use indirection from hypervisor

Memoryblade

OS

AppDIMMDIMM

Backplane

CPUs

PCI Express

On access: Local data page swapped with remote data page

Connected via commodity fabric to memory blade (PCI Express)

On access: Data transferred at page (4KB) granularity

Bridge

10

Compute BladeSoftware Stack

Page 11: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Summary: Addressing the challenges

FGRA PSTransparent expansion

Extends coherency Hypervisor indirection

Commodity HW HyperTransport PCI Express

High performance Direct access Leverage locality

Cost comparable Shared memory blade infrastructureRight-provisioned memory

11

Page 12: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Outline• Introduction

• Disaggregated memory architecture

• Methodology and results−Performance−Performance-per-cost

• Conclusion

12

Page 13: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Methodology• Trace-based

−Memory traces from detailed simulation• Web 2.0, compute-intensive, server

−Utilization traces from live data centers• Animation, VM Consolidation, Web 2.0

• Two baseline memory sizes−M-max

• Sized to largest workload

−M-median• Sized to median of workloads

Simulator parameters

Remote DRAM 120 ns, 6.4 GB/s

PCIe 120 ns, 1 GB/s

HyperTransport 60 ns, 4 GB/s

13

Page 14: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

zeus

mp

perl

gcc

bwav

es

nutc

h4p

tpch

mix

mcf

pgbe

nch

inde

xer

spec

jbb

spec

4p

Hmea

n

Hmea

n2

1

10

100

1000 M-max (ideal)PSFGRA

Nor

mal

ized

Per

form

ance

Baseline: M-median local + disk

Performance

8X

2X

Footprint > M-median

Performance 8X higher, close to ideal

FGRA slower on these memory intensive workloads Locality is most important to performance

14

Page 15: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Performance / Cost

zeus

mp

perl

gcc

bwav

es

nutc

h4p

tpch

mix

mcf

pgbe

nch

inde

xer

spec

jbb

spec

4p

Hmea

n

Hmea

n2

0

0.5

1

1.5

2

2.5

3

M-median PS

FGRA

Nor

mal

ized

Per

form

ance

/ $

1.3X 1.4X

Footprint > M-median

Baseline: M-max local + disk PS able to provide consistently high performance / $ M-median has significant drop-off on large workloads

15

Page 16: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Conclusions• Motivation: Impending memory capacity wall• Opportunity: Optimizing for the ensemble

• Solution: Memory disaggregation−Transparent, commodity HW, high perf., low cost−Dedicated memory blade for expansion, sharing−PS and FGRA provide transparent support

• Please see paper for more details!

16

Page 17: Disaggregated Memory  for Expansion and Sharing  in Blade Servers

Thank you!

Any questions?

[email protected]

17