Disaggregated Memory for Expansion and Sharing in Blade Servers
description
Transcript of Disaggregated Memory for Expansion and Sharing in Blade Servers
Kevin Lim*, Jichuan Chang+, Trevor Mudge*, Parthasarathy Ranganathan+, Steven K. Reinhardt*†,Thomas F. Wenisch*June 23, 2009
Disaggregated Memory for Expansion and Sharing in Blade Servers
* University of Michigan + HP Labs † AMD
Motivation: The memory capacity wall
Memory capacity per core drop ~30% every 2 years
1
10
100
1000
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
# CoreGB DRAM
Capacity Wall
2
Opportunity: Optimizing for the ensemble
Dynamic provisioning across ensemble enables cost & power savings
Intra-server variation (TPC-H, log scale) Inter-server variation (rendering farm)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q120.1MB
1MB10MB
100MB1GB
10GB100GB
Time
3
ContributionsGoal: Expand capacity & provision for typical usage
• New architectural building block: memory blade− Breaks traditional compute-memory co-location
• Two architectures for transparent mem. expansion
• Capacity expansion:− 8x performance over provisioning for median usage− Higher consolidation
• Capacity sharing:− Lower power and costs− Better performance / dollar
4
Outline• Introduction
• Disaggregated memory architecture−Concept−Challenges−Architecture
• Methodology and results
• Conclusion
5
Disaggregated memory concept
Break CPU-memory co-location
Leverage fast, shared communication fabrics
Memory blade
Blade systems with disaggregated memory
CPUsDIMMDIMM
CPUsDIMMDIMM
CPUsDIMMDIMM
CPUsDIMMDIMM
DIMM
DIMM
DIMMBackplane
6
DIMM
DIMM
DIMM
DIMM
DIMM
Conventional blade systems
What are the challenges?
• Transparent expansion to app., OS−Solution 1: Leverage coherency−Solution 2: Leverage hypervisor
• Commodity-based hardware• Match right-sized, conventional systems
−Performance−Cost
Backplane
Memoryblade
Compute Blade
OSApp
Software Stack
HypervisorCPUs
DIMMDIMM
7
General memory blade design
Memory blade (enlarged)
Backplane
Protocol engine
Memory controller
Address mapping
Cost: Leverage sweet-spot of RAM pricing
Other optimizations
Transparency: Enforces allocation, isolation, and mapping
Cost: Handles dynamic memory partitioning
DIM
MDI
MM
DIM
MDI
MM
DIM
MDI
MM
DIM
MDI
MM
DIM
MDI
MM
DIM
MDI
MM
DIM
MDI
MM
DIM
MDI
MM
DIM
MDI
MM
Design driven by key challenges
CPUsDIMMDIMM
CPUsDIMMDIMM
CPUsDIMMDIMM
CPUsDIMMDIMM
Perf.: Accessed as memory, not swap spaceCommodity: Connected via PCIe or HT
8
Fine-grained remote access (FGRA)
Memoryblade
Compute Blade
OS
App
Software Stack
DIMMDIMM
Backplane
Connected via coherent fabric to memory blade (e.g., HyperTransport™)Add minor hardware: Coherence Filter
Filters unnecessary traffic
Memory blade doesn’t need all coherence traffic!
On access: Data transferred at cache-block granularity
CPUs
Extends coherency domain
HyperTransport
CF
9
Page-swapping remote memory (PS)
Hypervisor
Leverage existing remapping between OS and hypervisor
Performance dominated by transfer latency; insensitive to small changes
Use indirection from hypervisor
Memoryblade
OS
AppDIMMDIMM
Backplane
CPUs
PCI Express
On access: Local data page swapped with remote data page
Connected via commodity fabric to memory blade (PCI Express)
On access: Data transferred at page (4KB) granularity
Bridge
10
Compute BladeSoftware Stack
Summary: Addressing the challenges
FGRA PSTransparent expansion
Extends coherency Hypervisor indirection
Commodity HW HyperTransport PCI Express
High performance Direct access Leverage locality
Cost comparable Shared memory blade infrastructureRight-provisioned memory
11
Outline• Introduction
• Disaggregated memory architecture
• Methodology and results−Performance−Performance-per-cost
• Conclusion
12
Methodology• Trace-based
−Memory traces from detailed simulation• Web 2.0, compute-intensive, server
−Utilization traces from live data centers• Animation, VM Consolidation, Web 2.0
• Two baseline memory sizes−M-max
• Sized to largest workload
−M-median• Sized to median of workloads
Simulator parameters
Remote DRAM 120 ns, 6.4 GB/s
PCIe 120 ns, 1 GB/s
HyperTransport 60 ns, 4 GB/s
13
zeus
mp
perl
gcc
bwav
es
nutc
h4p
tpch
mix
mcf
pgbe
nch
inde
xer
spec
jbb
spec
4p
Hmea
n
Hmea
n2
1
10
100
1000 M-max (ideal)PSFGRA
Nor
mal
ized
Per
form
ance
Baseline: M-median local + disk
Performance
8X
2X
Footprint > M-median
Performance 8X higher, close to ideal
FGRA slower on these memory intensive workloads Locality is most important to performance
14
Performance / Cost
zeus
mp
perl
gcc
bwav
es
nutc
h4p
tpch
mix
mcf
pgbe
nch
inde
xer
spec
jbb
spec
4p
Hmea
n
Hmea
n2
0
0.5
1
1.5
2
2.5
3
M-median PS
FGRA
Nor
mal
ized
Per
form
ance
/ $
1.3X 1.4X
Footprint > M-median
Baseline: M-max local + disk PS able to provide consistently high performance / $ M-median has significant drop-off on large workloads
15
Conclusions• Motivation: Impending memory capacity wall• Opportunity: Optimizing for the ensemble
• Solution: Memory disaggregation−Transparent, commodity HW, high perf., low cost−Dedicated memory blade for expansion, sharing−PS and FGRA provide transparent support
• Please see paper for more details!
16