Practical Memory Disaggregation

51
1 Practical Memory Disaggregation A Case Study in Network-Informed Data Systems Design Mosharaf Chowdhury November 2020

Transcript of Practical Memory Disaggregation

Page 1: Practical Memory Disaggregation

1

Practical Memory DisaggregationA Case Study in Network-Informed Data Systems Design

Mosharaf ChowdhuryNovember 2020

Page 2: Practical Memory Disaggregation

Five Years Ago…

3

The volume of data businesses want to make sense of is increasing

2015 2016 2017 2018

Page 3: Practical Memory Disaggregation

1. Data Volume Will Keep Increasing

4

2015 2016 2017 2018

Page 4: Practical Memory Disaggregation

Data Systems

5

Big Data AnalyticsAI/ML Tools

Massive dataHigh parallelismGPU clustersDistributed…

Page 5: Practical Memory Disaggregation

> 100 ms

Over the World

2. Deployed in Diverse Networks

< 10 µs

Within a Rack Within a Datacenter~ 1 ms

6

Page 6: Practical Memory Disaggregation

Network-Informed Data Systems Design

7

I. Network-adaptive Big Data and AI/ML systems

II. Tailoring data systems to extreme networksI. Computation over the InternetII. Leveraging high-speed networks

2016 20202017 20192018

HU

G

CO

DA

Herm

es

EC-C

ache

Carbyne

QO

OP

Tiresias

Salu

s

AlloX

NO

CS

Sol

Pand

o

Infin

iswap

DSL

R

Leap

NetLock

CellScope

Page 7: Practical Memory Disaggregation

PracticalMemoryDisaggregation

8

Page 8: Practical Memory Disaggregation

Memory is King!

9

Page 9: Practical Memory Disaggregation

Perform Great!

36.18

6.619

1.5420

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

TPC-C on VoltDB

10

Page 10: Practical Memory Disaggregation

Perform Great Until Memory Runs Out

36.18

6.619

1.5420

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

TPC-C on VoltDB

11

Page 11: Practical Memory Disaggregation

Perform Great Until Memory Runs Out

36.18

6.619

1.5420

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

95.8

44.9

23.8

0

20

40

60

80

100

120

100% 75% 50%O

ps (

Tho

usan

ds)

In-Memory Working Set

TPC-C on VoltDB FB Workload on Memcached

12

Page 12: Practical Memory Disaggregation

Perform Great Until Memory Runs Out

36.18

6.619

1.5420

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

57 67.5

453.4

1

10

100

1000

100% 75% 50%

Com

plet

ion

Tim

e (s

)

In-Memory Working Set

TPC-C on VoltDB FB Workload on Memcached PageRank on PowerGraph95

.8

44.9

23.8

0

20

40

60

80

100

120

100% 75% 50%O

ps (

Tho

usan

ds)

In-Memory Working Set

13

Page 13: Practical Memory Disaggregation

50% Less Memory Causes Slowdown of …

36.18

6.619

1.5420

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

57 67.5

453.4

1

10

100

1000

100% 75% 50%

Com

plet

ion

Tim

e (s

)

In-Memory Working Set

TPC-C on VoltDB FB Workload on Memcached PageRank on PowerGraph95

.8

44.9

23.8

0

20

40

60

80

100

120

100% 75% 50%O

ps (

Tho

usan

ds)

In-Memory Working Set

14

Page 14: Practical Memory Disaggregation

Between a Rock and a Hard Place

OverallocationLeads to underutilization

30-50% in Google, Alibaba, and Facebook

UnderallocationLeads to severe performance loss

VS.

15

Page 15: Practical Memory Disaggregation

Machine 1 Machine 2 Machine 3 Machine N

Used Memory Free Memory

Disaggregated Memory

Memory Disaggregation

Remote Memory 16

Page 16: Practical Memory Disaggregation

Network is Getting Faster!

TCP/IP Hundreds of µsec

DPDK Tens of µsec

RDMA Single-digit µsec

Hundreds of nsecDRAM

time to access a 4KB memory page17

Page 17: Practical Memory Disaggregation

What is PracticalMemoryDisaggregation?

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

18

Page 18: Practical Memory Disaggregation

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

19

What is PracticalMemoryDisaggregation?

Page 19: Practical Memory Disaggregation

Infiniswap

w/ Juncheng Gu and many othersNSDI’17

Efficient Memory Disaggregation

20

Page 20: Practical Memory Disaggregation

Applicability

Application ChangesPe

rfor

man

ce

Best

Poor

Major NoMinor

Key-Value

File

Disk

Paging

Very Good

File

Disk

Key-Value

Paging

How can we enable any application to leverage disaggregated memory without sacrificing performance?

Page 21: Practical Memory Disaggregation

Remote Memory Paging

22

Exposes memory across server boundaries• Scalable• Efficient• Fault-tolerant

No changes to• applications,• operating systems, or• hardware

Page 22: Practical Memory Disaggregation

Core Idea

23

1. Infiniswap Block DeviceFinds free remote memory, maps pages, and provides fault tolerance without any central coordination

2. Infiniswap DaemonProactively evicts remote pages to ensure transparent, best-effortservice

Exposes free remote memory as swap devices in a decentralized manner without affecting remote processes

Page 23: Practical Memory Disaggregation

Infiniswap in One Slide

Container 1 Container NInfiniswapDaemon

User Space

Kernel Space

Local Disk

Async Sync

RNIC

2 3 2

Machine-1

User Space

Kernel Space

InfiniswapDaemonContainer A

Machine-2

X Mapped to memory of Machine-X

Virtual Memory Manager (VMM)

Infiniswap Block Device

Individual pagePage fault

User Space

Kernel Space

InfiniswapDaemonContainer A

Machine-3

Container 1 Container NInfiniswapDaemon

User Space

Kernel Space

Local Disk

Async Sync

RNIC

2 3

Machine-N

Virtual Memory Manager (VMM)

Infiniswap Block Device

24

Page 24: Practical Memory Disaggregation

Scalability via Decentralization

How to find free remote memory in a large cluster?• Problem: Centralized solution can be slow and expensive• Solution: Power of two choices

How to evict mapped memory?• Problem: LRU/LFU is hard because one-sided RDMA bypasses CPU• Solution: Power of many choices

25

Page 25: Practical Memory Disaggregation

Higher Efficiency & Better Load Balancing

Higher Utilization

0

20

40

60

80

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29Mem

ory

Util

izat

ion

(%)

Rank of 32 Machines

Infiniswap w/o Infiniswap

26

Page 26: Practical Memory Disaggregation

35.89

27.74

19.33

0

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

56.1 63.9 64.2

1

10

100

1000

10000

100% 75% 50%

Com

plet

ion

Tim

e (s

)

In-Memory Working Set

99.1

100.

4

91.3

0

20

40

60

80

100

120

140

160

100% 75% 50%O

ps (

Tho

usan

ds)

In-Memory Working Set

TPC-C on VoltDB FB Workload on Memcached

Even on 50% Memory, Slowdown is

PageRank on PowerGraph

<2X ≈1X ≈1X 27

Page 27: Practical Memory Disaggregation

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

28

What is PracticalMemoryDisaggregation?

Page 28: Practical Memory Disaggregation

w/ Hasan Al MarufATC’20 Best Paper

29

LeapEffectively Prefetching Remote Memory

Page 29: Practical Memory Disaggregation

User Space

Kernel Space

Device Mapping Layer

Block Device Driver

Generic Block Layer

I/O Scheduler Request QueueRequest queue processing:

Insertion, Merging, Sorting, Staging and Dispatch

bio

Remote Memory

Dispatch Queue

Memory ManagementUnit (MMU)

Process 1 Process 2 Process N…

Page Fault

RDMA: 4.3 us

0.27 us

10.04 us

21.88 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

Life of a Page

30

Page 30: Practical Memory Disaggregation

Where Does the Time Go?Page Request

In Page Cache?

Allocate Cache for Page

Read Request?

No

Yes

Update Page Table & End I/OPrepare for I/O

YesNo

Queue and Batch Requests

Execute I/O

0.12 µs

2.1 µs

10.04 µs

21.88 µs

RDMA: 4.3 µs

0.15 µs

Fast Path

Slow Path

31

Page 31: Practical Memory Disaggregation

We Need to

32

1. Increase cache hit• Faster path serves more page faults

2. Reduce the latency of the slow path• Remove block-layer operations unnecessary for RDMA

Page 32: Practical Memory Disaggregation

User Space

Kernel Space

Device Mapping Layer

Block Device Driver

Generic Block Layer

I/O Scheduler Request QueueRequest queue processing:

Insertion, Merging, Sorting, Staging and Dispatch

bio

Remote Memory

Dispatch Queue

Memory ManagementUnit (MMU)

Process 1 Process 2 Process N…

Page Fault

RDMA: 4.3 us

0.27 us

10.04 us

21.88 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

Life of a Page

33

Page 33: Practical Memory Disaggregation

User Space

Kernel Space

Remote Memory

Memory ManagementUnit (MMU)

Process 1 Process 2 Process N…

Page Fault

RDMA: 4.3 us

0.27 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

Life of a Pagew/ Leap

Process Specific Page Access Tracker

Leap

Trend Detection

Prefetch CandidateGeneration

Prefetcher

Eager Cache Eviction

34

0.34 us

Page 34: Practical Memory Disaggregation

Prefetching in Linux

Reads ahead pages sequentially

Based only on the last page access

Does not distinguish between processesCannot detect thread-level access irregularities

too aggressive on seq: cache pollution

too conservative off seq: brings nothing

35

Page 35: Practical Memory Disaggregation

Trend Detection in LeapStart with a smaller window of Access History

Majority found?

Doubles the window size

No Yes

Run Boyer-Moore on the window

Return Majority ∆maj

Max. window

size?

YesNo trend found

No

Resilient to short term irregularity

Identifies the majority element in access history

Regular trends can be found within recent accesses

36

Page 36: Practical Memory Disaggregation

Lowers Remote Page Access Latency by…Sequential Access Stride Access

0

0.2

0.4

0.6

0.8

1

0.01 1 100 10000

CD

F

Latency (us)

Infiniswap

Infiniswap+Leap

0

0.2

0.4

0.6

0.8

1

0.01 1 100 10000

CD

F

Latency (us)

4X 104X37

Page 37: Practical Memory Disaggregation

Performs Great Even After Memory Runs Out

TPC-C on VoltDB

<2X

37.00

27.74

19.33

1.50

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

Infiniswap

37 36.3 35.6

15.6

0

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

TPC-C on VoltDB

Infiniswap + Leap

≈1X 38

Page 38: Practical Memory Disaggregation

Performs Great Even After Memory Runs Out

TPC-C on VoltDB

37.00

27.74

19.33

1.50

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

Infiniswap

37 36.3 35.6

15.6

0

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

TPC-C on VoltDB

Infiniswap + Leap

2.4X 39

Page 39: Practical Memory Disaggregation

Applicability & Performance

Application Changes

Perf

orm

ance

Best

Poor

Major NoMinor

Very Good

File

Disk

Key-Value

Infiniswap

40

Infiniswap + Leap

Page 40: Practical Memory Disaggregation

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

41

What is PracticalMemoryDisaggregation?

Memtrade

Justitia

Hydra

Page 41: Practical Memory Disaggregation

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

42

What is PracticalMemoryDisaggregation?

Page 42: Practical Memory Disaggregation

w/ ZhuolongYu, Yiwen Zhang and othersSIGCOMM’20

43

NetLockLock Management with Programmable Switches

Page 43: Practical Memory Disaggregation

Transactions

44

Transaction processing needs• High throughput;• Low latency; and• Policy support

Existing approaches• Centralized: low throughput• Decentralized: limited policy support

Page 44: Practical Memory Disaggregation

Network-Assisted Lock Management

45

Transaction processing needs• High throughput;• Low latency; and• Policy support

Challenges• Limited memory to store the locks• Limited functionalities to process the

locks and realize the policies

Programmable Switch

Page 45: Practical Memory Disaggregation

NetLock processes lock requests with a combination of switch and servers• The switch only stores and

processes the requests on hot locks• Servers do the rest

Implemented on a 6.5Tbps Barefoot Tofino switch

46

Clients

L2/L3 Routing

LockTable

Lock TableServer

DatabaseServers

ToR Switch Lock TableServer

NetLock

NetLock Architecture

Page 46: Practical Memory Disaggregation

Switch Memory Disaggregation

47

Determine how much switch memory is needed for a target throughput• Formulated as a fractional knapsack problem• Depends of expected contention

Handling overflow• Move locks back and forth between switch and servers

Page 47: Practical Memory Disaggregation

48

Single µs Latency

20X lower latency for TPC-C over DSLR

Page 48: Practical Memory Disaggregation

Billions of Locks/Sec

49

18X higher throughput for TPC-C over DSLR

Page 49: Practical Memory Disaggregation

1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …

50

What is PracticalMemoryDisaggregation?

Page 50: Practical Memory Disaggregation

51

Memory Disaggregation

Wide-Area Computing

AI/MLSystems

Big DataSystems

Page 51: Practical Memory Disaggregation

Network-Informed Data Systems Design

52

Juncheng Gu Jie You Yiwen ZhangFan Lai Jiachen Liu Hasan Al Maruf Peifeng Yu

PhD Students

Undergraduate& Master’s

Collaborators

Chris ChenYinwei DaiShuoren FuSongyuan Guan

Jack KosaianQinye LiYang LiuYuze Lou

Alexander NebenWenting TanYue TanKaiwei Tu

Yuchen WangYujia XieYilei XuJiaxing Yang

Yiwei ZhangJiangchen ZhuJingyuan ZhuXiangfeng Zhu

Aditya AkellaGanesh AnanthanarayananWei BaiVladimir BravermanShuchi ChawlaKai ChenLi ChenAsaf CidonYanhui GengAli GhodsiAyush Goel

Robert GrandlChuanxiong GuoMatan HamilisAnthony HuangAnand P. IyerMyeongjae JeonXin JinSamir KhullerTan N. LeYoungmoon LeeLi Erran Li

Hongqiang LiuZhenhua LiuHarsha V. MadhyasthaKshiteej MahajanBarzan MozafariLinh NguyenAurojit PandaManish PurohitJunjie QianKannan RamchandranK. V. Rashmi

Kang G. ShinScott ShenkerBrent StephensIon StoicaXiao SunMuhammed UluyolShivaram VenkataramanCarl WaldspurgerHongyi WangJingfeng WuSheng Yang

BairenYiDong Young YoonZhuolongYuHong ZhangJunxue ZhangYuhong ZhongYibo Zhu