Providing Atomic Sector Updates in Software for … Providing Atomic Sector Updates in Software for...

35
Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma [email protected] Vault 2015

Transcript of Providing Atomic Sector Updates in Software for … Providing Atomic Sector Updates in Software for...

1

Providing Atomic Sector Updates in Software for Persistent Memory

Vishal Verma

[email protected]

Vault 2015

2

Introduction

The Block Translation Table

Read and Write Flows

Synchronization

Performance/Efficiency

BTT vs. DAX

3

NVDIMMs and Persistent Memory

● NVDIMMs are byte-addressable

● We won't talk of “Total System Persistence”● But using persistent memory DIMMs for storage

● Drivers to present this as a block device - “pmem”

CPUcaches

DRAM

Persistent Memory

Traditional Storage

Speed Capacity

4

Problem Statement

• Byte addressability is great– But not for writing a

sector atomically

Userspace

write()

'pmem' driver - /dev/pmem0

- - - NVDIMM0 1 2 3

memcpy()

5

Problem Statement

• On a power failure, there are three possibilities

1.No blocks are torn (common on modern drives)

2.A block was torn, but reads back with an ECC error

3.A block was torn, but reads back without an ECC error (very rare on modern drives)

• With pmem, we use memcpy()

– ECC is correct between two stores

– Torn sectors will almost never trigger ECC on the NVDIMM

– Case 3 becomes most common!

– Only file systems with data checksums will survive this case

6

Naive solution

• Full Data Journaling

• Write every block to the journal first

• 2x latency

• 2x media wear

7

Slightly better solution

• Maintain an 'on-disk' indirection table and an in-memory free block list

• The map/indirection table has LBA -> actual block offset mappings

• New writes grab a block from free list

• On completing the write, atomically swap the free list entry and map entry NVDIMM

LBA Actual

0 42

1 5050

2 314

3 3

Free List

0

2

12

42 - LBA 0

3 - LBA 3

314 - LBA 2

0 - Free

Map

8

Slightly better solution

• Maintain an 'on-disk' indirection table and an in-memory free block list

• The map/indirection table has LBA -> actual block offset mappings

• New writes grab a block from free list

• On completing the write, atomically swap the free list entry and map entry NVDIMM

LBA Actual

0 42

1 5050

2 314

3 3

Free List

0

2

12

42 - LBA 0

314 - LBA 2

write( to LBA 3 )

Map

0 - Free

3 - LBA 3

9

Slightly better solution

• Maintain an 'on-disk' indirection table and an in-memory free block list

• The map/indirection table has LBA -> actual block offset mappings

• New writes grab a block from free list

• On completing the write, atomically swap the free list entry and map entry NVDIMM

LBA Actual

0 42

1 5050

2 314

3 0

Free List

3

2

12

42 - LBA 0

3 - Free

314 - LBA 2

0 - LBA 3

Map

10

Slightly better solution

• Easy enough to implement

• Should be performant

• Caveat:– The only way to recreate the free list is to read the entire map

– Consider a 512GB volume, bs=512 => reading 1073741824 map entries

– Map entries have to be 64-bit, so we end up reading 8GB at startup

– Could save the free list to media on clean shutdown

– But...clunky at best

11

Introduction

The Block Translation Table

Read and Write Flows

Synchronization

Performance/Efficiency

BTT vs. DAX

12

The Block Translation Table• nfree: The number of free blocks in reserve.

• Flog: Portmanteau of free list + log– Has nfree entries.– Each entry has two 'slots' that 'flip-flop'– Each slot has:

• Info block: Info about arena - offsets, lbasizes etc.

• External LBA: LBA as visible to upper layers

• ABA: Arena Block Address - Block offset within an arena

• Premap/Postmap ABA: The block offset into the data area as seen prior to/post indirection from the map

Arena

Arena Info Block (4K)

Data Blocks

BTT Map

Info Block Copy (4K)

BTT Flog (8K)

Backing Store

Arena 0512G

Arena 1512G

.

.

.

nfree reserved blocksBlock being written

Old mapping

New mapping

Sequence num

13

What's in a lane?

• The idea of “lanes” is purely logical

• num_lanes = min(num_cpus, nfree)

• lane = cpu % num_lanes

• If num_cpus > num_lanes, we need locking on lanes– But if not, we can simply preempt_disable() and need not take a lock

CPU 0get_lane() = 0

Lane 0

Free List

blk seq slot

2 0b10 0

6 0b10 1

14 0b01 0

LBA old new seq LBA` old` new` seq`

5 32 2 0b10 XX XX XX XX

XX XX XX XX 8 38 6 0b10

42 42 14 0b01 XX XX XX XX

Flog

CPU 1get_lane() = 1

Lane 1

CPU 2 Lane 2get_lane() = 2

5 2

8 6

42 14

Map

14

Introduction

The Block Translation Table

Read and Write Flows

Synchronization

Performance/Efficiency

BTT vs. DAX

15

BTT – Reading a block

• Convert external LBA to Arena number + pre-map ABA

• Get a lane (and take lane_lock if needed)

• Read map to get the mapping

• If ZERO flag is set, return zeroes

• If ERROR flag is set, return an error

• Read data from the block that the map points to

• Release lane (and lane_lock)

CPU 0

Lane 0

read() LBA 5

Read data from 10

pre post

5 10

Map

Release Lane 0

16

BTT – Writing a block

• Convert external LBA to Arena number + pre-map ABA

• Get a lane (and take lane_lock if needed)

• Use lane to index into free list, write data to this free block

• Read map to get the existing mapping

• Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq]

• Write new post-map ABA into map.

• Write old post-map entry into the free list

• Calculate next sequence number and write into the free list entry

• Release lane (and lane_lock)

CPU 0

Lane 0

blk seq slot

2 0b10 0

Free List[0]

write() LBA 5

write data to 2

pre post

5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10}

map[5] = 2

pre post

5 2

Map

Release Lane 0

free[0] = {10, 0b11, 1}

17

BTT – Analysis of a write

CPU 0Lane 0

blk seq slot

2 0b10 0

Free List[0]

write() LBA 5

write data to 2

pre post

5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2

pre post

5 2

Map

ReleaseLane 0

free[0] = {10, 0b11, 1}

Opportunities for interruption/power failure

18

BTT – Analysis of a write

CPU 0Lane 0

blk seq slot

2 0b10 0

Free List[0]

write() LBA 5

write data to 2

pre post

5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2

pre post

5 2

Map

ReleaseLane 0

free[0] = {10, 0b11, 1}

• On reboot:

– No on-disk change had happened, everything comes back up as normal

19

BTT – Analysis of a write

CPU 0Lane 0

blk seq slot

2 0b10 0

Free List[0]

write() LBA 5

write data to 2

pre post

5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2

pre post

5 2

Map

ReleaseLane 0

free[0] = {10, 0b11, 1}

• On reboot:

– Map hasn't been updated

– Reads will continue to get the 5 → 10 mapping

– Flog will still show '2' as free and ready to be written to

20

BTT – Analysis of a write

CPU 0Lane 0

blk seq slot

2 0b10 0

Free List[0]

write() LBA 5

write data to 2

pre post

5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2

pre post

5 2

Map

ReleaseLane 0

free[0] = {10, 0b11, 1}

• On reboot:

– Read flog[0][0] = {5, 10, 2, 0b10}

– Flog claims map[5] should have been '2', but map[5] is still '10' (== flog.old)

– Since flog and map disagree, recovery routine detects an incomplete transaction

– Flog is assumed to be “true” since it is always written before the map

– Recovery routine completes the transaction by updating map[5] = 2; free[0] = 10

21

BTT – Analysis of a write

CPU 0Lane 0

blk seq slot

2 0b10 0

Free List[0]

write() LBA 5

write data to 2

pre post

5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2

pre post

5 2

Map

ReleaseLane 0

free[0] = {10, 0b11, 1}

• Special case, the flog write is torn:

• On reboot:

– Read flog[0][0] = {5, 10, X, 0b11}; flog[0][1] = {X, X, X, 0b01}

– Since seq is written last, the half-written flog entry does not show up as “new”

– Free list is reconstructed using the newest non-torn flog entry flog[0][1] in this case

– map[5] remains '10', and '2' remains free.

Bit sequence for flog.seq: 01->10->11->01 Old New← →

22

BTT – Analysis of a write

CPU 0Lane 0

blk seq slot

2 0b10 0

Free List[0]

write() LBA 5

write data to 2

pre post

5 10

Map (old)

flog[0][0] = {5, 10, 2, 0b10} map[5] = 2

pre post

5 2

Map

ReleaseLane 0

free[0] = {10, 0b11, 1}

• On reboot:

– Since both flog and map were updated, free list reconstruction will happen as usual

23

Introduction

The Block Translation Table

Read and Write Flows

Synchronization

Performance/Efficiency

BTT vs. DAX

24

Let's Race! Write vs. Write

CPU 1 CPU 2

write LBA 0 write LBA 0

get-free[1] = 5 get-free[2] = 6

write data - postmap ABA 5 write data - postmap ABA 6

... ...

read old_map[0] = 10 read old_map[0] = 10

write log 0/10/5/xx write log 0/10/6/xx

write map = 5 write map = 6

write free[1] = 10 write free[2] = 10

25

Let's Race! Write vs. Write

CPU 1 CPU 2

write LBA 0 write LBA 0

get-free[1] = 5 get-free[2] = 6

write data - postmap ABA 5 write data - postmap ABA 6

... ...

read old_map[0] = 10 read old_map[0] = 10

write log 0/10/5/xx write log 0/10/6/xx

write map = 5 write map = 6

write free[1] = 10 write free[2] = 10

26

Let's Race! Write vs. Write

CPU 1 CPU 2

write LBA 0 write LBA 0

get-free[1] = 5 get-free[2] = 6

write data - postmap ABA 5 write data - postmap ABA 6

... ...

read old_map[0] = 10 read old_map[0] = 10

write log 0/10/5/xx write log 0/10/6/xx

write map = 5 write map = 6

write free[1] = 10 write free[2] = 10

Critical section

27

Let's Race! Write vs. Write● Solution: An array of map_locks indexed by a hash of the premap ABA

CPU 1 CPU 2

write LBA 0; get-free[1] = 5; write_data to 5 write LBA 0; get-free[2] = 6; write_data to 6

lock map_lock[0 % nfree]

read old_map[0] = 10

write log 0/10/5/xx; write map = 5; free[1] = 10

unlock map_lock[0 % nfree] lock map_lock[0 % nfree]

read old_map[0] = 5

write log 0/5/6/xx; write map = 6; free[2] = 5

unlock map_lock[0 % nfree]

28

Let's Race! Read vs. Write

CPU 1 (Reader) CPU 2 (Writer)

read LBA 0 write LBA 0

... get-free[2] = 6

read map[0] = 5 write data to postmap block 6

start reading postmap block 5 write meta: map[0] = 6, free[2] = 5

... another write LBA 12

... get-free[2] = 5

... write data to postmap block 5

finish reading postmap block 5

BUG! – writing a block that is being read from

● This doesn't corrupt on-disk layout, but the read appears torn

29

Let's Race! Read vs. Write

CPU 1 (Reader) CPU 2 (Writer)

read LBA 0 write LBA 0

read map[0] = 5 get-free[2] = 6; write data

write rtt[1] = 5 write meta: map[0] = 6, free[2] = 5

start reading postmap block 5 another write LBA 12

... get-free[2] = 5

... scan RTT – '5' is present - wait!

finish reading postmap block 5 ...

clear rtt[1] ...

write data to postmap block 5

● Solution: A Read Tracking Table indexed by lane, tracking in-progress reads

30

Introduction

The Block Translation Table

Read and Write Flows

Synchronization

Performance/Efficiency

BTT vs. DAX

31

That's Great...but is it Fast?

● Overall, BTT to introduces a ~10% performance overhead

● We think there is still room for improvement

512B Block size 4K Block size

Write Amplification ~4.6% [536B] ~0.5% [4120B]

Capacity Overhead ~0.8% ~0.1%

32

Introduction

The Block Translation Table

Read and Write Flows

Synchronization

Performance/Efficiency

BTT vs. DAX

33

BTT vs. DAX

● DAX stands for Direct Access

● Patchset by Matthew Wilcox, merged into 4.0-rc1

● Allows mapping a pmem range directly into userspace via mmap

● DAX is fundamentally incompatible with the idea of BTT

● If the application is aware of persistent, byte-addressable memory, and can use it to an advantage, DAX is the best path for it

• If the application relies on atomic sector update semantics, it must use the BTT– It may not know that it relies on this..

● XFS relies on journal updates being sector atomic– For xfs-dax, we'd need to use logdev=/dev/[btt-partition]

34

Resources

● http://pmem.io - General persistent memory resources. Focuses on the NVML, a library to make persistent memory programming easier

● The 'pmem' driver on github: https://github.com/01org/prd

● linux-nvdimm mailing list: https://lists.01.org/mailman/listinfo/linux-nvdimm

● linux-nvdimm patchwork: https://patchwork.kernel.org/project/linux-nvdimm/list/

● #pmem on OFTC

Q & A