File systems for persistent memory

36
File systems for persistent memory CS 839 - Persistence

Transcript of File systems for persistent memory

Page 1: File systems for persistent memory

File systems for persistent memoryCS 839 - Persistence

Page 2: File systems for persistent memory

Questions on homework?

• Can we shift the schedule and do BPFS on Thursday and Nova on Monday? Drop Aerie or SplitFS.

Page 3: File systems for persistent memory

Learning outcomes

• Understand how disk-based file systems update metadata and handle consistency

• Understand the properties of NVM that can change file system design

• Understand the key ordering requirements for file systems

• Understand BPFS software and hardware mechanisms and their limitations

Page 4: File systems for persistent memory

Background story

• PCM is becoming popular, first for main memory

• Obvious approach seems to be use it for file systems too

• Question: how do you optimize?

Page 5: File systems for persistent memory

Background: normal file systems

• Use page cache to buffer data in DRAM

• Access SSD through block layer

• Use logging for consistency

Page 6: File systems for persistent memory

Background: FS data structures

• Standard FS data structures• Superblock: describes FS parameters, location of root inode

• Inode: metadata for a single file• Attributes, size, location of data blocks

• Number

• Data block: holds file or directory contents

• Directory entry: String name and inode number

• Inode and data block bitmaps: track free/used locations on storage

• Indirect block: location of other data blocks or indirect blocks

Page 7: File systems for persistent memory

Background: FS consistency

• What gets updated when appending to a file?• Allocate block from data bitmap

• Write data to data block

• Write block address to inode or indirect block

• Update file length & modification time in inode

• What happens if system crashes in the middle?

Page 8: File systems for persistent memory

Background: FS consistency mechanisms

• Journaling: write metadata (and/or data) to a journal before writing it in place – redo logging• Write journal, force to storage

• Later checkpoint – write metadata/data in real place

• Can skip data journaling for performance

• Shadow updates: write data/metadata updates to new location (used in BPFS)• Basically copy-on write data structures

Page 9: File systems for persistent memory

Review 1: Journaling

• Write to journal, then write to file system

A B

file system

journal A’ B’

B’A’

• Reliable, but all data is written twice9

Page 10: File systems for persistent memory

Review 2: Shadow Paging

• Use copy-on-write up to root of file system

BA A’ B’

file’s root pointer

• Any change requires bubbling to the FS root

• Small writes require large copying overhead 10

Page 11: File systems for persistent memory

Atomicity requirements

• What happens when you crash while writing data to a file?1. Entire write takes place or none takes place

2. Some blocks may be written entirely but not all

3. Arbitrary bytes of file may be replaced

• What do normal file systems do?• “torn write” – partially written block

• Data vs metadata journaling

Page 12: File systems for persistent memory

Basic idea: RAM disk

• Idea 1: RAM disk• Make a block device that access

NVM instead of going to a device

• BTT: block-translation table, uses shadow updates to allow atomic block-sized writes

• Problems:• Still copy data to DRAM –

inefficient

• All writes are block sized --inefficient

Page 13: File systems for persistent memory

What changes with NVM/SCM/PMem?

Page 14: File systems for persistent memory

What changes with NVM/SCM/PMem?

• Fine grained writes• Don’t have to write entire blocks when updating a single value

• Fast random access• Don’t need to optimize metadata for sequential extents

• No buffering• Can serve data directly from memory

• But:• Loss of ordering

Page 15: File systems for persistent memory

Short-Circuit Shadow Paging

• Uses byte-addressability and atomic 64b writes

BA A’ B’

file’s root pointer

15

• Inspired by shadow paging– Optimization: In-place update when possible

Page 16: File systems for persistent memory

Opt. 1: In-Place Writes

• Aligned 64-bit writes are performed in place• Data and metadata

file’s root pointer

in-place write

16

Page 17: File systems for persistent memory

• Appends committed by updating file size

file’s root pointer + size

in-place append

file size update

17

Opt. 2: Exploit Data-Metadata Invariants

Page 18: File systems for persistent memory

BPFS Example

directory filedirectory

inodefile

root pointer

indirect blocks

inodes

add entry

remove entry

18

• Cross-directory rename bubbles to common ancestor

Page 19: File systems for persistent memory

What happens if you memory-map a file?

Page 20: File systems for persistent memory

• Rely on hardware for 1-word atomic update

➢CPU cache may reorder writes to NVM• Breaks “crash-consistent” update protocols`

Consistent updates

20

0xC02

Write-back Cache

0

NVM

0xDEADBEEFvalue

valid

value

valid1

1

STORE value = 0xC02STORE valid = 1

Page 21: File systems for persistent memory

Primitive operation: ordering writes

• Why?• Ensures ability to commit a change

• How?• Flush – MOVNTQ/CLFLUSH

• Fence – MFENCE

• Inefficiencies:• Removes recent data from cache

21

0

NVMWrite-back cache

0xDEADBEEFvalue

valid

value

valid 1

0xC02

STORE value = 0xC02FLUSH (&value)FENCESTORE valid = 1

Page 22: File systems for persistent memory

BPRAM

L1 / L2

...

CoW

Commit

...

Ordering in BPFS

22

Page 23: File systems for persistent memory

...

CoW

Commit

...

Atomicity in BPFS

L1 / L2

BPRAM

23

Page 24: File systems for persistent memory

Enforcing Ordering and Atomicity

• Ordering• Solution: Epoch barriers to declare constraints

• Faster than write-through

• Important hardware primitive (cf. SCSI TCQ)

• Atomicity• Solution: Capacitor on DIMM

• Simple and cheap!

24

Page 25: File systems for persistent memory

Intel x86 flush mechanism

25

A

ST ACLWB ASFENCE

ACK

SFENCECOMMITS

A

ST A

ST B

CLWB A

CLWB B

SFENCE

ST C

CLWB C

SFENCE

25

Page 26: File systems for persistent memory

Intel x86 flush mechanism

26

Drawback 1: No distinction between ordering and durability

Drawback 2: Ordering introduces stalls

26

ST A

ST B

CLWB A

CLWB B

SFENCE

ST C

CLWB C

SFENCE

Page 27: File systems for persistent memory

Epoch ordering

• Goal:• No software flushes – too expensive/complex

• Ordering is asynchronous – too expensive to stall

• Solution• Persist barriers

Page 28: File systems for persistent memory

Persist barriers: Ordering Fence

28

ST A=1Volatile Memory Order

ST A=1Persistence OrderTime

ST B=2

ST B=2

Thread 1 Thread 1Barrier

• Orders stores preceding barrier before later stores

Happens Before

Page 29: File systems for persistent memory

Ordering Epochs without Flushing

29

CPU 1

Local TS

L1 Cache

2526

1. ST A = 12. ST B = 13. LD R1 = A4. BARRIER5. ST A = 2

A = 1 25A = 2 26

B = 1 25

Page 30: File systems for persistent memory

...

CoW

Barrier

Commit

...

Ordering and Atomicity with Epoch Barriers

L1 / L2

BPRAM

1

1 1

2

Ineligible for eviction!

30

Page 31: File systems for persistent memory

Epoch ordering complexity

• When is it safe to let something leave the cache?• When all writes from preceding epoch have left already

• What happens if you overwrite something from a preceding epoch?• Must flush earlier epoch first – can’t store multiple versions

• What happens when you access something from another core?• Can’t track ordering across cores (epoch numbers across cores aren’t orderd)

• Old data must be flushed

• How do you implement efficiently?• Store 8-bit pointer in each cacheline to registers holding 8 in-flight epochs

Page 32: File systems for persistent memory

Considerations for epoch ordering

• How complex is it?

• How easy to use is it?

Page 33: File systems for persistent memory

Considerations for epoch ordering

• How complex is it?• Need hardware walkers to evict cachelines during cache replacement

• How easy to use is it?• Dependencies across volatile variables not recorded

• Example:

• Could reboot with Y=2, A=4

Acquire(vol_lock);X = 1;Y = 2;Release(vol_lock);

Acquire(vol_lock);A = 4;B = 5;Release(vol_lock);

Page 34: File systems for persistent memory

0

2

4

6

8

10

8 64 512 4096Th

ou

san

ds

Random n Byte Write

Microbenchmarks

0

0.4

0.8

1.2

1.6

2

8 64 512 4096

Tim

e (

s)

Append n Bytes

NTFS - DiskNTFS - RAMBPFS - RAM

34

NOT DURABLE!

NOT DURABLE!

DURABLE!

DURABLE!

Page 35: File systems for persistent memory

Notes from reviews

• How much performance improvement should we expect?

• How important is using real PCM (or real PCM latency) in evaluation?

• Could we have systems with just Pmem and no SSD?

• What journaling mode does NTFS use?• Ordered journaling

• Is modifying HW ok?

• Using volatile structures• Free blocks, freed & allocate inode numbers, • data freed by CoW operation• Dentry cache

Page 36: File systems for persistent memory

How well does it perform?

• Evaluation:• Implement in Windows & run over DRAM (no epoch barrier delays)

• Implement in usermode & run in a simulator

• Analytical model

• Workloads