ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides...

14
ZEA, A Data Management Approach for SMR Adam Manzanares

Transcript of ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides...

Page 1: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

ZEA, A Data Management Approach for SMR

Adam Manzanares

Page 2: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

• Western Digital Research– Cyril Guyot, Damien Le Moal, Zvonimir Bandic

• University of California, Santa Cruz– Noah Watkins, Carlos Maltzahn

©2016 Western Digital Corporation or affiliates. All rights reserved. 2

Co-Authors

Page 3: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

Why SMR ?

• HDDs are not going away– Exponential growth of data still exists– $/TB vs. Flash is still much lower– We want to continue this trend!

• Traditional Recording (PMR) is reaching scalability limits– SMR is a density enhancing technology being shipped right now.

• Future recording technologies may behave like SMR– Write constraint similarities– HAMR

©2016 Western Digital Corporation or affiliates. All rights reserved. 3

Page 4: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

Flavors of SMR

• SMR Constraint– Writes must be sequential to avoid data loss

• Drive Managed– Transparent to the user– Comes at the cost of predictability and additional drive resources

• Host Aware– Host is aware of SMR working model– If user does something “wrong” the drive will fix the problem internally

• Host Managed– Host is aware of SMR working model– If user does something “wrong” the drive will reject the IO request

©2016 Western Digital Corporation or affiliates. All rights reserved. 4

Page 5: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

SMR Drive Device Model

• New SCSI standard Zoned Block Commands (ZBC)– SATA equivalent ZAC

• Drive described by zones and their restrictions

©2016 Western Digital Corporation or affiliates. All rights reserved. 5

Type Write Restriction

Intended Use Con Abbreviation

Conventional None In-place updates Increased Resources

CZ

Sequential Preferred

None Mostly sequential writes

Variable Performance

SPZ

Sequential Required

Sequential Write Only sequential writes

SRZ

• Our user space library (libzbc) queries zone information from the drive– https://github.com/hgst/libzbc

Page 6: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

Why Host Managed SMR ?

• We are chasing predictability– Large systems with complex data flows• Demand latency windows that are relatively tight from storage devices• Translates into latency guarantees from the larger system

• Drive managed HDDs– When is GC triggered– How long does GC take

• Host Aware HDDs– Seem ok if you do the “right” thing– Degrade to drive managed when you don’t

• Host Managed– Complexity and variance of drive internalized schemes are now gone

©2016 Western Digital Corporation or affiliates. All rights reserved. 6

Page 7: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

Host Managed SMR seems great, but …

Applications (processes)

VFS

Request-baseddevice mapper targets

dm-multipath

Physical devices

HDD SSD DVDdrive

MicronPCIe card

LSIRAID

AdaptecRAID

QlogicHBA

EmulexHBA

malloc

BIOs (block I/Os)

sysfs(transport attributes) SCSI upper level drivers

/dev/sda

scsi-mq

.../dev/sd*

SCSI low level driversmegaraid_sas

aacraid

qla2xxx ...libata

ahci ata_piix ... lpfc

Transport classesscsi_transport_fc

scsi_transport_sas

scsi_transport_...

/dev/vd*

virtio_blk mtip32xx

/dev/rssd*

The Linux Storage Stack Diagramhttp://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram

Created by Werner Fischer and Georg SchönbergerLicense: CC-BY-SA 3.0, see http://creativecommons.org/licenses/by-sa/3.0/

ext2 ext3

btrfs

ext4 xfs

ifs iso9660

...

NFS codaNetwork FS

gfs ocfs

smbfs ...

Pseudo FS Specialpurpose FSproc sysfs

futexfs

usbfs ...

tmpfs ramfs

devtmpfspipefs

network

nvmedevice

The Linux Storage Stack Diagramversion 4.0, 2015-06-01

outlines the Linux storage stack as of Kernel version 4.0

mmap(anonymous pages)

iscsi_tcp

network

/dev/rbd*

Block-based FS

read(2

)

wri

te(2

)

open(2

)

stat(

2)

chm

od(2

)

...

Pagecache

mdraid...

stackable

Devices on top of “normal”block devices drbd

(optional)

LVMBIOs (block I/Os)

BIOs BIOs

Block Layer

multi queue

blkmq

Softwarequeues

Hardwaredispatchqueues

...

...

hooked in device drivers(they hook in like stackeddevices do)

BIOs

Maps BIOs to requests

deadline

cfqnoop

I/O scheduler

Hardwaredispatchqueue

Requestbased drivers

BIObased drivers

Requestbased drivers

ceph

struct bio- sector on disk

- bio_vec cnt- bio_vec index- bio_vec list

- sector cnt

Fibre

Channel

over

Eth

ern

et

LIO

target_core_mod

tcm

_fc

Fire

Wir

e

ISC

SI

Direct I/O(O_DIRECT)

device mapper

network

iscsi

_targ

et_

mod

sbp_t

arg

et

target_core_file

target_core_iblock

target_core_pscsi

vfs_writev, vfs_readv, ...

dm-crypt dm-mirrordm-thindm-cache

tcm

_qla

2xxx

tcm

_usb

_gadget

USB

Fibre

Channel

tcm

_vhost

Vir

tual H

ost

/dev/nvme*n*

SCSI mid layer

virtio_pci

LSI 12GbsSAS HBA

mpt3sas

bcache

/dev/nullb*

vmw_pvscsi

/dev/skd*

skd

stecdevice

virtio_scsi

para-virtualizedSCSI

VMware'spara-virtualized

SCSI

target_core_user

unionfs FUSE

/dev/mmcblk*p*

dm-raid

/dev/sr* /dev/st*

pm8001

PMC-SierraHBA

SD-/MMC-Card

/dev/rsxx*

rsxx

IBM flashadapter

/dev/zram*

memory

null_blk

ufs

userspace

ecryptfs

Stackable FS

mobile deviceflash memory

nvme

overlayfs

userspace (e.g. sshfs)

mmcrbdzram

dm-delay

©2016 Western Digital Corporation or affiliates. All rights reserved. 7

• All of these layers must be SMR compatible– Any IO reordering causes a problem• Must not happen at any point between the user and device

• What about my new SMR FS?– What is your GC policy?– Is it tunable? – Does your FS introduce latency at your discretion?

• What about my new KV that is SMR compatible?– See above– In addition, is it built on top of a file system?

Page 8: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

What Did We Do ?

• Introduce a zone and block based extent allocator [ ZEA ] • Write Logical Extent [ Zone Block Address ] – Return Drive LBA

• Read Extent [ Logical Block Address ] – Return data if extent is valid

• Iterators over extent metadata– Allows one to build ZBA -> LBA mapping

©2016 Western Digital Corporation or affiliates. All rights reserved. 8

ZEA + SMR Drive

Per Write Metadata

Page 9: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

ZEA Performance vs. Block Device

©2016 Western Digital Corporation or affiliates. All rights reserved. 9

Page 10: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

ZEA Is Just An Extent Allocator, What Next ?

• ZEA + LevelDB• LevelDB is KV store library that is widely used– LevelDB Backend API is good for SMR• Write File Append Only, Read Randomly,

Create & Delete Files, Flat Directory– Lets Build a Simple FS compatible

with LevelDB

©2016 Western Digital Corporation or affiliates. All rights reserved. 10

ZEA + LevelDB Architecture

Page 11: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

ZEA + LevelDB Performance

©2016 Western Digital Corporation or affiliates. All rights reserved. 11

Page 12: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

Lessons Learned

• ZEA is a lightweight abstraction – Hides sequential write constraint from application– Low overhead vs. a block device when extent size is reasonable– Provides addressing flexibility to application• Useful for GC

• LevelDB integration opens up usability of Host Managed SMR HDD• Unfortunately LevelDB not so great for large objects– Ideal Use case for SMR drive would be large static blobs

©2016 Western Digital Corporation or affiliates. All rights reserved. 12

Page 13: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

What Is Left ?

• What is a “good” interface above ZEA• Garbage collection policies– When and How

• How to use multiple zones efficiently – Allocation– Garbage collection

• Thanks for listening

©2016 Western Digital Corporation or affiliates. All rights reserved. 13

Page 14: ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides sequential write constraint from application – Low overhead vs. a block device when extent

©2016 Western Digital Corporation or affiliates. All rights reserved. 14

[email protected]