ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides...
Transcript of ZEA, A Data Management Approach for SMR · • ZEA is a lightweight abstraction – Hides...
ZEA, A Data Management Approach for SMR
Adam Manzanares
• Western Digital Research– Cyril Guyot, Damien Le Moal, Zvonimir Bandic
• University of California, Santa Cruz– Noah Watkins, Carlos Maltzahn
©2016 Western Digital Corporation or affiliates. All rights reserved. 2
Co-Authors
Why SMR ?
• HDDs are not going away– Exponential growth of data still exists– $/TB vs. Flash is still much lower– We want to continue this trend!
• Traditional Recording (PMR) is reaching scalability limits– SMR is a density enhancing technology being shipped right now.
• Future recording technologies may behave like SMR– Write constraint similarities– HAMR
©2016 Western Digital Corporation or affiliates. All rights reserved. 3
Flavors of SMR
• SMR Constraint– Writes must be sequential to avoid data loss
• Drive Managed– Transparent to the user– Comes at the cost of predictability and additional drive resources
• Host Aware– Host is aware of SMR working model– If user does something “wrong” the drive will fix the problem internally
• Host Managed– Host is aware of SMR working model– If user does something “wrong” the drive will reject the IO request
©2016 Western Digital Corporation or affiliates. All rights reserved. 4
SMR Drive Device Model
• New SCSI standard Zoned Block Commands (ZBC)– SATA equivalent ZAC
• Drive described by zones and their restrictions
©2016 Western Digital Corporation or affiliates. All rights reserved. 5
Type Write Restriction
Intended Use Con Abbreviation
Conventional None In-place updates Increased Resources
CZ
Sequential Preferred
None Mostly sequential writes
Variable Performance
SPZ
Sequential Required
Sequential Write Only sequential writes
SRZ
• Our user space library (libzbc) queries zone information from the drive– https://github.com/hgst/libzbc
Why Host Managed SMR ?
• We are chasing predictability– Large systems with complex data flows• Demand latency windows that are relatively tight from storage devices• Translates into latency guarantees from the larger system
• Drive managed HDDs– When is GC triggered– How long does GC take
• Host Aware HDDs– Seem ok if you do the “right” thing– Degrade to drive managed when you don’t
• Host Managed– Complexity and variance of drive internalized schemes are now gone
©2016 Western Digital Corporation or affiliates. All rights reserved. 6
Host Managed SMR seems great, but …
Applications (processes)
VFS
Request-baseddevice mapper targets
dm-multipath
Physical devices
HDD SSD DVDdrive
MicronPCIe card
LSIRAID
AdaptecRAID
QlogicHBA
EmulexHBA
malloc
BIOs (block I/Os)
sysfs(transport attributes) SCSI upper level drivers
/dev/sda
scsi-mq
.../dev/sd*
SCSI low level driversmegaraid_sas
aacraid
qla2xxx ...libata
ahci ata_piix ... lpfc
Transport classesscsi_transport_fc
scsi_transport_sas
scsi_transport_...
/dev/vd*
virtio_blk mtip32xx
/dev/rssd*
The Linux Storage Stack Diagramhttp://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram
Created by Werner Fischer and Georg SchönbergerLicense: CC-BY-SA 3.0, see http://creativecommons.org/licenses/by-sa/3.0/
ext2 ext3
btrfs
ext4 xfs
ifs iso9660
...
NFS codaNetwork FS
gfs ocfs
smbfs ...
Pseudo FS Specialpurpose FSproc sysfs
futexfs
usbfs ...
tmpfs ramfs
devtmpfspipefs
network
nvmedevice
The Linux Storage Stack Diagramversion 4.0, 2015-06-01
outlines the Linux storage stack as of Kernel version 4.0
mmap(anonymous pages)
iscsi_tcp
network
/dev/rbd*
Block-based FS
read(2
)
wri
te(2
)
open(2
)
stat(
2)
chm
od(2
)
...
Pagecache
mdraid...
stackable
Devices on top of “normal”block devices drbd
(optional)
LVMBIOs (block I/Os)
BIOs BIOs
Block Layer
multi queue
blkmq
Softwarequeues
Hardwaredispatchqueues
...
...
hooked in device drivers(they hook in like stackeddevices do)
BIOs
Maps BIOs to requests
deadline
cfqnoop
I/O scheduler
Hardwaredispatchqueue
Requestbased drivers
BIObased drivers
Requestbased drivers
ceph
struct bio- sector on disk
- bio_vec cnt- bio_vec index- bio_vec list
- sector cnt
Fibre
Channel
over
Eth
ern
et
LIO
target_core_mod
tcm
_fc
Fire
Wir
e
ISC
SI
Direct I/O(O_DIRECT)
device mapper
network
iscsi
_targ
et_
mod
sbp_t
arg
et
target_core_file
target_core_iblock
target_core_pscsi
vfs_writev, vfs_readv, ...
dm-crypt dm-mirrordm-thindm-cache
tcm
_qla
2xxx
tcm
_usb
_gadget
USB
Fibre
Channel
tcm
_vhost
Vir
tual H
ost
/dev/nvme*n*
SCSI mid layer
virtio_pci
LSI 12GbsSAS HBA
mpt3sas
bcache
/dev/nullb*
vmw_pvscsi
/dev/skd*
skd
stecdevice
virtio_scsi
para-virtualizedSCSI
VMware'spara-virtualized
SCSI
target_core_user
unionfs FUSE
/dev/mmcblk*p*
dm-raid
/dev/sr* /dev/st*
pm8001
PMC-SierraHBA
SD-/MMC-Card
/dev/rsxx*
rsxx
IBM flashadapter
/dev/zram*
memory
null_blk
ufs
userspace
ecryptfs
Stackable FS
mobile deviceflash memory
nvme
overlayfs
userspace (e.g. sshfs)
mmcrbdzram
dm-delay
©2016 Western Digital Corporation or affiliates. All rights reserved. 7
• All of these layers must be SMR compatible– Any IO reordering causes a problem• Must not happen at any point between the user and device
• What about my new SMR FS?– What is your GC policy?– Is it tunable? – Does your FS introduce latency at your discretion?
• What about my new KV that is SMR compatible?– See above– In addition, is it built on top of a file system?
What Did We Do ?
• Introduce a zone and block based extent allocator [ ZEA ] • Write Logical Extent [ Zone Block Address ] – Return Drive LBA
• Read Extent [ Logical Block Address ] – Return data if extent is valid
• Iterators over extent metadata– Allows one to build ZBA -> LBA mapping
©2016 Western Digital Corporation or affiliates. All rights reserved. 8
ZEA + SMR Drive
Per Write Metadata
ZEA Performance vs. Block Device
©2016 Western Digital Corporation or affiliates. All rights reserved. 9
ZEA Is Just An Extent Allocator, What Next ?
• ZEA + LevelDB• LevelDB is KV store library that is widely used– LevelDB Backend API is good for SMR• Write File Append Only, Read Randomly,
Create & Delete Files, Flat Directory– Lets Build a Simple FS compatible
with LevelDB
©2016 Western Digital Corporation or affiliates. All rights reserved. 10
ZEA + LevelDB Architecture
ZEA + LevelDB Performance
©2016 Western Digital Corporation or affiliates. All rights reserved. 11
Lessons Learned
• ZEA is a lightweight abstraction – Hides sequential write constraint from application– Low overhead vs. a block device when extent size is reasonable– Provides addressing flexibility to application• Useful for GC
• LevelDB integration opens up usability of Host Managed SMR HDD• Unfortunately LevelDB not so great for large objects– Ideal Use case for SMR drive would be large static blobs
©2016 Western Digital Corporation or affiliates. All rights reserved. 12
What Is Left ?
• What is a “good” interface above ZEA• Garbage collection policies– When and How
• How to use multiple zones efficiently – Allocation– Garbage collection
• Thanks for listening
©2016 Western Digital Corporation or affiliates. All rights reserved. 13
©2016 Western Digital Corporation or affiliates. All rights reserved. 14