P16 Slides

download P16 Slides

of 36

Transcript of P16 Slides

  • 8/3/2019 P16 Slides

    1/36

    Porting ZFS file system toFreeBSD

    Pawe Jakub Dawidek

    mailto:[email protected]:[email protected]
  • 8/3/2019 P16 Slides

    2/36

    The beginning...

    ZFS released by SUN underCDDL license available in Solaris / OpenSolaris

    only ongoing Linux port for FUSE

    framework (userland); started as

    SoC project

  • 8/3/2019 P16 Slides

    3/36

    Features...

    ZFS has many very interestingfeatures, which make it one of

    the most wanted file systems

  • 8/3/2019 P16 Slides

    4/36

    Features...

    dynamic stripping use the entirebandwidth available,

    RAID-Z (RAID-5 withoutwrite hole (more like RAID-3actually)),

    RAID-1, 128 bits (POSIX limits FS to 64 bits)...

    (think about 65 bits)

  • 8/3/2019 P16 Slides

    5/36

    Features...

    pooled storage no more volumes/partitions does for storage what VM did for memory

    copy-on-write model transactional operation

    always consistent on disk no fsck, no journaling

  • 8/3/2019 P16 Slides

    6/36

    Features...

    snapshots very cheap, because of COW model

    clones writtable snapshots snapshot rollback

    always consistent on disk end-to-end data integrity

    detects and corrects silent data corruption causedby any defect in disk, cable, controller, driver

    or firmware

  • 8/3/2019 P16 Slides

    7/36

    Features...

    built-in compression lzjb, gzip (as a patch)

    self-healing endian-independent

    always write in native endianess

    simplified administration per-filesystem encryption soon

  • 8/3/2019 P16 Slides

    8/36

    Volume

    FS

    Volume

    FS

    Volume

    FS

    Storage Pool

    ZFS ZFS ZFS

    Tranditional Volumes abstraction: virtual disk volume/partition for each FS grow/shrink by hand each FS has limited bandwidth storage is fragmented

    ZFS Pooled Storage abstraction: malloc/free no partitions to manage grow/shrink automatically all bandwidth always available all storage in the pool is shared

    FS/Volume model vs. ZFS

  • 8/3/2019 P16 Slides

    9/36

    ZFS Self-Healing

  • 8/3/2019 P16 Slides

    10/36

    xVM mirror

    File System

    1. Application issuesa read. Mirror reads

    the first disk, whichhas a corrupt block.It can't tell...

    Application

    xVM mirror

    File System

    2. Volume managerpasses the bad block

    to file system. If it's ametadata block, thesystem panics. If not...

    Application

    xVM mirror

    File System

    3. File systemreturns bad data to

    the application...

    Application

    Traditional mirroring

  • 8/3/2019 P16 Slides

    11/36

    ZFS mirror

    1. Application issues

    a read. ZFS mirrortries the first disk.Checksum revealsthat the block iscorrupt on disk.

    Application

    ZFS mirror

    2. ZFS tries the

    second disk.Checksum indicatesthat the block isgood.

    Application

    ZFS mirror

    3. ZFS returns

    good data to theapplication andrepairs thedamaged block.

    Application

    Self-Healing data in ZFS

  • 8/3/2019 P16 Slides

    12/36

    Porting...

    very portable code (started to workafter 10 days of porting) few ugly Solaris-specific details few ugly FreeBSD-specific

    details (VFS, buffer cache) ZPL was hell (ZFS POSIX layer);

    yes, this is the thing which VFStalks to

  • 8/3/2019 P16 Slides

    13/36

    Solaris compatibility layer

    contrib/opensolaris/- userland code taken from OpenSolarisused by ZFS (ZFS control utilities, libraries, test tools)

    compat/opensolaris/- userland API compatibility layer(Solaris-specific functions missing in FreeBSD)

    cddl/- Makefiles used to build userland libraries and utilitiessys/contrib/opensolaris/- kernel code taken from OpenSolaris

    used by ZFSsys/compat/opensolaris/- kernel API compatiblity layer

    sys/modules/zfs/- Makefile for building ZFS kernel module

  • 8/3/2019 P16 Slides

    14/36

    ZFS connection points in the kernel

    ZFS

    GEOM(ZVOL)

    VFS(ZFS file systems)

    /dev/zfs(userland

    communication)

    GEOM(VDEV)

  • 8/3/2019 P16 Slides

    15/36

    How does it look exactly...

    ZVOL/GEOMproviders only

    VDEV_GEOMconsumers only VDEV_FILE VDEV_DISK

    GEOM

    GEOM VFS

    ZPL

    ZFSmany other layers

    use mdconfig(8)

  • 8/3/2019 P16 Slides

    16/36

    Snapshots

    contains @ in its name:# zfs listNAME USED AVAIL REFER MOUNTPOINTtank 50,4M 73,3G 50,3M /tanktank@monday 0 - 50,3M -

    tank@tuesday 0 - 50,3M -tank/freebsd 24,5K 73,3G 24,5K /tank/freebsdtank/freebsd@tuesday 0 - 24,5K -

    mounted on first access under

    /mountpoint/.zfs/snapshot/ hard to NFS-export

    separate file systems have to be visible when its

    parent is NFS-mounted

    mailto:tank@mondaymailto:tank@tuesdaymailto:tank/freebsd@tuesdaymailto:tank/freebsd@tuesdaymailto:tank@tuesdaymailto:tank@monday
  • 8/3/2019 P16 Slides

    17/36

    NFS is easy

    # mountd /etc/exports /etc/zfs/exports# zfs set sharenfs=ro,maproot=0,network=192.168.0.0,mask=255.255.0.0 tank# cat /etc/zfs/exports# !!! DO NOT EDIT THIS FILE MANUALLY !!!

    /tank -ro -maproot=0 -network=192.168.0.0 -mask=255.255.0.0/tank/freebsd -ro -maproot=0 -network=192.168.0.0 -mask=255.255.0.0

    we translate options to exports(5) format

    and SIGHUP mountd(8) daemon

  • 8/3/2019 P16 Slides

    18/36

    Missing bits in FreeBSD needed by ZFS

  • 8/3/2019 P16 Slides

    19/36

    Sleepable mutexes

    no sleeping while holding mutex(9) Solaris mutexes implemented

    on top of sx(9) locks condvar(9) version that operates on

    sx(9) locks

  • 8/3/2019 P16 Slides

    20/36

    GFS (Generic Pseudo-Filesystem)

    allows to create virtual objects(not stored on disk)

    in ZFS we have:.zfs/.zfs/snapshot

    .zfs/snapshot//

  • 8/3/2019 P16 Slides

    21/36

    VPTOFH

    translates vnode to a file handle VFS_VPTOFH(9) replaced with

    VOP_VPTOFH(9) to support NFSexporting of GFS vnodes

    its just better that way, ask Krik for

    the story:)

  • 8/3/2019 P16 Slides

    22/36

    lseek(2) SEEK_{DATA,HOLE}

    SEEK_HOLE returns the offsetof the next hole

    SEEK_DATA returns the offsetof the next data

    helpful for backup software not ZFS-specific

  • 8/3/2019 P16 Slides

    23/36

    Testing correctness

    ztest (libzpool) a product is only as good as its test suite runs most of the ZFS code in userland

    probably more abuse in 20 seconds that you'dsee in a lifetime fstest regression test suite

    3438 tests in 184 files # prove -r /usr/src/tools/regression/fstest/tests tests: chflags(2), chmod(2), chown(2), link(2),

    mkdir(2), mkfifo(2), open(2), rename(2),rmdir(2), symlink(2), truncate(2), unlink(2)

  • 8/3/2019 P16 Slides

    24/36

    Performance

  • 8/3/2019 P16 Slides

    25/36

    Before showing the numbers...

    a lot to do in this area bypass the buffer cache use new sx(9) locks implementation use name cache

    on the other hand... ZFS on FreeBSD is MPSAFE

  • 8/3/2019 P16 Slides

    26/36

    Untaring src.tar four times one by one

    0

    25

    50

    75

    100

    125

    150

    175

    200

    225

    250

    275

    UFS

    UFS+SU

    UFS+GJ+AS

    ZFS

    Timeinseconds(lessisbetter)

  • 8/3/2019 P16 Slides

    27/36

    Removing four src directories one by one

    0

    25

    50

    75

    100

    125

    150

    175

    200

    225

    250

    UFS

    UFS+SU

    UFS+GJ+AS

    ZFS

    Timeinseconds(lessisbetter)

  • 8/3/2019 P16 Slides

    28/36

    Untaring src.tar four times in parallel

    0

    25

    50

    75

    100

    125

    150

    175

    200

    225

    250

    275

    300

    325

    350

    UFS

    UFS+SU

    UFS+GJ+AS

    ZFS

    Timeinseconds(lessisbetter)

  • 8/3/2019 P16 Slides

    29/36

    Removing four src directories in parallel

    0

    25

    5075

    100

    125

    150

    175

    200

    225

    250

    275

    300

    325

    350

    375

    UFS

    UFS+SU

    UFS+GJ+AS

    ZFS

    Timeinseconds(lessisbetter)

  • 8/3/2019 P16 Slides

    30/36

    dd if=/dev/zero of=/fs/zero bs=1m count=5000

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    UFS

    UFS+SU

    UFS+GJ+AS

    ZFS

    Timeinseconds(lessisbetter)

  • 8/3/2019 P16 Slides

    31/36

    Future directions

  • 8/3/2019 P16 Slides

    32/36

    Access Control Lists

    we currently have support forPOSIX.1e ACLs

    ZFS natively operates on

    NFSv4-style ACLs not implemented yet in my port

  • 8/3/2019 P16 Slides

    33/36

    iSCSI support

    iSCSI daemon (target mode) onlyin the ports collection

    once in the base system, ZFS will

    start to use it to export ZVOLsjust like it exports file system

    via NFS

  • 8/3/2019 P16 Slides

    34/36

    Integration with jails

    ZFS nicely integrates with zoneson Solaris, so why not to use itwith FreeBSD's jails?

    pools can only be managed fromoutside a jail

    zfs file systems can be managedfrom within a jail

  • 8/3/2019 P16 Slides

    35/36

    Integration with jails

    main# zpool create tank mirror da0 da1main# zfs create tank/jailmain# jail hostname /jail/root 10.0.0.1 /bin/tcshmain# zfs jail -i tank/jail

    jail# zfs create tank/jail/homejail# zfs create tank/jail/home/pjdjail# zfs snapshot tank/jail/home@today

  • 8/3/2019 P16 Slides

    36/36

    Proofs...