8/3/2019 P16 Slides
1/36
Porting ZFS file system toFreeBSD
Pawe Jakub Dawidek
mailto:[email protected]:[email protected]8/3/2019 P16 Slides
2/36
The beginning...
ZFS released by SUN underCDDL license available in Solaris / OpenSolaris
only ongoing Linux port for FUSE
framework (userland); started as
SoC project
8/3/2019 P16 Slides
3/36
Features...
ZFS has many very interestingfeatures, which make it one of
the most wanted file systems
8/3/2019 P16 Slides
4/36
Features...
dynamic stripping use the entirebandwidth available,
RAID-Z (RAID-5 withoutwrite hole (more like RAID-3actually)),
RAID-1, 128 bits (POSIX limits FS to 64 bits)...
(think about 65 bits)
8/3/2019 P16 Slides
5/36
Features...
pooled storage no more volumes/partitions does for storage what VM did for memory
copy-on-write model transactional operation
always consistent on disk no fsck, no journaling
8/3/2019 P16 Slides
6/36
Features...
snapshots very cheap, because of COW model
clones writtable snapshots snapshot rollback
always consistent on disk end-to-end data integrity
detects and corrects silent data corruption causedby any defect in disk, cable, controller, driver
or firmware
8/3/2019 P16 Slides
7/36
Features...
built-in compression lzjb, gzip (as a patch)
self-healing endian-independent
always write in native endianess
simplified administration per-filesystem encryption soon
8/3/2019 P16 Slides
8/36
Volume
FS
Volume
FS
Volume
FS
Storage Pool
ZFS ZFS ZFS
Tranditional Volumes abstraction: virtual disk volume/partition for each FS grow/shrink by hand each FS has limited bandwidth storage is fragmented
ZFS Pooled Storage abstraction: malloc/free no partitions to manage grow/shrink automatically all bandwidth always available all storage in the pool is shared
FS/Volume model vs. ZFS
8/3/2019 P16 Slides
9/36
ZFS Self-Healing
8/3/2019 P16 Slides
10/36
xVM mirror
File System
1. Application issuesa read. Mirror reads
the first disk, whichhas a corrupt block.It can't tell...
Application
xVM mirror
File System
2. Volume managerpasses the bad block
to file system. If it's ametadata block, thesystem panics. If not...
Application
xVM mirror
File System
3. File systemreturns bad data to
the application...
Application
Traditional mirroring
8/3/2019 P16 Slides
11/36
ZFS mirror
1. Application issues
a read. ZFS mirrortries the first disk.Checksum revealsthat the block iscorrupt on disk.
Application
ZFS mirror
2. ZFS tries the
second disk.Checksum indicatesthat the block isgood.
Application
ZFS mirror
3. ZFS returns
good data to theapplication andrepairs thedamaged block.
Application
Self-Healing data in ZFS
8/3/2019 P16 Slides
12/36
Porting...
very portable code (started to workafter 10 days of porting) few ugly Solaris-specific details few ugly FreeBSD-specific
details (VFS, buffer cache) ZPL was hell (ZFS POSIX layer);
yes, this is the thing which VFStalks to
8/3/2019 P16 Slides
13/36
Solaris compatibility layer
contrib/opensolaris/- userland code taken from OpenSolarisused by ZFS (ZFS control utilities, libraries, test tools)
compat/opensolaris/- userland API compatibility layer(Solaris-specific functions missing in FreeBSD)
cddl/- Makefiles used to build userland libraries and utilitiessys/contrib/opensolaris/- kernel code taken from OpenSolaris
used by ZFSsys/compat/opensolaris/- kernel API compatiblity layer
sys/modules/zfs/- Makefile for building ZFS kernel module
8/3/2019 P16 Slides
14/36
ZFS connection points in the kernel
ZFS
GEOM(ZVOL)
VFS(ZFS file systems)
/dev/zfs(userland
communication)
GEOM(VDEV)
8/3/2019 P16 Slides
15/36
How does it look exactly...
ZVOL/GEOMproviders only
VDEV_GEOMconsumers only VDEV_FILE VDEV_DISK
GEOM
GEOM VFS
ZPL
ZFSmany other layers
use mdconfig(8)
8/3/2019 P16 Slides
16/36
Snapshots
contains @ in its name:# zfs listNAME USED AVAIL REFER MOUNTPOINTtank 50,4M 73,3G 50,3M /tanktank@monday 0 - 50,3M -
tank@tuesday 0 - 50,3M -tank/freebsd 24,5K 73,3G 24,5K /tank/freebsdtank/freebsd@tuesday 0 - 24,5K -
mounted on first access under
/mountpoint/.zfs/snapshot/ hard to NFS-export
separate file systems have to be visible when its
parent is NFS-mounted
mailto:tank@mondaymailto:tank@tuesdaymailto:tank/freebsd@tuesdaymailto:tank/freebsd@tuesdaymailto:tank@tuesdaymailto:tank@monday8/3/2019 P16 Slides
17/36
NFS is easy
# mountd /etc/exports /etc/zfs/exports# zfs set sharenfs=ro,maproot=0,network=192.168.0.0,mask=255.255.0.0 tank# cat /etc/zfs/exports# !!! DO NOT EDIT THIS FILE MANUALLY !!!
/tank -ro -maproot=0 -network=192.168.0.0 -mask=255.255.0.0/tank/freebsd -ro -maproot=0 -network=192.168.0.0 -mask=255.255.0.0
we translate options to exports(5) format
and SIGHUP mountd(8) daemon
8/3/2019 P16 Slides
18/36
Missing bits in FreeBSD needed by ZFS
8/3/2019 P16 Slides
19/36
Sleepable mutexes
no sleeping while holding mutex(9) Solaris mutexes implemented
on top of sx(9) locks condvar(9) version that operates on
sx(9) locks
8/3/2019 P16 Slides
20/36
GFS (Generic Pseudo-Filesystem)
allows to create virtual objects(not stored on disk)
in ZFS we have:.zfs/.zfs/snapshot
.zfs/snapshot//
8/3/2019 P16 Slides
21/36
VPTOFH
translates vnode to a file handle VFS_VPTOFH(9) replaced with
VOP_VPTOFH(9) to support NFSexporting of GFS vnodes
its just better that way, ask Krik for
the story:)
8/3/2019 P16 Slides
22/36
lseek(2) SEEK_{DATA,HOLE}
SEEK_HOLE returns the offsetof the next hole
SEEK_DATA returns the offsetof the next data
helpful for backup software not ZFS-specific
8/3/2019 P16 Slides
23/36
Testing correctness
ztest (libzpool) a product is only as good as its test suite runs most of the ZFS code in userland
probably more abuse in 20 seconds that you'dsee in a lifetime fstest regression test suite
3438 tests in 184 files # prove -r /usr/src/tools/regression/fstest/tests tests: chflags(2), chmod(2), chown(2), link(2),
mkdir(2), mkfifo(2), open(2), rename(2),rmdir(2), symlink(2), truncate(2), unlink(2)
8/3/2019 P16 Slides
24/36
Performance
8/3/2019 P16 Slides
25/36
Before showing the numbers...
a lot to do in this area bypass the buffer cache use new sx(9) locks implementation use name cache
on the other hand... ZFS on FreeBSD is MPSAFE
8/3/2019 P16 Slides
26/36
Untaring src.tar four times one by one
0
25
50
75
100
125
150
175
200
225
250
275
UFS
UFS+SU
UFS+GJ+AS
ZFS
Timeinseconds(lessisbetter)
8/3/2019 P16 Slides
27/36
Removing four src directories one by one
0
25
50
75
100
125
150
175
200
225
250
UFS
UFS+SU
UFS+GJ+AS
ZFS
Timeinseconds(lessisbetter)
8/3/2019 P16 Slides
28/36
Untaring src.tar four times in parallel
0
25
50
75
100
125
150
175
200
225
250
275
300
325
350
UFS
UFS+SU
UFS+GJ+AS
ZFS
Timeinseconds(lessisbetter)
8/3/2019 P16 Slides
29/36
Removing four src directories in parallel
0
25
5075
100
125
150
175
200
225
250
275
300
325
350
375
UFS
UFS+SU
UFS+GJ+AS
ZFS
Timeinseconds(lessisbetter)
8/3/2019 P16 Slides
30/36
dd if=/dev/zero of=/fs/zero bs=1m count=5000
0
20
40
60
80
100
120
140
160
180
200
UFS
UFS+SU
UFS+GJ+AS
ZFS
Timeinseconds(lessisbetter)
8/3/2019 P16 Slides
31/36
Future directions
8/3/2019 P16 Slides
32/36
Access Control Lists
we currently have support forPOSIX.1e ACLs
ZFS natively operates on
NFSv4-style ACLs not implemented yet in my port
8/3/2019 P16 Slides
33/36
iSCSI support
iSCSI daemon (target mode) onlyin the ports collection
once in the base system, ZFS will
start to use it to export ZVOLsjust like it exports file system
via NFS
8/3/2019 P16 Slides
34/36
Integration with jails
ZFS nicely integrates with zoneson Solaris, so why not to use itwith FreeBSD's jails?
pools can only be managed fromoutside a jail
zfs file systems can be managedfrom within a jail
8/3/2019 P16 Slides
35/36
Integration with jails
main# zpool create tank mirror da0 da1main# zfs create tank/jailmain# jail hostname /jail/root 10.0.0.1 /bin/tcshmain# zfs jail -i tank/jail
jail# zfs create tank/jail/homejail# zfs create tank/jail/home/pjdjail# zfs snapshot tank/jail/home@today
8/3/2019 P16 Slides
36/36
Proofs...
Top Related