17 Mistakes Microsoft made in the Xbox Security System Microsoft
Tolerating File-System Mistakes with EnvyFS
description
Transcript of Tolerating File-System Mistakes with EnvyFS
Tolerating File-System Mistakes with EnvyFS
Lakshmi N. BairavasundaramNetApp, Inc.
Swaminathan Sundararaman Andrea C. Arpaci-DusseauRemzi H. Arpaci-Dusseau
University of Wisconsin Madison
File Systems in Today’s World
• Modern file systems are complex– Tens of thousands of lines of code (e.g., XFS 45K LOC)
• Storage stack is also getting deeper– Hypervisor, network, logical volume manager
• Need to handle a gamut of failures– Memory allocation, disk faults, bit flips, system crashes
• Preserve integrity of its meta-data and user data
04/21/23 2Tolerating File-System Mistakes with EnvyFS
File System Bugs
• Bug reports for Linux 2.6 series from Bugzilla– ext3: 64, JFS: 17, ReiserFS: 38– Some are FS corruption causing permanent data loss
• FS bugs broadly classified into two categories– “fail-stop”: System immediately crashes
• Solutions: Nooks [Swift 04], CuriOS [David08]
– “fail-silent”: Accidentally corrupt on-disk state• Many such bugs uncovered [Prabhakaran05, Gunawi08, Yang04, Yang06b]
04/21/23 3Tolerating File-System Mistakes with EnvyFS
04/21/23 4Tolerating File-System Mistakes with EnvyFS
Bugs are inevitable in file systems
Challenge: how to cope with them?
• Based on N-version programming [Avizienis77]– NFS servers [Rodrigues01], databases [Vandiver07], security [Cox06]
N-Version File Systems
• EnvyFS: Simple software layer
– Store data in N child file systems– Operations performed on all children
• Rely on a simple software layer• Challenge: reducing overheads
while retaining reliability – SubSIST: Novel Single Instance
Store04/21/23 Tolerating File-System Mistakes with EnvyFS 5
EnvyFS layer
Child
1
Child
2
Child
N
Disk driver
Disk
…
SIS layer
Application
Results• Robustness – Traditional file systems handle few corruptions (< 4%)– EnvyFS3 tolerates 98.9% of single file system mistakes
• Performance– Desktop workloads: EnvyFS3 has comparable performance– I/O intensive workloads:
• Normal mode: EnvyFS3 + SubSIST acceptable performance • Under memory pressure: EnvyFS3 + SubSIST large overheads
• Potential as a debugging tool for FS developers– Pinpoint the source of “fail-silent” bug in ext3
04/21/23 6Tolerating File-System Mistakes with EnvyFS
Outline
• Introduction• Building reliable file systems• Reducing overheads with SubSIST • Evaluation• Conclusion
04/21/23 7Tolerating File-System Mistakes with EnvyFS
N-Version Systems
Development process: 1. Producing the specification of software2. Implementing N versions of the software3. Creating N-version layer
— Executes different versions— Determines the consensus result
04/21/23 8Tolerating File-System Mistakes with EnvyFS
1. Producing Specification• Our own specification ?– Impractical: Requires wide scale changes to file systems– Specifications take years to get accepted
• Can we leverage existing specification ?– Yes, can leverage VFS, but there are some issues
• VFS not precise for N-versioning purpose– Needs to handle cases where specification is not precise– e.g., Ordering directory entries, inode number allocation
04/21/23 Tolerating File-System Mistakes with EnvyFS 9
Imprecise VFS Specification
Ordering directory entries
• Issue:– No specified return order– Can’t blindly compare entries
• Solution: – Read all entries from a directory
(dir: test in our case) from all FSes– Match entries from FSes – Return majority results
04/21/23 10Tolerating File-System Mistakes with EnvyFS
FS X
FS Y
FS Z
EnvyFS layer
…File 1File 2File 3
Dir: test
File 2File 3File 1
Dir: test Dir: test
File 1 File 2 File 3
Readdir: test No Entries
File 3File 1File 2
File 1File 2File 3
File 1File 2File 3
Dir: test
Virt # FS 1 FS 3FS 2
??
File 1 | 36
Imprecise VFS Specification (cont)• Inode number allocation
– Inode numbers returned through system calls– Each child file system issues different inode numbers– Possible solution: Force file systems to use same algorithm?– Our solution: Issue inode numbers at EnvyFS layer
04/21/23 11Tolerating File-System Mistakes with EnvyFS
FS X
FS Y
FS Z
EnvyFS layer
Dir: test Dir: test Dir: test
File 1 | 10 File 1 |65
File 1 10File 2 15File 3 16
File 2 04File 3 44File 1 36
File 1 |
15 10 36 65
Inode Mapping Table
15Stat: File 1
File 3 99File 1 65File 2 43
Inode Numbers
Inode Mapping Table not persistently stored
2. Implementing N versions of FS
• Painful process– High cost of development, long time delays
• Lucky! Hard work already done for us– 30 different disk based file systems in Linux 2.6
• Which file systems to use?– ext3, JFS, ReiserFS in a three-version FS – Others should work without modifications
04/21/23 12Tolerating File-System Mistakes with EnvyFS
3. Creating N-Version Layer
04/21/23 13
• N-Version layer (EnvyFS)– Inserted beneath VFS– Simple design to avoid bugs
• Example: Reading a file– Allocate N data buffers– Read data block from the disk– Compare: data, return code, file position– Return: data, return code
• Issues:– Allocate memory for each read operation– Extra copy from allocated buffer to application– Comparison overheads
ComparatorsWrappersInode Mapping Table
Application
VFS layer
…ext3
JFS
Reis
erFS
EnvyFS Layer
Read (file, 1 block)
Read (file, 1 block)
Read (…) Read (…) Read (…)
F F Fpos: x pos: x pos: x
D D D
D D D
err = err = err =
Disk
Derr ,
Derr ,
Tolerating File-System Mistakes with EnvyFS
Reading a File in EnvyFS• Solution:
– Same application buffer for all FS– TCP-like checksums for data comparison– Compare: checksums, return code, file
position– Read data until majority
04/21/23 14Tolerating File-System Mistakes with EnvyFS
ComparatorsWrappersInode Mapping Table
Application
VFS layer
…ext3
JFS
Reis
erFS
EnvyFS Layer
Read (file, 1 block)
Read (file, 1 block)
Read (…) Read (…)
F F Fpos: x pos: x
D D
D D D
err = err = err =
FS 1 # FS 2 # FS N #
…
435 435 … 436
ChecksumsDisk
Derr ,
Derr ,
Read (…) D
pos: x
Outline
• Introduction• Building reliable file systems• Reducing overheads with SubSIST • Evaluation• Conclusion
04/21/23 15Tolerating File-System Mistakes with EnvyFS
Part 1 Part 2 Part N……Disk 1 Disk 2 Disk N
Disk
Case for Single Instance Storage (SIS)
• Ideal: One disk per FS
• Practical: One disk for all FS
• Overheads– Effective storage space: 1/N– N times more I/O (Read/write)
• Challenge: Maintain diversity while minimizing overheads
04/21/23 16Tolerating File-System Mistakes with EnvyFS
EnvyFS layer
…FS 1
FS 2
FS N
Application
VFS layer
Disk Req. Queue
1
1 2 N
1 2 N
SubSIST: Single Instance Store• Variant of an Single Instance Store
– Selectively merges data blocks
• Block addressable SIS– Exports virtual disks to FSes– Manages mapping, free space info.– Not persistently stored on disk
• EnvyFS writes through N file systems – N data blocks merged to 1 data block– Content hashes not stored persistently – Meta-data blocks not merged– Inter FS blocks and not intra FS
04/21/23 17Tolerating File-System Mistakes with EnvyFS
EnvyFS layer
…FS 1
FS 2
FS N
Application
VFS layer
Vdisk 1
Disk
Vdisk 2 Vdisk N
Read CacheCHash LayerFree Space Management
SubS
IST
DDM MMD
DDD
D
FS 1
D
Disk
Handling Data Block Corruptions? Corruption to data in a single FS
– Due to bugs, bit flips, storage stack– Corrupt data blocks not merged– All other N-1 data blocks merged– Corrupt data block fixed at next read
× Corruption to data block inside disk
• Single copy of data – Different code paths – Different on-disk structures
04/21/23 18Tolerating File-System Mistakes with EnvyFS
EnvyFS layer
…FS 2
FS N
Application
VFS layer
Vdisk 1 Vdisk 2 Vdisk N
Read CacheCHash LayerFree Space Management
SubS
IST
DD
DD
DD
D DDD
D
Outline
• Introduction• Building reliable file systems • Reducing overheads with SubSIST • Evaluation– Reliability– Performance
• Conclusion
04/21/23 19Tolerating File-System Mistakes with EnvyFS
Robustness of EnvyFS in recovering from a child file system’s mistake?
DiskBB
B
EnvyFS layer
Block Driver
B
Reliability Evaluation: Fault Injection
• Corruption: bugs in FS / storage stack• Types of disk blocks
– superblock, inode, block bitmap, file data, …
• Perform different file ops– mount, stat, creat, unlink, read, …
• Report user visible results• All results are applicable with SubSIST
except corruption to data blocks04/21/23 Tolerating File-System Mistakes with
EnvyFS 20
ext 3 JFS
Reis
erFS
Pseudo Device Driver
VFS
B B
B
B
Type-aware fault injection [Prabhakaran05]
ext3
path
trav
ersa
lSE
T-1
(sta
t, …
)SE
T-2
(chm
od)
read
read
link
getd
irent
ries
crea
tlin
km
kdir
rena
me
sym
link
writ
etr
unca
term
dir
unlin
km
ount
SET-
3 (fs
ync)
umou
nt
INODE
DIR
BMAP
IMAP
INDIRECT
DATA
SUPER
JSUPER
GDESC
Result Matrix
Normal
Data loss
N/A
Cannot mount
Ops fail
Data corrupt
Crash
Read-only
eDepends
E
04/21/23 21Tolerating File-System Mistakes with EnvyFS
ext3
path
trav
ersa
lSE
T-1
(sta
t, …
)SE
T-2
(chm
od)
read
read
link
getd
irent
ries
crea
tlin
km
kdir
rena
me
sym
link
writ
etr
unca
term
dir
unlin
km
ount
SET-
3 (fs
ync)
umou
nt
INODE
DIR
BMAP
IMAP
INDIRECT
DATA
SUPER
JSUPER
GDESC
Data loss
N/A
Cannot mount
Ext3 stores many superblock copies;
but, does not handle superblock corruption
04/21/23 Tolerating File-System Mistakes with EnvyFS 22
E
ext3
path
trav
ersa
lSE
T-1
(sta
t, …
)SE
T-2
(chm
od)
read
read
link
getd
irent
ries
crea
tlin
km
kdir
rena
me
sym
link
writ
etr
unca
term
dir
unlin
km
ount
SET-
3 (fs
ync)
umou
nt
INODE
DIR
BMAP
IMAP
INDIRECT
DATA
SUPER
JSUPER
GDESC
Data loss
N/A
Cannot mount
Ops fail
Crash
• In addition to operations failing, inode corruption leads to data loss
• Unlink: system crash during unmount
04/21/23 Tolerating File-System Mistakes with EnvyFS 23
E
ext3
path
trav
ersa
lSE
T-1
(sta
t, …
)SE
T-2
(chm
od)
read
read
link
getd
irent
ries
crea
tlin
km
kdir
rena
me
sym
link
writ
etr
unca
term
dir
unlin
km
ount
SET-
3 (fs
ync)
umou
nt
INODE
DIR
BMAP
IMAP
INDIRECT
DATA
SUPER
JSUPER
GDESC
Normal
Data loss
N/A
Cannot mount
Ops fail
Data corrupt
Crash
Read-only
eDepends
E
04/21/23 24Tolerating File-System Mistakes with EnvyFS
Kernel panic in
ext3pa
th tr
aver
sal
SET-
1 (s
tat,
…)
SET-
2 (c
hmod
)re
adre
adlin
kge
tdire
ntrie
scr
eat
link
mkd
irre
nam
esy
mlin
kw
rite
trun
cate
rmdi
run
link
mou
ntSE
T-3
(fsyn
c)um
ount
Normal
N/A
INODE
DIR
BMAP
IMAP
INDIRECT
DATA
SUPER
JSUPER
GDESC
EnvyFS3 works in every scenario04/21/23 Tolerating File-System Mistakes with
EnvyFS 25
EnvyFS3
E RJ
EnvyFS
Potential for Bug Isolationext3 EnvyFS3
Tim
e
Unlink on corrupt inode:- ext3_lookup (bug)- ext3_unlink
Unmount (panic)
Tim
e
Unlink on corrupt inode:- ext3_lookup (bug)- ext3 inode does not match
others- Further ops not issued
In typical use, a problem is noticed only on panic
In EnvyFS3, a problem is noticed the first time child file system returns wrong results
04/21/23 Tolerating File-System Mistakes with EnvyFS 26
JFS
path
trav
ersa
lSE
T-1
SET-
2re
adre
adlin
kge
tdire
ntrie
scr
eat
link
mkd
irre
nam
esy
mlin
kw
rite
trun
cate
rmdi
run
link
mou
ntSE
T-3
umou
nt
INODE
DIR
BMAP
IMAP
INTERNAL
DATA
SUPER
JSUPER
JDATA
AGGR-INODE
IMAPDESC
IMAPCNTL
Normal
Data loss
N/A
Cannot mount
Ops fail
Data corrupt
Crash
Read-only
aDepends
J
04/21/23 27Tolerating File-System Mistakes with EnvyFS
EnvyFS3
path
trav
ersa
lSE
T-1
SET-
2re
adre
adlin
kge
tdire
ntrie
scr
eat
link
mkd
irre
nam
esy
mlin
kw
rite
trun
cate
rmdi
run
link
mou
ntSE
T-3
umou
nt
INODE
DIR
BMAP
IMAP
INTERNAL
DATA
SUPER
JSUPER
JDATA
AGGR-INODE
IMAPDESC
IMAPCNTL
Normal
N/A
Crash
Kernel panic in EnvyFS3
E RJ
EnvyFS
04/21/23 28Tolerating File-System Mistakes with EnvyFS
• Experimental setup– AMD Opteron 2.2 GHz Processor– 2GB RAM– 80 GB Hitachi Deskstar 7200-rpm SATA disk– Linux 2.6.12– 4GB disk partition for each file system
OpenSSH BenchmarkPerformance Evaluation
04/21/23 29Tolerating File-System Mistakes with EnvyFS
Elap
sed
Tim
e (i
n Se
cond
s)
File Systems
• CPU Intensive• OpenSSH 4.5 -- Copy, untar and
make
Performance of EnvyFS3 is comparable to a single file system
3 % overhead
• I/O Intensive– Mimics busy mail server workload– Transaction: creates, deletes, reads, appends, …
• Postmark Configuration– 2500 files– File size: 4Kb – 40Kb– No. of transactions: 10K and 100K
Postmark Benchmark
04/21/23 30Tolerating File-System Mistakes with EnvyFS
Elap
sed
Tim
e (i
n Se
cond
s)
129.039.0 26.414.7 9.6 29
10734
851
430
128
243
78
406
271EnvyFS3: 3.3x+ SubSIST: -32%
EnvyFS3: 8x+ SubSIST: 4x
EnvyFS3: 1.7x+ SubSIST: 11.5%
Summary of Results
• Robustness– Traditional file systems vulnerable to corruptions – EnvyFS3 tolerates almost all mistakes in one FS
• Performance– Desktop workloads: EnvyFS3 has comparable performance
– I/O intensive workloads: • Regular Operations: EnvyFS3 + SubSIST acceptable performance
• Memory pressure: EnvyFS3 + SubSIST has large overhead
04/21/23 31Tolerating File-System Mistakes with EnvyFS
Outline
• Introduction• Building reliable file systems • Reducing overheads with SubSIST • Evaluation• Conclusion
04/21/23 32Tolerating File-System Mistakes with EnvyFS
Conclusion
• Bugs/mistakes are inevitable in any software– Must cope, not just hope to avoid
• EnvyFS: N-version approach to tolerating FS bugs – Built using existing specification and file systems
• SubSIST: single instance store– Decreases overheads while retaining reliability
04/21/23 33Tolerating File-System Mistakes with EnvyFS
Thank You!
Advanced Systems Lab (ADSL)University of Wisconsin-Madison
http://www.cs.wisc.edu/adsl
04/21/23 34Tolerating File-System Mistakes with EnvyFS