CPS216: Data-Intensive Computing Systems Data Access from Disks Shivnath Babu.
More on Disks and File Systems
description
Transcript of More on Disks and File Systems
CS-3013 & CS-502, Summer 2006
More on File Systems 1
More on Disks and File Systems
CS-3013 & CS-502Operating Systems
CS-3013 & CS-502, Summer 2006
More on File Systems 2
Additional Topics
• Mapping files to VM• RAID – Redundant Array of
Inexpensive Disks• Stable Storage• Log Structured File Systems
CS-3013 & CS-502, Summer 2006
More on File Systems 3
Reading Assignment(s)
• RAID – Tanenbaum §5.4.1• Stable Storage – §5.4.5• Log-structured File System – §6.3.8
These topics will be included on exam next week regardless of whether we
complete them this evening
CS-3013 & CS-502, Summer 2006
More on File Systems 4
Mapping files to VM
• Instead of “reading” from disk into virtual memory, why not simply use file as the swapping storage for certain VM pages?
• Called mapping
• Page tables in kernel point to disk blocks of the file
CS-3013 & CS-502, Summer 2006
More on File Systems 5
Memory-Mapped Files
• Memory-mapped file I/O allows file I/O to be treated as routine memory access by mapping a disk block to a page in memory
• A file is initially read using demand paging. A page-sized portion of the file is read from the file system into a physical page. Subsequent reads/writes to/from the file are treated as ordinary memory accesses.
• Simplifies file access by allowing application to simple access memory rather than be forced to use read() & write() calls to file system.
CS-3013 & CS-502, Summer 2006
More on File Systems 6
Memory-Mapped Files (continued)
• A tantalizingly attractive notion, but …
• Cannot use C/C++ pointers within mapped data structure
• Corrupted data structures more likely to persist in file
• Don’t really save anything in terms of• Programming energy• Thought processes• Storage space & efficiency
CS-3013 & CS-502, Summer 2006
More on File Systems 7
Memory-Mapped Files (continued)
Nevertheless, the idea has its uses1. Simpler implementation of file operations
– read(), write() are memory-to-memory operations
– seek() is simply changing a pointer, etc…– Called memory-mapped I/O
2. Shared Virtual Memory among processes
CS-3013 & CS-502, Summer 2006
More on File Systems 8
Shared Virtual Memory
CS-3013 & CS-502, Summer 2006
More on File Systems 9
Shared Virtual Memory (continued)
• Supported in – Windows XP– Apollo DOMAIN– Linux??
• Synchronization is the responsibility of the sharing applications– OS retains no knowledge
CS-3013 & CS-502, Summer 2006
More on File Systems 10
Questions?
CS-3013 & CS-502, Summer 2006
More on File Systems 11
Problem
• Question:–– If mean time to failure of a disk drive is
100,000 hours,– and if your system has 100 identical disks,– what is mean time between drive
replacement?• Answer:–
– 1000 hours (i.e., 41.67 days 6 weeks)• I.e.:–
– You lose 1% of your data every 6 weeks!• But don’t worry – you can restore most of
it from backup!
CS-3013 & CS-502, Summer 2006
More on File Systems 12
Can we do better?
• Yes, mirrored– Write every block twice, on two separate
disks– Mean time between simultaneous failure
of both disks is 57,000 years
• Can we do even better?– E.g., use fewer extra disks?– E.g., get more performance?
CS-3013 & CS-502, Summer 2006
More on File Systems 13
RAID – Redundant Array of Inexpensive Disks
• Distribute a file system intelligently across multiple disks to– Maintain high reliability and availability– Enable fast recovery from failure– Increase performance
CS-3013 & CS-502, Summer 2006
More on File Systems 14
“Levels” of RAID
• Level 0 – non-redundant striping of blocks across disk
• Level 1 – simple mirroring• Level 2 – striping of bytes or bits with
ECC• Level 3 – Level 2 with parity, not ECC• Level 4 – Level 0 with parity block• Level 5 – Level 4 with distributed parity
blocks
CS-3013 & CS-502, Summer 2006
More on File Systems 15
RAID Level 0 – Simple Striping
• Each stripe is one or a group of contiguous blocks• Block/group i is on disk (i mod n)• Advantage
– Read/write n blocks in parallel; n times bandwidth
• Disadvantage– No redundancy at all. System MBTF is 1/n disk MBTF!
stripe 8stripe 4stripe 0
stripe 9stripe 5stripe 1
stripe 10stripe 6stripe 2
stripe 11stripe 7stripe 3
CS-3013 & CS-502, Summer 2006
More on File Systems 16
RAID Level 1– Striping and Mirroring
• Each stripe is written twice• Two separate, identical disks
• Block/group i is on disks (i mod 2n) & (i+n mod 2n)• Advantages
– Read/write n blocks in parallel; n times bandwidth– Redundancy: System MBTF = (Disk MBTF)2 at twice the cost– Failed disk can be replaced by copying
• Disadvantage– A lot of extra disks for much more reliability than we need
stripe 8stripe 4stripe 0
stripe 9stripe 5stripe 1
stripe 10stripe 6stripe 2
stripe 11stripe 7stripe 3
stripe 8stripe 4stripe 0
stripe 9stripe 5stripe 1
stripe 10stripe 6stripe 2
stripe 11stripe 7stripe 3
CS-3013 & CS-502, Summer 2006
More on File Systems 17
RAID Levels 2 & 3
• Bit- or byte-level striping• Requires synchronized disks
• Highly impractical
• Requires fancy electronics • For ECC calculations
• Not used; academic interest only• See Silbershatz, §12.7.3 (pp. 471-
472)
CS-3013 & CS-502, Summer 2006
More on File Systems 18
Observation
• When a disk or stripe is read incorrectly,
we know which one failed!
• Conclusion:– A simple parity disk can provide very high
reliability• (unlike simple parity in memory)
CS-3013 & CS-502, Summer 2006
More on File Systems 19
RAID Level 4 – Parity Disk
• parity 0-3 = stripe 0 xor stripe 1 xor stripe 2 xor stripe 3• n stripes plus parity are written/read in parallel• If any disk/stripe fails, it can be reconstructed from others
– E.g., stripe 1 = stripe 0 xor stripe 2 xor stripe 3 xor parity 0-3• Advantages
– n times read bandwidth– System MBTF = (Disk MBTF)2 at 1/n additional cost– Failed disk can be reconstructed “on-the-fly” (hot swap)– Hot expansion: simply add n + 1 disks all initialized to zeros
• However– Writing requires read-modify-write of parity stripe only 1x
write bandwidth.
stripe 8stripe 4stripe 0
stripe 9stripe 5stripe 1
stripe 10stripe 6stripe 2
stripe 11stripe 7stripe 3
parity 8-11parity 4-7parity 0-3
CS-3013 & CS-502, Summer 2006
More on File Systems 20
RAID Level 5 – Distributed Parity
• Parity calculation is same as RAID Level 4• Advantages & Disadvantages
– Same as RAID Level 4• Additional advantage: avoids beating up on parity
disk
• Writing individual stripes (RAID 4 & 5)– Read existing stripe and existing parity– Recompute parity– Write new stripe and new parity
stripe 12stripe 8stripe 4stripe 0
parity 12-15stripe 9stripe 5stripe 1
stripe 13parity 8-11stripe 6stripe 2
stripe 14stripe 10parity 4-7stripe 3
stripe 15stripe 11stripe 7parity 0-3
CS-3013 & CS-502, Summer 2006
More on File Systems 21
RAID 4 & 5
• Very popular in data centers– Corporate and academic servers
• Built-in support in Windows XP and other systems– Connect a group of disks to fast SCSI
port (320 MB/sec bandwidth)– OS RAID support does the rest!
CS-3013 & CS-502, Summer 2006
More on File Systems 22
New Topic
• Problem – how to protect against disk write operations that don’t complete– Power or CPU failure in the middle of a block– Related series of writes interrupted in middle
• Examples:– Database update of charge and credit– RAID 1, 4, 5 failure between redundant
writes
CS-3013 & CS-502, Summer 2006
More on File Systems 23
Solution (part 1) – Stable Storage
• Write everything twice (separate disks)• Be sure 1st write does not invalidate previous
2nd copy• RAID 1 is okay; RAID 4/5 not okay!• Read blocks back to validate; then report
completion
• Reading both copies• If 1st copy okay, use it – i.e., newest value• If 2nd copy different, update it with 1st copy• If 1st copy error; use 2nd copy – i.e., old value
CS-3013 & CS-502, Summer 2006
More on File Systems 24
Stable Storage (continued)
• Crash recovery• Scan disks, compare corresponding blocks• If one is bad, replace with good one• If both good but different, replace 2nd with 1st
copy
• Result:–• If 1st block is good, it contains latest value• If not, 2nd block still contains previous value
• An abstraction of an atomic disk write of a single block
• Uninterruptible by power failure, etc.
CS-3013 & CS-502, Summer 2006
More on File Systems 25
What about more complex disk operations?
• E.g., File create operation involves• Allocating free blocks• Constructing and writing i-node
– Possibly multiple i-node blocks
• Reading and updating directory
• What if system crashes with the sequence only partly completed?
• Answer: inconsistent data structures on disk
CS-3013 & CS-502, Summer 2006
More on File Systems 26
Solution (Part 2) –Log-Structured File System
• Make changes to cached copies in memory• Collect together all changed blocks• Write to log file
• A circular buffer on disk• Fast, contiguous write
• Update log file pointer in stable storage
• Offline: Play back log file to actually update directories, i-nodes, free list, etc.
• Update playback pointer in stable storage
CS-3013 & CS-502, Summer 2006
More on File Systems 27
Transaction Data Base Systems
• Similar techniques– Every transaction is recorded in log
before recording on disk– Stable storage techniques for managing
log pointers– One log exist is confirmed, disk can be
updated in place– After crash, replay log to redo disk
operations
CS-3013 & CS-502, Summer 2006
More on File Systems 28
Unix LFS
• Tanenbaum, §6.3.8, pp. 428-430• Everything is written to log
• i-nodes point to updated blocks in log• i-node cache in memory updated whenever i-node is
written• Cleaner daemon follows behind to compact log
• Advantages:– LFS is always consistent– LFS performance
• Much better than Unix FS for small writes• At least as good for reads and large writes