Chap12:FileSystemImplementaoncs370/Fall15/slides/12FileSystem... · 1 1 YashwantK...
Transcript of Chap12:FileSystemImplementaoncs370/Fall15/slides/12FileSystem... · 1 1 YashwantK...
1 1
Yashwant K Malaiya Fall 2015
CS370 Opera;ng Systems Slides based on • Text by Silberschatz, Galvin, Gagne • Berkeley Opera;ng Systems group • S. Pallikara • Other sources
Chap 12: File System Implementa4on
• Implemen4ng local file systems and directory structures
• Implementa4on of remote file systems • Block alloca4on and free-‐block
algorithms and trade-‐offs
2
Chap 12: File System Implementa4on • File-‐System Structure • File-‐System Implementa4on • Directory Implementa4on • Alloca4on Methods • Free-‐Space Management • Efficiency and Performance • Recovery
3
Ques4ons from last 4me • How do you access data when one drive in RAID 0 fails? – RAID 0: non-‐redundant, striping – Drive fails – data is lost.
• Upcoming deadlines: see Schedule on website – HW2 Dec 1 – Slides Dec 3 on Piazza – Peer reviews Dec 8 (it is a homework) – Final project report Dec 9 – Final test Dec 15
• Done: File System Interface. Now implementa4on.
4
File-‐System Structure • File structure
– Logical storage unit – Collec4on of related informa4on
• File system resides on secondary storage (disks/SSD) – Provides user interface to storage, mapping logical to physical – Provides efficient and convenient access to disk by allowing data
to be stored, located retrieved easily – Can be on other media (flash etc), with different file system
• Disk provides in-‐place rewrite and random access – I/O transfers performed in blocks of sectors (usually 512 bytes)
• File control block – storage structure -‐informa4on about a file (inode in Linux) inc loca4on of data
• Device driver controls the physical device
5
Layered File System
Device drivers
Logical blocks to physical blocks
Files, metadata
File system
Linear array of blocks
6
Layered File System
Search dir, find file loca4on, determine which file blocks
will be used
Map file blocks (logical blocks) to disk blocks (physical blocks), disk
alloca4on
Commands to device driver, Buffering of disk data, caching of disk blocks
Logical File System Layer
File Organiza4on Layer
Basic File System Layer
Processes
Disk Driver
fd = open (afilename, ..) read (fd, buf, size); write (fd, buf, size); close (fd)
Logical block numbers(s); file_start block on disk
Physical block numbers
Disk Controller Cylinder, track, sector, R/W
In cache? If not, get block
7
File System Layers (from bohom) • Device drivers manage I/O devices at the I/O control
layer – Given commands like “read drive1, cylinder 72, track 2, sector
10, into memory loca4on 1060” outputs low-‐level hardware specific commands to hardware controller
• Basic file system given command like “retrieve block 123” translates to device driver – Also manages memory buffers and caches (alloca4on, freeing,
replacement) • Buffers hold data in transit • Caches hold frequently used data
• File organiza;on module understands files, logical address, and physical blocks -‐ Translates logical block # to physical block # -‐ Manages free space, disk alloca4on
8
File System Layers (Cont.)
• Logical file system manages metadata informa4on – Translates file name into file number, file handle, loca4on by maintaining file control blocks (inodes in UNIX)
– Directory management – Protec4on
9
File Systems • Many file systems, some4mes several within an opera4ng system – Each with its own format
• Windows has FAT (1977), FAT32 (1996), NTFS (1993) • Linux has more than 40 types, with extended file system (1992) ext2 (1993), ext3 (2001), ext4 (2008);
• plus distributed file systems • floppy, CD, DVD Blu-‐ray
– New ones s4ll arriving – ZFS, GoogleFS, Oracle ASM, FUSE, xFAT
10
File Systems
File System Max File Size Max Partition Size Journaling Notes
Fat32 4 GiB 8 TiB No Commonly supported
ExFAT 128 PiB 128 PiB No Optimized for flash
NTFS 2 TiB 256 TiB Yes For Windows Compatibility
ext2 2 TiB 32 TiB No Legacy
ext3 2 TiB 32 TiB Yes Standard linux filesystem for many years. Best choice for super-standard installation.
ext4 16 TiB 1 EiB Yes
Modern iteration of ext3. Best choice for new installations where super-standard isn't necessary.
11
File-‐System Implementa4on • Based on several on-‐disk and in-‐memory
structures. • On-‐disk
– Boot control block (per volume) – Volume control block (per volume) – Directory structure (per file system) – File control block (per file)
• In-‐memory – Mount table – Directory structure cache – The open-‐file table (system-‐wide and per process) – Buffers of the file-‐system blocks
Volume: logical disk drive, perhaps a par44on
12
On-‐disk File-‐System Structures 1. Boot control block contains info needed by
system to boot OS from that volume – Needed if volume contains OS, usually first block of volume
2. Volume control block (superblock UFS or master file tableNTFS) contains volume details – Total # of blocks, # of free blocks, block size, free block pointers or array
3. Directory structure organizes the files – File Names and inode numbers UFS, master file table NTFS
Volume: logical disk drive, perhaps a par44on
Super block
Directory, FCBs File data blocks Boot
block
13
File-‐System Implementa4on (Cont.)
4. Per-‐file File Control Block (FCB or “inode”) contains many details about the file
– Indexed using inode number; permissions, size, dates UFS
– master file table using rela4onal DB structures NTFS
14
In-‐Memory File System Structures
• An in-‐memory mount table contains informa4on about each mounted volume.
• An in-‐memory directory-‐structure cache holds the directory informa4on of recently accessed directories.
• The system-‐wide open-‐file table contains a copy of the FCB of each open file, as well as other informa4on.
• The per-‐process open file table contains a pointer to the appropriate entry in the system-‐wide open-‐file table
• Plus buffers hold data blocks from secondary storage Open returns a file handle (file descriptor) for subsequent use • Data from read eventually copied to specified user
process memory address
15
Create a file
• A program calls the logical file system. • The logical file system knows the format of the directory structures, and allocates a new FCB.
• The system, then, reads the appropriate directory into memory, up-‐dates it with the new file name and FCB, and writes it back to the disk.
16
Open a File
• The file must be opened. – The open() passes a file name to the logical file system.
• The open() first searches the system-‐wide open-‐file: if the file is already in use by another process. – If yes: a per-‐process open-‐file table entry is created. – If no: the directory structure is searched for the given file
name: once the file is found, the FCB is cached into a system-‐wide open-‐file table in memory. (next slide)
• This table stores the FCB as well as the number of processes that have the file open.
17
Open a file, Read From a File
• The open() returns an index to the appropriate entry in the per-‐process file-‐system table.
• This index is called file descriptor in Unix and file handle in Windows.
• All file read/write opera4ons are then performed via this index.
18
Close a File
• When a process closes the file: • The per-‐process table entry is removed. • The system-‐wide entry's open count is decremented.
• When all users that have opened the file close it, any updated meta-‐data is copied back to the disk-‐based directory structure, and the system-‐wide open-‐file table entry is removed.
19
Par44ons and Moun4ng • Par44on can be a volume containing a file system or
raw – just a sequence of blocks with no file system • Boot block can point to boot volume or boot loader set
of blocks that contain enough code to know how to load the kernel from the file system
• Root par;;on contains the OS, Mounted at boot 4me – other par44ons can hold other OSes, other file systems, or be
raw – Other par44ons can mount automa4cally or manually
• At mount 4me, file system consistency checked – Is all metadata correct?
• If not, fix it, try again • If yes, add to mount table, allow access
20
Virtual File Systems
• Virtual File Systems (VFS) on Unix provide an object-‐oriented way of implemen4ng file systems
• VFS allows the same system call interface (the API) to be used for different types of file systems
• The API (POSIX system calls) is to the VFS interface, rather than any specific type of file system
Virtual to specific FS interface
21
Virtual File Systems
• VFS layer serves two important func;ons: 1. It separates file-‐system-‐generic opera;ons from their
implementa;on, and allows transparent access to different types of file systems mounted locally.
2. It provides a mechanism for uniquely represen;ng a file throughout a network.
• The VFS is based on a structure, called a vnode. – Contains a numerical designator for a network-‐wide unique file. – Unix inodes are unique within only a single file system. – The kernel maintains one vnode structure for each ac;ve node.
22
Virtual File System Implementa4on
• VFS defines set of opera4ons on the objects that must be implemented
• Every object has a pointer to a func4on table. – Func4on table has addresses of rou4nes to implement that func4on on that object.
For example: – int open(. . .)—Open a file – int close(. . .)—Close an already-‐open file – ssize t read(. . .)—Read from a file – ssize t write(. . .)—Write to a file – int mmap(. . .)—Memory-‐map a file
23
Directory Implementa4on • Linear list of file names with pointer to the data
blocks – Simple to program – Time-‐consuming to execute
• Linear search 4me • Could keep ordered alphabe4cally via linked list or use B+ tree
• Hash Table – linear list with hash data structure – Decreases directory search 4me – Collisions – situa4ons where two file names hash to the same loca4on.
• use chained-‐overflow method • Each hash entry can be a linked list instead of an individual value.
24
Ques4ons from last 4me
• Which file structure is the fastest? – Probably the newer ones?
• How do device drivers work? – Complex. Translate logical view to controller commands
• What is moun4ng? – Making a file structure on a storage devices accessible to the computer filing structure.
• How does a flash memory work? – By trapping or not trapping electrons
• On-‐disk file system struc4eres
25
Alloca4on Methods – i.Con4guous
An alloca4on method refers to how disk blocks are allocated for files: • Con4guous (not common now) • Linked (e.g. FAT32) • Indexed (e.g. ex2) i. Con;guous alloca;on – each file occupies set of con4guous blocks
– Simple – only star4ng loca4on (block #) and length (number of blocks) are required
– First fit/Best fit/Worst fit – Problems include finding space for file, knowing file size, external fragmenta4on, need for compac;on off-‐line (down;me) or on-‐line
26
Con4guous Alloca4on
• Mapping logical byte address LA to physical
LA/512"
Q"
R"
Block to be accessed = Q + starting block number (address)"Displacement into block = R"
Assume block size =512
File tr: 3 blocks Star4ng at block 14
27
Extent-‐Based Systems
• Many newer file systems (i.e., Veritas File System) use a modified con4guous alloca4on scheme
• Extent-‐based file systems allocate disk blocks in extents
• An extent is a con4guous block of disks – Extents are allocated for file alloca4on – A file consists of one or more extents
Actually 1991 "
28
Alloca4on Methods -‐ Linked ii. Linked alloca;on – each file a linked list of blocks
– Each block contains pointer to next block. – File ends at null pointer – No external fragmenta4on, no compac4on Free space management system called when new block needed – Loca4ng a block can take many I/Os and disk seeks. – Improve efficiency by clustering blocks into groups but increases internal fragmenta4on
– Reliability can be a problem, since every block in a file is linked
29
Alloca4on Methods – Linked (Cont.) • FAT (File Alloca4on Table) varia4on
– Beginning of volume has table, indexed by block number – Much like a linked list, but faster on disk and cacheable – New block alloca4on simple
Each FAT entry corresponds to the corresponding block of Storage. Free block entries are also linked.
30
Linked Alloca4on • Each file is a linked list of disk blocks: blocks may be
scahered anywhere on the disk pointer"block ="
Mapping Logical byte address LA
Block to be accessed is the Qth block in the linked chain of blocks representing the file.""Displacement into block = R + 1"
LA/511"Q"
R"
(assuming pointer "size is 1 byte)"
33
Alloca4on Methods -‐ Indexed • Indexed alloca;on
– Each file has its own index block(s) of pointers to its data blocks
• Logical view
index table
Pointers to Data blocks
35
Indexed Alloca4on (Cont.)
• Need index table • Random access • Dynamic access without external fragmenta4on,
but have overhead of index block even for a small file
• Mapping from logical to physical in a file of maximum size of 512x512 = 256K bytes and block size of 512 bytes: we need only 1 block for index table
LA/512"Q"
R"Q = displacement into index table"R = displacement into block"
(assuming pointer size is 1 byte, block is 512 bytes)"
Larger files? Coming up"
36
Indexed Alloca4on – Mapping (Cont.) • Mapping a file of unbounded length:
from logical to physical (block size of 512 words)
• Linked scheme – Link blocks of mul4-‐block index table (no limit on size)
LA / (512 x 511)"Q1"
R1"Q1 = block of index table"R1 is used as follows:"
R1 / 512"Q2"
R2"
Q2 = displacement into block of index table"R2 displacement into block of file:"
Large file? • Linked scheme • Mul4-‐level index • Combined scheme
37
Indexed Alloca4on – Mapping (Cont.) • Two-‐level index (4K blocks could store 1,024 four-‐byte pointers in outer index -‐>
1,048,567 data blocks and file size of up to 4GB)
LA / (1024x1024)"Q1"
R1"
Q1 = displacement into outer-index"R1 is used as follows:"
R1 /1024"Q2"
R2"
Q2 = displacement into block of index table"R2 displacement into block of file:"
1024x1024 blocks, each 4K
39
Combined Scheme: UNIX UFS
More index blocks than can be addressed with 32-bit file pointer
4K bytes per block, 32-bit addresses
Inode (file control block)
Volume block: Table with file names Points to this
Common: 12+3 Indirect block could contain 1024 pointers. Max file size: k.k.k.4k+
40
Performance
• Best method depends on file access type – Con4guous great for sequen4al and random
• Linked good for sequen4al, not random • Declare access type at crea4on -‐> select either con4guous or linked
• Indexed more complex – Single block access could require 2 index block reads then data block read
– Clustering can help improve throughput, reduce CPU overhead
Cluster: set of con4guous sectors
41
Performance (Cont.)
• Adding instruc4ons to the execu4on path to save one disk I/O is reasonable – Intel Core i7 Extreme Edi4on 990x (2011) at 3.46Ghz = 159,000 MIPS
• hhp://en.wikipedia.org/wiki/Instruc4ons_per_second – Typical disk drive at 250 I/Os per second
• 159,000 MIPS / 250 = 630 million instruc4ons during one disk I/O
– Fast SSD drives provide 60,000 IOPS • 159,000 MIPS / 60,000 = 2.65 millions instruc4ons during one disk I/O
42
Ques4ons from last 4me
• Loca4ng a byte address with 2-‐level directory structure – Use inner directory, then block, then offset within block – LA = 788505 = 3(512x512)+4(512)+25 – Disadvantage delay, Advantage: large files (512x512x512x4)
• Page vs block • Page/frame: virtual addressing units • Block: one or more sectors on a disk
• Indexed: good for random access (not seq)
43
Free-‐Space Management • File system maintains free-‐space list to track available blocks/clusters
– (Using term “block” for simplicity) • Approaches: i. Bit vector ii. Linked list iii. Grouping iv. Coun;ng • Bit vector or bit map (n blocks)
…"
0" 1" 2" n-1"
bit[i] ="
"
1 ⇒ block[i] free"0 ⇒ block[i] occupied"
Block number calculation"
(number of bits per word) *(number of 0-value words) + offset of first 1 bit"
CPUs have instructions to return offset within word of first “1” bit"
00000000 00000000 00111110 ..
44
Free-‐Space Management (Cont.)
• Bit map requires extra space – Example:
block size = 4KB = 212 bytes disk size = 240 bytes (1 terabyte) blocks: n = 240/212 = 228 bits (or 32MB) for map if clusters of 4 blocks -‐> 8MB of memory
• Easy to get con4guous files if desired
45
Linked Free Space List on Disk "
ii. Linked list (free list)" Cannot get contiguous space easily" No waste of space" No need to traverse the entire list
(if # free blocks recorded)"
Superblock Can hold pointer to head of linked list
46
Free-‐Space Management (Cont.)
• iii. Grouping – Modify linked list to store address of next n-‐1 free blocks in first free block, plus a pointer to next block that contains free-‐block-‐pointers (like this one)
• iv. Coun4ng – Because space is frequently con4guously used and freed, with con4guous-‐alloca4on alloca4on, extents, or clustering
• Keep address of first free block and count of following free blocks
• Free space list then has entries containing addresses and counts
47
Efficiency and Performance
• Efficiency dependent on: – Disk alloca4on and directory algorithms – Types of data kept in file’s directory entry – Pre-‐alloca4on or as-‐needed alloca4on of metadata structures
– Fixed-‐size or varying-‐size data structures
48
Efficiency and Performance (Cont.) • Performance impacted by
– Keeping data and metadata close together in disks – Buffer cache (Disk cache)– separate sec4on of main memory
for frequently used blocks – Synchronous writes some4mes requested by apps or needed
by OS • No buffering / caching – writes must hit disk before
acknowledgement • Asynchronous writes more common, buffer-‐able, faster
– Free-‐behind and read-‐ahead – techniques to op4mize sequen4al access
• Read ahead: read ahead into memory in an4cipa4on, free blocks axer access
49
Memory mapped files
• I/O using: open(), read(), write() – Requires system calls and disk access
• I/O using Memory mapping – maps a disk block to a page (or pages) in memory – Manipulate files through memory – Writes to files in memory are not necessarily immediate
50
Page Cache vs “buffer cache”
• A page cache caches pages rather than disk blocks using virtual memory techniques and addresses
• Memory-‐mapped I/O uses a page cache
• Rou4ne I/O through the file system uses the buffer (disk) cache
• This leads to double caching (following figure)
52
Unified Buffer Cache
• A unified buffer cache uses the same page cache to cache both memory-‐mapped pages and ordinary file system I/O to avoid double caching
But which caches get priority, and what replacement algorithms to use?
54
Recovery
• Consistency checking – compares data in directory structure with data blocks on disk, and tries to fix inconsistencies – Can be slow and some4mes fails
• Some4mes metadata is duplicated
• Use system programs to back up data from disk to another storage device (magne4c tape, other magne4c disk, op4cal)
• Recover lost file or disk by restoring data from backup
55
Log Structured File Systems • Log structured (or journaling) file systems record each
metadata update to the file system as a transac;on
• All transac4ons are wrihen to a log – A transac4on is considered commihed once it is wrihen to
the log (sequen4ally) – Some4mes to a separate device or sec4on of disk – However, the file system may not yet be updated
• The transac4ons in the log are asynchronously wrihen to the file system structures – When the file system structures are modified, the transac4on
is removed from the log
• If the file system crashes, all remaining transac4ons in the log must s4ll be performed
• Faster recovery from crash, removes chance of inconsistency of metadata
56
UNIX directory structure • Contains only file names and the corresponding inode numbers
• Use ls –i to retrieve inode numbers of the files in the directory
• Looking up path names in UNIX – Example: /usr/tom/mbox – Lookup inode for /, then for usr, then for tom, then for mbox
57
Advantages of directory entries that have name and inode informa4on
• Changing filename only requires changing the directory entry
• Only 1 physical copy of file needs to be on disk – File may have several names (or the same name) in different directories
• Directory entries are small – Most file info is kept in the inode
58
Two hard links to the same file
• Directory entry in /dirA ..[12345 filename1]..
• Directory entry in /dirB ..[12345 filename2]..
• Both refer to the same inode • To create a hard link
ln /dirA/filename1 /dirB/filename2 • To create a symbolic link
ln -‐s /dirA/filenmame1 /dirB/filename3 filename3 just contains a pointer
59
File system based on inodes
Limita4ons – File must fit in a single disk par44on – Par44on size and number of files are fixed when system is set up
inode prealloca4on and distribu4on • inodes are preallocated on a volume
– Even on empty disks % of space lost to inodes
• Prealloca4ng inodes and spreading them – Improves performance
• Keep file’s data block close to its inode – Reduce seek 4mes
60
Checking up on the inodes
Command: df -‐i Gives inode sta4s4cs for a given system: total, free and used nodes
Command: ls –i Command: stat *.*
Filesystem 512-blocks Used Available Capacity iused ifree %iused Mounted on /dev/disk0s2 488555536 126143120 361900416 26% 15831888 45237552 26% / devfs 361 361 0 100% 626 0 100% /dev map -hosts 0 0 0 100% 0 0 100% /net map auto_home 0 0 0 100% 0 0 100% /home
211655579 exfile.txt 211655593 exfile2.txt
234881026 211655579 -rw-r--r-- 1 ymalaiya staff 0 25 "Dec 3 18:11:02 2015” "Dec 3 18:11:00 2015" "Dec 3 18:11:00 2015" "Dec 3 18:11:00 2015" 4096 8 0 exfile.txt 234881026 211655593 -rw-r--r-- 1 ymalaiya staff 0 0 "Dec 3 18:11:46 2015" "Dec 3 18:11:46 2015" "Dec 3 18:11:46 2015" "Dec 3 18:11:46 2015" 4096 0 0 exfile2.txt