Operating Systems: Filesystems

Operating Systems:

Filesystems

Shankar

April 25, 2015

Outline fs interface

Filesystem Interface

Persistent Storage Devices

Filesystem Implementation

FAT Filesystem

FFS: Unix Fast Filesystem

NTFS: Microsoft Filesystem

Copy-On-Write: Slides from OS:P&P text

Overview: Filesystem Interface fs interface

Persistent structure of directories and �les

File: named sequence of bytes

Directory: zero or more pointers to directories/�les

Persistent: non-volatile storage device: disk, ssmd, ...

Structure organized as a tree or acylic graph

nodes: directories and �lesroot directory: �/�path to a node: �/a/b/...�acyclic: more than one path to a �le (but not directory)

problematic for disk utilities, backup

Node metadata: owner, creation time, allowed access, ...

Users can create/delete nodes, modify content/metadata

Examples: FAT, UFS, NTFS, ZFS, ..., PFAT, GOSFS, GSFS2

Node fs interface

Name(s)

a name is the last entry in a path to the nodeeg, path �/a/b� vs name �b�subject to size and char limits

Type: directory or �le or device or ...

Size: subject to limit

Directory may have a separate limit on number of entries.

Time of creation, modi�cation, last access, ...

Content type (if �le): eg, text, pdf, jpeg, mpeg, executable, ...

Owner

Access for owner, other users, ...: eg, r, w, x, setuid, ...

Operations on Filesystems fs interface

Format(dev)create an empty �lesystem on device dev (eg, disk, �ash, ...)

Mount(fstype, dev)attach (to computer) �lesystem of type fstype on device devreturns a path to the �lesystem (eg, volume, mount point, ...)after this, processes can operate on the �lesystem

Unmount(path)detach (from computer) �lesystem at path // �nish all ioafter this, the �lesystem is inert in its device, unaccessible

Operations on Attached Filesystem � 1 fs interface

Create(path), CreateDir(path)create a �le/directory node at given path

Link(existingPath, newPath)create a (hard) link to an existing �le node

Delete(path) // aka Unlink(path)delete the given path to the �le node at pathdelete �le node of if no more links to it

DeleteDir(path)delete the directory node at path // must be empty

Change attributes (name, metadata) of node at path

eg, stat, touch, chown/chgrp, chmod, rename/mv


Open(path, access), OpenDir(path, access)open the node at path with given access (r, w, ...)returns a �le descriptor // prefer: node descriptorafter this, node can be operated on

Close(nd), CloseDir(nd)close the node associated with node descriptor nd

Read(nd, file range, buffer), ReadDir(nd, dir range, buffer),read the given range from open node nd into given bu�erreturns number of bytes/entries read

Write(nd, file range, buffer)write bu�er contents into the given range of open �le node ndreturns number of bytes written


Seek(nd, file location), SeekDir(nd, entry)move �r/w head� to given location/entry

MemMap(nd, file range, mem range)map the given range of �le nd to given range of memory

MemUnmap(nd, file range, mem range)unmap the given range of �le nd from given range of memory

Sync(nd)complete all pending io for open node nd

Consistency of Shared Files fs interface

Shared �le

one that is opened by several processes concurrently

Consistency of the shared �le:

when does a write by one process become visiblein the reads by the other processes

Various types of consistency (from strong to weak)

visible when write returnsvisible when sync returnsvisible when close returns

Single-processor system

all types of consistency easily achieved

Multi-processor system

strong notions are expensive/slow to achieve

Reliability fs interface

Disks, though non-volatile, are susceptible to failures

magnetic /mechanical / electronic parts wear outpower loss in the middle of a �lesystem operation

Filesystem should work inspite of failures to some extent

Small disk-area failures

no loss/degradation

Large disk-area failures

�lesystem remains a consistent subtreeeg, �les may be lost, but no undetected corrupted �les/links

Power outage

each ongoing �lesystem operation remains atomicie, either completes or has no e�ect on �lesystem

Outline storage devices




FAT Filesystem




Disk Drive: Geometry storage devices

Platter(s) �xed to rotating spindle

Spindle speed

Platter has two surfaces

Surface has concentric tracks

Track has sectors

more in outer than inner

Sector: �xed capacity

Movable arm with �xed rw heads,one per surface

Bu�er memory

Eg, laptop disk (2011)

1 or 2 platters

4200�15000 rpm,15�4ms/rotation

diameter: 2.5 in

track width: < 1 micron

sector: 512 bytes

bu�er: 16MB

Disk Access Time storage devices

Disk access time = seek + rotation + transfer

Seek time: (moving rw head) + (electronic settling)

minimum: target is next track; settling onlymaximum: target is at other end

rotation delay: half-rotational delay on avg

transfer time: platter ↔ bu�er

transfer time: bu�er ↔ host memory

Eg, laptop disk (2011)

min seek: 0.3�1.5ms

max seek: 10�20ms

rotation delay: 7.5�2ms

platter ↔ bu�er: 50 (inner) � 100 (outer)MB/s

bu�er ↔ host memory: 100�300MB/s

Disk IO storage devices

Disk IO is in blocks (sectors): slower, more bursty (wrt memory)

Causes the �lesystem to be implemented in blocks also

Disk geometry strongly a�ects �lesystem layout

Disk blocks usually become corrupted/unusable over time

Filesystem layout must compensate for this

redundancy (duplication)error-correcting codesmigration without stopping

Disk Scheduling storage devices

FIFO: terrible: lots of head movement

SSTF (shortest seek time �rst)

favors �middle� requests; can starve �edge� requests

SCAN (elevator)

sweep from inner to outer, until no requests in this directionsweep from outer to inner, until no requests in this direction

CSCAN: like SCAN but in only one direction

fairer, less chance of sparsely-requested track

R-SCAN / R-CSCAN

allow minor deviations in direction to exploit rotational delays

Flash Devices storage devices

RAM with a �oating insulated gateNo moving partsmore robust than disk to impactsconsumes less energyuniform access time // good for random access

Wears out with repeated usageUnused device can lose charge over time (years?)

Read/write operationsdone in blocks (12KB to 128KB)tens of microseconds // if erased area available for writes

Write requires erased areaErase operationerasure block (few MB)tens of millisecondsso maintain a large erased area

Outline fs implementation




FAT Filesystem




Overview of Fs Implementation � 1 fs implementation

De�ne a mapping of �lesystem to device blocksand implement �lesystem operations

performance: random and sequential data accessreliability: inspite of device failuresaccomodate di�erent devices (blocks, size, speed, geometry, ...)

The term ��lesystem� refers to two distinct entities

1. �lesystem interface, ie, �lesystem seen by users2. �lesystem implementation

Here are some abbreviations (used as quali�ers)

fs: �lesystemfs-int: �lesystem interface // eg, fs-int node is directory or �lefs-imp: �lesystem implementation


Fs implementation usually starts at the second block in the device

[The �rst device block is usually reserved for the boot sector]

Fs implementation organizes the device in fs blocks 0, 1, · · ·Fs block = one or more device blocks, or vice versa

henceforth, �block� without quali�er means �fs block�

Fs implementation is a graph over blocks, rooted at a specialblock (the �superblock�)

[In contrast, fs interface is a graph over �les and directories]

Each �le or directory x maps to a subgraph of the fs-imp graph,attached to the fs-imp graph at a �root� block

subgraph's blocks hold

x 's data, x 's metadata, pointers to subgraph's blockspointers to the roots of other subgraphs if x directory


0

1

2

boot 0

1

2

fs-blk contains- metadata- data- ptrs to data- ptrs to other subgraphs

(for dir-node subgraph)

fs interface

/

a cb

d e f

g h

fs implementation device blocksfs blocks

graph over nodes

(tree or acyclic)

graph over fs-blks

superblk

/.subgraph

a.subgraph

operations- mount, unmout, ...- create, link, unlink, delete- open, read, write, sync, close

implementations of operations

Superblock fs implementation

First fs block read by OS when mounting a �lesystem

Usually in the �rst few device blocks after the boot sector

Identi�es the �lesystem type and the info to mount it

magic number, indicating �lesystem typefs blocksize (vs device blocksize)size of disk (in fs blocks).pointer (block #) to the root of �/� directory's subgraphpointer to list of free blocks(perhaps) pointer to array of roots of subgraphs...

Subgraph for fs-int node (�le or directory) fs implementation

Node name(s) // multiple names/paths due to links

Node metadata

sizeowner, access, ...creation time, ......

Pointers to fs blocks containing node's data

If node is a �le, the data is the �le's data

If node is a directory, the data is a table of directory entries

table may be unordered, sorted, hashed, ...depends on number and size of entries, desired performance, ...

A directory entry points to the entry's subgraph

Data Pointers Organization � 1 fs implementation

Organization of the pointers from subgraph root to data blocks

Contiguous

data is placed in consecutive fs blockssubgraph root has starting fs block and data sizepros: random accesscons: external fragmentation; poor tolerance to bad blocks

Linked-list

each fs block of data has a pointer to the next fs block of data(or nil, if no more data)subgraph root has starting fs block and data sizepros: no fragmentation; good tolerance to bad blockscons: poor random access

Data Pointers Organization � 2 fs implementation

Separate linked-list

keep all the pointers (w/o data) in one or more fs blocks�rst load those fs blocks into memorythen the fs block for any data chunk is locatedby traversing a linked-list in memory (rather than disk)pros: no fragmentation; good tolerance to bad blocks; goodrandom access (as long as �le size is not too big)

Indexed

have an array of pointershave multiple levels of pointers for accomodating large �les

ie, a pointer points to a fs block of pointers

pros: no fragmentation; good tolerance to bad blocks; goodrandom access (logarithmic cost in data size)

For Good Performance fs implementation

Want large numbers of consecutive reads or writes

So put related info in nearby blocks

Large bu�ers (to minimize read misses)

Batched writes: need a large queue of writes

Reliability: overcoming block failures fs implementation

Redundancy within a disk

error-detection/correction code (EDC/ECC) in each disk blockmap each fs block to multiple disk blocksdynamically remap within a disk to bypass failing areas

Redundancy across disks

map each fs block to blocks in di�erent disksEDC/ECC in each disk blockOr EDC/ECC for a fs block across disks // eg, RAID· · ·

Reliability: Atomicity via Copy-On-Write fs implementation

When a user modi�es a �le or directory f

identify the part, say X , of the fs-imp graph to be modi�edwrite the new value of X in fresh blocks, say Yattach Y to the fs-imp graph // step with low prob of failuredetach X and garbage collect its blocks

note that X typically includes part of the path(s) from thefs-imp root to f 's subgraph

Consider a path a0, · · · , aN , where a0 is the fs-imp rootand aN holds data modi�ed by user

Then for some point aJ in the path

the new value of the su�x aj , · · · , aN are in new blocksthe new values of the pre�x a0, · · · , aJ−1 are in the old blocks

Reliability: Atomicity via Logging fs implementation

Maintain log of requested operations

When user issues a sequence of disk operations

add records (one for each operation) to logadd �commit� record after last operationasynchronously write them to diskerase from log after copmletion

Upon recovery from crash, (re)do all operations in log

operations are idempotent (repeating is ok)

Hierarchy in Filesystem fs implementation

Virtual �lesystem: optional

memory-only framework on which to mount real �lesystems

Mounted Filesystem(s)

real �lesystems, perhaps of di�erent types

Block cache

cache �lesystem blocks: performance, sharing, ...

Block device

wrapper for the various block devices with �lesystems

Device drivers for the various block devices

GeekOS: Hierarchy in Filesystem fs implementation

Outline FAT




FAT Filesystem




FAT32 Overview � 1 FAT

FAT: MS-DOS �lesystemsimple, but no good for scalability, hard links, reliability, ...currently used only on simple storage devices: �oppy, �ash, ...

Disk divided into following regionsBoot sector: device block 0BIOS parameter blockOS boot loader code

Filesystem info sector: device block 1signatures, fs type, pointers to other sectionsfs blocksize, # free fs blocks, # last allocated fs block...

FAT: fs blocks 0 and 1; corresponds to the superblock

Data region: rest of the disk, organized as an array of fs blocksholds the data of the fs-int nodes (ie, �les and directories)


Each block in the data region is eitherfree or bad or holds data (of a �le or directory)

FAT: array with an entry for each block in data region

entries j0, j1, · · · form a chain i�blocks j0, j1, · · · hold successive data (of a fs-int node)

Entry n contains

constant, say FREE, if block n is free

constant, say BAD, if block n is bad (ie, unusable)

32-bit number, say x , if block n holds data (of a fs-int node)and block x holds the succeeding data (of the fs-int node)

constant, say END, if block n holds the last data chunk

Root directory table: typically at start of data region (block 2)


Directory entry: 32 bytes

name (8)extension (3)attributes (1)

read-only, hidden, system, volume label,subdirectory, archive, device

reserved (10)last modi�cation time (2)last modi�cation date (2)fs block # of starting fs block of the entry's datasize of entry's data (4)

Hard links??

Outline FFS




FAT Filesystem




FFS Layout FFS

Boot blocks // few blocks at start

Superblock // after boot blocks

magic number�lesystem geometry (eg, locations of groups)�lesystem statistics/tuning params

Groups, each consisting of // cylinder groups

backup copy of superblockheader with statisticsfree space bit map...array of inodes

holds metadata and data pointers of fs-int nodes# inodes �xed at format time

array of data blocks

FFS Inodes FFS

Inodes are numbered sequentially starting at 0

inodes 0 and 1 are reservedinode 2 is the root directory's inode

An inode is either free or not

Non-free inode holds metadata and data pointers of a fs-int node

owner idtype (directory, �le, device, ...)access modesreference count // # of hard linkssize and #blocks of data15 pointers to data blocks

12 direct1 single-indirect, 1 double-indirect, 1 triple-indirect

FFS: fs-int node → asymmetric tree (max depth 4) FFS

meta

inode indirect blocks data blocks

FFS Directory entries FFS

The data blocks of a directory node hold directory entries

A directory entry is not split across data blocks

Directory entry −→ inode −→ data/pointer blocks

Directory entry for a fs-int node

# of the inode of the fs-int node // hard linksize of node datalength of node name (up to 255 bytes)entry name

Multiple directory entries can point to the same inode

User Ids FFS

Every user account has a user id (uid)

Root user (aka superuser, admin) has uid of 0

Processes and �lesystem entries have associated uids

indicates ownersdetermines access processes have to �lesystem entriesdetermines which processes can be signalled by a process

Process Ids FFS

Every process has two associated uids

e�ective user id (euid)

uid of user on whose behalf it is currently executingdetermines its access to �lesystem entries

real uid (ruid)

uid of the process's ownerdetermines which processes it can signal:x can signal y only if x is superuser or x .ruid = y .ruid

Process Ids FFS

Process is created: ruid/euid ← creating process's euid

Process with euid 0 executes SetUid(z): ruid/euid ← zno e�ect if process has non-zero euid

Example SetUid usage

login process has euid 0 (to access auth info �les)upon successful login, it starts a shell process (with euid 0)shell executes SetUid(authenticated user’s uid)

When a process executes a �le f with �setuid bit� set:its euid is set to f's owner's uid while it is executing f.

Upon bootup, the �rst process (�init�) runs with uid of 0

it spawns all other processes directly or indirectly

Directory entry's uids and permissions FFS

Every directory entry has three classes of users:owner (aka �user�)group (owner need not be in this group)others (users other than owner or group)

Each class's access is de�ned by three bits: r, w, x

For a �le:r: read the �lew: modify the �lex: execute the �le

For a directory:r: read the names (but not attributes) of entries in the directoryw: modify entries in the directory (create, delete, rename)x: access an entry's contents and metainfo

When a directory entry is created: attributes are set according tothe creating process's attributes (euid, umask, etc)

Directory entry's setuid bit FFS

Each directory entry also has a �setuid� bit.

If an executable �le has setuid set and a process (with executeaccess) executes it, the process's euid changes to the �le'sowner's uid while executing the �le.

Typically, the executable �le's owner is root, allowing a normaluser to get root privileges while executing the �le

This is a high-level analog of system calls

Directory entry's sticky bit FFS

Each directory entry also has a sticky bit.

Executable �le with sticky bit set: hint to the OS to retain thetext segment in swap space after the process executes

An entry x in a directory with sticky bit set:

a user with wx access to the directory can rename/delete anentry x in the directory only if it is x 's owner (or superuser)Usually set on /tmp directory.

Directory entry's setgid bit FFS

Unix has the notion of groups of users

A group is identi�ed by a group id, abbreviated gid

A gid de�nes a set of uids

A user account can be in di�erent groups, i.e., have multiple gids

Process has e�ective gid (egid) and real gid (rgid)

play a similar role as euid and ruid

A directory entry has a setgid bit

plays a similar role to setuid for executablesplays an entirely di�erent role for directories

Outline NTFS




FAT Filesystem




NTFS Index Structure NTFS

Master File Table (MFT)corresponds to FFS inode arrayholds an array of 1KB MFT records

MFT Record: sequence of variable-size attribute records

Std info attribute: owner id, creation/mod/... times, security, ...

File name attribute: �le name and number

Data attribute recorddata itself (if small enough), or // residentlist of data �extents� (if not small enough) // non-resident

Attribute listpointers to attributes in this or other MFT recordspointers to attribute extentsneeded if attributes do not �t in one MFT recordeg, highly-fragmented and/or highly-linked

Example: Files with single MFT record NTFS

free

std info file name

extent

extentdatafile name data (resident)

data (non−resident) free

Master File Table (MFT)

std info

< MFT record >

(2 data extents)File B

File A

(data resident) start / length

datastart / length

Example: File with three MFT records NTFS

attr liststd infoFile C datafn fn

attr liststd info datafn

data extents

attr list extent

fn

data extents

std info data

data extents

Outline COW




FAT Filesystem




Operating Systems: Filesystems

Documents

Transcript of Operating Systems: Filesystems