CS519: Lecture 4 zI/O and File Management. 2 CS 519Operating System Theory I/O Devices zSo far we...

CS519: Lecture 4

I/O and File Management

CS 519Operating System

Theory2

I/O Devices

So far we have talked about how to abstract and manage CPU and memory (processes, VM, etc)

Now: I/O and file management I/O devices are the computer’s interface to the

outside world (I/O Input/Output) Example devices: display, keyboard, mouse, speakers,

network interface, and disk


Theory3

Basic Computer Structure

CPU Memory

Bridge

Disk NIC

Memory Bus(System Bus)

I/O Bus


Theory4

Intel SR440BX Motherboard

CPU

System Bus &MMU/AGP/PCI

Controller

I/O Bus

IDE DiskController

USBController Another

I/O BusSerial &

Parallel Ports Keyboard & Mouse


Theory5

Communication Between CPU and I/O Devices

How does the CPU communicate with I/O devices? Memory-mapped communication

Each I/O device assigned a portion of the physical address space

CPU I/O device• CPU writes to locations in this area to "talk" to I/O device

I/O device CPU• Polling: CPU repeatedly check location(s) in portion of address

space assigned to device• Interrupt: Device sends an interrupt (on an interrupt line) to get

the attention of the CPU Programmed I/O, Interrupt-Driven, Direct Memory Access

PIO and ID = word at a time DMA = block at a time


Theory6

Programmed I/O vs. DMA

Programmed I/O is ok for sending commands, receiving status, and communication of a small amount of data

Inefficient for a large amount of data Keeps CPU busy during the transfer Programmed I/O memory operations slow

Direct Memory Access Device read/write directly from/to memory Transfer from memory to device typically initiated from

CPU Transfer from device to memory can be initiated by the

device or the CPU


Theory7

Programmed I/O vs. DMA

CPU Memory

Disk

Interconnect

CPU Memory

Disk

Interconnect

CPU Memory

Disk

Interconnect

ProgrammedI/O

DMA DMA


Theory8

Device Driver

OS module controlling an I/O device Hides the device specifics from the above layers in the

kernel Supporting a common API UNIX: block or character device

Block: device communicates with the CPU/memory in fixed-size blocks

Character/Stream: stream of bytes Translates logical I/O into device I/O

E.g., logical disk blocks into {head, track, sector} Performs data buffering and scheduling of I/O operations Structure

Several synchronous entry points: device initialization, queue I/O requests, state control, read/write

An asynchronous entry point to handle interrupts


Theory9

Some Common Entry Points for UNIX Device Drivers

Attach: attach a new device to the system. Close: note the device is not in use. Halt: prepare for system shutdown. Init: initialize driver globals at load or boot time. Intr: handle device interrupt (not used). Ioctl: implement control operations. Mmap: implement memory-mapping (SVR4). Open: connect a process to a device. Read: character-mode input. Size: return logical size of block device. Start: initialize driver at load or boot time. Write: character-mode output.


Theory10

I/O Buffering

I/O Transfer – DMA After an I/O request is placed the source/destination of the I/O

transfer must be locked in memory To allow user process to continue (when possible), data is

often copied from user address space to kernel buffers (or vice-versa) which are pinned to memory

Copying is expensive asynchronous I/O

Devices are typically slow compared to CPU How do we speed up accesses? Caching, of course …

I/O buffering Buffer cache: a buffer in main memory for block devices Character queue: follows the producer/consumer model

(characters in the queue are read once)


Theory11

User to Driver Control Flow

user

kernel

read, write, ioctl

special file ordinary file

file system

buffer cache

blockdevice

characterdevice

character queue

driver_read/write driver-strategy


Theory12

Buffer Cache

When an I/O request is made for a block, the buffer cache is checked first

If block is missing from the cache, it is read into the buffer cache from the device

Exploits locality of reference as any other cache Replacement policies similar to those for VM UNIX

Historically, UNIX has a buffer cache for the disk which does not share buffers with character/stream devices

Adds overhead in a path that has become increasingly common: disk NIC


Theory13

Disks

Seek time: time to move the disk head to the desired track

Rotational delay: time to reach desired sector once head is over the desired track

Transfer rate: rate data read/write to disk

Some typical parameters: Seek: ~10-15ms Rotational delay:

~4.15ms for 7200 rpm Transfer rate: 30 MB/s

Sectors

Tracks


Theory14

Disk Scheduling

Disks are at least four orders of magnitude slower than main memory The performance of disk I/O is vital for the performance

of the computer system as a whole Access time (seek time+ rotational delay) >> transfer

time for a sector Therefore the order in which sectors are read matters a

lot Disk scheduling

Usually based on the position of the requested sector rather than according to the process priority

Possibly reorder stream of read/write request to improve performance


Theory15

Disk Scheduling Policies

Shortest-service-time-first (SSTF): pick the request that requires the least movement of the head

SCAN (back and forth over disk): good service distribution C-SCAN (one way with fast return): lower service variability

Problem with SSTF, SCAN, and C-SCAN: arm may not move for long time (due to rapid-fire accesses to same track)

N-step SCAN: scan of N records at a time by breaking the request queue in segments of size at most N and cycling through them

FSCAN: uses two sub-queues, during a scan one queue is consumed while the other one is produced


Theory16

RAID

Redundant Array of Inexpensive Disks (RAID) A set of physical disk drives viewed by the OS as a single

logical drive Replace large-capacity disks with multiple smaller-capacity

drives to improve the I/O performance (at lower price) Data are distributed across physical drives in a way that

enables simultaneous access to data from multiple drives Redundant disk capacity is used to compensate for the

increase in the probability of failure due to multiple drives Improve availability because no single point of failure

Six levels of RAID representing different design alternatives


Theory17

RAID Level 0

Does not include redundancy Data is stripped across the available disks

Total storage space across all disks is divided into strips Strips are mapped round-robin to consecutive disks A set of consecutive strips that maps exactly one strip to each disk in

the array is called a stripe Can you see how this improves the disk I/O bandwidth? What access pattern gives the best performance?

strip 0 strip 3strip 2strip 1

strip 7strip 6strip 5strip 4

...

stripe 0


Theory18

RAID Level 1

Redundancy achieved by duplicating all the data Every disk has a mirror disk that stores exactly the same data

A read can be serviced by either of the two disks which contains the requested data (improved performance over RAID 0 if reads dominate)

A write request must be done on both disks but can be done in parallel Recovery is simple but cost is high

strip 0 strip 0strip 1 strip 1

strip 2strip 3 strip 3strip 2

...


Theory19

RAID Levels 2 and 3

Parallel access: all disks participate in every I/O request Small strips since size of each read/write = # of disks * strip size RAID 2: error correcting code is calculated across corresponding bits on each data disk and stored on

log(# data disks) parity disks Hamming code: can correct single-bit errors and detect double-bit errors Less expensive than RAID 1 but still pretty high overhead – not really needed in most reasonable environments

RAID 3: a single redundant disk that keeps parity bits P(i) = X2(i) X1(i) X0(i)

In the event of a failure, data can be reconstructed Can only tolerate a single failure at a time

b0 b1 b2 P(b) X2(i) = P(i) X1(i) X0(i)


Theory20

RAID Levels 4 and 5

RAID 4 Large strips with a parity strip like RAID 3 Independent access - each disk operates independently, so multiple I/O request can be satisfied in parallel Independent access small write = 2 reads + 2 writes Example: if write performed only on strip 0:

P’(i) = X2(i) X1(i) X0’1(i) = X2(i) X1(i) X0’(i) X0(i) X0(i) = P(i) X0’(i) X0(i)

Parity disk can become bottleneck

RAID 5 Like RAID 4 but parity strips are distributed across all disks

strip 0 P(0-2)

P(3-5)strip 3

strip 2strip 1

strip 5strip 4


Theory21

File System

File system is an abstraction of the disk File Track/sector To a user process

A file looks like a contiguous block of bytes (Unix) A file system provides a coherent view of a group of files A file system provides protection

API: create, open, delete, read, write files Performance: throughput vs. response time Reliability: minimize the potential for lost or

destroyed data E.g., RAID could be implemented in the OS as part of

the disk device driver


Theory22

Unix File System

Ordinary files (uninterpreted) Directories

File of files Organized as a rooted tree Pathnames (relative and absolute) Contains links to parent, itself Multiple links to files can exist

Link - hard OR symbolic


Theory23

Unix File Systems (Cont’d)

Tree-structured file hierarchies

Mounted on existing space by using mount

No links between different file systems


Theory24

File Naming

Each file has a unique name User visible (external) name must be symbolic

In a hierarchical file system, unique external names are given as pathnames (path from the root to the file)

Internal names: i-node in UNIX - an index into an array of file descriptors/headers for a volume

Directory: translation from external to internal name May have more than one external name for a single internal

name Information about file is split between the directory and the

file descriptor: name, type, size, location on disk, owner, permissions, date created, date last modified, date last access, link count


Theory25

Name Space

In UNIX, “devices are files” E.g., /dev/cdrom,

/dev/tape User process accesses

devices by accessing corresponding file

/

usr A B

C D


Theory26

File Allocation

Contiguous: a contiguous set of blocks is pre-allocated to a file at the time of file creation Good for sequential files File size must be known at the time of file creation External fragmentation – like memory allocation when giving a

contiguous block to each job So what do we do?

Dynamic allocation (new space allocated on demand) First fit (first chunk of sufficient size), best fit (smallest chunk of

sufficient size), nearest fit (chunk of sufficient size that is closest to the previous allocation for the same file)

Indexed allocation (contiguous and chained allocations are other options) with file allocation table. FAT includes file names and corresponding index block numbers

Use a disk allocation table (bit map, chained, and indexed) to manage the free space


Theory27

File Allocation Strategies

Contiguous allocation: find contiguous chunk for whole file

Chained allocation: pointer to next block allocated to file

Indexed: index block points to file blocks


Theory28

Free Space Management

Bitmap: one bit for each block on the disk Good to find a contiguous group of free blocks Small enough to be kept in memory Requires sequential scan of bits

Chained free portions: pointer to the next one Indexed: treats free space as a file


Theory29

UNIX File

i-nodes


Theory30

File System Buffer Cache

application: read/write files

OS: translate file to disk blocks

...buffer cache ...maintains

controls disk accesses: read/write blocks

hardware:

Any problems?


Theory31

File System Buffer Cache

Disks are “stable” while memory is volatile What happens if you buffer a write and the machine

crashes before the write has been saved to disk? Can use write-through but write performance will suffer In UNIX

Use un-buffered I/O when writing i-nodes or pointer blocks Use buffered I/O for other writes and force sync every 30

seconds

What about replacement? How can we further improve performance?


Theory32

Application-controlled caching

application: read/write files replacement policy

OS: translate file to disk blocks

...buffer cache ...maintains

controls disk accesses: read/write blocks

hardware:


Theory33

Application-Controlled File Caching

Two-level block replacement: responsibility is split between kernel and user level

A global allocation policy performed by the kernel which decides which process will give up a block

A block replacement policy decided by the user: Kernel provides the candidate block as a hint to the

process The process can overrule the kernel’s choice by

suggesting an alternative block The suggested block is replaced by the kernel

Examples of alternative replacement policy: most-recently used (MRU)


Theory34

Sound kernel-user cooperation

Oblivious processes should do no worse than under LRU Foolish processes should not hurt other processes Smart processes should perform better than LRU whenever

possible and they should never perform worse If kernel selects block A and user chooses B instead, the kernel

swaps the position of A and B in the LRU list and places B in a “placeholder” which points to A (kernel’s choice)

If the user process misses on B (i.e. it made a bad choice), and B is found in the placeholder, then the block pointed to by the placeholder is chosen (prevents hurting other processes)


Theory35

File System Consistency

File system almost always uses a buffer/disk cache for performance reasons

Two copies of a disk block (buffer cache, disk) consistency problem if the system crashes before all the modified blocks are written back to disk

This problem is critical especially for the blocks that contain control information: i-node, free-list, directory blocks

Utility programs for checking block and directory consistency Write critical blocks from the buffer cache to disk

immediately Data blocks are written to disk periodically: sync


Theory36

More on File System Consistency

To maintain file system consistency the ordering of updates from buffer cache to disk is critical

Example: if the directory block (contains pointer to i-node) is written back before the i-node and the system crashes, the directory structure will be inconsistent

Similar case when free list is updated before i-node and the system crashes, free list will be incorrect

A more elaborate solution: use dependencies between blocks containing control data in the buffer cache to specify the ordering of updates


Theory37

Protection Mechanisms

Files are OS objects: unique names and a finite set of operations that processes can perform on them

Protection domain is a set of {object,rights} where right is the permission to perform one of the operations

At every instant in time, each process runs in some protection domain

In Unix, a protection domain is {uid, gid} Protection domain in Unix is switched when running a

program with SETUID/SETGID set or when the process enters the kernel mode by issuing a system call

How to store all the protection domains?


Theory38

Protection Mechanisms (cont’d)

Access Control List (ACL): associate with each object a list of all the protection domains that may access the object and how In Unix ACL is reduced to three protection domains:

owner, group and others

Capability List (C-list): associate with each process a list of objects that may be accessed along with the operations C-list implementation issues: where/how to store them

(hardware, kernel, encrypted in user space) and how to revoke them


Theory39

Log-Structured File System (LFS)

As memory gets larger, buffer cache size increases increase the fraction of read requests which can be satisfied from the buffer cache with no disk access

In the future, most disk accesses will be writes but writes are usually done in small chunks in most file

systems (control data, for instance) which makes the file system highly inefficient

LFS idea: structure the entire disk as a log Periodically, or when required, all the pending writes being

buffered in memory are collected and written as a single contiguous segment at the end of the log


Theory40

LFS segment

Contain i-nodes, directory blocks and data blocks, all mixed together

Each segment starts with a segment summary Segment size: 512 KB - 1MB Two key issues:

How to retrieve information from the log? How to manage the free space on disk?


Theory41

File Location in LFS

The i-node contains the disk addresses of the file block as in standard UNIX

But there is no fixed location for the i-node An i-node map is used to maintain the current

location of each i-node i-node map blocks can also be scattered but a

fixed checkpoint region on the disk identifies the location of all the i-node map blocks

Usually i-node map blocks are cached in main memory most of the time, thus disk accesses for them are rare


Theory42

Segment Cleaning in LFS

LFS disk is divided into segments that are written sequentially

Live data must be copied out of a segment before the segment can be re-written

The process of copying data out of a segment: cleaning A separate cleaner thread moves along the log, removes old

segments from the end and puts live data into memory for rewriting in the next segment

As a result a LFS disk appears like a big circular buffer with the writer thread adding new segments to the front and the cleaner thread removing old segments from the end

Bookkeeping is not trivial: i-node must be updated when blocks are moved to the current segment


Theory43

LFS Performance


Theory44

LFS Performance (Cont’d)

CS519: Lecture 4 zI/O and File Management. 2 CS 519Operating System Theory I/O Devices zSo far we...

Documents

Transcript of CS519: Lecture 4 zI/O and File Management. 2 CS 519Operating System Theory I/O Devices zSo far we...