CS519: Lecture 6 zCommunication in tightly coupled systems (parallel computing)
CS519: Lecture 4 zI/O and File Management. 2 CS 519Operating System Theory I/O Devices zSo far we...
-
date post
20-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of CS519: Lecture 4 zI/O and File Management. 2 CS 519Operating System Theory I/O Devices zSo far we...
CS519: Lecture 4
I/O and File Management
CS 519Operating System
Theory2
I/O Devices
So far we have talked about how to abstract and manage CPU and memory (processes, VM, etc)
Now: I/O and file management I/O devices are the computer’s interface to the
outside world (I/O Input/Output) Example devices: display, keyboard, mouse, speakers,
network interface, and disk
CS 519Operating System
Theory3
Basic Computer Structure
CPU Memory
Bridge
Disk NIC
Memory Bus(System Bus)
I/O Bus
CS 519Operating System
Theory4
Intel SR440BX Motherboard
CPU
System Bus &MMU/AGP/PCI
Controller
I/O Bus
IDE DiskController
USBController Another
I/O BusSerial &
Parallel Ports Keyboard & Mouse
CS 519Operating System
Theory5
Communication Between CPU and I/O Devices
How does the CPU communicate with I/O devices? Memory-mapped communication
Each I/O device assigned a portion of the physical address space
CPU I/O device• CPU writes to locations in this area to "talk" to I/O device
I/O device CPU• Polling: CPU repeatedly check location(s) in portion of address
space assigned to device• Interrupt: Device sends an interrupt (on an interrupt line) to get
the attention of the CPU Programmed I/O, Interrupt-Driven, Direct Memory Access
PIO and ID = word at a time DMA = block at a time
CS 519Operating System
Theory6
Programmed I/O vs. DMA
Programmed I/O is ok for sending commands, receiving status, and communication of a small amount of data
Inefficient for a large amount of data Keeps CPU busy during the transfer Programmed I/O memory operations slow
Direct Memory Access Device read/write directly from/to memory Transfer from memory to device typically initiated from
CPU Transfer from device to memory can be initiated by the
device or the CPU
CS 519Operating System
Theory7
Programmed I/O vs. DMA
CPU Memory
Disk
Interconnect
CPU Memory
Disk
Interconnect
CPU Memory
Disk
Interconnect
ProgrammedI/O
DMA DMA
CS 519Operating System
Theory8
Device Driver
OS module controlling an I/O device Hides the device specifics from the above layers in the
kernel Supporting a common API UNIX: block or character device
Block: device communicates with the CPU/memory in fixed-size blocks
Character/Stream: stream of bytes Translates logical I/O into device I/O
E.g., logical disk blocks into {head, track, sector} Performs data buffering and scheduling of I/O operations Structure
Several synchronous entry points: device initialization, queue I/O requests, state control, read/write
An asynchronous entry point to handle interrupts
CS 519Operating System
Theory9
Some Common Entry Points for UNIX Device Drivers
Attach: attach a new device to the system. Close: note the device is not in use. Halt: prepare for system shutdown. Init: initialize driver globals at load or boot time. Intr: handle device interrupt (not used). Ioctl: implement control operations. Mmap: implement memory-mapping (SVR4). Open: connect a process to a device. Read: character-mode input. Size: return logical size of block device. Start: initialize driver at load or boot time. Write: character-mode output.
CS 519Operating System
Theory10
I/O Buffering
I/O Transfer – DMA After an I/O request is placed the source/destination of the I/O
transfer must be locked in memory To allow user process to continue (when possible), data is
often copied from user address space to kernel buffers (or vice-versa) which are pinned to memory
Copying is expensive asynchronous I/O
Devices are typically slow compared to CPU How do we speed up accesses? Caching, of course …
I/O buffering Buffer cache: a buffer in main memory for block devices Character queue: follows the producer/consumer model
(characters in the queue are read once)
CS 519Operating System
Theory11
User to Driver Control Flow
user
kernel
read, write, ioctl
special file ordinary file
file system
buffer cache
blockdevice
characterdevice
character queue
driver_read/write driver-strategy
CS 519Operating System
Theory12
Buffer Cache
When an I/O request is made for a block, the buffer cache is checked first
If block is missing from the cache, it is read into the buffer cache from the device
Exploits locality of reference as any other cache Replacement policies similar to those for VM UNIX
Historically, UNIX has a buffer cache for the disk which does not share buffers with character/stream devices
Adds overhead in a path that has become increasingly common: disk NIC
CS 519Operating System
Theory13
Disks
Seek time: time to move the disk head to the desired track
Rotational delay: time to reach desired sector once head is over the desired track
Transfer rate: rate data read/write to disk
Some typical parameters: Seek: ~10-15ms Rotational delay:
~4.15ms for 7200 rpm Transfer rate: 30 MB/s
Sectors
Tracks
CS 519Operating System
Theory14
Disk Scheduling
Disks are at least four orders of magnitude slower than main memory The performance of disk I/O is vital for the performance
of the computer system as a whole Access time (seek time+ rotational delay) >> transfer
time for a sector Therefore the order in which sectors are read matters a
lot Disk scheduling
Usually based on the position of the requested sector rather than according to the process priority
Possibly reorder stream of read/write request to improve performance
CS 519Operating System
Theory15
Disk Scheduling Policies
Shortest-service-time-first (SSTF): pick the request that requires the least movement of the head
SCAN (back and forth over disk): good service distribution C-SCAN (one way with fast return): lower service variability
Problem with SSTF, SCAN, and C-SCAN: arm may not move for long time (due to rapid-fire accesses to same track)
N-step SCAN: scan of N records at a time by breaking the request queue in segments of size at most N and cycling through them
FSCAN: uses two sub-queues, during a scan one queue is consumed while the other one is produced
CS 519Operating System
Theory16
RAID
Redundant Array of Inexpensive Disks (RAID) A set of physical disk drives viewed by the OS as a single
logical drive Replace large-capacity disks with multiple smaller-capacity
drives to improve the I/O performance (at lower price) Data are distributed across physical drives in a way that
enables simultaneous access to data from multiple drives Redundant disk capacity is used to compensate for the
increase in the probability of failure due to multiple drives Improve availability because no single point of failure
Six levels of RAID representing different design alternatives
CS 519Operating System
Theory17
RAID Level 0
Does not include redundancy Data is stripped across the available disks
Total storage space across all disks is divided into strips Strips are mapped round-robin to consecutive disks A set of consecutive strips that maps exactly one strip to each disk in
the array is called a stripe Can you see how this improves the disk I/O bandwidth? What access pattern gives the best performance?
strip 0 strip 3strip 2strip 1
strip 7strip 6strip 5strip 4
...
stripe 0
CS 519Operating System
Theory18
RAID Level 1
Redundancy achieved by duplicating all the data Every disk has a mirror disk that stores exactly the same data
A read can be serviced by either of the two disks which contains the requested data (improved performance over RAID 0 if reads dominate)
A write request must be done on both disks but can be done in parallel Recovery is simple but cost is high
strip 0 strip 0strip 1 strip 1
strip 2strip 3 strip 3strip 2
...
CS 519Operating System
Theory19
RAID Levels 2 and 3
Parallel access: all disks participate in every I/O request Small strips since size of each read/write = # of disks * strip size RAID 2: error correcting code is calculated across corresponding bits on each data disk and stored on
log(# data disks) parity disks Hamming code: can correct single-bit errors and detect double-bit errors Less expensive than RAID 1 but still pretty high overhead – not really needed in most reasonable environments
RAID 3: a single redundant disk that keeps parity bits P(i) = X2(i) X1(i) X0(i)
In the event of a failure, data can be reconstructed Can only tolerate a single failure at a time
b0 b1 b2 P(b) X2(i) = P(i) X1(i) X0(i)
CS 519Operating System
Theory20
RAID Levels 4 and 5
RAID 4 Large strips with a parity strip like RAID 3 Independent access - each disk operates independently, so multiple I/O request can be satisfied in parallel Independent access small write = 2 reads + 2 writes Example: if write performed only on strip 0:
P’(i) = X2(i) X1(i) X0’1(i) = X2(i) X1(i) X0’(i) X0(i) X0(i) = P(i) X0’(i) X0(i)
Parity disk can become bottleneck
RAID 5 Like RAID 4 but parity strips are distributed across all disks
strip 0 P(0-2)
P(3-5)strip 3
strip 2strip 1
strip 5strip 4
CS 519Operating System
Theory21
File System
File system is an abstraction of the disk File Track/sector To a user process
A file looks like a contiguous block of bytes (Unix) A file system provides a coherent view of a group of files A file system provides protection
API: create, open, delete, read, write files Performance: throughput vs. response time Reliability: minimize the potential for lost or
destroyed data E.g., RAID could be implemented in the OS as part of
the disk device driver
CS 519Operating System
Theory22
Unix File System
Ordinary files (uninterpreted) Directories
File of files Organized as a rooted tree Pathnames (relative and absolute) Contains links to parent, itself Multiple links to files can exist
Link - hard OR symbolic
CS 519Operating System
Theory23
Unix File Systems (Cont’d)
Tree-structured file hierarchies
Mounted on existing space by using mount
No links between different file systems
CS 519Operating System
Theory24
File Naming
Each file has a unique name User visible (external) name must be symbolic
In a hierarchical file system, unique external names are given as pathnames (path from the root to the file)
Internal names: i-node in UNIX - an index into an array of file descriptors/headers for a volume
Directory: translation from external to internal name May have more than one external name for a single internal
name Information about file is split between the directory and the
file descriptor: name, type, size, location on disk, owner, permissions, date created, date last modified, date last access, link count
CS 519Operating System
Theory25
Name Space
In UNIX, “devices are files” E.g., /dev/cdrom,
/dev/tape User process accesses
devices by accessing corresponding file
/
usr A B
C D
CS 519Operating System
Theory26
File Allocation
Contiguous: a contiguous set of blocks is pre-allocated to a file at the time of file creation Good for sequential files File size must be known at the time of file creation External fragmentation – like memory allocation when giving a
contiguous block to each job So what do we do?
Dynamic allocation (new space allocated on demand) First fit (first chunk of sufficient size), best fit (smallest chunk of
sufficient size), nearest fit (chunk of sufficient size that is closest to the previous allocation for the same file)
Indexed allocation (contiguous and chained allocations are other options) with file allocation table. FAT includes file names and corresponding index block numbers
Use a disk allocation table (bit map, chained, and indexed) to manage the free space
CS 519Operating System
Theory27
File Allocation Strategies
Contiguous allocation: find contiguous chunk for whole file
Chained allocation: pointer to next block allocated to file
Indexed: index block points to file blocks
CS 519Operating System
Theory28
Free Space Management
Bitmap: one bit for each block on the disk Good to find a contiguous group of free blocks Small enough to be kept in memory Requires sequential scan of bits
Chained free portions: pointer to the next one Indexed: treats free space as a file
CS 519Operating System
Theory29
UNIX File
i-nodes
CS 519Operating System
Theory30
File System Buffer Cache
application: read/write files
OS: translate file to disk blocks
...buffer cache ...maintains
controls disk accesses: read/write blocks
hardware:
Any problems?
CS 519Operating System
Theory31
File System Buffer Cache
Disks are “stable” while memory is volatile What happens if you buffer a write and the machine
crashes before the write has been saved to disk? Can use write-through but write performance will suffer In UNIX
Use un-buffered I/O when writing i-nodes or pointer blocks Use buffered I/O for other writes and force sync every 30
seconds
What about replacement? How can we further improve performance?
CS 519Operating System
Theory32
Application-controlled caching
application: read/write files replacement policy
OS: translate file to disk blocks
...buffer cache ...maintains
controls disk accesses: read/write blocks
hardware:
CS 519Operating System
Theory33
Application-Controlled File Caching
Two-level block replacement: responsibility is split between kernel and user level
A global allocation policy performed by the kernel which decides which process will give up a block
A block replacement policy decided by the user: Kernel provides the candidate block as a hint to the
process The process can overrule the kernel’s choice by
suggesting an alternative block The suggested block is replaced by the kernel
Examples of alternative replacement policy: most-recently used (MRU)
CS 519Operating System
Theory34
Sound kernel-user cooperation
Oblivious processes should do no worse than under LRU Foolish processes should not hurt other processes Smart processes should perform better than LRU whenever
possible and they should never perform worse If kernel selects block A and user chooses B instead, the kernel
swaps the position of A and B in the LRU list and places B in a “placeholder” which points to A (kernel’s choice)
If the user process misses on B (i.e. it made a bad choice), and B is found in the placeholder, then the block pointed to by the placeholder is chosen (prevents hurting other processes)
CS 519Operating System
Theory35
File System Consistency
File system almost always uses a buffer/disk cache for performance reasons
Two copies of a disk block (buffer cache, disk) consistency problem if the system crashes before all the modified blocks are written back to disk
This problem is critical especially for the blocks that contain control information: i-node, free-list, directory blocks
Utility programs for checking block and directory consistency Write critical blocks from the buffer cache to disk
immediately Data blocks are written to disk periodically: sync
CS 519Operating System
Theory36
More on File System Consistency
To maintain file system consistency the ordering of updates from buffer cache to disk is critical
Example: if the directory block (contains pointer to i-node) is written back before the i-node and the system crashes, the directory structure will be inconsistent
Similar case when free list is updated before i-node and the system crashes, free list will be incorrect
A more elaborate solution: use dependencies between blocks containing control data in the buffer cache to specify the ordering of updates
CS 519Operating System
Theory37
Protection Mechanisms
Files are OS objects: unique names and a finite set of operations that processes can perform on them
Protection domain is a set of {object,rights} where right is the permission to perform one of the operations
At every instant in time, each process runs in some protection domain
In Unix, a protection domain is {uid, gid} Protection domain in Unix is switched when running a
program with SETUID/SETGID set or when the process enters the kernel mode by issuing a system call
How to store all the protection domains?
CS 519Operating System
Theory38
Protection Mechanisms (cont’d)
Access Control List (ACL): associate with each object a list of all the protection domains that may access the object and how In Unix ACL is reduced to three protection domains:
owner, group and others
Capability List (C-list): associate with each process a list of objects that may be accessed along with the operations C-list implementation issues: where/how to store them
(hardware, kernel, encrypted in user space) and how to revoke them
CS 519Operating System
Theory39
Log-Structured File System (LFS)
As memory gets larger, buffer cache size increases increase the fraction of read requests which can be satisfied from the buffer cache with no disk access
In the future, most disk accesses will be writes but writes are usually done in small chunks in most file
systems (control data, for instance) which makes the file system highly inefficient
LFS idea: structure the entire disk as a log Periodically, or when required, all the pending writes being
buffered in memory are collected and written as a single contiguous segment at the end of the log
CS 519Operating System
Theory40
LFS segment
Contain i-nodes, directory blocks and data blocks, all mixed together
Each segment starts with a segment summary Segment size: 512 KB - 1MB Two key issues:
How to retrieve information from the log? How to manage the free space on disk?
CS 519Operating System
Theory41
File Location in LFS
The i-node contains the disk addresses of the file block as in standard UNIX
But there is no fixed location for the i-node An i-node map is used to maintain the current
location of each i-node i-node map blocks can also be scattered but a
fixed checkpoint region on the disk identifies the location of all the i-node map blocks
Usually i-node map blocks are cached in main memory most of the time, thus disk accesses for them are rare
CS 519Operating System
Theory42
Segment Cleaning in LFS
LFS disk is divided into segments that are written sequentially
Live data must be copied out of a segment before the segment can be re-written
The process of copying data out of a segment: cleaning A separate cleaner thread moves along the log, removes old
segments from the end and puts live data into memory for rewriting in the next segment
As a result a LFS disk appears like a big circular buffer with the writer thread adding new segments to the front and the cleaner thread removing old segments from the end
Bookkeeping is not trivial: i-node must be updated when blocks are moved to the current segment
CS 519Operating System
Theory43
LFS Performance
CS 519Operating System
Theory44
LFS Performance (Cont’d)