CMT 412 Distributed OS Notes

115
CMT 412: DISTRIBUTED OPERATING SYSTEMS – LECTURE NOTES COURSE OUTLINE 1. Introduction to DOS Introduction to OS HW / SW Architectures SW Concepts NOS Systems DOS Systems 2. Distributed Systems Introduction to DS Characteristics of DS Design Issues of DS Systems Models 3. Communications in DS Process & Thread Creation Networks and Protocols Communication Models - Inter Process Communication - Remote Procedure Call - Message Passing 4. Distributed Processing Synchronization Distributed Shared Memory Resource Management Process Migration 5. System Naming Schemes Basic Concepts System Oriented Names File Oriented Names Name Spaces and Resolution 6. Distributed File Systems File Service Models Shared Semantics Network File Systems Name Caches & Schemes 7. Distributed Transactions Transaction Implementation Concurrency Control Approaches Transaction Synchronization Transaction Implementation 8. Distributed Fault Tolerance Basic Concepts Failure Models & Masking Replication Models Communication Models Distributed Commit 9. Distributed Security Security Threats Policies and Mechanisms Security Design Issues Layering Security Mechanisms Distributing Security Mechanisms Simplicity & Cryptography Recreated by Kyalo J. K Page 1 of 115

Transcript of CMT 412 Distributed OS Notes

Page 1: CMT 412 Distributed OS Notes

CMT 412: DISTRIBUTED OPERATING SYSTEMS – LECTURE NOTES

COURSE OUTLINE

1. Introduction to DOS Introduction to OSHW / SW ArchitecturesSW ConceptsNOS SystemsDOS Systems

2. Distributed SystemsIntroduction to DSCharacteristics of DSDesign Issues of DSSystems Models

3. Communications in DSProcess & Thread CreationNetworks and ProtocolsCommunication Models - Inter Process Communication - Remote Procedure Call - Message Passing

4. Distributed ProcessingSynchronizationDistributed Shared MemoryResource ManagementProcess Migration

5. System Naming SchemesBasic ConceptsSystem Oriented Names

File Oriented NamesName Spaces and Resolution

6. Distributed File SystemsFile Service ModelsShared SemanticsNetwork File SystemsName Caches & Schemes

7. Distributed TransactionsTransaction ImplementationConcurrency Control ApproachesTransaction SynchronizationTransaction Implementation

8. Distributed Fault ToleranceBasic ConceptsFailure Models & MaskingReplication ModelsCommunication ModelsDistributed Commit

9. Distributed SecuritySecurity ThreatsPolicies and MechanismsSecurity Design IssuesLayering Security MechanismsDistributing Security MechanismsSimplicity & Cryptography

REFERENCES1. Operating Systems – Design and Implementation by Andrew Tanenbaum (2005), Prentice Hall 2. Distributed Computing: Principles and Applications by Liu M.L (2004), Pearson Addison-

Wesley, 3. Schaum’s Outlines of Operating Systems by Archer J. H (2002), McGraw Hill4. Operating System Projects using Windows NT by Nutt Gary (2001), Addison Wesley 5. Distributed Operating Systems – Concepts and Design by Pradeep K. S (2001), Prentice Hall 6. Other Resources: The Internet, Papers, Handouts, Lecture Notes etc

Recreated by Kyalo J. K Page 1 of 73

Page 2: CMT 412 Distributed OS Notes

INSTRUCTIONAL MATERIALS Computers and Projectors Writing Boards and Mark Pens Well Ventilated Lecture Theaters

COURSE ASSESSMENTa) Student Performance

Two Assignments contributing 10% Three Sitting CATs contributing

20% End Unit Exam contributing 70%

b) Lecturer Performance Based on Student Evaluation Head of Department Evaluation Lecturer / Self Evaluation

TEACHING METHODOLOGYDELIVERY:

Lectures, Tutorials and Tests Readings on Relevant Materials Group Work, Discussions and

Reporting

Research on the field of study and others

LECTURER: Name: Mr. Josphat K. Kyalo Cells: 0724 577772, 0732 307 288 Email: [email protected],

[email protected] COURSE OBJECTIVES

1. To Understand of the software that implements networking

2. To Explain the principles of a Network Operating System e.g. Unix, Win NT etc

3. To Discuss DOS, Process Management and Distributed File Systems

4. To Analyze distributed transactions, replication, naming and Security

5. To Apply the knowledge of DOS to sampled case studies

Recreated by Kyalo J. K Page 2 of 73

Page 3: CMT 412 Distributed OS Notes

Hardware

Application Programs

System Programs

INTRODUCTION TO OPERATING SYSTEMSoftware makes a computer become useful. With software a computer can store, process and retrieve information. Computer Software can roughly be divided into two forms: system programs and application programs. System programs manage the operations of the computer itself and application programs perform the work that the user wants. The most fundamental system program is the Operating System, which controls all computer resources and provides the base upon which application programs run. Long ago, there was no such thing as an operating system. The computers ran one program at a time. The computer programmers would load the program they had written and run them. If there was a bug in the program, the programmer had to start over. Even if the program did run correctly, the programmer probably never got to work on the machine directly. The program (punched card) was fed into the computer by an operator who then passed the printed output to the programmer later on.

As technology advanced, many such programs, or jobs, were all loaded onto a single tape. This tape was then loaded and manipulated by another program, which was the ancestor of today's operating systems. This program (also known as a monitor) would monitor the behavior of the running program and if it misbehaved (crashed), the monitor could then immediately load and run another. The process of loading and monitoring programs in the computer was somehow cumbersome and with time, it became apparent that some way had to be found to shield programmers from the complexity of the hardware and allow for smooth sharing of the relatively vast computer resources. The way that has evolved gradually is to put a layer of software on top of the bare hardware, to manage all the components of the system and present the user with a virtual machine that is easier to understand and program. This layer of software is the operating system.

The concept of an operating system can be illustrated using the following diagram:

Banking System

Airline Reservation

Web Browser

Compilers Editors Command Interpreter

Operating System

Machine Language

Microprogramming

Physical Devices

At the bottom layer is the hardware, which in many cases is composed of two or more layers. The lowest layer contains physical devices such as IC chips, wires, network cards, cathode ray tubes etc. The next layer, which may be absent in some machines, is a layer of primitive software that directly controls the physical devices and provides a clean interface to the next layer. This software called the micro-program is normally located in ROM. It is an interpreter, fetching the machine language instruction such as ADD, MOVE and JUMP and carry them out as a series of little steps. The set of instructions that the micro-program can interpret defines the machine language. The machine language typically has between 50 and 300 instructions, mostly for

Recreated by Kyalo J. K Page 3 of 73

Page 4: CMT 412 Distributed OS Notes

moving data around the machine, doing arithmetic and comparing values. In this layer, I/O devices are controlled by loading values into special device registers.

A major function of the operating system is to hide all this complexity and give the programmer and users a more convenient set of instruction to work with. For example, COPY FILE1 FILE2 is conceptually simpler than having to worry about the location of the original file on disk, location of the new file and the movement of the disk heads to effect the copying. On top of the operating system is the rest of the system software for example, compilers, command interpreters and editors. These are not part of the operating system. The operating system runs in kernel or supervisor mode meaning is protected from user tampering. Compilers run in user mode, meaning that users are free to write their own compiler or editor if they so wish. Finally, above the system programs come the application programs. These are programs purchased or written by the users to solve particular problems, such as word-processing, spreadsheets, databases etc

FUNCTIONS OF OPERATING SYSTEMProvision of a virtual machineA virtual machine is software that creates an environment between the computer platform and the end-user. A programmer does not want to get too intimately involved with programming hardware devices like floppy disks, hard disks and memory. Instead, the programmer wants a simple, high-level abstraction to deal with. In the case of disks, typical abstraction would be that the disk contains a collection of named files. Each file can be opened for reading or writing then read or written, and finally closed.

The program that hides the truth about hardware from the programmer and presents a nice, simple view of named files that can be read and written is, of course, the operating system. The operating system also conceals a lot of unpleasant business concerning interrupts, timers, memory management and other low level features. In this view, the function of the operating system is to present the user with the equivalent of an extended machine or virtual machine that is easier than the underlying hardware.

Resource managementModern computers consist of processors, memories, timers, disks, network interface cards, printers etc. The job of the operating system is to provide for an orderly and controlled allocation of processors, memories and I/O devices among the various programs competing for them. For example, if different programs running in the same computer sent print jobs to the same printer at the same time, if the printing is not controlled then the printing will be interleaved with say the first line of the printout being for the first program, the second line being for the second program etc. The operating system brings some order in such situation by buffering all output destined for the printer on disk. When one program is finished, the operating system can then copy its output from the disk to the printer. In this view, the operating system keeps track of who is using which resource, grant resource requests, account for usage and mediate conflicting requests from different programs and users.

OPERATING SYSTEM CONCEPTSThe interface between the operating systems and user programs is defined the set of ‘extended instructions’ that the operating system provides. These instructions are referred to as system

Recreated by Kyalo J. K Page 4 of 73

Page 5: CMT 412 Distributed OS Notes

calls. The calls available in the interface vary from one operating system to another although the underlying concept is similar.

A process is basically a program in execution. Associated with each process is its address space: memory locations, which the process can read and write. The address space contains the executing program, its data and stack. Also associated with each process is some set of registers, including the program counter, stack pointer and hardware registers and all information needed to run the program. In a time-sharing system, the operating system decides to stop running one process and start running another. When a process is suspended temporarily, it must later be restarted in exactly the same state it had when it was stopped. This means that the context of the process must be explicitly saved during suspension. In many operating systems, the information about each process, apart from the contents of its address space, is stored in a table called the process table.

Therefore, a suspended program consists of its address space, usually referred to as the core image and its process table entry. The key process management system calls are those dealing with the creation and termination of processes. For example, a command interpreter (shell) reads commands from a terminal, for instance a request to compile a program. The shell must create a new process that will run the compiler and when the process has finished the compilation, it executes a system call to terminate itself. A process can create other processes known as child processes and these processes can in turn create other child processes. Related processes that are cooperating to get some job done often need to communicate with one another and synchronize their activities. This communication is known as Inter-Process Communication (IPC). Other systems calls are available to request more memory or release unused memory, wait for a child process to terminate and overlay its program with a different one.

Files - A file is a collection of related information defined by its creator. Commonly, files represent programs and data. Data files may be numeric, alphabetic or alphanumeric. System calls are needed to create, delete, move, copy, read and write files. Before a file can be read, it must be opened, and after reading it should be closed. System calls are provided to do all these things. Files are normally organized into logical clusters or directories, which make them easier to locate and access. For example, you can have directories for keeping all your program files, word processing documents, database files, spreadsheets, electronic mail etc. System calls are available to create and remove directories. Calls are also provided to put an existing file in a directory, remove a file from a directory. Every file within a directory hierarchy can be specified by giving its path name from the root directory. Such absolute path names consist of the list of directories that must be traversed from the root directory to get to the file with slashes separating the components.

Batch Systems - The early operating systems were batch systems. The common input devices were card readers and tape drives. The common output devices were line printers, tape drives and card punches. The users did not interact with the system, but would rather prepare a job and submit it to the computer operator, who would feed the job into the computer and later on the output appeared. The major task of the operating system was to transfer control automatically from one job to the next. To speed processing, jobs with similar needs were batched together and run through the computer as a group. Programmers would leave their jobs with the operator who

Recreated by Kyalo J. K Page 5 of 73

Page 6: CMT 412 Distributed OS Notes

would then sort them out into batches and as the computer became available, would run each batch. The output would then be sent to the appropriate programmer. The delay between job submission and completion also referred to, as the turnaround time was high in these systems. In this execution environment the CPU, is often idle because of the disparity in speed between the I/O devices and the CPU. To reduce the turnaround time and CPU idle time in these systems, the spool (simultaneous peripheral operation on-line) concept was introduced. Spooling, in essence uses the disk as a huge buffer, for reading as far ahead as possible on input device and for storing output files until the output device is able to accept them.

Multiprogramming - Spooling will result in several jobs that have already been read waiting on disk, ready to run. This allows the operating system to select which job to put in memory next, ready for execution. This is referred to as job scheduling. The most important aspect of job scheduling is the ability to multiprogram. The operating system keeps several jobs in memory at the same time, which is a subset of jobs kept in the job spool. The operating system picks and starts executing one of the jobs in memory. Eventually, the job may have to wait for some task such as an I/O operation to complete. In multiprogramming, when this happens the operating system, simply switches to and executes another job. If several jobs are ready to be brought from the job spool to the memory and there is no room for all of them, then the system must chose among them. Making this decision is job scheduling. Having several jobs or programs in memory at the same time ready for execution also requires some memory management and that system must chose one among them. Making this decision is known as CPU scheduling.

Time Sharing Systems - Time-sharing, or multitasking is a logical extension of multiprogramming. In time-sharing, multiple jobs are executed by the CPU switching between them, but the switches occur so frequently that the users may interact with each program while it is running. An interactive computer system provides on-line communication between the user and the system. Time-sharing systems were developed to provide interactive use of the computer system at a reasonable cost. A time-shared operating system uses CPU scheduling and multiprogramming to provide each user or program with a small portion of a time-shared computer. It allows many users to share the computer simultaneously. As the system switches rapidly from one user to the other, each user is given the impression that they have their own computer, whereas one computer is being shared among many users

Parallel Systems - Most systems are single-processor systems, that is, they have only one main CPU. However, there is a trend towards multiprocessing systems. Such systems have more than one processor in close communication, sharing the computer bus, clock and sometimes memory and peripheral devices. These systems are referred to as tightly coupled systems. The motivation for having such systems is to improve the throughput and reliability of the system.

Real-Time Systems - A real time system is used when there are rigid time requirements on the operation of a processor or the flow of data and thus often used as a control device in a dedicated application. Sensors bring data to the computer. The computer must analyze the data and possibly adjust control to modify the sensor inputs. Systems that control scientific experiments, medical imaging systems, industrial control systems and some display systems are examples of real-time systems.

Recreated by Kyalo J. K Page 6 of 73

Page 7: CMT 412 Distributed OS Notes

THE OPERATING SYSTEMSNetwork Operating System Assumes loosely coupled software on loosely coupled hardware that includes Network of workstations and servers. All commands are run locally on the workstation. To access a remote machine the user log in using a remote login command. Servers used in client/server functions: file storage, printer mgmt. Examples: Solaris, Windows NT.

Distributed Operating SystemCreate the illusion in the minds of the users that the entire network of computers is a single timesharing system, rather than a collection of distinct machines. No current system fulfills this requirement entirely yet. Distributed Operating Systems are often broadly classified into two extremes of a spectrum: Examples – Amoeba, Argus, Cronas and V-System

Tightly Coupled Systems - Tightly coupled software on loosely coupled hardware. Components are Processors, Memory, Bus, I/O e.g. Meiko Compute Surface. The operating system tries to maintain a single global view of the resources it manages. Single global intercrosses communication mechanism: any process can talk to any other process (regardless of what processor the process is running on). Global protection scheme: security system (e.g. passwords, access rights) must look the same everywhere. File system must look the same everywhere: Every file should be visible at every location (subject to protection /security constraints). Runs the same operating system

Loosely Coupled Systems – they can be thought of as a collection of computers each running their own O/S. However these OS work together to make their services and resources available to others (Network Operating Systems) Components are Workstations, LAN, and Servers e.g. V-System, BSD Unix

Cooperative Autonomous System - In between a Distributed Operating System and Network Operating System with an interface for Service integration and cooperating processes, Example: CORBA: Common Object Request Broker Architecture.

Multiprocessor Timesharing System - Tightly coupled software on tightly coupled hardware Used with Parallel Processing. Examples: UNIX timesharing system with multiple CPUs instead of one CPU. Single run time queue for all processors, Common memory (may be single memory) for all processors. Mutual exclusion achieved by monitors and semaphores. Can process switch or busy wait for I/O interrupt to occur for cache purposes. Traditional file system with single unified block cache examples: Solaris, Windows NT.

Recreated by Kyalo J. K Page 7 of 73

Page 8: CMT 412 Distributed OS Notes

THE DISTRIBUTED COMPUTINGSymptoms of a Distributed System

Interconnection hardware connects Multiple processing elements run independently Shared state - in order to recover from failures Processing elements fail independently partial failures

HARDWARE AND SOFTWARE ARCHITECTURESA key characteristic of our definition of distributed systems is that it includes both a hardware aspect (independent computers) and a software aspect (performing a task and providing a service). From a hardware point of view distributed systems are generally implemented on multi-computers. From a software point of view they are generally implemented as distributed operating systems or middleware.Multiprocessors - Multiprocessor systems all share a single key property: All the CPUs have direct access to the shared memory. Bus-based multiprocessors consist of some number of CPUs all connected to a common bus, along with a memory module.Multi-computers - A multicomputer consists of separate computing nodes connected to each other over a network.

Node Resources - This includes the processors, amount of memory, amount of secondary storage, etc. available on each node.

Network Connection - The network connection the various nodes can have a large impact on the functionality and applications that such a system can be used for. A multicomputer with a very high bandwidth network is more suitable for applications that actively share data over the nodes and modify large amounts of that shared data. A lower bandwidth network, however, is sufficient for applications where there is less intense sharing of data.

Homogeneity - A homogeneous multicomputer is one where all the nodes arethe same, that is they are based on the same physical architecture (e.g. processor, system bus, memory, etc.). A heterogeneous multicomputer is one where the nodes are not expected to be the same. One common characteristic of all types of multi-computers is that the resources on any particular node cannot be directly accessed by any other node. All access to remote resources takes the form of requests sent over the network to the node where that resource resides.

PROCESSING CONFIGURATIONS Concerns processing configurations with two characteristics: The number of instruction streams; and the number of data streams. SISD: A computer with a Single Instruction stream and a Single Data stream: All

traditional uni-processor computers. SIMD: Single Instruction, Multiple Data: Array processors with one instruction unit that

fetches an instruction and then commands many data units to carry it out in parallel, each with its own data. Good for vector processing.

MISD: Multiple Instruction Single Data: Pipelined computers: Fetch and process multiple instructions simultaneously, operating on one data at a time. (Book differs here).

MIMD: Multiple Instruction Multiple Data: A group of independent computers, each with own program counter, program and data. All distributed systems are MIMD.

Recreated by Kyalo J. K Page 8 of 73

Page 9: CMT 412 Distributed OS Notes

Processors can be subdivided further: Multiprocessors: CPUs share memory. If one CPU writes to location 44, all will see the new value. Multi-computers: CPUs do not share memory and each machine has own private memory. CONNECTION CONFIGURATIONSTwo Multiprocessor Architecture types (based on architecture of the interconnection network): 1. Bus: A single network, backplane or bus cable that connects all machines, 300 Mbps and faster for a backplane bus.  When connecting 4-5 CPUs the bus becomes overloaded and performance drops drastically.  Solution: add a cache between CPU and bus. If cache is large the hit rate will be high, bus  traffic will drop dramatically, allowing many more CPUs to be added. Supports 32-64 CPUs

Write-through cache: Cache hits for reads do not cause bus traffic, but cache misses for reads, and all writes, cause bus traffic.

Snoopy cache: When a snoopy cache sees a write to a memory address in its cache, it either removes or updates the entry from its cache. 'Snoops' on the bus.

2. Switch: Individual wires connect machines to other machines. Used to build a multiprocessor with more than 64 processors

Crossbar switch: Each CPU and each memory have a connection coming out of it. A cross-point switch is at every intersection which can be opened and closed in h/w. Processors can access memory simultaneously. If two CPUs try to access the same memory simultaneously, one of them will have to wait.

Omega Switch: The network contains 2x2 switches, each having two inputs and two    outputs. The switches can be set in nanoseconds or less. The omega network requires log2 n switching stages, each containing n/2 switches.

NUMA (Non-Uniform Memory Access): Each CPU can access its own local memory   quickly, but accessing anybody else's memory is slower. Hierarchical system where some memory is associated with each CPU

NB. Multi-computers also are configured in a bus or switch type configuration. Each CPU has direct connection to its own local memory. Thus there is much less traffic on it.

SYSTEM COUPLING TYPES Software and hardware can be loosely or tightly coupled.

Tightly coupled: Delay to send message from one computer to another is short Data rate is high. Associated with multiprocessors and parallel systems (which work on one problem)

Loosely coupled: Inter-machine message delay is large. Data rate is low. Associated with multi-computers and distributed processing (working on many unrelated problems).

Distributed system: designed to allow many users to work together. Parallel systems: goal is to achieve maximum speedup on a single problem (e.g. 500,000

MIPS).  Allocates multiple processors to a single problem and divides the work.

SOFTWARE CONCEPTS Distributed Operating System - A distributed operating system (DOS) is an operating system that is built, from the ground up, to provide distributed services. As such, a DOS integrates key distributed services into its architecture. These services may include distributed shared memory, assignment of tasks to processors, masking of failures, distributed storage, inter-process

Recreated by Kyalo J. K Page 9 of 73

Page 10: CMT 412 Distributed OS Notes

communication, transparent sharing of resources, distributed resource management, etc. A key property of a distributed operating system is that it strives for a very high level of transparency, ideally providing a single system image. That is, with an ideal DOS users would not be aware that they are, in fact, working on a distributed system. Distributed operating systems generally assume a homogeneous multicomputer. They are also generally more suited to LAN environments than to wide-area network environments. In the earlier days of distributed systems research, distributed operating systems where the main topic of interest. Most research focused on ways of integrating distributed services into the operating system, or on ways of distributing traditional operating system services. Currently. however, the emphasis has shifted more toward middleware systems. The main reason for this is that middleware is more flexible (i.e., it does not require that users install and run a particular operating system), and is more suitable for heterogeneous and wide-area multi-computers. Multicomputer operating systems that do not provide a notion of shared memory can offer only message-passing facilities to applications. Unfortunately, the semantics of message-passing primitives may vary widely between different systems. It is easiest to explain their differences by considering whether or not messages are buffered. In addition, we need to take into account when, if ever, a sending or receiving process is blocked.

NETWORK OPERATING SYSTEMSIn contrast to distributed operating systems, network operating systems do not assume that the underlying hardware is homogeneous and that it should be managed as if it were a single system. Instead, they are generally constructed from a collection of uniprocessor systems, each with its own operating system, as show in figure below.

Figure: Network Operating System Difference between DOS and NOSDifference between DOS and NOSFigure below shows a shared, global file system accessible from all the workstations in a Network Operating System. The file system is supported by one or more machines called file servers. The file servers accept requests from user programs running on the other machines, called clients, to read and write files.

Recreated by Kyalo J. K Page 10 of 73

Page 11: CMT 412 Distributed OS Notes

THE MIDDLEWAREWhereas a DOS attempts to create a specific system for distributed applications, the goal of middle ware is to create system independent interfaces for distributed applications.

Figure: Middle WareAs shown in figure above, middleware consists of a layer of services added between those of a regular network OSI and the actual applications. These services facilitate the implementation of distributed applications and attempt to hide the heterogeneity of the underlying system architectures (both hardware and software). The principle aim of middleware, namely raising the level of abstraction for distributed programming, is achieved in three ways: Communication mechanisms that are more convenient and less error prone than message passing; Independence from as, network protocol, programming language, etc. and Standard services (such as a naming service).To make the integration of these various services easier, and to improve transparency and system independence, middleware is usually based on a particular paradigm, or model, for describing distribution and communication. This often manifests itself in a particular programming model such as ‘everything is a file’, remote procedure call, and distributed objects. Providing such a paradigm automatically provides an abstraction for procedure to follow, and provides direction for how to design and set up the distributed applications.

Although some forms of middleware focus on adding support for distributed computing directly into a language, middleware is generally implemented as a set of libraries and tools that enable

Recreated by Kyalo J. K Page 11 of 73

Page 12: CMT 412 Distributed OS Notes

retrofitting of distributed computing capabilities to existing programming languages. Such systems typically use a central mechanism of the host language (such as the procedure call and method invocation) and dress remote operations up such that they use the same syntax as that mechanism, resulting for example in remote procedure calls and remote method invocation. Because an important goal of middleware is to hide the heterogeneity of the underlying systems (and in particular of the services offered by the underlying OS), middleware systems often try to offer a complete set of services so that clients do not have to rely on underlying OS services directly. This provides transparency for programmers writing distributed application using the given middleware. Unfortunately this ‘everything but the kitchen sink’ approach often leads to highly bloated systems. As such, current systems exhibit an unhealthy tendency to include more and more functionality in basic middleware and its extensions, which leads to a jungle of bloated interfaces.

Recreated by Kyalo J. K Page 12 of 73

Page 13: CMT 412 Distributed OS Notes

INTRODUCTION TO DISTRIBUTED SYSTEMSDefinitions

A distributed system consists of a number of components, which are by themselves computer systems. These components are connected by some communication medium, usually a sophisticated network. Applications execute by using a number of processes in different component systems. These processes communicate and interact to achieve productive work within the application.

A distributed system in this context is simply a collection of autonomous computers connected by a computer network to enable resource sharing and co-operation between applications to achieve a given task.

A Distributed System is one that runs a collection of machines that do not have shared memory, yet it looks to its users as a single computer.

A Distributed System is a collection of independent computers that appear to the users of the system as a single computer.

Goals of Distributed Systems A state-of-the-art distributed system is one that combines the accessibility, coherence and

manageability advantages of centralized systems Has the sharing, growth, cost and autonomy advantages of networked systems Has the added advantage of security, availability and reliability. Distribution should be concealed, giving the users the illusion that all available resources

are located at the user’s workstation.

NEED FOR A DISTRIBUTED SYSTEM Resource sharing - People / processes can share scarce hardware and software resources. Flexibility – concerns adding new resource-sharing services without disruption /

duplication of existing services Concurrency – ability to run several processes simultaneously while in different

components of the system Scalability – the need to have the System grow as requirements increase. System and

application s/w should not need to change when the scale of the system increases. As demand for a resource grows, it should be possible to extend the system meet

Reliability: If a machine goes down during processing, some other machine takes over the job. When one of the components in a distributed system fails, only the work that was using the failed component is affected.

Transparency: Concealment from the user and application programmer of the separation of components in a distributed system so that the system is perceived as a whole rather than as a collection of independent components.

REASONS FOR USE OF DISTRIBUTED SYSTEMSThe alternative to using a distributed system is usually to have a huge centralized system, such as a mainframe. For many applications there are a number of economic and technical reasons that make distributed systems much more attractive than their centralized counterparts.

Cost - Better price/performance as long as commodity hardware is used for the component computers

Recreated by Kyalo J. K Page 13 of 73

Page 14: CMT 412 Distributed OS Notes

Performance - By using the combined processing and storage capacity of many nodes, performance levels can be reached that are beyond the range of centralized machines

Scalability - Resources such as processing and storage capacity can be increased incrementally.

Transparency - An important goal of a distributed system is to hide the fact that its processes and resources are physically distributed across multiple computers. A distributed system that is able to present itself to users and applications as if it were only a single computer system is said to be transparent. Figure below shows different forms of transparency in a distributed system.

Reliability - By having redundant components the impact of hardware and software faults on users can be reduced however, these advantages are often offset by the following problems encountered during the use and development of distributed systems:

Limited Software - As will become clear throughout this course distributed software is harder to develop than conventional software; hence, it is more expensive to develop and there is less such software available

New Components - Networks are needed to connect independent nodes and are subject to performance limitations. Besides these limitations, networks also constitute new potential points of failure

Security - Because a distributed system consists of multiple components there are more elements that can be compromised and must, therefore, be secured. This makes it easier to compromise distributed systems.

CHARACTERISTICS OF DISTRIBUTED SYSTEMS Multiple autonomous processing elements – A distributed system is composed of

several independent components each with processing ability. There is no master-slave relationship between processing elements. Thus, it excludes traditional centralized mainframe based systems.

Information exchange over a network – the network connects autonomous processing elements that communicate using various protocols

Processes interact via non-shared local memory – A DS assumes hybrid configuration involving separate computers with a distributed shared memory. Multiple processor computer systems can be classified into those that share memory (multiprocessor computers) and those without shared memory (multi-computers).

Multiple Computers – there are more than one physical computers each consisting of a processor, local memory, a stable storage module together with input-output paths that connect it with other components in the distributed system environment

Interconnections – there are mechanisms and configurations for communicating with the other nodes via the network

Shared state – the sub-sets nodes cooperate providing services which are distributed or replicated among the participants or users

Transparency – A distributed system is designed to conceal from the users the fact that they are operating over a wide spread geographical area and provide the illusion of a single desktop environment. It should allow every part of the system to be viewed the same way regardless of the system size and provide services the same way to every part of the system. Some aspects of transparency include:

Recreated by Kyalo J. K Page 14 of 73

Page 15: CMT 412 Distributed OS Notes

Global names – the same name works everywhere. Machines, users, files, control groups and services have full names that mean the same thing regardless of where in the system the name is used.

Global access – the same functions are usable everywhere with reasonable performance. A program can run anywhere and get the same results. All the services and objects required by a program to run are available to the program regardless of where in the system the program is executing.

Global security – same user authentication and control access work everywhere e.g. same mechanism to let the same person next door and someone at another site read ones files. Authentication to any computer in the system

Global management – The same person can administrate system components anywhere. System management tools perform the same actions e.g. configuration of workstations.

ADVANTAGES OF DISTRIBUTED SYSTEMS It can be more fault-tolerant. It can be designed so that if one component of the system

fails then the others will continue to work. Such a system will provide useful work in the face of quite a large number of failures in individual component systems.

It is more flexible - A distributed system can be made up from a number of different components. Some of these components may be specialized for a specific task while others may be general purpose. Components can be added, upgraded, moved and removed without impacting upon other components.

It is easier to extend – the need of more processing, storage or other power can be obtained by increasing the number of components.

It is easier to upgrade - A distributed system may be upgraded in increments by replacing individual components without a major disruption, or a large cash injection. When a single large computer system becomes obsolete all of it has to be replaced in a costly and disruptive operation.

Local autonomy – by allowing domains of control to be defined where decisions are made relating to purchasing, ownership, operating priorities, IS development and management, etc. Each domain decides where resources under its control are located.

Increased Reliability and Availability – In a distributed system, multiple components of the same type can be configured to fail independently. This aspect of replication of components improves the fault tolerance in distributed systems, consequently, the reliability and availability of the system is enhanced. In a centralized system, a component failure can mean that the whole system is down, stopping all users from getting services.

Improved Performance – A Distributed System can have a service that is partitioned over many server computers each supporting a smaller set of and users access to local data and resources results in faster access. Another performance advantage is the support for parallel access to distributed data across the organization. Large centralized systems can be slow performers due to the sheer volume of data and transactions being handled.

Security breaches are localized – In distributed systems with multiple security control domains, a security breach in one domain does not compromise the whole system. Each security domain has varying degree of security authentication, access control and auditing.

Recreated by Kyalo J. K Page 15 of 73

Page 16: CMT 412 Distributed OS Notes

DISADVANTAGES OF DISTRIBUTED SYSTEMS It’s more difficult to manage and secure – Centralized systems are inherently easier to

secure and easier to manage because control is done from a single point. Distributed systems require more complex procedures for security, administration, maintenance and user support due to greater levels of co-ordination and control required.

Lack of skilled support and development Staff – Since the equipment and software in a DS can be sourced from different vendors, unlike in traditional systems where everything is sourced from the same vendor, its difficult to find personnel with a wide range of skills to offer comprehensive support.

Introduce problems of maintaining consistency of data. Introduce problems of synchronization between processes. Significantly more complex in structure and implementation Communication network can lose messages and become overloaded. Security can become a problem since a computer is most secure if it minimizes network

connections and computer is retained in a locked room.

DESIGN ISSUES OF DISTRIBUTED SYSTEMSThere are key designs issues that people building distributed systems must deal with, with a goal to ensure that they are attained. These are:Transparency – It is described as “the concealment from the user and the application programmer of the separation of components in a distributed system so that the system is perceived as a whole rather than a collection of independent components.” Transparency there involves hiding all the distribution from human users and application programs

Human users – In terms of the commands issued from the terminal and the results displayed on the terminal. The distributed system can be made to look just like a single processor system.

Programs – At the lower level, the distribution should be hidden from programs. Transparency minimizes the difference to the application developer between programming for a distributed system and programming for a single machine. In other words, the system call interface should be designed such that the existence of multi-processors is not visible. A file should be accessed the same way whether its local or remote. A system in which remote files are accessed by explicitly setting up a network connection to a remote server and then sending messages to it is not transparent because remote services are being accessed differently from local ones.

Fault Tolerance – Since failures are inevitable, a computer system can be made more reliable by making it fault tolerant. A fault tolerant system is one designed to fulfill its specified purposes despite the occurrence of component failures (machine and network). Fault tolerant systems are designed to mask component failures i.e. attempt to prevent the failure of a system in spite of the failure of some of its components. Fault tolerance can be achieved through hardware and software.Although fault tolerance improves the system availability and reliability, it brings some overheads in terms of:

Cost - increased system costs Software development – recovery mechanisms and testing Performance – makes system slower in updates of replicas

Recreated by Kyalo J. K Page 16 of 73

Page 17: CMT 412 Distributed OS Notes

Consistency – maintaining data consistency is not trivialConcurrency – Concurrency arises in a system when several processes run in parallel. If these processes are not controlled then inconsistencies may arise in the system. This is an issue in of distributed systems because designers have to do this carefully and keenly to control the problem of inconsistency and conflicts. The end result is to achieve a serial access illusion. Concurrency control is important to achieve proper resource sharing and co-operation of processes. Uncontrolled interleaving of sub-operations of concurrent transactions can lead to four main types of problems: Openness – It’s the ability of the system to accommodate different technology (hardware and software components) without changing the underlying structure of the system. For example, the ability to accommodate a 64-bit processor where a 32-bit processor was being used without changing the underlying system structure or the ability to accommodate a machine running a MAC O/S in a predominantly Windows O/S system.

Scalability – Each component of a distributed system has a finite capacity. Designing for scalability involves calculating the capacity of each of these elements and the extent to which the capacity can be increased. Good distributed systems design minimizes utilization components that are not scalable. Also, the element that is weakest in terms of available capacity (and the extent to which the capacity can be increased) should be of prime importance in terms of design. There are four principle components to be considered when designing for scalability: client workstation, LAN, servers and WAN.

Performance – There are two common measures of performance for distributed systems: Response time – defined as the average elapsed time from the moment the user is ready

to transmit and the entire response is received. Throughput – the number of requests handled per unit time. Latency – the delay between the start of a message’s transmission from one process and

the beginning of its receipt by another Bandwidth – the total amount of information that can be transmitted over given time unit Jitter – the variation in time taken to deliver a series of messages

Performance improvements can be made in distributed systems environment by migrating much of the processing on to a user’s client workstation. This reduces the processing on the server per client request which leads to faster and more predictable response time. Data intensive applications can improve performance by avoiding I/O operations to read from disk storage. Reading from buffer areas in memory is much faster. Applications invoking remote operations offered by remote servers can improve performance by avoiding the need to access a remote server to satisfy a request. A caching system reduces the performance cost of I/O and remote operations by storing the results of recently executed I/O or remote operations in memory and re-using the same data whenever the same operation is re-invoked and when it can be ascertained that the data is still valid. COMPUTING SYSTEM MODELS

Client Server Model – This is the most widely used paradigm for structuring distributed systems. A client requests a particular service. One or more processes called servers are responsible for the provision of services to clients. Services are accessed via a well-defined interface that is

Recreated by Kyalo J. K Page 17 of 73

Page 18: CMT 412 Distributed OS Notes

made known to the clients. On the receipt of a request the server executes the appropriate operation and sends a reply back to the client. The interaction is known as request/reply or interrogation. Both clients and servers are run as user processes. A single computer may run a single client or server process or may run multiple client or server processes. A server process is normally persistent (non-terminating) and provides services to more than one client process. The main distinction between master –slave and client/server models is in the fact that client and server processes are on equal footing but with distinct roles.

Master –Slave model – It may not be an appropriate model for structuring a distributed system. In this model, a master process initiates and controls any dialogue with the slave processes. Slave processes exhibit very little intelligence, responding to commands from a single master process and exchange messages only when invited by the master process. The slave process merely complies with the dialogue rules set by the master. This is the model on which centralized systems were based and has limited applications in distributed systems because it does not make the best use of distributed resources and is a single point of failure.

Peer-to-peer Model – This model is quite similar to the client/server model. The use of a small a small manageable number of servers (i.e. increased centralization of resources) increase system management compared to a case where potentially every computer can be configured as client and server. This model is known as a peer-to-peer model because every process has the same functionality as a peer process.

Group Model – In many circumstances, a set of processes need to co-operate in such a way that one process may need to send a message to all other processes in the group and receive response from one or more members. For example, in a video conferencing involving multiple participants and a whiteboard facility, when someone writes to the board, every other participant must receive the new image. In this model a set of group members are modeled conveniently to behave as a single unit called a group. When a message is sent to a group interface, all the members receive it.

There are different approaches to routing a ‘group’ message to every member: Unicasting: a Point-to-point sending of a message from single sender to a single receiver. Broadcasting: sending a message to all of the computers in a given network

environment. Multicasting: Sending a message to the members of a specified group of processes. A

single group send operation will (hopefully) result in a receive operation performed by each member of the process group.

REASONS FOR MULTICASTING: Locating an object - the client multicasts a message containing the name of a file

directory to a group of file - server processes. Only one which holds relevant directory replies to the request.

Fault tolerance: A client multicasts its requests to a group of server processes, all of which process the requests identically and one or more of which reply to them.

Replicated data: data is replicated to increase the performance of a service

Recreated by Kyalo J. K Page 18 of 73

Page 19: CMT 412 Distributed OS Notes

Multiple updates: an event such as 'the time is 18:01' can be multicast to interested processes.

PROBLEMS WITH MULTICASTING: What if some processes receive the message and some don't?

Atomic multicast: message is received by all processes or else it is received by none of them. Acknowledgements are required and retransmissions occur. Originators assume no replies are no longer in the network. Alternately, everyone sends a received message once to everyone else (unless retransmissions are required to get an acknowledgement).

Reliable multicast: A best effort to deliver to all members of a group, but no guarantee. What if the messages are not received in the same time and order at all nodes?

Synchronous system: Events happen strictly sequentially, with each event taking essentially zero time to complete. Impossible to build

Loosely synchronous system: Events take a finite amount of time, but all events appear in the same order to all parties.

Virtually synchronous system: Since the ordering of messages is not so important, the ordering constraint has been relaxed. Example: ISIS is a set of programs for building distributed applications, from Cornell.

Client Server Application Example Supports GUI which uses windows and mouse Presentation Layer Services: GUI (Graphical User Interface) Application Logic: Data analysis / number crunching: SQL Database Management System: Search/sort/validate/access database Communications Software: Protocol between server and client

EXAMPLES OF DISTRIBUTED SYSTEMS Probably the simplest and most well known example of a distributed system is the

collection of Web servers - or more precisely, servers implementing the HTTP protocol-that jointly provide the distributed database of hypertext and multimedia documents that we know as the World-Wide Web.

The computers of a local network that provide a uniform view of a distributed file system and the collection of computers on the Internet that implement the Domain Name Service (DNS).

The distributed system is the T3E series of parallel computers by Cray. These are high performance machines consisting of a collection of computing nodes that are linked by a high-speed network.

The operating system, UNICOS, presents users with a standard UNIX environment upon login, but transparently schedules login sessions over a number of available login nodes.

Despite the fact that the systems in these examples are all similar (because they fulfill the definition of a distributed system), there are also many differences between them.

The World-Wide Web and DNS, for example, both operate on a Global scale. The distributed file system, on the other hand, operates on the scale of a LAN, while the Cray supercomputer operates on an even smaller scale making use of a specially designed high speed network to connect all of its nodes.

Recreated by Kyalo J. K Page 19 of 73

Page 20: CMT 412 Distributed OS Notes

COMMUNICATION IN DISTRIBUTED SYSTEM Inter process communication is at the heart of all distributed systems. Communication in distributed systems is always based on low-level message passing as offered by the underlyingNetwork. In this unit we will discuss the rules that communication processes must adhere to, known as protocols, and concentrate on structuring those protocols in the form of layers.

Layered Protocols - Due to the absence of shared memory, all communication in distributed systems in based on exchanging (low level) messages. When process A wants to communicate with process B, itfirst builds a message in its own address space. Then it executes a system call that causes the operating system to send the message over the network to B. To make it easier to deal with the numerous levels and issues involved in communication, the International Standard Organization(ISO) developed a reference model that clearly identifies the various levels involved, gives them standard names, and points out which level should do which job. This model is called OSI model. Figure below shows seven layers of OSI.

Seven Layers of OSIMessage travels from Application layer to Physical layer and adds various headers at each layer so that at the receiving end message is deduced.

Typical Message as It Appears on the Network Client-Server TCPClient-Server interaction in distributed systems is often done using the transport protocols of the underlying network. With the increasing popularity of the Internet, it is now common to build client-server applications and systems using TCP. The benefit of TCP compared to UDP is that it works reliably over any network. The obvious drawback is that TCP introduces considerably more overhead especially compared to those cases in which underlying network is highly

Recreated by Kyalo J. K Page 20 of 73

Page 21: CMT 412 Distributed OS Notes

reliable, such as in local area systems. Figure below shows Normal operation of TCP and UDP and transactional TCP. Middleware ProtocolsMiddleware is an application that logically lives in the application layer, but which contains many general-purpose protocol that warrant their own layers, independent of others, more specific applications.

Communication in Distributed SystemWhile the discussion of communication between processes has, so far, explicitly assumed a uniprocessor (or multiprocessor) environment, the situation for a distributed system (i.e., amulticomputer environment) remains similar. The main difference is that in a distributed system processes running on separate computers cannot directly access each others memory. Nevertheless processes in a distributed system can still communicate through either shared memory or message passing.

Distributed Shared MemoryBecause distributed process cannot access each other’s memory directly, using shared memory in a distributed system require special mechanisms that emulate the presence of directly accessible shared memory. This is called distributed-shared memory (DSM). The idea behind DSM is that processes on separate computers all have access to the same virtual address space. The memory pages that make up this address space actually reside on separate computers. Whenever a process on one of the computers needs to access a particular page it must find the computer actually hosting that page and request the data from it.

Message PassingMessage passing in a distributed system is similar to communication using messages in a non-distributed system. The main difference being that the only mechanism available for the passing of messages is network communication. At its core message passing involves two operations send( ) and receive( ). Although these are very simple operations, there are many variations on the basic model. For example, the communication can be connectionless or connection oriented. Connection oriented communication requires that the sender and receiver first create a connection before send( ) and receive( ) can be used. Communication operations can also be synchronous or asynchronous. In the first case the operations block until a message has been delivered (or received). In the second case the operations return immediately. Yet another possible variation involves the buffering of communication. In the buffered case, a message will be stored if the receiver is not able to pick it up right away. In the un buffered case the message will be lost. There are also varying degrees of reliability of the communication. With reliable communication errors are discovered and fixed transparently. This means that the processes can assume that a message that is sent will actually arrive at the destination (as long as the destination process is there to receive it). Communication ModelsThere are numerous ways that communicating processes can be arranged. This section will discuss some of the most common communication models. These models are distinguished from each other by the roles that the communicating processes take on.

Recreated by Kyalo J. K Page 21 of 73

Page 22: CMT 412 Distributed OS Notes

Client-ServerThe client-server model is the most common and widely used model for communication between processes. In this model one process takes on the role of a server, while all other processes take on the roles of clients. The server process provides a service (e.g., a time service, a database service, a banking service, etc.) and the clients are customers of that service. A client sends a request to a server, the request is processed at the server and a reply is returned to the client. A typical client-server application can be decomposed into three logical parts: the interface part, the application logic part, and the data part. Implementations of the client-server model vary with regards to how the parts are separated over the client and server roles. A thin client implementation will provide a minimal user interface layer, and leave everything else to the server. A fat client implementation, on the other hand, will include all of the user interface and application logic in the client, and only rely on the server to store and provide access to data. Implementations in between will split up the interface or application logic parts over the clients and server in different ways.

Vertical Distribution (Multi-Tier) - An extension of the client-server model, the vertical distribution, or multi-tier, model (see Figure below) distributes the traditional server functionality over multiple servers. A client request is sent to the first server. During processing of the request this server will request the services of the next server, who will do the same, until the final server is reached. In this way the various servers become clients of each other.

Communication in a Multi-tier SystemEach servers is responsible for a different step (or tier) in the fulfillment of the original client request. Splitting up the server functionality in this way is beneficial to a system’s scalability as well as its flexibility. Scalability is improved because the processing load on each individual server is reduced, and the whole system can therefore accommodate more users. With regards to flexibility this model allows the internal functionality of each server to be modified as long as the interfaces provided remain the same.

Horizontally Distributed Web ServerWhile vertical distribution focuses on splitting up a server’s functionality over multiple computers, horizontal distribution involves replicating a server’s functionality over multiple computers. In this case each server machine contains a complete copy of all hosted Web pages and client requests are passed on to the servers in a round robin fashion. The horizontal distribution model is generally used to improve scalability (by reducing the load on individual servers) and reliability (by providing redundancy). Note that it is also possible to combine the vertical and horizontal distribution models. For example, each of the servers in the vertical

Recreated by Kyalo J. K Page 22 of 73

Page 23: CMT 412 Distributed OS Notes

decomposition can be horizontally distributed. Another approach is for each of the replicas in the horizontal distribution model to themselves be vertically distributed.

Peer to PeerWhereas the previous models have all assumed that different processes take on different roles in the communication model, the peer to peer (P2P) model takes the opposite approach and assumes that all processes play the same role, and are therefore peers of each other. In figure below each processes acts as both a client and a server, both sending out requests and processing incoming requests. Group CommunicationThe group communication model provides a departure from the point to point style of communication assumed so far. In this model of communication a process can send a single message to a group of other processes. Group communication is often referred to as broadcast (when a single message is sent out to everyone) and multicast (when a single message is sent out to a predefined group of recipients). Group communication can be applied in any of the applied in any of the previously discussed models. It is often used to send requests to a group of replicas, or to send updates to a group of servers containing the same data, It is also used for service discovery (e.g., broadcast a request saying “who offers this service?”) as well as event notification (e.g., to tell everyone that the printer is on fire). Issues involved with implementing and using group communication are similar to those involved with regular point-to-point communication. This includes reliability and ordering. The issues are made more complicated because now there are multiple recipients of a message and different combinations of problems may occur . A widely implemented (but not as widely used) example of group communication is IP multicast.

Communication AbstractionsIn the previous topic it was assumed that all processes explicitly send and receive messages (e.g., using send ( ) and receive ( )). Although this style of programming is effective and works, it is not always easy to write correct programs using explicit message passing. In this section we will discuss a number of communication abstractions that make writing distributed applications easier. In the same way that higher level programming languages make programming easier by providing abstractions above assembly language, so do communication abstractions make programming in distributed systems easier. Some of the abstractions discussed attempt to completely hide the fact that communication is taking place. While other abstractions do not attempt to hide communication, all abstractions have in common that they hide the details of the communication taking place. For example, the programmers using any of these abstractions do not have to know what the underlying communication protocol is, nor do they have to know how to use any particular operating system communication primitives. The abstractions discussed below are often used as core foundations of most middleware systems. Using these abstractions, therefore, generally involves using some sort of middleware framework. This brings with it a number of the benefits of middleware, in particular the various services associated with the middleware that tend to make a distributed application programmer’s life easier.

Communication Modes

Recreated by Kyalo J. K Page 23 of 73

Page 24: CMT 412 Distributed OS Notes

Before discussing the details of the various abstractions, it is important to make a distinction between two modes of communication: data-oriented communication and control-oriented communication. In the first mode, communication serves solely to exchange data between processes. Although the data might trigger an action at the receiver, there is no explicit transfer of control implied in this mode. The second mode, control oriented communication, explicitly associates a transfer of control with every data transfer. Data-oriented communication is clearly the type of communication used in communication via shared address space and shared memory, as well as message passing. Control-oriented communication is the mode used by abstractions such as remote procedure call, remote method invocation, active messages, etc. Note that low level communication mechanisms are generally data-oriented while the higher level ones (e.g., middleware) are control oriented. This isn’t always the case; however, as MPI is a data-oriented mode of communication and is implemented at a higher level, while some operating systems provide RPC (control-oriented) at a low level.

CLIENT-SERVER STUBSRemote Procedure Call (RPC)The idea behind a remote procedure call (RPC) is to replace the explicit message passing model with the model of executing a procedure call on a remote node. A programmer using RPC simply performs a procedure call, while behind the scenes messages are transferred between the client and server machines.In theory the programmer is unaware of any communication taking place.

Client and Server StubsBelow shows the steps taken when an RPC is invoked. The numbers in the figure correspond to the following steps.

1. Client program calls client stub routine (normal procedure call)2. Client stub packs parameters into message data structure (marshalling)3. Client stub performs send( ) syscall and blocks 4. Kernel transfers message to remote kernel5. Remote kernel delivers to server stub procedure, blocked in receive ( )6. Server stub unpacks message, calls service procedure (normal proc call)7. Service procedure returns to stub, which packs result into message8. Server stub performs send ( ) syscall9. Kernel delivers to client stub, which unpacks and returns

A server that provides remote procedure call services defines the available procedures in a service interface. A service interface is generally defined in an interface definition language (IDL), which is a simplified programming language, sufficient for defining data types and procedure signatures but not for writing executable code. The IDL service interface definition is used to generate client and server stub code. The stub code is then compiled and linked in with the client program and service procedure implementations respectively.

An important part of marshalling is converting data into a format that can be understood by the receiver. Generally, differences in format can be handled by defining a standard network format into which all data is converted. However, this may be wasteful if two communicating machines use the same internal format, but that format differs from the network format. To avoid this problem, an alternative is to indicate the format used in the transmitted message and rely on the

Recreated by Kyalo J. K Page 24 of 73

Page 25: CMT 412 Distributed OS Notes

receiver to apply conversion where required. Because pointers cannot be shared between remote processes (i.e., addresses cannot be transferred verbatim as they are usually meaningless in another address space) it is necessary to flatten,or serialise, all pointer-based data structures when they are passed to the RPC client stub. At the server stub, these serialized data structures must be unpacked an recreated in the recipients address space. Unfortunately this approach presents problems with aliening and cyclic structures. Another approach to dealing with pointers involves the server sending a request for the referenced data to the client every time a pointer is encountered. In general the RPC abstraction assumes synchronous, or blocking, communication. This means that clients invoking RPCS are blocked until the procedure has been executed remotely and a reply returned. Although this is often the desired behaviour, sometimes the waiting is not necessary. For example, if the procedure does not return any values it is not necessary to wait for a reply. In this case it is better for the RPC to return as soon as the server acknowledges receipt of the message. This is called an asynchronous RPC.

a)Interaction between a client and server in a traditional RPCb)The interaction using asynchronous RPC

It is also possible that a client does require a reply, but does not need it right away and does not want to block for it either. An example of this is a client that pre-fetches network addresses of hosts that it expects to contact later. The information is important to the client, but as it is not needed right away theclient does not want to wait. In this case it is best f the server performs an asynchronous call to the client when the results are available. This is known as deferred synchronous RPC.

A Client and Server interaction through two Asynchronous RPCs

A final issue that has been silently ignored so far is how a client stub knows where to send the RPC message. In a regular procedure call the address of the procedure is determined at compile time, and the call is then made directly. In RPC this information is acquired from a binding

Recreated by Kyalo J. K Page 25 of 73

Page 26: CMT 412 Distributed OS Notes

service; a service that allows registration and lookup of services. A binding service typically provides an interface similar to the following:

Register (name, version, handle, UID) Deregister (name, version, UID) Lookup (name, version) à (handle, UID)

Here handle is some physical address (IP address, process ID, etc.) and UID is used to distinguish between servers offering the same service. Moreover, it is important to include version information as the flexibility requirement for distributed system requires us to deal with different versions of the same software in a heterogeneous environment.

Remote Method Invocation (RMI)When using RPC, programmers must explicitly specify the server on which they want to perform the call (possibly using information retrieved from a binding service). Furthermore, it is complicated for a server to keep track of the different state belonging to different clients and their invocations. These problems with RPC lead to the remote method invocation (RMI) abstraction. The transition from RPC to RMI is, at its core, a transition from the server metaphor to the object metaphor. When using RMI, programmers invoke methods on remote objects. The object metaphor associates all operations with the data that they operate on, meaning that state is encapsulated in the remote object and much easier to keep track of. Furthermore, the concept of remote object, improves location transparency: once a client is bound to a remote object, it no longer has to worry about where that object is located. Also, objects are first-class citizens in an object-based model, meaning that they can be passed as arguments or received as results in RMI. This helps to relieve many of the problems associated with passing pointers in RPC. Although, technically, RMI is a small evolutionary step from RPC, the model of remote and distributed objects is very powerful.

The Danger of TransparencyUnfortunately, the illusion of a procedure call is not perfect for RPCS and that of a method invocation is not perfect for RMI. The reason for this is that an RPC or RMI can fail in ways that a “real” procedure call or method invocation cannot. This is due to the problems such as not being able to locate a service (e.g., it may be down or have the wrong version), messages getting lost, servers crashing while executing a procedure, etc. As a result, the client code has to handle error cases that are specific toRPCS. Furthermore, RPC and RMI involve many more software layers than local system calls and also incur network latencies. Both form potential performance bottlenecks. The code must, therefore, be carefully optimized and should use lightweight network protocols. Moreover, as copying often dominates the overhead, hardware support can help. This includes DMA directly to/from user buffers and scatter-gather network interfaces that can compose a message from data at different addresses on the fly. Finally, issues of concurrency control can show up in subtle ways that, again, break the illusion of executing a local operation.

Synchronous communication the sender of a message blocks until the message has been received by the intended recipient. Synchronous communication is usually even stronger than this in that the sender often blocks until receiver has processed the message and sender has received a reply. In asynchronous Communication, on the other hand, the sender continues execution immediately after sending a message.

Recreated by Kyalo J. K Page 26 of 73

Page 27: CMT 412 Distributed OS Notes

Transient communication a message will only be delivered if a receiver is active. If there is no active receiver process (i.e., no one interested in receiving messages) then an undeliverable message will simply be dropped. In persistent communication, however, a message will be stored in the system until it can be delivered to the intended recipient.

Message - Oriented CommunicationDue to the dangers of RPC and RMI, and the fact that those models are generally limited to synchronous (and transient) communication, alternative abstractions are often needed. The message-oriented communication abstraction is one of these and does not attempt to hide the fact that communication istaking place. Instead its goal is to make the use of flexible message passing easier. Message-oriented communication is provided by message-oriented middleware (MOM). Besides providing many variations of the send ( ) and receive ( ) primitives, MOM also provides infrastructure required to support persistent communication. The send ( ) and receive( ) primitives offered by MOM also abstract from the underlying operating system or hardware primitives. As such, MOM allows programmers to use message passing without having to be aware of what platforms their software will run on, and what services those platforms provide. As part of this abstraction MOM also provides marshalling services. Furthermore, as with most middleware, MOM also provides other services that make building distributed applications easier.

Message-oriented communication is based around the model of processes sending messages to each other. Underlying message-oriented communication has two orthogonal properties. Communication can be synchronous or asynchronous, and it can be transient or persistent. Whereas RPC and RMI are generally synchronous and transient, message oriented communication systems make many other options available to programmers.

Message Passing Interface (MPI) is an example of a MOM that is geared toward high performance transient message passing. MPI is a message passing library that was designed for parallel computing. It makes use of available networking protocols, and provides a huge array of functions that basically perform synchronous and asynchronous send ( ) and receive( ). Another example of MOM is MQ Series from IBM. This is an example of a message queuing system. Its main characteristic is that it provides persistent communication.

Message queuing system, messages are sent to other processes by placing them in queues. The queues hold messages until an intended receiver extracts them from the queue and processes them. Communication in a message queuing system is largely asynchronous. The basic queue interface is very simple. There is a primitive to append a message onto the end of a specified queue, and a primitive to remove the message at the head of a specific queue. These can be blocking or non blocking. Allmessages contain the name or address of a destination queue. Messages can only be added and retrieved from local queues. Senders place messages in source queues, while receivers retrieve messages from destination queues. The underlying system is responsible for transferring messages from source queues to destination queues. This can be done simply by fetching messages from source queues and directly sending them to machines responsible for the

Recreated by Kyalo J. K Page 27 of 73

Page 28: CMT 412 Distributed OS Notes

appropriate destination queues. Or it can be more complicated and involve relaying messages to their destination queues through an overlay network of routers. An example of such a system is shown in Figure below

Stream abstractionWhereas the previous communication abstractions dealt with discrete communication (that is they communicated chunks of data), the Stream abstraction deals with continuous communication, and in particular with the sending and receiving of continuous media. In continuous media, data is represented as a single stream of data rather than discrete chunks (for example, an email is a discrete chunk of data, a live radio program is not). The main characteristic of continuous media is that besides a spatial relationship (i.e., the ordering of the data), there is also a temporal relationship between the data. Film is a good example of continuous media. Not only must the frames of a film must be played in the right order, they must also be played at the right time, otherwise the result will be incorrect.

A stream is a communication channel that is meant for transferring continuous media. Streams can be set up between two communicating processes, or possibly directly between two devices (e.g., a camera and a TV). Streams of continuous media are examples of isochronous communication that is communication that has minimum and maximum end-to-end time delay requirements. When dealing with isochronous communication, quality of service is an important issue. In this case quality of service is related to the time dependent requirements of the communication. These requirements describe what is required of the underlying distributed system so that the temporal relationships in a stream can be preserved. This generally involves timeliness and reliability.

Quality of service requirements axe often specified in terms of the parameters of a token bucket model In this model tokens (permission to send a fixed number of bytes) axe regularly generated and stored in a bucket. An application wanting to send data removes the required amount of tokens from the bucket and then sends the data. If the bucket is empty the application must wait until more tokens are available. If the bucket is full newly generated tokens are discarded. It is often necessary to synchronize two or more separate streams. For example, when sending stereo audio it is necessary to synchronize the left and right channels likewise when streaming video it is necessary to synchronize the audio with the video.

Formally, synchronization involves maintaining temporal relationships between sub-streams. There are two basic approaches to synchronization. The first is the client-based approach, where it is up to the client receiving the sub-streams to synchronize them. The client uses a synchronization profile that details how the streams should be synchronized. One possibility is to base the synchronization on timestamps that are sent along with the stream. A problem with client side synchronization is that, if the sub-streams come in as separate streams, the individual streams may encounter different communication delays. If the difference in delays is significant the client may be unable to synchronize the streams.

Recreated by Kyalo J. K Page 28 of 73

Page 29: CMT 412 Distributed OS Notes

The other approach is for the server to synchronize the streams. By multiplexing the sub streams into a single data stream, the client simply has to demultiplex them and perform some rudimentary synchronization. Distributed processing can be loosely defined as the execution of co-operating processes which communicate by exchanging messages across an information network. It means that the infrastructure consists of distributed processors, enabling parallel execution of processes and message exchanges. Communication and data exchange can be implemented in two ways: Shared memory and Message exchange / passing and Remote Procedure Call (RPC)

DEADLOCKS IN PROCESSING The Centralized Deadlock Detection has a Central coordinator maintains resource graph. If cycle detected, coordinator kills off a process to break deadlock.

False deadlock: Delays in transmission in distributed systems causes system to think that a cycle exists, when resources have been released. With strict two phase locks, cannot release and then obtain more data items, so this cannot occur.  Only situation is if a transaction is aborted while in deadlock.

Edge Chasing: Distributed approach to deadlock detection. Process 0 sends probe message when waiting on Process 1. If Process 1 is waiting on resource(s) it forwards probe message(s) to processes it is waiting on. If probe message returns to original sender, a cycle is detected.

Probe message contains: process that just blocked, process sending probe message, process to whom probe is sent. Actually in two steps: Transaction coordinator indicates what transaction waits for, server indicates who holds the data item. Who is high priority: Oldest transaction: because they have run longer.

Deadlock Prevention: Wound-wait (-> is waits on) Old process -> young process then young process killed. Young process -> old process then young process waits.

PROCESSES AND PROCESSORS Microkernel Architecture Components include:

Process manager: Creates and performs low-level operations upon processes. Enhanced by applications: O.S. emulation, language support.

Thread manager: Thread creation, synchronization, scheduling (across processors). Scheduling can occur in user-level modules.

Communication manager: Communication between threads. May include communication to other processors, otherwise additional service

Memory manager: Memory management units, hardware caches. Supervisor: for the Interrupts, traps, exceptions.

RPC Threads Local RPC Call Implementation. Kernel recognizes local RPC during binding. Server and

client share argument stack. No copying or marshaling needed. Calling thread handles system call, context switch, and up calls into server code.

Remote RPC Call Implementation - Kernel creates pop-up thread when RPC message received. Less copying required, no context restoration. Original implantation: thread blocks waiting for next call. Example: SUN Solaris Threads

SYNCHRONIZATION OF PROCESSES

Recreated by Kyalo J. K Page 29 of 73

Page 30: CMT 412 Distributed OS Notes

There are two main reasons why there is need for synchronization mechanisms: Two or more processes may need to co-operate in order to accomplish a given task. This

implies that the operating mechanism must provide facilities for identifying co-operating processes and synchronizing them.

Two or more processes may need to compete for access to shared services or resources. The implication is that the synchronization mechanism must provide facilities for a process to wait for a resource to become available and another process to signal the release of that resource.

When processes are running on the same computer, synchronization is straightforward since all processes use the same physical clock and can share memory. This can be done using well-known techniques such as

1. Semaphores - used to provide mutually exclusive access to a non-sharable resource by preventing concurrent execution of the critical region of a program through which the non-sharable resource is accessed.

2. A Monitor is a collection of procedures, which may be executed by a collection of concurrent processes. It protects its internal data from the users, and is a mechanism for synchronizing access to the resources the procedures use. Since only the monitor can access its private data, it automatically provides mutual exclusive between customer processes. Entry to the monitor by one process excludes entry by others.

Synchronization can either be synchronous (blocked) or asynchronous (non-blocking). A synchronous process is delayed until it receives a response from the destination process. A primitive is non-blocking if its execution never delays the invoking process. Non-blocking primitives must buffer messages to maintain synchronization. This makes programs flexible but increases their complexity. When blocking versions of message passing are used, programs are easier to write and synchronization easier to maintain. When send() operation is invoked, the invoking process blocks until the message is received. A subsequent receive() operation again blocks the invoking process until a message is actually received.

(a) Synchronous communication (b) Asynchronous communication

Recreated by Kyalo J. K Page 30 of 73

Page 31: CMT 412 Distributed OS Notes

SYNCHRONIZATION OF CLOCKSLogical vs. Physical Clocks:  Logical Clocks: Processes must agree on the order in which events occur. It is not necessary that clocks are synchronized. Physical Clock: All clocks must not deviate from real time by greater tolerance level.

Computer clock: counts oscillations occurring in a crystal, and divide count. Clock drift: oscillators' frequencies vary Quartz crystal: varies 10**-5 or 1 second every 1 million seconds. International Atomic Time (TAI): Based on Cesium-133. Accuracy is 10 **13. January 1

1958 is beginning of TAI time.   Universal Coordinated Time (UTC): Based on atomic time but adds leap second. National Institute of Standards and Technology (NIST) Automated Computer Time

Service (ACTS) provides modem service to NIST through telephone.   Designed for infrequent access Satellite: Geo-stationary Operational Environmental

Satellites (GOES) Satellite: Global Positioning System (GPS).  Propagation speed varies with atmospheric conditions: Accuracy to 0.1-10 ms.

External vs. Internal Clocks External: Synch computer clock with an authoritative, external source of time. Internal: Synch computer clock with other computers' clocks to a known degree of accuracy.

PHYSICAL CLOCK SYNCHRONIZATION ALGORITHMS Christian Algorithm: Use a clock server synchronized with UTC. Each machine issues message to time server requesting time. Time server responds with current time: C. Propagation time P = Reply_time - Request_time) /2. Discard threshold-exceeded values and average others.  Actual time = C + P. However time can never go backwards! So slow down or increase clock rate as appropriate. average: Add 10 msec to clock each interrupt until time matches. Problems: Dependent on single time server. Solution: Have a number of time servers. Broadcast time request to all servers and take first returned value.

Berkeley Algorithm: Berkeley UNIX time daemon polls every machine periodically. Time server computes average time from stable machines, considering Propagation time. Time server returns delta time to adjust clocks to each machine: + or -. Requires no interface to WWW Network Time Protocol: Synchronizes clock in Internet occurs in hierarchy called Synchronization Subnet. Primary servers connected to UTC clock source. Secondary servers connected to Primary servers. Stratum 3 servers synchronized from stratum 2 servers, and so on. Synchronization most accurate at higher levels

Decentralized Algorithm - All processors broadcast current time every R interval. All processors discard endpoints received back and average remaining to get actual time value.

Logical Clock Synchronization Algorithm - Lamport’s Algorithm Based on the theory of happens-before includes: If A and B are events in same process, and A occurs before B, then a>b. If A is an event sent by one process and B is the event being received by another process, and so on, then the transitive properties holds: a>b>c … If a message is

Recreated by Kyalo J. K Page 31 of 73

Page 32: CMT 412 Distributed OS Notes

received before it was sent, then receiver fast-forwards clock to send-time + 1,. Concurrent: If 2 events happen in different processes and don't exchange message, the ordering is unknown. In between every 2 events for a process, clock must tick at least once.  No 2 events ever occur at the same time: attach process number to time if necessary.

Mutual Exclusion (mutex)At most one process may execute in critical region at a time. A process requesting entry is eventually granted it. (No starvation).Entry to the critical region should happen in happened-before ordering. Central Server Mutual Exclusion: A server grants permission to enter a critical section. Process sends a request message to server and awaits a reply. Grant reply gives permission to enter critical region managed by server. If token held by another process, server queues request until token becomes available. When critical region exited, process sends Release message to central server. Problem: What if server fails? Must elect new server. Processes are waiting for reply. Single server can become a bottleneck. But: Simple and efficient.

Ricart and Agrawala Distributed Algorithm:Uses distributed agreement.  Process sends request to enter CR to all other processes.

Request contains name of critical region, process number and time.              If receiver does not want to enter CR, returns OK.              If receiver is in CR it queues request.              If receiver wants to enter CR it sends OK if it is an earlier request; or queues the request if it is a later request.             When process exits CR it sends OK to all processes on its queue. Problem: If any process crashes it will not respond to requests. Solution: Receiver always sends a reply: OK or deny permission. Sender resends requests periodically then assumes destination is dead. Many messages. Must track all in group. Time consuming. Less efficient than centralized.

Token Ring Algorithm: Process take turns as the token circulates around the ring. When you get a token you have option to enter critical region; upon exit pass token. Inefficiency when none wants to enter CR, and yet token message circulates. Processes do not have to be in ring configuration physicallyProblem: Token not obtained in happened-before order. If token lost, election must occur so token is regenerated.

ELECTION ALGORITHMS Many distributed algorithms require one process to act as coordinator. An election selects the coordinator. Elect a timeserver. Elect a mutual exclusion server. In general: elected coordinator is process with highest process number.

Bully Algorithm: Coordinator is ALWAYS process with highest process number. Process P notices coordinator is no longer responding to requests. P sends ELECTION message to all processes with higher numbers. Higher process responds with OK and send ELECTION messages to all higher number processes. P drops out. If no one responds, P becomes coordinator. Sends COORDINATOR

Recreated by Kyalo J. K Page 32 of 73

Page 33: CMT 412 Distributed OS Notes

message to all running processes. If higher number process ever boots, it sends ELECTION message.

Ring Algorithm: Let's See Who Has The Highest Number In The Ring Algorithm. Process P notices coordinator is no longer responding to requests. P sends ELECTION message to next process on ring. ELECTION message has process P's number in it. Each process adds its process number to the ELECTION message. When P receives its ELECTION message back P sends COORDINATOR message listing highest process number as winner.

INTER-PROCESS COMMUNICATION (IPC)When processes in the same local computer wish to interact they make use of an inter-process communication (IPC) mechanism that is usually provided by the O/S. The most common mode of communication is via a shared memory since the processes reside in the same address space. A number of mechanisms are available:

Pipes / Named Pipes - perhaps the most primitive example is a synchronous filter mechanism. For example the pipe mechanism in UNIX: ls –l | more. The commands ls and more run as two concurrent processes, with the output of ls connected to the input of more and has the overall effect of listing the contents of the current directory one screen at a time.

File sharing - An alternative mechanism is the use of a local file. This has the advantage that it can handle large volumes of data and is well understood. This is the basis on which on-line database systems are built. The major drawback is that there are no inherent synchronization mechanisms between communicating processes to avoid state data corruption, synchronization mechanisms such as file and record locking are used to allow concurrent processes communicate while preserving data consistency. Secondly, communication is inefficient since it uses a relatively slow medium.

Shared Memory - Since all processes are local, the computer’s RAM can be used to implement a shared memory facility. A common region of memory addressable by all concurrent processes is used to define shared variables which are used to pass data or for synchronization purposes. Processes must use semaphores, monitors or other techniques for synchronization purposes. A good example of a shared memory mechanism is the clipboard facility.

Message Queuing - A common asynchronous linkage mechanism is a message queuing mechanism that provides the ability for any process to read/write from a named queue. Synchronization is inherent in the read/write operations and the message queue which together support asynchronous communication between many different processes. Messages are identified by a unique identifier and security implemented by granting read/write permissions to processes.IPC mechanisms can be broadly classified into:

Reliable communication - channels fail only with the end system (e.g. if a central computer bus fails, usually the entire machine (stable storage/memory/CPU access) fails.

Unreliable communication - channels exhibit various different types of fault. Messages may be lost, re-ordered, duplicated, changed to apparently correct but different messages

Recreated by Kyalo J. K Page 33 of 73

Page 34: CMT 412 Distributed OS Notes

and even created as if from nowhere by the channel. All of these problems may have to be overcome by the IPC mechanism.

SYNCHRONIZATION MODELS Unicasting – This involves sending a separate copy of the message to each member. An implicit assumption is that the sender knows the address of every member in the group. This may be not possible in some systems. In the absence of more sophisticated mechanisms, a system may resort to unicasting if member addresses are known. The number of network transmissions is proportional to the number of members in the group.

Multicasting – In this model a single message with a group address can be used for routing purposes. When a group is first created it is assigned a unique group address. When a member is added to the group, it is instructed to listen for messages stamped with the group address as well as for its own unique address. This is an efficient mechanism since the number of network transmissions is significantly less than for unicasting.

Broadcasting – Broadcast the message by sending a single message with a broadcast address. The message is sent to every possible entity on the network. Every entity must read the message and determine whether they should take action or discard it. This may be appropriate in the case where the address of members is not known since most network protocols implement broadcast facility. However, if messages are broadcasted frequently and there is no efficient network broadcast mechanism, the network becomes saturated. In some cases, all group members or none must receive a group message at all. Group communication in this case is said to be atomic. Achieving atomicity in the presence of failures is difficult, resulting in many more messages being sent. Another aspect of group communication is the ordering of group messages. For example, in a computer conferencing system a user would expect to receive the original news item before any response to that item is received. This is known as ordered multicast and the requirement to ensure that all multicasts are received in the same order for all group members is common in distributed systems. Atomic multicasting does not guarantee that all messages will be received by group the members in the order they were sent. REMOTE INTER - PROCESS COMMUNICATION (IPC)In a distributed system, processes interact in a logical sense by exchanging messages across a communication network. This is referred to as remote IPC. As with local processes, remote processes are either co-operating to complete a defined task or a competing for the use of a resource. Remote IPC can be implemented using the message passing, remote procedure call or the shared memory paradigm. Remote IPC functions are:

Process registration for the purpose of identifying communicating processes Hide differences between local and remote communication Establishing communication channels between processes Routing messages to the destination process Synchronizing concurrent processes Shutting down communication channels Enforces a clean and simple interface providing a natural environment for modular

structuring of distributed applications.

Recreated by Kyalo J. K Page 34 of 73

Page 35: CMT 412 Distributed OS Notes

a) The interconnection between client and server in a traditional RPCb) The interaction using asynchronous RPC

BINDING At some point, a process needs to determine the identity of the process with which it is communicating. This is known as binding. There are two major ways of binding:

Static binding – destination processes are identified explicitly at program compile time. Static binding is the most efficient approach and is most appropriate when a client almost always binds to the same server although in some systems its often not possible to identify all potential destination processes

Dynamic binding – source to destination binding are created, modified and deleted at program run-time. Dynamic binding facilitates location and migration transparency when processes are referred to indirectly (by name) and mapped to the location address at run-time. This is normally facilitated by a facility service known as a directory service.

Binding a Client to a Server

RPC Semantics in the Presence of Failures Locate the server’s machine Locate the service on that machine The client is unable to locate the server. The request message from the client to the server is lost. The server crashes after receiving a request. The reply message from the server to the client is lost. The client crashes after sending a request.

Recreated by Kyalo J. K Page 35 of 73

Page 36: CMT 412 Distributed OS Notes

MESSAGE PASSING Direct communication - a low level remote IPC in which the developer is explicitly aware of the message used in communication and the underlying message transport mechanism used in message exchange is message passing. Processes interact directly using send and receive or equivalent language primitives to initiate message transmission and reception, explicitly naming the recipient or sender, for example:

Send (message, destination_process) Receive (message, source_process)

Message passing is the most flexible remote IPC mechanism that can be used to support all types of process interactions and the underlying transport protocols can be configured by the application according to the needs of the application. The above example is known as direct communication.

Indirect communication - for identifying co-operating processes Here the destination and the source identifiers are not process identifiers, instead, a port also known as a mailbox is specified which represents an abstract object at which messages are queued. Potentially, any process can write or read from a port. To send a message to a process, the sending process simply issues a send operation specifying a well-known port number that is associated with the destination process. To receive the message, the recipient simply issues a receive specifying the same port number. For example:

Send (message, destination_port) Receive (message, source_port)

Security constraints can be introduced by allowing the owning process to specify access control rights on a port. Messages are not lost provided the queue size is adequate for the rate at which messages are being queued and de-queued.

REMOTE PROCEDURE CALL

Many distributed systems have been based on explicit message exchange between processes. However, the procedures send and receive do not conceal communication, which is important to achieve access transparency in distributed systems. This interaction is very similar to the traditional procedure call in high-level programming languages except that the caller and the procedure to be executed are on different computers. A procedure call mechanism that allows the calling and the called procedures to be running on different computers is known as remote procedure call (RPC). When a process on machine A calls a procedure on machine B, the calling process on A is suspended, and execution of the called procedure takes place on B. Information can be transported from the sender to the recipient in the parameters and can come back in the procedure result. No message passing at all is visible to the programmer. While the basic idea sounds simple and elegant, subtle problems exist. To start with, because the calling and called procedures run on different machines, they execute in different address spaces, which causes complications. Parameters and results also have to be passed, which can be complicated, especially if the machines are not identical. Finally, both machines can crash and each of the possible failures causes different problems. Still, most of these can be dealt with, and RPC is a widely used technique that underlies many distributed systems.

RPC is popular for developing distributed systems because it looks and behaves like a well-understood, conventional procedure call in high-level languages. A procedure call is a very

Recreated by Kyalo J. K Page 36 of 73

Page 37: CMT 412 Distributed OS Notes

effective tool for implementing abstraction since to use it all one needs to know is the name of the procedure and arguments associated with it. Packing parameters into a message is called parameter marshaling. RPC is a remote operation with semantics similar to a local procedure call and can provide a degree of:

Access transparency – since a call to a remote procedure may be similar to a local procedure.

Location transparency – since the developer can refer to the procedure by name, unaware of where exactly the remote procedure is located.

Synchronization – since the process invoking the RPC remains suspended (blocked) until the remote procedure is completed, just as a call to a local procedure. A remote procedure call occurs in the following steps:

1. The client procedure calls the client stub in the normal way.2. The client stub builds a message and calls the local operating system.3. The client’s OS sends the message to the remote OS.4. The remote OS gives the message to the server stub.5. The server stub unpacks the parameters and calls the server.6. The server does the work and returns the result to the stub.7. The server stub packs it in a message and calls its local OS.8. The server’s OS sends the message to the client’s OS.9. The client’s OS gives the message to the client stub.10. The stub unpacks the result and returns to the client.

Stub Generation - Once the RPC protocol has been completely defined, the client and server stubs need to be implemented. Fortunately, stubs for the same protocol but different procedures generally differ only in their interface to the applications. An interface consists of a collection of procedures that can be called by a client, and which are implemented by a server. An interface is generally available in the same programming language as the one in which the client or server is written (although this is strictly speaking, not necessary). To simplify matters, interfaces are often specified by means of an Interface

Definition Language (IDL). An interface specified in such an IDL, is then subsequently compiled into a client stub and a server stub, along with the appropriate compile-time or run-time interfaces. Practice shows that using an interface definition language considerably simplifies client-server applications based on RPCs. Because it is easy to fully generate client and server stubs, all RPC-based middleware systems offer an IDL to support application development.

MARSHALLING Marshalling is the process of converting the data types from the machines representation to a standard representation before transmission and converting it at the other end from the standard to the machines internal representation. Marshalling is complicated by use of global variables and pointers as they only have meaning in the client’s address space. Client and server processes run in different address spaces on separate machines. One solution would be to pass data values held by global variables or pointed to by the pointer. However there are cases where this will not workout, for example, when a linked list data structure is being passed to a procedure that manipulates the list. Differences in representation of data can be overcome by use of an agreed

Recreated by Kyalo J. K Page 37 of 73

Page 38: CMT 412 Distributed OS Notes

language for representing data between client and server processes. For example, the common syntax for describing and encoding of data which known as Abstract Syntax Notation (ASN.1) has been defined as an international standard by the International Organization for standardization (ISO). ASN.1 is similar to the data declaration statements in a high-level programming language. FAILURE HANDLINGRPC failures can be difficult to handle. There are four generalized types of failures that can occur when an RPC call is made:

The Client’s request message is lost. The client process fails while the server is processing the request. The sever process fails while servicing the request. The reply message is lost.

If the client’s message gets lost then the client will wait forever unless a time out error detection mechanism is employed. If the client process fails then, the server will carry out the remote operation unnecessarily. If the operation involves updating a data value then this can lead to a loss of data integrity. Furthermore, the server would generate a reply to client process that does not exist. This must be discarded by the client’s machine. When the client re-starts, it may be send the request again causing the server to execute more than once. A similar situation arises when the server crashes. The server could crash just prior to the execution of the remote operation or just after execution completes but before a reply to the client is generated. In this case, clients will time-out and continually generate retries until either the server restarts or the retry limit is met.

REMOTE METHOD INVOCATION (RMI) Remote method invocation allows applications to call object methods located remotely, sharing resources and processing load across systems. Unlike other systems for remote execution that require that only simple data types or defined structures be passed to and from methods, RMI allows any object type to be used - even if the client or server has never encountered it before. RMI allows both client and server to dynamically load new object types as required.

RMI Applications - RMI is the equivalent of RPC commonly used in middleware based on distributed objects model. RMI applications are often comprised of two separate programs: a server and a client. A typical server application creates some remote objects, makes references to them accessible, and waits for clients to invoke methods on these remote objects. A typical client application gets a remote reference to one or more remote objects in the server and then invokes methods on them. RMI provides the mechanism by which the server and the client communicate and pass information back and forth. Such an application is sometimes referred to as a distributed object application.

THE DISTRIBUTED MEMORY The main idea is to provide the mechanism for a set of networked workstations to share a single, paged virtual address space. A reference to local memory is done in hardware first and emulates the multi-processor caches using MMU and OS. An attempt to reference an address that is not local causes a page fault, trap to OS, massage to remote node to fetch the page, and restart faulting instruction. The idea is similar to traditional Virtual Machine systems. Distributed

Recreated by Kyalo J. K Page 38 of 73

Page 39: CMT 412 Distributed OS Notes

Shared Memory (DSM) is an abstraction used for sharing data between processes in computers that do not share physical memory. The Processes appear to access a single shared memory and thus a tool for parallel applications since there is no message passing and no marshalling of data. It is also Scalable to large number of computers and the approaches to DSM is the Hardware-Based , Page-Based , Library or Object-Based

CONSISTENCY MODELS Strict Consistency: Ideal programming model: Any read to a memory location X returns

the value stored by the most recent write operation to X. Nearly impossible to implement in distributed system. Easy on a parallel system or single system with multiple threads/processes.

Sequential Consistency: Any valid interleaving is acceptable, but all processes must see the same sequence of memory references.

Causal Consistency: Happens-before order. Includes: All processors agree to order of writes issued by processor X. All messages received by processor Y (Reads) must occur after processor X sent message (Write). Concurrent (non-causal) writes may be seen in a different order on different machines. Pipelined RAM: PRAM: Writes from different processes may be seen in a different order. All processors agree to order of ‘writes’ issued by processor X.

Weak Consistency: Programmer uses synchronization method to update data. Synchronization methods may include: critical section, mutual exclusion -mutex, or barrier: all processes must arrive at barrier before any can continue

Release Consistency: Shared data are made consistent when a critical region is exited. Multiple data may be associated with one critical section

Entry Consistency: Shared data is made consistent upon entering a critical region. One synchronization variable associated with each data object, Multiple data variables can be updated at a time by different processes

Release Consistency: Shared data updated in critical section. Exploit fact that programmers use synchronization objects.

OBJECT-BASED DSM The Object includes attributes such as object state: internal data, methods or operations. Uses information hiding. Treated as collection of separate objects instead of linear address space.

MEMO - MEMO is a filing package or organizational package which coordinates data and tasks between processes. Perfect for job jar allocation schemes.

Caching - Used when CPUs share the same physical memory and since the Cache is faster than memory, it Reduces access to bus for CPUs which share memory and Works with less than 64 CPUs.

NUMA (Non Uniform Memory Accesses) Multiprocessors: that all memories glued together to create one real address space. Access to remote memory is possible, Accessing remote memory is slower than accessing local memory though no cache allowed.

Distributed object applications need to:

Locate remote objects: Applications can use one of two mechanisms to obtain references to remote objects. An application can register its remote objects with RMI's simple naming

Recreated by Kyalo J. K Page 39 of 73

Page 40: CMT 412 Distributed OS Notes

facility or the application can pass and return remote object references as part of its normal operation.

Communicate with remote objects: Details of communication between remote objects are handled by RMI; to the programmer, remote communication looks like a standard method invocation.

Load class byte codes for objects that are passed around: Because RMI allows a caller to pass objects to remote objects; RMI provides the necessary mechanisms for loading an object's code, as well as for transmitting its data.

One of the central and unique features of RMI is its ability to download the bytecodes (or simply code) of an object's class if the class is not defined in the receiver's virtual machine. The types and the behavior of an object, previously available only in a single virtual machine, can be transmitted to another, possibly remote, virtual machine. RMI passes objects by their true type, so the behavior of those objects is not changed when they are sent to another virtual machine. This allows new types to be introduced into a remote virtual machine, thus extending the behavior of an application dynamically.

Creating Distributed Applications Using RMIWhen you use RMI to develop a distributed application, you follow these general steps.

Design and implement the components of your distributed application. Compile sources and generate stubs. Make classes network accessible. Start the application.

IMPLEMENTING APPLICATION COMPONENTSFirst, decide on your application architecture and determine which components are local objects and which ones should be remotely accessible. This step includes:

Defining the remote interfaces: A remote interface specifies the methods that can be invoked remotely by a client. Clients program to remote interfaces, not to the implementation classes of those interfaces. Part of the design of such interfaces is the determination of any local objects that will be used as parameters and return values for these methods; if any of these interfaces or classes do not yet exist, you need to define them as well.

Implementing the remote objects: Remote objects must implement one or more remote interfaces. The remote object class may include implementations of other interfaces (either local or remote) and other methods (which are available only locally). If any local classes are to be used as parameters or return values to any of these methods, they must be implemented as well.

Implementing the clients: Clients that use remote objects can be implemented at any time after the remote interfaces are defined, including after the remote objects have been deployed.

DISTRIBUTED PROCESSINGDistributed processing can be loosely defined as the execution of co-operating processes which communicate by exchanging messages across an information network. It means that the infrastructure consists of distributed processors, enabling parallel execution of processes and

Recreated by Kyalo J. K Page 40 of 73

Page 41: CMT 412 Distributed OS Notes

message exchanges. Communication and data exchange can be implemented in two ways: Shared memory and Message exchange / passing and Remote Procedure Call (RPC)

PROCESSES AND THREADSA process is a logical representation of a physical processor that executes program code and has associated state and data. Sometimes described as a virtual processor. A process is the unit of resource allocation and so is defined by the resources it uses and by the location at which it’s executing. A process can run either in a separate (private) address space or may share the same address space with other processes. Processes are created either implicitly (e.g. by the operating system) or explicitly using an appropriate language construct or O/S construct such as fork( ). In uni-processor computer systems the illusion of many programs running at the same time is created using the time slicing technique, but in actual sense there is only one program utilizing the CPU at any given time. Processes are switched in and out of the CPU rapidly that each process appears to be executing continuously. Switching involves saving the state of the currently active process and setting up the state of another process, sometimes known as context switching.Threads - Some operating systems allow additional ‘child processes’ to be created, each competing for the CPU and other resources with the other processes. All resources belonging to the ‘parent process’ are duplicated thus making the available to the ‘child processes’. It’s common for a program to create multiple processes that are required to share memory and other resources. The process may wait for a particular event to occur. Some operating systems support this situation efficiently by allowing a number of processes to share a single address space. Processes in this context are referred to as threads and the O/S is said to support multi-threading. Usually a processes and threads are used interchangeably and has layers as shown.

The general organization of an Internet search engine into three different layers SYSTEM NAMES AND NAMING TECHNIQUESIntroductionMost computer systems (in particular operating systems) manage wide collections of entities (such as, files, users, hosts, networks, and so on). These entities are referred to by users of the system and other entities by various kinds of names. Examples of names in UNIX systems include the following:

Devices: /dev/hda, /dev/ttyS1 Files: /boot/vmlinuz/lectures/DS/notes/tex/naming.tex

Recreated by Kyalo J. K Page 41 of 73

Page 42: CMT 412 Distributed OS Notes

For largely historical reasons, different entities are often named using different naming schemes. We say that they exist in different name spaces. From time to time a new system design attempts to integrate a variety of entities into a homogeneous name space, and then also attempts to provide a uniform interface to these entities. For example, a central concept of UNIX systems is the uniform treatment of files, devices, sockets, and so on. Some systems also introduce a /proc file system, which maps processes to names in the file system and supports access to process information through this file interface. In addition, Linux provides access to a variety of kernel data structures via the /proc file system.

BASIC CONCEPTS A name is the fundamental concept underlying naming. We define a name as a string of

bits or characters that is used to refer to an entity. An entity in this case is any resource, user, process, etc. in the system.

Entities are accessed by performing operations on them; the operations are performed at an entity’s access point. An access point is also referred to by a name, we call an access point’s name an address. Entities may have multiple access points and may therefore have multiple addresses. Furthermore an entity’s access points may change over time (that is an entity may get new access points or lose existing ones), which means that the set of an entity’s addresses may also change.

A pure name is a name that consists of an un interpreted bit pattern that does not encode any of the named entity’s attributes.

A non pure name, on the other hand, does encode entity attributes (such as an access point address) in the name.

An identifier is a name that uniquely identifies an entity. An identifier refers to at most one entity and an entity is referred to by at most one identifier. Furthermore an identifier can never be reused, so that it will always refer to the same entity. Identifiers allow for easy comparison of entities; if two entities have the same identifier then they are the same entity. Pure names that are also identifiers are called pure identifiers.

Location independent names are names that are independent of an entity’s address. They remain valid even if an entity moves or otherwise changes its address. Note that pure names are always location independent, though location independent names do not have to be pure names.

SYSTEM NAMES VERSUS HUMAN NAMESRelated to the purity of names is the distinction between system-oriented and human-oriented names. Human-oriented names are usually chosen for their mnemonic value, whereas system-oriented names are a means for efficient access and identification of objects. Taking into account the desire for transparency human-oriented names would ideally be pure. In contrast, system-oriented names are often non pure which, speeds up access to repeatedly used object attributes. We can characterize these two kinds of names as follows:

System-Oriented NamesSystem-oriented names are usually implemented as one or more fixed-sized numerals to facilitate efficient handling. Moreover, they typically need to be unique identifiers and may be sparse to convey access rights (e.g., capabilities). Depending on whether they are globally or locally unique, we also call them structured or unstructured: Globally unique integer unstructured node

Recreated by Kyalo J. K Page 42 of 73

Page 43: CMT 412 Distributed OS Notes

identifier | local unique identifier structured. The structuring may be over multiple levels. Note that a structured name is not pure. Global uniqueness without further mechanism requires a centralized generator with the usual drawbacks regarding scalability and reliability. In contrast, distributed generation without excessive communication usually leads to structured names. For example, a globally unique structured name can be constructed by combining the local time with a locally unique identifier. Both values can be generated locally and do not require any communication.

Human-Oriented NamesIn many systems, the most important attribute bound to a human-oriented name is the system-oriented name of the object. All further information about the entity is obtained via the system-oriented name. This enables the system to perform the usually costly resolution of the human-oriented name just once and implement all further operations on the basis of the system-oriented name (which is more efficient to handle). Often a whole set of human-oriented names is mapped to a single system-oriented name (symbolic links, relative addressing, and so on).

As an example of all this, consider the naming of files in UNIX. A pathname is a human-oriented name that, by means of the directory structure of the file system, can be resolved to an inode number, which is a machine-oriented name. All attributes of a file are accessible via the inode (i.e., the machine orientedname). By virtue of symbolic and hard links multiple human oriented names may refer to the same inode, which makes equality testing of files merely by their human-oriented name impossible. The design space for human-oriented names is considerably wider than that for system-oriented names. As such naming systems for human-oriented names usually require considerably greater implementation effort.

NAME SPACES Names are grouped and organized into name spaces. A structured name space is

represented as a labeled directed graph, with two types of nodes. A leaf node represents a named entity and stores information about entity. The information could include the entity itself, or a reference to the entity (e.g., an address).

A directory node (also called a context) is an inner node and does not represent any single entity. Instead it stores a directory table, containing (node - id, edge - label) pairs, that describes the node’s children. A leaf node only has incoming edges, while a directory node has both incoming and outgoing edges. A third kind of node, a root node is a directory node with only outgoing edges.

A structured name space can be strictly hierarchical or can form a directed acyclic graph (DAG). In a strictly hierarchical name space a node will only have one incoming edge. In a DAG name space any node can have multiple incoming edges. It is also possible to have name spaces with multiple root nodes.

Scalable systems usually use hierarchically structured names spaces. A sequence of edge labels leading from one node to another is called a path name.

A path name is used to refer to a node in the graph. An absolute path name always starts from a root node, a relative path name is any path name that does not start at the root node.

Recreated by Kyalo J. K Page 43 of 73

Page 44: CMT 412 Distributed OS Notes

Many name spaces support aliening, in which case an entity may be reachable by multiple paths from a root node and will therefore be named by numerous path names. There are two types of alias.

A hard link is when there two or more paths that directly lead to that entity. A soft link, occurs when a leaf node holds a pathname that refers to another node.

In this case the leaf node implicitly refers to the file named by the pathname. Ideally we would have a global, homogeneous name space that contains names for all entities used. However, we are often faced with the situation where we already have a collection of name spaces that have to be combined into a larger name space. One approach is to simply create a new name that combines names from the other name spaces. For example, a Web URL http://www.raiuniversity.edu/-cs9243/naming-slides.ps globalizes the local name ~ cs9243/naming-slides.ps by adding the context www.raiuniversity.edu. Unfortunately, this approach often compromises location transparency—as is the case in the example of URLs. Another example of the composition of name spaces is mounting a name space onto a mount point in a different (external) name space. This approach is often applied to merge file systems (e.g., mounting a remote file system onto a local mount point). In terms of a name space graph, mounting requires one directory node to contain information about another directory node in the external name space. This is similar to the concept of soft linking, except that in this case the link is to a node outside of the name space. The information contained in the mount point node must, therefore, include information about where to find the external name space.

NAME RESOLUTIONThe process of determining what entity a name refers to is called name resolution. Resolving a name results in a reference to the entity that the name refers to. Resolving a name in a name space often results in a reference to the node that the name refers to. Path name resolution is a process that starts with the resolution of the first element in the path name, and ends with resolution of the last element in the name. There are two approaches to this process, iterative resolution and recursive resolution.

Iterative Name ResolutionIn iterative resolution the resolver contacts each node directly to resolve each individual element of the path name. In recursive resolution the resolver only contacts the first node and asks it to resolve the name. This node looks up the node referred to by first element of the name and then passes the rest of the name on to that node. The process is repeated until the last element is resolved after which the result is return back through the nodes to the resolver.

Recursive Name ResolutionA problem with name resolution is how to determine which node to start resolution at. Knowing how and where to start name resolution is referred to as the closure mechanism. One approach is to keep an external reference (e.g., in a file) to the root node of the name space. Another approach is to keep a reference to the ‘current’ directory node for dealing with relative names.

Recreated by Kyalo J. K Page 44 of 73

Page 45: CMT 412 Distributed OS Notes

Note that the actual closure mechanism is always implicit, that is it is never explicitly defined in a name. The reason for this is that if a closure mechanism was defined in a name there would have to be a way to resolve the name used for that closure mechanism. This would require the use of a closure mechanism to bootstrap the original closure mechanism. Because this could be repeated indefinitely, at a certain point an implicit mechanism will always be required.

NAMING SERVICEA naming service is a service that provides access to a name space allowing clients to perform operations on the name space. These operations include adding and removing directory or leaf nodes, modifying the contents of nodes and looking up names. The naming service is implemented by name servers. Name resolution is performed on behalf of clients by resolvers. A resolver can be implemented by the client itself, in the kernel, by the name server, or as a separate service.

Distributed Naming ServiceAs with most other system services, naming becomes more involved in a distributed environment. A distributed naming service is implemented using multiple name servers over which the name space is partitioned and/or replicated. The goal of a distributed naming service is to distribute both the management and name resolution load over these name servers. Before discussing implementation aspects of distributed naming services it is useful to split a name space up into several layers according to the role the nodes play in the name space. These layers help to determine how and where to partition and replicate that part of the name space. The highest level nodes belong to the global layer. Internet access files into three layersA main characteristic of nodes in this layer is that they are stable, meaning that they do not change much. As such, replicating these nodes is relatively easy because consistency does not cause much of a problem. The next layer is the administrational layer. The nodes in this layer generally represent a part of the name space that is associated to a single organizational entity (e.g., a company or a university). They are relatively stable (but not as stable as the nodes in the global layer). Finally the lowest layer isthe managerial layer. This layer sees much change. Nodes may be added or removed as well as have their contents modified. The nodes in the top layers generally see the most traffic and, therefore, require more effort to keep their performance at an acceptable level.

Typically, a client does not directly converse with a name server, but delegates this to a local resolver that may use caching to improve performance. Each of the name servers stores one or more naming contexts, some of which may be replicated. We call the name servers storing attributes of an object this object’s authoritative name servers.

Recreated by Kyalo J. K Page 45 of 73

Page 46: CMT 412 Distributed OS Notes

A comparison between name servers for implementing nodes from a large scale name space partitioned into a global layer as an administrational layer and a managerial layer

Directory nodes are the smallest unit of distribution and replication of a name space. If they are all on one host, we have one central server, which is simple, but does not scale and does not provide fault tolerance. Alternatively, there can be multiple copies of the whole name space, which is called full replication. Again, this is simple and access may be fast. However, the replicas will have to be kept consistent and this may become a bottleneck as the system grows.

In the case of a hierarchical name space, partial sub trees (often called zones) may be maintained by a single server. In the case of the Internet Domain Name Service (DNS), this distribution also matches the physical distribution of the network. Each zone is associated with a name prefix that leads from the rootto the zone. Now, each node maintains a prefix table (essentially, a hint cache for name servers corresponding to zones) and, given a name, the server corresponding to the zone with the longest matching prefix is contacted. If it is not the authoritative name server, the next zone’s prefix is broadcast to obtain the corresponding name server (and update the prefix table). As an alternative to broadcasting, the contacted name server may be able to provide the address of the authoritative name server for this zone. This scheme can be efficiently implemented, as the prefix table can be relatively small and, on average, only a small number of messages are needed for name resolution. Consistency of the prefix table is checked on use, which removes the need for explicit update messages. For smaller systems, a simpler structure-free distribution scheme may be used. In this scheme contexts can be freely placed on the available name servers (usually, however, some distribution policy is in place). Name resolution starts at the root and has to traverse the complete resolution chain of contexts. This is easy to reconfigure and, for example, used in the standard naming service of CORBA.

IMPLEMENTATION OF NAMING SERVICESIn the following, we consider a number of issues that must be addressed by implementations name services. First, a starting point for name resolution has to be fixed. This essentially means that the resolver must have a list of name servers that it can contact. This list will usually not include the root name server to avoid overloading it. Instead, physically close servers are normally chosen. For example, in the BIND (Berkeley Internet Name Domain) implementation of DNS, the resolver is implemented as a library linked to the client program. It expects the file /etc/resolv.conf to contain a list of name servers. Moreover, it facilitates relative naming in form of the search option.

Recreated by Kyalo J. K Page 46 of 73

Page 47: CMT 412 Distributed OS Notes

Name CachesName resolution is expensive. For example, studies found that a large proportion of UNIX system calls (and network traffic in distributed systems) is due to name-mapping operations. Thus, caching of the results of name resolution on the client is attractive:

High degree of locality of name lookup thus a reasonably sized name cache can give good hit ratio.

Slow update of name information database; thus, the cost for maintaining consistency is low.

On-use consistency of cached information is possible; thus, no invalidation on update: stale entries are detected on use.

There are three types of name caches: Directory cache: directory node data is cached. Directory caches are normally used with

iterative name resolution. They require large caches, but are useful for directory listings etc.

Prefix cache: path name prefix and zone information is cached. Prefix caching is unsuitable with structure – free context distribution.

Full-name cache: full path name information is cached. Full-name caching is mostly used withstructure-free context distribution and tends to require larger cache sizes than prefix caches.

A name cache can be implemented as a process-local cache, which lives in the address space of the client process. Such a cache does not need many resources, as it typically will be small in size, but much of the information may be duplicated in other processes. More seriously, it is a short-lived cache and incurs a high rate of start-up misses, unless a scheme such as cache inheritance is used, which propagates cache information from parent to child processes. The alternative is a kernel cache, which avoids duplicate entries and excessive start-up misses, but access to a kernel cache is slower and it takes up valuable kernel memory. Alternatively, a shared cache can be located in a user space cache process that is utilized by clients directly or by redirection of queries via the kernel (the latter is used in theCODA file system). ATTRIBUTE-BASED NAMINGWhereas names as described above encode at most one attribute of the named entity (e.g., a domain name encodes the entity’s administrative or geographical location) in attribute-based naming an entity’s name is composed of multiple attributes. An example of an attribute-based name is given below: /C=AU/0=UNSW/0U=CSE/CN=WWW.server/ Hardware=Sparc/OS=Solaris/Server=Apache. The name not only encodes the location of the entity (/C=AU/ 0=UNSW/ 0U=CSE, where C is the attribute country, O is organization, OU is organizational unit - these are standard attributes in X.500 and LDAP), it also identifies it as a Web server, and provides information about the hardware that it runs on, the operating system running on it, and the software used.Although an entity’s attribute-based name contains information about all attributes, it is common to also define a distinguished name (DN), which consists of a subset of the attributes and is

Recreated by Kyalo J. K Page 47 of 73

Page 48: CMT 412 Distributed OS Notes

sufficient to uniquely identify the entity. In attribute-based naming systems the names are stored in directories, and each distinguished name refers to a directory entry. Attribute-based naming services are normally called directory services. Similar to a naming service, a directory service implements a name space that can be flat or hierarchical. With a hierarchical name space its structure mirrors the structure of distinguished names. The structure of the name space (i.e., the naming graph) is defined by a directory information tree (DIT). The actual contents of the directory (that is the collection of all directory entries) are stored in the directory information base (DIB).

DISTRIBUTED FILE SYSTEMS State: File Server maintains information about clients between requests.     Information retained includes: Who opened which files. Where to read from next in file.     Advantages:         Shorter request messages: Internal file name only required.         Better performance: Open file information is retained in memory for each request.         File locking possible: Restrict access to one user. Stateless: File Server replies to requests but does not keep client information between requests. Each request includes: Full file name and offset into file.     Advantages:         Fault tolerance: File server reboots.         No OPEN/CLOSE calls needed: fewer messages.         No server space wasted on tables.         No limits on number of open files.         No problems if a client crashes: No open files which are never closed.

Dealing with Shared Files: 1. UNIX Semantics: Every operation on a file is instantly visible to all processes.     Desired: When READ follows WRITE, READ gets value just written.     Easy Implementation: one file server and no cache files.     Distributed system: File server cache gives good performance.     WRITE must be immediately updated. 2. Session Semantics: No changes are visible to other processes until the file is closed.     WRITEs are updated when file is closed. 3. Immutable Files: No updates are possible.     Allowable operations include: CREATE and READ.     WRITEs are not allowed. Simplifies sharing and replication.     Always create newer versions of the same file with a new version number 4. Atomic Transactions: All changes have the all or nothing property.     BEGIN TRANSACTION and END TRANSACTION executed indivisibly. Caching / BufferingFour places to store files: Server disk, Server memory, Client disk, and Client memory. Advantages: Plenty of space. Files accessible to all clients. No consistency problems with one copy. Disadvantages: Read time: transfer from Server disk to Client memory.

Recreated by Kyalo J. K Page 48 of 73

Page 49: CMT 412 Distributed OS Notes

Server Cache: Advantages: Performance gain. One copy: No consistency problems. Implementation:   Main memory contains array of blocks = size of disk blocks. Read from cache if available - else from disk. Upon read with a cache miss release: "least recently used" block. Requires time stamping of each read and write Handling updates:  Dirty flag: indicates if cache updated and needs to be written to disk. Write-through cache: updates to disk immediately

Client Cache: Advantage: Reduces network traffic & delays accessing files. Disadvantage: More complex, Potential for different version of files in client nodes. Implementation: Three locations for caching: within process: no shared cache (e.g. database) kernel: processes share cache, but a kernel call is needed to access

REPLICATION Goal: Replication Transparency i.e. Provide backup, Split workload Architecture: Includes Client Program: Does a read and/or write Front end: communicates with replica managers. Hides the implementation of how replication is maintained from the client program Implemented as: user package executed in each client; or separate process. Talks with one or multiple Replica Managers Replica Manager: Holds copy of data, and performs direct reads/writes to it.

METHODS OF REPLICATION: Explicit file replication: copy Lazy file replication (=Gossip): updates occur in background Group communication: WRITES occur simultaneously to all servers. Primary copy: Primary updates secondary replicated files Advantage: simple for programmer; Disadvantage: recovery upon primary failure Implementation: Read from secondary or primary, Write to primary only, Elect slave upon primary failure. Example: Network Information Service (NIS) Totally ordered updates: Solves problem of updates arriving out of order All requests sent to sequencer process, Sequencer process assigns consecutive sequence numbers and sends to Rep Mgr. All replication mgrs process requests in same order Problem: Sequencer failure or bottleneck

Sun Network File System (NFS) Sun NFS 4.0 Introduced in 1985. Widely adopted in industry - de facto standard.      clients & servers can run different OS and different h/w. Implementation: Use Remote Procedure Calling. File and Directory services integrated. Client parses and controls path name translation. Stateless: Authentication info required each request Server caching:

Read-ahead. Writes occur immediately to stable storage: disk, nonvolatile memory (NVRAM), uninterruptible power supply (UPS). May use write gathering: delays and groups similar writes together

Caches all request: checks cache before processing new request RequestID: contains clientID, transactionID, procedure#, state, timestamp

Recreated by Kyalo J. K Page 49 of 73

Page 50: CMT 412 Distributed OS Notes

Network Information Service: NIS     Translates a key into a value used in authenticating parties.     Translates user names to encrypted passwords.     Maps machine names to network addresses.     Supports primary copy replication method.

ATOMIC TRANSACTIONS Atomic Transaction - The effect of performing any single operation is free from interference from concurrent operations being performed in other threads. If the transaction does not complete, all previous operations within the transaction are backed out. Aspects of Atomicity:

All-or-nothing: All operations in an AT are completed or rolled back out to initial state. Failure atomicity: Effects are atomic even when server fails. Durability: Completed transactions are saved in permanent storage. Isolation: Each transaction performed w/o interference from other transactions.

Need for Transactions: Banking operation to transfer money is done in two steps: Withdraw(amt, account); Deposit(amt, account); Windows NT: Transactions also used to ensure all file data is consistent so a disk is recoverable if a system crash occurs.

Problems with Simultaneous Transactions: Lost Update: Two writes happen for same read. Inconsistent Retrievals: Process B reads midway through Process A's transaction. Dirty Read: Transaction B reads after Transaction A writes - but Transaction A aborts. Over-written Uncommitted Values: Later transaction backs out to aborted earlier

transaction.

Transaction Primitives include:       BEGIN_TRANSACTION: marks start of transaction       END_TRANSACTION: commit transaction.       ABORT-TRANSACTION: kill transaction: restore old values.       READ: Read data in file (or other object).       WRITE: Write data to file. Note: Each transaction has an identifier, which is included on all operations.

Nested Transactions        If parent transaction aborts, all children transactions must abort.        If child transaction aborts, parent transaction may decide to commit or abort.        Child transactions may run concurrently on different servers.

TRANSACTION IMPLEMENTATION Clients may use a Server to share resources. Good design techniques include:  Server holds requests for service until resource becomes available. A Server uses a new thread for each

Recreated by Kyalo J. K Page 50 of 73

Page 51: CMT 412 Distributed OS Notes

request. A thread that cannot continue execution uses the Wait operation. Thread causes suspended thread to resume using Signal operation.

Fault Tolerance - Transactions should survive Server processor failures. Multiple replicas run on different computers - Replicas maintain recovery files. Can recover if disk block failure or if all replicas not updated before processor failure.

Fault tolerant: Replicas monitor each other and may continue operation if one fails. Server may back out of partially completed transactions after restart.

Private Workspace: Uncommitted records are written in temporary location. File's index (UNIX i-node) is copied into private workspace. Private workspace index is updated with new/modified records. All other processes continue to see original file. If transaction aborts, private workspace is deleted and private blocks put on free list. If transaction commits, private workspace index replaces previous index and old blocks put on free list.

DISTRIBUTED CONCURRENCY CONTROLLogical ClocksFor many purposes, it is sufficient that all machines agree on the same time. It is not essential that this time also agrees with the real lime as announced on the radio every hour. For running make, for example, it is adequate that all machines agree that it is 10:00, even if it is really 10:02. Thus for a certain class of algorithms, it is the internal consistency of the clocks that matters, not whether they are particularly close to the real time.For these algorithms, it is conventional to speak of the clocks as logical clocks. In a classic paper, Lamport (1978) showed that although clock synchronization is possible, it need not be absolute. If two processes do not interact, it is not necessary that their clocks be synchronized because the lack of synchronization would not be observable and thus could not cause problems. Furthermore, he pointed out that what usually matters is not that all processes agree on exactly what time it is, but rather that they agree on the order in which events occur. In the make example given in the previous section, what counts is whether input.c is older or newer than input.o, not their absolute creation times. In this section we will discuss Lamport’s algorithm, which synchronizes logical clocks.

Lamport TimestampsTo synchronize logical clocks, Lamport defined a relation called “happens-before”. The expression ab is read “a happens before b” and means that all processes agree that first event a occurs, then afterward, event b occurs. The “happens-before” relation can be observed directly in two situations:

1. If a and b are events in the same process, and a occurs before b, then a greater than b is true..

2. If a is the event of a message being sent by one process, and b is the event of the message being received by another process, then a is greater than b is also true.

A message cannot be received before it is sent, or even at the same time it is sent, since it takes a finite, nonzero amount of time to arrive. Happens-before is a transitive relation, so if a is greater than b and b greater than c, then a is greater than c. If two events, x and y, happen in different processes that do not exchange messages (not even indirectly via third parties), then x is greater than y is not true, but neither is y is greater than x. These events are said to be concurrent, which

Recreated by Kyalo J. K Page 51 of 73

Page 52: CMT 412 Distributed OS Notes

simply means that nothing can be said (or need be said) about when the events happened or which event happened first.

What we need is a way of measuring time such that for every event, a, we can assign it a time value (a) on which all processes agree. These time values must have the property that if a < b, then C(a) < C(b). To rephrase the conditions we stated earlier, if a and b are two events within the same process and a occurs before b, then C(a) < C(b). Similarly, if a is the sending of a message by one process and b is the reception of that message by another process, then C(a) and C(b) must be assigned in such a way that everyone agrees on the values of (a) and C(b) with C(a) < C(b). In addition, the clock time, C, must always go forward (increasing), never backward (decreasing). Corrections to time can be made by adding a positive value, never by subtracting one. Global StateDetermining global properties in a distributed system is often difficult, but crucial for some applications. For example, in distributed garbage collection, we need to be able to determine for some object whether it is referenced by any other objects in the system. Deadlock detection requires detection of cycles of processes infinitely waiting for each other. To detect the termination of a distributed algorithm we need to obtain simultaneous knowledge of all involved process as well as take account of messages that may still traverse the network. In other words, it is not sufficient to check the activity of all processes. Even if all processes appear to be passive, there may be messages in transition that, upon arrival, trigger furtheractivity. In the following, we are concerned with determining stable global states or properties that, once they occur, will not disappear without outside intervention. For example, once an object is no longer referenced by any other object (i.e., it may be garbage collected), no reference to the object can appear at a later time. Distributed Concurrency ControlSome of the issues encountered when looking at concurrency in distributed systems are familiar from the study of operating systems and multithreaded applications. In particular dealing with race conditions that occur when concurrent processes access shared resources. In non distributed system these problems are solved by implementing mutual exclusion using local primitives such as locks, semaphores, and monitors. In distributed systems, dealing with concurrency becomes more complicated due to the lack of directly shared resources (such as memory, CPU registers, etc.), the lack of a global clock, the lack of a single global program state, and the presence of communication delays.

Distributed Mutual ExclusionWhen concurrent access to distributed resources is required, we need to have mechanisms to prevent race conditions while processes are within critical sections. These mechanisms must fulfill the following three requirements:

Safety: At most one process may execute the critical section at a time Liveness: Requests to enter and exit the critical section eventually succeed Ordering: Requests are processed in happened-before ordering

Method 1: Central Server

Recreated by Kyalo J. K Page 52 of 73

Page 53: CMT 412 Distributed OS Notes

The simplest approach is to use a central server that controls the entering and exiting of critical sections. Processes must send requests to enter and exit a critical section to a lock server (or coordinator), which grants permission to enter by sending a token to the requesting process. Upon leaving the critical section, the token is returned to the server. Processes that wish to enter a critical section while another process is holding the token are put in a queue. When the token is returned the process at the head of the queue is given the token and allowed to enter the critical section. This scheme is easy to implement, but it does not scale well due to the central authority. Moreover, it is vulnerable to failure of the central server.

Method 2: Token RingMore sophisticated is a setup that organizes all processes in a logical ring structure, along which a token message is continuously forwarded. Before entering the critical section, a process has to wait until the token comes by and then retain the token until it exits the critical section. A disadvantage of this approach is that the ring imposes an average delay of N/2 hops, which again limits scalability. Moreover, the token messages consume bandwidth and failing nodes or channels can break the ring. Another problem is that failures may cause the token to be lost. In addition, if new processes join the network or wish to leave, further management logic is needed.

Method 3: Using Multicast and Logical ClocksRicart & Agrawala proposed an algorithm for distributed mutual exclusion that makes use of logical clocks. Each participating process pi maintains a Lamport clock and all processes must be able to communicate pairwise. At any moment, each process is in one of three states:

1. Released: Outside of critical section2. Wanted: Waiting to enter critical section3. Held: Inside critical section

If a process wants to enter a critical section, it multicasts a message and waits until it has received a reply from every other process. The processes operate as follows:

If a process is in Released state, it immediately replies to any request to enter the critical section.

If a process is in Held state, it delays replying until it is finished with the critical section. If a process is in Wanted state, it replies to a request immediately only if the requesting

timestamp is smaller than the one in its own request.

The only hurdle to scalability is the use of multicasts (i.e., all processes have to be contacted in order to enter a critical section). More scalable variants of this algorithm require each individual process to only contact subsets of its peers when wanting to enter a critical section. Unfortunately, failure of any peer process can deny all other processes entry to the critical section.

TRANSACTIONSA transaction can be regarded as a set of server operations that are guaranteed to appear atomic in the presence of multiple clients and partial failure. The concept of a transaction originates from the database community as a mechanism to maintain the consistency of databases. Transaction management is buildaround two basic operations:

Begin Transaction

Recreated by Kyalo J. K Page 53 of 73

Page 54: CMT 412 Distributed OS Notes

End Transaction.An EndTransaction operation causes the whole transaction to either Commit or Abort. For this discussion, the operations performed in a transaction are Read and Write. Transactions have the ACID property.

Atomic - All-or-nothing, once committed the full transaction is performed, if aborted, there is no trace left.

Consistent - Concurrent transactions will not produce inconsistent results. Isolated - Transactions do not interfere with each other, i.e., no intermediate state of a

transaction is visible outside; (this is also called the serialisable property) Durable - All-or-nothing property must hold even if server or hardware fails.

TRANSACTION IMPLEMENTATIONTwo general strategies exist for the implementation of transactions:

Private Workspace - All tentative operations are performed on a shadow copy of the server state, which is atomically swapped with the main copy on Commit or discarded on abort.

Write ahead Log - Updates are performed in-place, but all updates are logged and reverted when a transaction aborts.

Concurrency in TransactionsIt is often necessary to allow transactions to occur simultaneously (for example, to allow multiple travel agents to simultaneously reserve seats on the same flight). Due to the consistency and isolation properties of transactions concurrent transaction must not be allowed to interfere with each other. Concurrency control algorithms for transactions guarantee that multiple transactions can be executed simultaneously while providing a result that is the same as if they were executed one after another. A key concept when discussing concurrency control for transactions is the serialization of conflicting operations. Recall that conflicting operations are those operations that operate on the same data item and whose combined effects depend on the order they are executed in. We define a schedule of operations asan interleaving of the operations of concurrent transactions. A legal schedule is one that provides results that are the same as though the transactions were serialized (i.e., performed one after another). This leads to the concept of serial equivalence. A schedule is serially equivalent if all conflicting operations are performed in the same order on all data items. For example, given two transactions Xi and T-i in a serially equivalent schedule, then of all the pairs of conflicting operations the first operation will be performed by T\ and the second by Ti (or vice versa: of all the pairs the first is performed by T2 and the second by Ti). There are three type of concurrency control algorithms for transactions: those using locking, those using timestamps, and those using optimistic algorithms.

LockingThe locking algorithms require that each transaction obtains a lock from a scheduler process before performing a read or a write operation. The scheduler is responsible for granting and releasing locks in such a way that legal schedules are produced. The most widely used locking approach is two-phase locking (2PL). In this approach a lock for a data item is granted to a process if no conflicting locks are held by other processes (otherwise the process requesting the

Recreated by Kyalo J. K Page 54 of 73

Page 55: CMT 412 Distributed OS Notes

lock blocks until the lock is available again). A lock is held by a process until the operation it was requested for has been completed. Furthermore once a process has released a lock, it can no longer request any new locks until its current transaction has been completed. This results in a growing phase of the transaction where locks are acquired and a shrinking phase where locks are released. While this approach results in legal schedules, it can also result in deadlock when conflicting locks are requested in reverse orders. This problem can be solved either by detecting and breaking deadlocks or by adding timeouts to the locks (when a lock times out then the transaction holding the lock is aborted). Another problem is that 2PC can lead to cascaded aborts. If a transaction (T1) reads the results of a write of another transaction (T2) that is subsequently aborted, then the first transaction (T1) will also have to be aborted. The solution to this problem is called strict two-phase locking and allows locks to be released only at commit or abort time.

Timestamp Ordering A different approach to creating legal schedules is to timestamp all operations and ensures that operations are ordered according to their timestamps. In this approach each transaction receives a unique timestamp and each operation receives its transaction’s timestamp. Each data item also has three timestamps – the timestamp of the last committed write, the timestamp of the last read, and the timestamp of the last tentative (noncommittal) write. Before executing a write operation the scheduler ensures that the operation’s time stamp is both greater than the data item’s write timestamp and greater than or equal to the data item’s read timestamp. For read operations the operation’s time stamp must be greater than the data item’s write timestamps (both committed and tentative). When scheduling conflicting operations the operation with a lower timestamp is always executed first.

Optimistic Control Both locking and time stamping incur significant overhead. The optimistic approach to concurrency control assumes that no conflicts will occur, and therefore only tries to detect and resolve conflicts at commit time. In this approach a transaction is split into three phases, a working phase (using shadow copies), a validation phase, and an update phase. In the working phase operations are carried out on shadow copies with no attempt to detect or order conflicting operations. In the validation phase the scheduler attempts to detect conflicts with other transactions that were in progress during the working phase. If conflicts are detected then one of the conflicting transactions are aborted. In the update phase, assuming that the transaction was not aborted, all the updates made on the shadow copy are made permanent.

DISTRIBUTED TRANSACTIONSIn contrast to transactions in the sequential database world, transactions in a distributed setting are complicated because a single transaction will usually involve multiple servers. Multiple servers may involve multiple services and files stored on different servers. To ensure the atomically of transactions, all servers involved must agree whether to Commit or Abort. Moreover, the use of multiple servers and services may require nested transactions, where a transaction is implemented by way of multiple other transactions, each of which can independently Commit or Abort.Transactions that span multiple hosts include one host that acts as the coordinator, which is the host that handles the initial BeginTransaction. This coordinator maintains a list of workers,

Recreated by Kyalo J. K Page 55 of 73

Page 56: CMT 412 Distributed OS Notes

which are the other servers involved in the transaction. Each worker must be aware of the identity of the coordinator. The responsibility for ensuring the atomicity of an entire transaction lies with the coordinator, which needs to rely on a distributed commit protocol.

TWO PHASE COMMITThis protocol ensures that a transaction commits only when all workers are ready to commit, which, for example, corresponds to validation in optimistic concurrency. As a result a distributed commit protocol requires at least two phases:

1. Voting phase: all workers vote on commit; then the coordinator decides whether to commit or abort.

2. Completion phase: all workers commit or abort according to the decision of the coordinator. This basic protocol is called two-phase commit (2PC)

DISTRIBUTED NESTED TRANSACTIONS Distributed nested transactions are realized by letting sub transactions commit provisionally, whereby they report a provisional commit list that contains all provisionally committed sub-transactions to the parent. If the parent aborts, it aborts all transactions on the provisional commit list. Otherwise, if the parent is ready to commit, it lets all sub-transactions commit. The actual transition from provisional to final commit needs to be via a 2PC protocol, as a worker may crash after it has already provisionally committed. Essentially, when a worker receives a Can Commit message, there are two alternatives:

If it has no recollection of the sub-transactions involved in the committing transaction, it votes abort, as it must have recently crashed.

Otherwise, it saves the information about the provisionally committed sub-transaction to a persistent store and votes yes.

COORDINATION ELECTIONSVarious algorithms require a set of peer processes to elect a leader or coordinator. In the presence of failure, it can be necessary to determine a new leader if the present one fails to respond. Provided that all processes have a unique identification number, leader election can be reduced to finding the non crashedprocess with the highest identifier. Any algorithm to determine this process needs to meet the following two requirements:

Safety: A process either doesn’t know the coordinator or it knows the identifier of the process with largest identifier.

Liveness: Eventually, a process crashes or knows the coordinator.

BULLY ALGORITHMThe following algorithm was proposed by Garcia-Molina and uses three types of messages:

Election: Announce election Answer: Response to an election Coordinator: Elected coordinator announces itself.

A process begins an election when it notices through a timeout that the coordinator has failed or receives an Election message. When starting an election, a process sends Election message to all higher-numbered processes. If it receives no Answer within a predetermined time bound, the

Recreated by Kyalo J. K Page 56 of 73

Page 57: CMT 412 Distributed OS Notes

process that started the election decides that it must be coordinator and sends a Coordinator message to all other processes. If an Answer arrives, the process that triggered an election waits a pre-determined period of time for a Coordinator message. A process that receives an Election message can immediately announce that it is the coordinator if it knows that it is the highest numbered process. Otherwise, it itself starts a sub-election by sending Election message to higher-numbered processes. This algorithm is called the bully algorithm because the highest numbered process will always be the coordinator.

RING ALGORITHMAn alternative to the bully algorithm is to use a ring algorithm. In this approach all processes are ordered in a logical ring and each process knows the structure of the ring. There are only two types of messages involved: Election and Coordinator. A process starts an election when it notices that the current coordinator has failed (e.g., because requests to it have timed out). An election is started by sending an Election message to the first neighbor on the ring. The Election message contains the node’s process identifier and is forwarded on around the ring, with each process adding its own identifier to the message. When the Election message reaches the originator, the election is complete. Based on the contents of the message that originator process determines the highest numbered process and sends out a coordinator message specifying this process as the winner of the election.

ADVANCED OPERATING SYSTEMS Workstation Models

Workstation-server model: Workstation's processor performance and memory capacity determine largest task that can be performed on behalf of user. Fixed amount of dedicated computing power and guaranteed response time.

Diskless workstation: low-cost computers with: processor, memory, network interface. Cheap: Few large disks cheaper than many little ones. Ease of maintenance: Centralized s/w installation: Flexibility: Access from any node Example: X Terminal runs X.11 server software.

Disk workstations can include: Paging and temporary files. System binaries: s/w installation is broadcast to all machines UP (or when they come up) Explicit caches: local working copy copied centrally when completed. Local file system: loss of transparency.

Processor pool model: Dynamically allocates processors to users. Centralized file server & processor pool Graphics workstations or dumb terminals, Assigns processors to users as needed. Supports incremental growth, Example use: makes, simulations, processing-intensive applications

Hybrid Model: Personal Workstation and Processor Pool. More expensive, simple design.Fast interactive response: workstation, Heavy computing performed using processor pool

Load Distribution: Transfer load from heavily loaded computers to idle or lightly loaded computers.

Recreated by Kyalo J. K Page 57 of 73

Page 58: CMT 412 Distributed OS Notes

Load Balancing: Equalize load at all computers. Goal: Minimize response time or maximize CPU utilization. Design Issues: Deterministic versus heuristic (=dynamic):

Transfer Policy: When does a node become a sender? Use thresholds Swap info with other machines (e.g. periodically)

Selection Policy: How does sender choose a process to transfer? Newly originated Least overhead: small, location independent.

Location Policy: Which node should be the target receiver? Polling To Determine Processor Load:

      Solution 1: Count # of processes on each machine (running or in ready state)       Solution 2: Idle process or periodic interrupt determines amount of time processor bus       Goal: Fraction of time CPU is busy PROCESS MIGRATION:Move a process already in progress to remote site. Motivation: Load sharing: Move from heavy to lightly loaded system to improve performance. Communications performance: Move the process to the data to minimize communications overhead. Availability: Survive a scheduled downtime. Utilize special capabilities: Take advantage of unique h/w or s/w on a particular node. Commonly: Owner returns to workstation. Alternately: lower priority of foreign process. Select a target machine.  Send part of process image and open file information. Receiving kernel forks a child with the passed information. New process pulls over data, environment, register/stack information, and modified program text.  Program demand paged otherwise. New process sends migration-completed message.  Old process destroys itself.

Characteristics of real time O.S Fast process/thread switch Small size - minimum functionality Responds to external interrupts quickly Minimizes intervals where interrupts are disabled. Supports multitasking with inter-process communications tools: (semaphores, signals, events) Accumulates data in sequential files at a fast rate Uses preemptive scheduling based on priority Pauses / resumes tasks for fixed intervals: clocks/timers Supports special alarms and timeouts Multiprocessor Scheduling Effects of Scheduling in Multiprocessors: Multi-programed Single applications run better: Traditional priority, FCFS, round robin algorithms matter less because other processes can be served by other processors. Multithreaded: Threads run faster if scheduled together. Application speedup on a multiprocessor often exceed expectations because: threads share disk caches; threads share compiler code.

Classes of Multiprocessor OS Separate Supervisor: Each processor has own copy of kernel, data structures, I/O

devices, file systems. Minimum shared data structures (e.g. for semaphores). Disadvantage: Difficult to perform parallel execution of a single task. Inefficient since much replication for each processor

Master Slave: Master assigns work to slaves. Master runs O.S. Slaves run applications Advantage: Simplified O.S. Disadvantage: master can fail or become bottleneck.

Recreated by Kyalo J. K Page 58 of 73

Page 59: CMT 412 Distributed OS Notes

Symmetric Multi-Processor (SMP): All processor are treated equally. Identical h/w capabilities.  All h/w available to all processors. One copy of kernel can be executed by all processors concurrently. Floating Master: O.S. is critical section and 1 processor can be active. One processor allowed in each segment of O.S. Advantage: Most flexible. Disadvantage: Most difficult to implement.

FAULT TOLERANCE AND FAILURE MODELS,A characteristic feature of distributed systems that distinguishes them from single-machine systems is the notion of partial failure. A partial failure may happen when one component in a distributed system fails. This failure may affect the proper operation of other components, while at the same time leavingyet other components totally unaffected. In contrast, a failure in non distributed systems is often total in the sense that it affects all components, and may easily bring down the entire application. An important goal in distributed systems design is to construct the system in such a way that it can automatically recover from partial failures without seriously affecting the overall performance. In particular, whenever a failure occurs, the distributed system should continue to operate in an acceptable way while repairs are being made, that is, it should tolerate faults and continue to operate to some extent even in their presence.

BASIC CONCEPTSTo understand the role of fault tolerance in distributed systems we first need to take a closer look at what it actually means for a distributed system to tolerate faults. Being fault tolerant is strongly related to what are called “dependable” systems. Dependability is a term that covers a number of useful requirements for distributed systems including the following

Availability - is defined as the property that a system is ready to be used immediately. In general, it refers to the probability that the system is operating correctly at any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is one that will most likely be working at a given instant in time.

Reliability - refers to the property that a system can run continuously without failure. In contrast to availability, reliability is defined in terms of a time interval instead of an instant in time. A highly reliable system is one that will most likely continue to work without interruption during a relatively long period of time. This is a subtle but important difference when compared to availability. If a system goes down for one millisecond every hour, it has an availability of over 99.9999 percent, but is still highly unreliable. Similarly, a system that never crashes but is shut down for two weeks every August has high reliability but only 96 percent availability. The two are not the same.

Safety - refers to the situation that when a system temporarily fails to operate correctly, nothing catastrophic happens. For example, many process control systems, such as those used for controlling nuclear power plants or sending people into space, are required to provide a high degree of safety. If such control systems temporarily fail for only a very brief moment, the effects could be disastrous. Many examples from the past (and probably many more yet to come) show how hard it is to build safe systems.

Maintainability - refers to how easy a failed system can be repaired. A highly maintainable system may also show a high degree of availability, especially if failures can be detected and repaired automatically. However, as we shall see later in this chapter,

Recreated by Kyalo J. K Page 59 of 73

Page 60: CMT 412 Distributed OS Notes

automatically recovering from failures is easier said than done. Often, dependable systems are also required to provide a high degree of security, especially when it comes to issues such as integrity. In particular, if a distributed system is designed to provide its users with a number of services, the system has failed when one or more of those services cannot be (completely) provided.

A distinction is made between preventing, removing, and forecasting faults. For our purposes, the most important issue is “fault tolerance”, meaning that a system can provide its services even in the presence of faults. Faults are generally classified as transient, intermittent, or permanent. “Transient faults” occur once and then disappear. If the operation is repeated, the fault goes away. A bird flying through the beam of a microwave transmitter may cause lost bits on some network (not to mention a roasted bird). If the transmission times out and is retried, it will probably work the second time.

An intermittent fault occurs, then vanishes of its own accord, then reappears, and so on. A loose contact on a connector will often cause an intermittent fault. Intermittent faults cause a great deal of aggravation because they are difficult to diagnose. Typically, whenever the fault doctor shows up, the system works fine.

A permanent fault is one that continues to exist until the faulty component is repaired. Burnt-out chips, software bugs, and disk head crashes are examples of permanent faults.

FAILURE MODELSA system that fails is not adequately providing the services it was designed for. If we consider a distributed system as a collection of servers that communicate with each other and with their clients, not adequately providing services means that servers, communication channels, or possibly both, are not doing what they are supposed to do. However, a malfunctioning server itself may not always be the fault we are looking for. If such a server depends on other servers to adequately provide its services, the cause of an error may need to be searched for somewhere else.

Such dependency relations appear in abundance in distributed systems. A failing disk may make life difficult for a file server that is designed to provide a highly available file system. If such a file server is part of a distributed database, the proper working of the entire database may be at stake, as only part of its data may actually be accessible. To get a better grasp on how serious a failure actually is, severalclassification schemes have been developed. One such scheme is shown in Fig. below

Recreated by Kyalo J. K Page 60 of 73

Page 61: CMT 412 Distributed OS Notes

Types of failure Different types of failures

A crash failure occurs when a server prematurely halts, but was working correctly until it stopped. An important aspect with crash failures is that once the server has halted, nothing is heard from it anymore. A typical example of a crash failure is an operating system that comes to a grinding halt, and for whichthere is only one solution: reboot. Many personal computer systems suffer from crash failures so often that people have come to expect them to be normal. In this sense, moving the reset button from the back of a cabinet to the front was done for good reason. Perhaps one day it can be moved to the back again, or even removed altogether.

An omission failure occurs when a server fails to respond to a request. Several things might go wrong. In the case of a receive omission failure, the server perhaps never got the request in the first place. Note that it may well be the case that the connection between a client and a server has been correctly established, but that there was no thread listening to incoming requests. Also, a receive omission failure will generally not affect the current state of the server, as the server is unaware of any message sent to it.

State transition failure. This kind of failure happens when the server reacts unexpectedly to an incoming request. For example, if a server receives a message it cannot recognize, a state transition failure happens if no measures have been taken to handle such messages. In particular, a faulty server may incorrectly take default actions it should never have initiated.

Arbitrary / Byzantine failures. In effect, when arbitrary failures occur, clients should be prepared for the worst. In particular, it may happen that a server is producing output it should never have produced, but which cannot be detected as being incorrect. Worse yet a faulty server may even be maliciously working together with other servers to produce intentionally wrong answers. This situation illustrates why security is also considered an important requirement when talking about dependable systems Failure Masking by Redundancy - If a system is to be fault tolerant, the best it can do is to try to hide the occurrence of failures from other processes. The key technique for masking faults is to use redundancy. Three kinds are possible: information redundancy, time redundancy, and physical redundancy. With information redundancy, extra bits are added to allow recovery from

Recreated by Kyalo J. K Page 61 of 73

Page 62: CMT 412 Distributed OS Notes

garbled bits. For example, a Hamming code can be added to transmitted data to recover from noise on the transmission line. With physical redundancy, extra equipment or processes are added to make it possible for the system as a whole to tolerate the loss or malfunctioning of some components. Physical redundancy can thus be done either in hardware or in software.

Process Resilience - Now that the basic issues of fault tolerance have been discussed, let us concentrate on how fault tolerance can actually be achieved in distributed systems. The first topic we discuss is protection against process failures, which is achieved by replicating processes into groups.

Design Issues - The key approach to tolerating a faulty process is to organize several identical processes into a group. The key property that all groups have is that when a message is sent to the group itself, all members of the group receive it. In this way, if one process in a group fails, hopefully some other process can take over for it. Process groups may be dynamic. New groups can be created and old groups can be destroyed. A process can join a group or leave one during system operation. A process can be a member of several groups at the same time. Consequently, mechanisms are needed for managing groups and group membership. Groups are roughly analogous to social organizations.

Agreement in Faulty Systems - Before considering the case of faulty processes, let us look at the “easy” case of perfect processes but where communication lines can lose messages. There is a famous problem, known as the two-army problem, which illustrates the difficulty of getting even two perfect processes to reach agreement about 1 bit of information.

CLIENT-SERVER COMMUNICATION Reliable Client-server CommunicationIn many cases, fault tolerance in distributed systems concentrates on faulty processes. However, we also need to consider communication failures. Most of the failure models discussed previously apply equally well to communication channels. In particular, a communication channel may exhibit crash, omission,timing, and arbitrary failures. In practice, when building reliable communication channels, the focus is on masking crash and omission failures. Arbitrary failures may occur in the form of duplicate messages, resulting from the fact that in a computer network messages may be buffered for a relatively long time, and are re-injected into the network after the original sender has already issued a retransmission

Point-to-Point CommunicationIn many distributed systems, reliable point-to-point communication is established by making use of a reliable transport protocol, such as TCP. TCP masks omission failures, which occur in the form of lost messages, by using acknowledgements and retransmissions. Such failures are completely hidden from aTCP client. However, crash failures of connections are often not masked. A crash failure may occur when, for whatever reason, a TCP connection is abruptly broken so that no more messages can be transmitted through the channel. In most cases, the client is informed that the channel has crashed by raising an exception. The only way to mask such failures is to let the distributed

Recreated by Kyalo J. K Page 62 of 73

Page 63: CMT 412 Distributed OS Notes

system attempt to automatically set up a new connection.

RPC Semantics in the Presence of FailuresLet us now take a closer look at client-server communication when using high-level communication facilities such as Remote Procedure Calls (RPCs) or Remote Method Invocations (RMIs). In the following pages, we focus on RPCs, but the discussion is equally applicable to communication with remote objects. The goal of RPC is to hide communication by making remote procedure calls look just like local ones. With a few exceptions, so far we have come fairly close. Indeed, as long as both client and server are functioning perfectly, RPC does its job well. The problem comes about when errors occur. It is then that the differences between local and remote calls are not always easy to mask. To structure our discussion, let us distinguish between five different classes of failures that can occur in RPC systems, as follows

1. The client is unable to locate the server.2. The request message from the client to the server is lost.3. The server crashes after receiving a request.4. The reply message from the server to the client is lost.5. The client crashes after sending a request.

Each of these categories poses different problems and requires different solutions.

Client Cannot Locate the ServerTo start with, it can happen that the client cannot locate a suitable server. The server might be down, for example. Alternatively, suppose that the client is compiled using a particular version of the client stub, and the binary is not used for a considerable period of time. In the meantime, the server evolves and a new version of the interface is installed; new stubs are generated and put into use. When the client is finally run, the binder will be unable to match it up with a server and will report failure. While this mechanism is used to protect the client from accidentally trying to talk to a server that may not agree with it in terms of what parameters are required or what it is supposed to do, the problem remains of how should this failure be dealt with.

Lost Request Messages - The second item on the list is dealing with lost request messages. This is the easiest one to deal with: just have the operating systems or client stub start a timer when sending the request. If the timer expires before a reply or acknowledgement comes back, the message is sent again. If the message was truly lost, the server will not be able to tell the difference between the retransmission and the original, and everything will work fine. Unless, of course, so many request messages are lost that the client gives up and falsely concludes that the server is down, in which case we are back to “Cannot locate server.” If the request was not lost, the only thing we need to do is let the server be able to detect it is dealing with a retransmission. Unfortunately, doing so is not so simple, as we explain when discussing lost replies.

Server Crashes - The next failure on the list is a server crash. Assume that the server crashes and subsequently recovers. It announces to all clients that it has just crashed but is now up and running again. The problem is that the client does not know whether its request to print some text will actually be carried out. There are four strategies the client can follow. First, the client can decide to never reissue a request, at the risk that the text will not be printed. Second, it can decide to always reissue a request, but this may lead to its text being printed twice. Third, it can decide to reissue a request only if it did not yet receive an acknowledgement that its print request had

Recreated by Kyalo J. K Page 63 of 73

Page 64: CMT 412 Distributed OS Notes

been delivered to the server. In that case, the client is counting on the fact that the server crashed before the print request could be delivered. The fourth and last strategy is to reissue a request only if it has received an acknowledgement for the print request. With two strategies for the server, and four for the client, there are a total of eight combinations to consider. Unfortunately, no combination is satisfactory. To explain, note that there are three events that can happen at the server: send the completion message (M), print the text (P), and crash (C). These events can occur in six different orderings.

1. MPC: A crash occurs after sending the completion message and printing the text.2. AC(P): A crash happens after sending the completion message, but before the text could be printed.3. PMC: A crash occurs after sending the completion message and printing the text.4. PC(M): The text printed, after which a crash occurs before the completion message could be sent.5. C(P(M): A crash happens before the server could do anything.6. C((M(P): A crash happens before the server could do anything.

Lost Reply MessagesLost replies can also be difficult to deal with. The obvious solution is just to rely on a timer again that has been set by the client’s operating system. If no reply is forthcoming within a reasonable period, just send the request once more. The trouble with this solution is that the client is not really sure why there was no answer. Did the request or reply get lost, or is the server merely slow? It may make a difference.In particular, some operations can safely be repealed as often as necessary with no damage being done. A request such as asking for the first 1024 bytes of a file has no side effects and can be executed as often as necessary without any harm being done. A request that has this property is said to be idempotent. Now consider a request to a banking server asking to transfer a million dollars from one account to another. If the request arrives and is carried out, but the reply is lost, the client will not know this and will retransmit the message. The bank server will interpret this request as a new one, and will carry it out too. Two million dollars will be transferred. Heaven forbid that the reply is lost 10 times. Transferring money is not idempotent. One way of solving this problem is to try to structure all requests in an idem-potent way.

Client CrashesThe final item on the list of failures is the client crash. What happens if a client sends a request to a server to do some work and crashes before the server replies? At this point a computation is active and no parent is waiting for the result. Such an unwanted computation is called an orphan. Orphans can cause a variety of problems. As a bare minimum, they waste CPU cycles. They can also lock files or otherwise tie up valuable resources. Finally, if the client reboots and does the RPC again, but the reply from the orphan comes back immediately afterward, confusion can result. DISTRIBUTED COMMIT

Recreated by Kyalo J. K Page 64 of 73

Page 65: CMT 412 Distributed OS Notes

The atomic multicasting problem discussed in the previous section is an example of a more general problem, known as distributed commit. The distributed commit problem involves having an operation being performed by each member of a process group, or none at all. In the case of reliable multicasting, the operation is the delivery of a message. With distributed transactions, the operation may be the commit of a transaction at a single site that takes part in the transaction. Distributed commit is often established by means of a coordinator. In a simple scheme, this coordinator tells all other processes that are also involved, called participants, whether or not to (locally) perform the operation in question. This scheme is referred to as a one-phase commit protocol. It has the obvious drawback that if one of the participants cannot actually perform the operation, there is no way to tell the coordinator. For example, in the case of distributed transactions, a local commit may not be possible because this would violate concurrency control constraints. In practice, more sophisticated schemes are needed, the most common one being the two-phase commit protocol, which is discussed in detail below. The main drawback of this protocol is that it cannot efficiently handle the failure of the coordinator.

TWO-PHASE COMMITThe original two-phase commit protocol (2PC) is due to Gray (1978). Without loss of generality, consider a distributed transaction involving the participation of a number of processes each running on a different machine. Assuming that no failures occur, the protocol consists of the following two phases, each consisting of two steps

1. The coordinator sends a VOTE_REQUEST message to all participants.2. When a participant receives a Vote_Request message, it returns either a Vote_Commit

message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a Vote_Abort message.

3. The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a Global_Commit message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a Global_Abort message.

4. Each participant that voted for a commit waits for the final reaction by the coordinator. If a participant receives a Global_Commit message, it locally commits the transaction. Otherwise, when receiving a Global_Abort message, the transaction is locally aborted as well.

THREE-PHASE COMMIT A problem with the two-phase commit protocol is that when the coordinator has crashed, participants may not be able to reach a final decision. Consequently, participants may need to remain blocked until the coordinator recovers. Skeen (1981) developed a variant of 2PC, called the three-phase commitprotocol (3PC), that avoids blocking processes in the presence of fail stop crashes. Although 3PC is widely referred to in the literature, it is not applied often in practice as the conditions under which 2PCblocks rarely occur. We discuss the protocol, as it provides further insight into solving fault-tolerance problems in distributed systems.

Recreated by Kyalo J. K Page 65 of 73

Page 66: CMT 412 Distributed OS Notes

RECOVERYSo far, we have mainly concentrated on algorithms that allow us to tolerate faults. However, once a failure has occurred, it is essential that the process where the failure happened can recover to a correct state. In what follows, we first concentrate on what it actually means to recover to a correct state, and subsequently when and how the state of a distributed system can be recorded and recovered to, by means of check pointing and message logging. Fundamental to fault tolerance is the recovery from an error. Recall that an error is that part of a system that may lead to a failure. The whole idea of error recovery is to replace an erroneous state with an error-free state. There are essentially two forms of error recovery.

In backward recovery, the main issue is to bring the system from its present erroneous state back into a previously correct state. To do so, it will be necessary to record the system’s state from time to lime, and to restore such a recorded state when things go wrong. Each time (part of) the system’s present state is recorded, a checkpoint is said to be made.

In forward recovery. In this case, when the system has just entered an erroneous state, instead of moving back to a previous, check pointed state, an attempt is made to bring the system in a correct new state from which it can continue to execute. The main problem with forward error recovery mechanisms is that it has to be known in advance which errors may occur. Only in that case is it possible to correct those errors and move to a new state.

STABLE STORAGETo be able to recover to a previous state, it is necessary that information needed to enable recovery is safely stored. Safely in this context means that recovery information survives process crashes and site failures, but possibly also various storage media failures. Stable storage plays an important role when it comes to recovery in distributed systems. Stable storage can be implemented with a pair ofordinary disks. Storage comes in three categories.

First there’s RAM memory, which is wiped out when power fails or a machine crashes. Next is disk storage, which survives CPU failures but which can be lost in disk head

crashes. Finally, there is also stable storage, which is designed to survive anything except major

calamities such as floods and earthquakes.

SECURITY IN DISTRIBUTED SYSTEMS Security is by no means the least important principle. However, one could argue that it is one of the most difficult principles, as security needs to be pervasive throughout a system. A single design flaw with respect to security may render all security measures useless. We concentrate on the various mechanisms that are generally incorporated in distributed systems to support security. The security in distributed systems can roughly be divided into two parts. One part concerns the communication between users or processes, possibly residing on different machines. The principal mechanism for ensuring secure communication is that of a secure

Recreated by Kyalo J. K Page 66 of 73

Page 67: CMT 412 Distributed OS Notes

channel. Secure channels, and more specifically, authentication, message integrity, and confidentiality.

The other part concerns authorization, which deals with ensuring that a process gets only those access rights to the resources in a distributed system it is entitled to. Authorization is covered in a separate section dealing with access control. In addition to traditional access control mechanisms, we also focuson access control when have to deal with mobile code such as agents. Secure channels and access control require mechanisms to hand out cryptographic keys, but also mechanisms to add and remove users from a system. These topics are covered by what is known as security management.

SECURITY THREATS, POLICIES, AND MECHANISMSSecurity in computer systems is strongly related to the notion of dependability. Informally, a dependable computer system is one that we justifiably trust to deliver its services. Dependability includes availability, reliability, safety, and maintainability. However, if we are to put our trust in a computer system, then confidentiality and integrity should also be taken into account. Confidentiality refers to the property of a computer system whereby its information is disclosed only to authorized parties.

Integrity is the characteristic that alterations to a system’s assets can be made only in an authorized way. In other words, improper alterations in a secure computer system should be detectable and recoverable. Major assets of any computer system are its hardware, software, and data. Another way of looking at security in computer systems is that we attempt to protect the services and data it offers against security threats. There are four types of security threats to consider

1. Interception refers to the situation that an unauthorized party has gained access to a service or data. A typical example of interception is where communication between two parties has been overheard by someone else. Interception also happens when data are illegally copied, for example, after breaking into a person’s private directory in a file system.

2. Interruption is when a file is corrupted or lost. In general, interruption refers to the situation in which services or data become unavailable, unusable, destroyed, and so on. In this sense, denial of service attacks by which someone maliciously attempts to make a service inaccessible to other parties is a security threat that classifies as interruption.

3. Modifications involve unauthorized changing of data or tampering with a service so that it no longer adheres to its original specifications. Examples of modifications include intercepting and subsequently changing transmitted data, tampering with database entries, and changing a program so that it secretly logs the activities of its user.

4. Fabrication refers to the situation in which additional data or activity are generated that would normally not exist. For example, an intruder may attempt to add an entry into a

Recreated by Kyalo J. K Page 67 of 73

Page 68: CMT 412 Distributed OS Notes

password file or database. Likewise, it is sometimes possible to break into a system by replaying previously sent messages.

Note that interruption, modification, and fabrication can each be seen as a form of data falsification.Simply stating that a system should be able to protect itself against all possible security threats is not the way to actually build a secure system. What is first needed is a description of security requirements, that is, a security policy. A security policy - describes precisely which actions the entities in a system are allowed to take and which ones are prohibited. Entities include users, services, data, machines, and so on. Once a security policy has been laid down, it becomes possible to concentrate on the security mechanisms by which a policy can be enforced.

SECURITY MECHANISMS1. Encryption is fundamental to computer security. Encryption transforms data into

something an attacker cannot understand. In other words, encryption provides a means to implement confidentiality. In addition, encryption allows us to check whether data have been modified. It thus also provides support for integrity checks.

2. Authentication is used to verify the claimed identity of a user, client, server, and so on. In the case of clients, the basic premise is that before a service will do work for a client, the service must learn the client’s identity. Typically, users are authenticated by means of passwords, but there are many other ways to authenticate clients.

3. Authorization - After a client has been authenticated, it is necessary to check whether that client is authorized to perform the action requested. Access to records in a medical database is a typical example. Depending on who accesses the database, permission may be granted to read records, to modify certain fields in a record, or to add or remove a record.

4. Auditing - tools are used to trace which clients accessed what, and which way. Although auditing does not really provide any protection against security threats, audit logs can be extremely useful for the analysis of a security breach, and subsequently taking measures against intruders. For this reason, attackers are generally keen not to leave any traces that could eventually lead to exposing their identity. In this sense, logging accesses makes attacking sometimes a riskier business.

Design Issues - A distributed system, or any computer system for that matter, must provide security services by which a wide range of security policies can be implemented. There are a number of important design issues that need to be taken into account when implementing general-purpose security services. In the following pages, we discuss three of these issues: focus of control, layering of security mechanisms, and simplicity

Focus of Control - When considering the protection of a (possibly distributed) application, there are essentially three different approaches that can be followed

Recreated by Kyalo J. K Page 68 of 73

Page 69: CMT 412 Distributed OS Notes

Distribution of Security MechanismsDependencies between services regarding trust lead to the notion of a Trusted Computing Base (TCB). A TCB is the set of all security mechanisms in a (distributed) computer system that are needed to enforce a security policy. The smaller the TCB, the better. If a distributed system is built as middleware on an existing network operating system, its security may depend on the security of the underlying local operating systems. In other words, the TCB in a distributed system may include the local operating systems at various hosts. Consider a file server in a distributed file system. Such a server may need to rely on the various protection mechanisms offered by its local operating system. Such mechanisms include not only those for protecting files against accesses by processes other than the file server, but also mechanisms to protect the file server from being maliciously brought down.

Middleware-based distributed systems thus require trust in the existing local operating systems they depend on. If such trust does not exist, then part of the functionality of the local operating systems may need to be incorporated into the distributed system itself. Consider a microkernel operating system, in which most operating-system services run as normal user processes. In this case, the file system, for instance, can be entirely replaced by one tailored to the specific needs of a distributed system, including its various security measures. Consistent with this approach is to separate security services from other types of services by distributing services across different machines depending on the required security. For example, for a secure distributed file system, it may be possible to isolate the file server from clients by placing the server on a machine with a trusted operating system, possibly running a dedicated secure file system. Clients and their applications are placed on un trusted machines.

This separation effectively reduces the TCB to a relatively small number of machines and software components. By subsequently protecting those machines against security attacks from the outside, overall trust in the security of the distributed system can be increased. Preventing clients and their applications direct access to critical services is followed in the Reduced Interfaces for Secure System Components (RISSC) approach, as described in (Neumann, 1995). In the RISSC approach, any security-critical server is placed on a separate machine isolated from end-user systems using low-level secure network interfaces

Simplicity - Another important design issue related to deciding in which layer to place a security mechanism is that of simplicity. Designing a secure computer system is generally considered a difficult task. Consequently, if a system designer can use a few, simple mechanisms that are easily understood and trusted to work, the better it is.

Cryptography - Fundamental to security in distributed systems is the use of cryptographic techniques. The basic idea of applying these techniques is simple. Consider a sender S wanting to transmit message m to a receiver R. To protect the message against security threats, the sender first encrypts it into an unintelligible message m’, and subsequently sends m’ to R. R, in turn, must decrypt the received message into its original formEncryption and decryption are accomplished by using cryptographic methods parameterized by keys

INTRUDERS AND EAVESDROPPERS IN COMMUNICATION

Recreated by Kyalo J. K Page 69 of 73

Page 70: CMT 412 Distributed OS Notes

To describe the various security protocols that are used in building security services for distributed systems, it is useful to have a notation to relate plaintext, cipher text, and keys. Following the common notational conventions, we will use C = EK(P) to denote that the cipher text C is obtained by encrypting the plaintext P using key K. Likewise, P = DK(C) is used to express the decryption of the cipher text C using key K, resulting in the plaintext P.

First, an intruder may intercept the message without either the sender or receiver being aware that eavesdropping is happening. Of course, if the transmitted message has been encrypted in such a way that it cannot be easily decrypted without having the proper key, interception is useless: the intruder will see only unintelligible data

The second type of attack that needs to be dealt with is that of modifying the message. Modifying plaintext is easy; modifying cipher text that has been properly encrypted is much more difficult because the intruder will first have to decrypt the message before it can meaningfully modify it. In addition, he will also have to properly encrypt it again or otherwise the receiver may notice that the message has been tampered with.

The third type of attack is when an intruder inserts encrypted messages into the communication system, attempting to make R believe these messages came from. Again encryption can help protect against such attacks. Note that if an intruder can modify messages, he can also insert messages. There is a fundamental distinction between different cryptographic systems, based on whether or not the encryption and decryption key are the same.

In a symmetric cryptosystem, the same key is used to encrypt and decrypt a message. In other words, P=DK(EK(P)) symmetric cryptosystems are also referred to as secret-key or shared-key systems, because the sender and receiver are required to share the same key, and to ensure that protection works, this shared key must be kept secret; no one else is allowed to see the key. We will use the notation KAB to denote a key shared by A and B.

In an asymmetric cryptosystem, the keys for encryption and decryption are different, but together form a unique pair. In other words, there is a separate key KE for encryption and one for decryption, KD, such that P=DKD(EKE(P)) One of the keys in an asymmetric cryptosystem is kept private, the other is made public. For this reason, asymmetric cryptosystems are also referred to as public- key systems. Inwhat follows, we use the notation KX to denote a public key belonging to A, and KX as its corresponding private key. AttacksPassive attacks are mainly based on observation without altering data or compromising services, they represent the interception and interruption forms of security threats. The simplest form of attack is browsing, which implies the nondestructive examination of all accessible data. This leads to the need for confidentiality and the need-to-know principle. Related is the leaking of information via authorized accomplices, which leads to the confinement problem. More indirect are attempts to infer information from traffic analysis, code breaking, and so on. In contrast, active attacks alter or delete data and may cause service to be denied to authorized users. They represent the modification and fabrication forms of security threats. Typical active attacks attempt to modify or destroy files. Communication related active attacks attempt to modify the data sent over a

Recreated by Kyalo J. K Page 70 of 73

Page 71: CMT 412 Distributed OS Notes

communication channels.

ATTACK OF COMMUNICATION CHANNELBecause of its networked nature, the communication channel presents a particularly important vulnerability in distributed systems. As such, many of the threats faced by distributed systems come in the form of attacks on their communication channels. We distinguish between five different types of attacks on communication channels.

Eaves Dropping - Attacks involve obtaining copies of messages without authorization. This could, for example, involve sniffing passwords being sent over the network.

Masquerading - Attacks involve sending or receiving messages using the identity of another principal without their authority. In a typical masquerading attack the attacker sends messages to the victim with the headers modified so that it looks like the messages are being sent by a trusted third party.

Message Tampering - Attacks involve the interception and modification of messages so that they have a different effect than what was originally intended. One form of the message tampering attack is called the man-in-the-middle attack. In this attack the attacker intercepts the first message in an exchange of keys and is able to establish a secure channel with both the original sender and intended receiver. By placing itself in the middle the attacker can view and modify all communication over that channel.

Replay - Attacks involve resounding intercepted messages at a later time in order to cause an action to be repeated. This kind of attack can be effective even if communication is authenticated and encrypted.

Denial of Service - Attacks involve the flooding of a channel with messages so that access is denied to others.

AUTHENTICATIONAuthentication involves verifying the claimed identity of an entity (or principal). Authentication requires a representation of identity (i.e., some way to represent a principal’s identity, such as, a user name, a bank account, etc.) and some way to verify that identity (e.g., a password, a passport, a PIN, etc.)- Depending on the system’s requirements different strengths of authentication may be required. For example, in some cases it is enough to simply present a user id, while in other cases a certificate signed by a trusted authority may be required to prove a principal’s identity. A comprehensive logic of authentication has been developed by Lampson et al.

A verified identity is represented by a credential. A certificate signed by a trusted authority stating that the bearer of the certificate has been successfully authenticated is an example of a credential. A credential has the property that it speaks for a principal. In some case it is necessary for more than oneprincipal to authorize an action. In that case multiple credentials could be combined to speak for those principles. A credential can also be made to represent a role (e.g., a system administrator) rather than an individual. Roles can be assigned to specific principals as needed.

Authentication based on A Shared Secret Key - This protocol no longer works. It can easily be defeated by what is known as a reflective attack.

Recreated by Kyalo J. K Page 71 of 73

Page 72: CMT 412 Distributed OS Notes

The Reflection Attack - The second approach improves on this problem by storing all keys at a key distribution centre (KDC). The KDC stores a copy of each entity’s secret key, and can, therefore, communicate securely with each entity.

The Principle of using A Kdc - A drawback of the KDC approach is that it requires a centralized and trusted service (the KDC).

Using A Ticket and Letting Alice Set Up A Connection to Bob - The third approach makes use of public keys to securely authenticate a principal. By sending a message encrypted with its private key a principal can prove its identity to the authenticator. A problem with this approach is that the authenticator must have the principal’s public key (and trust that it does indeed belong to that principal).

Mutual Authentication in A Public-key Cryptosystem - A different approach combines the public key and shared secret key approach. In this approach (which is used by the secure shell (ssh) protocol) two parties first establish a secure channel by exchanging a session key encrypted using their public keys, and then exchange their authentication information over this secure channel.

Protection (and Authorisation and Access Control) - Once a principal has been authenticated it is necessary to determine what actions that principal is allowed to perform and to enforce any restrictions placed on it. Restricting actions and enforcing those restrictions allows resources to be protected against abuse.

PROTECTION SYSTEMEvaluate the implementations based on these design considerations

Propagation of Rights: Can someone act as an agent’s proxy? That is, can one subject’s access rights be delegated to another subject?

Restriction of Rights: Can a subject propagate a subset of their rights (as opposed to all of their rights)?

Amplification of Rights: Can an unprivileged subject perform some privileged operations (i.e., (temporarily) extend their protection domain)?

Revocation of Rights: Can a right, once granted, be remove from a subject? Determination of Object Accessibility: Who has which access rights on a particular

object? Determination of a Subject’s Protection Domain: What is the set of objects that a

particular subject can access?

Implementation of the Access MatrixAn efficient representation of the access matrix can be achieved by representing it either by column or row. A column-wise representation that associates access rights with objects is called an access control list (ACL). Row-wise representations that associate access rights with subjects are based on capabilities.Access Control Lists Each object may be associated with an access control list (ACL), which corresponds to one column of the access matrix, and is represented as a list of subject-rights pairs. When a subject tries to access an object, the set of rights associated with that subject is used to determine whether access should be granted. This comparison of the accessing subjects identity with the subjectsmentioned in the ACL requires prior authentication of the subject. ACL-based systems usually support a concept of group rights (granted to each agent belonging to the group) or domain

Recreated by Kyalo J. K Page 72 of 73

Page 73: CMT 412 Distributed OS Notes

classes.The properties of ACLs, with respect to the previously listed design considerations, are as follows

Propagation: the owner of an object can add to or modify entries to the ACL Restriction: anyone who has the right to modify the ACL can restrict access Amplification: ACL entries can include protected invocation rights (e.g., setuid) Revocation: access rights can be revoked by removing or modifying ACL entries Object Accessibility: the ACL itself represents an object’s accessibility Protection Domain: hard (if not impossible) because the ACLs of all objects in the

system must be inspected.

FIREWALLSA different form of protection that can be employed in distributed systems is that offered by firewalls. A firewall is generally used when communicating with external untrusted clients and servers, and serves to disconnect parts of the system from the outside world, allowing inbound (and possibly outbound) communication only on predefined ports. Besides simply blocking communication, firewalls can also inspect incoming (or outgoing) communication and filter only suspicious messages. Two main types of firewalls are packetfiltering and application-level firewalls. Packet-filtering firewalls work at the packet level, filtering network packets based on the contents of headers.

Application-level firewalls, on the other hand, filter messages based on their contents. They are capable of spotting and filtering malicious content arriving over otherwise innocuous communication channels (e.g., virus filtering email gateways).

Recreated by Kyalo J. K Page 73 of 73