Distributed Systems Notes - WordPress.com · By running a distributed system software the computers...

Distributed Systems Notes

1

DISTRIBUTED SYSTEMS

1.1 Distributed System

A distributed system is a collection of independent computers that appears to its

users as a single coherent system.

A distributed system is one in which components located at networked

communicate and coordinate their actions only by passing message.

Characteristics of distributed System:

1. Programs are executed concurrently

2. There is no global time

3. Components can fail independently (isolation, crash)

By running a distributed system software the computers are enabled to:

Coordinate their activities

Share resources: hardware, software, data.

1.2 Examples of Distributed Systems

The Internet:

Collection of computer networks

Enables programs to communicate over arbitrary distance

Makes available services

– mail, file transfer, documents, telephony, ...

Communication via message passing according to Internet protocols

(IP, UDP, TCP, ICMP, SMTP, FTP, ...)

A back bone is a network link with high transmission capacity,employing

satellite connections ,fibre optic cables and other fibre optic circuits.

Infrastructure: backbones, routing, naming

Extensible (new services, new protocols)

Open communication channels (security!)

Technology applicable to other distributed systems

Multimedia services such as music , radio and TV ,video conferences available

in the internet.


2

Figure 1.1 A typical portion of the Internet

Intranets

– a single authority

– protected access

a firewall

total isolation

– may be worldwide

– typical services:

• infrastructure services

file service, name service

• application services

• application services


3

Mobile and ubiquitous Computing

Mobile: computing devices are being carried around

Ubiquitous: little computing devices are all over the place

- Portable devices

– laptops

– handheld devices-personal digital assitants (PDA s),mobile phones ,

pagers,video cameras and digital cameras.

– wearable devices-smart watches with the

functionality similar to PDA

– devices embedded in appliances-as washing machines, Hi-fi systems,

– car and refrigerator.

Difference between mobile and ubiquitous computing:

Ubiquitous computing used in single environment such as home or hospital.

Mobile computing has advantage when using different devices such as

laptops and printers.


4

Figure 1.3 Portable and handheld devices in a distributed system

1.3 Resource Sharing and the Web

Hardware resources (reduce costs)

• Data resources (shared usage of information)

• Service resources

– search engines

– computer-supported cooperative working

• Service vs. server (node or process )

(palvelu, palvelin, palvelija)

Figure 1.4 Web servers and web browsers


Advantages of Distributed Systems

Performance: very often a collection of processors can provide higher performance (and better

price/performance ratio) than a centralized computer.

Distribution: many applications involve, by their nature, spatially separated machines (banking,

commercial, automotive system).

Reliability (fault tolerance): if some of the machines crash, the system can survive.

Incremental growth: as requirements on processing power grow, new machines can be added

incrementally.

Sharing of data/resources: shared data is essential to many applications (banking, computer

supported cooperative work, reservation systems); other resources can be also

shared (e.g. expensive printers).

Communication: facilitates human-to-human communication.

Disadvantages of Distributed Systems

Difficulties of developing distributed software:

How should operating systems, programming languages and applications look like?

Networking problems:

Several problems are created by the network infrastructure, which have to be dealt with: loss of

messages, overloading.

Security problems: Sharing generates the problem of data security.

1.4 Challenges

1. Heterogeneity

2. Openness

3. Security

4. Scalability

5. Failure handling

6. Concurrency

7. Transparency


Heterogeneity appears at several levels:

Network (Ethernet, token ring, ISDN,...)

Computing hardware (data representation)

Operating systems (different APIs to protocols)

Programming languages (data structures, APIs)

Applications by different developers (data exchange standards)

Middleware:

Software layer which abstracts from the above providing a uniform computational model

(CORBA, Java RMI, ODBC,Web Services...)

Openness

The degree to which a computer system can be extended and re-implemented.

IEEE = Institute of Electrical and Electronic Engineers

e.g., IEEE 802.11 WLAN, IEEE 802.3 Ethernet

W3C = World Wide Web Consortium

e.g., HTML Recommendations

Security

The resources are accessible to authorized users and used in the way they are intended.

Confidentiality

Protection against disclosure to unauthorized individual.

E.g. ACLs (access control lists) to provide authorized access to information.

Integrity

Protection against alternation or corruption.

E.g. changing the account number or amount value in a money order

Availability

Protection against interference targeting access to the resources.

E.g. denial of service (DoS, DDoS) attacks

Non-repudiation

Proof of sending / receiving an information

E.g. digital signature

Security Mechanism:

Encryption

E.g. Blowfish, RSA

Authentication

E.g. password, public key authentication


Authorization

E.g. access control lists

Scalability

System should work efficiently at many different scales, ranging from a small Intranet to

the Internet.

Remain effective when there is a significant increase in the number of resources and the

number of users.

Challenges of designing scalable distributed systems:

Cost of physical resources

Cost should linearly increase with system size

Performance Loss

For example, in hierarchically structure data, search performance loss due

to data growth should not be beyond O(log n), where n is the size of data.

Preventing software resources running out:

Numbers used to represent Internet address (32 bit->64bit),Y2K like

problem.

Avoiding performance bottlenecks:

Use decentralized algorithms (centralized DNS to decentralized)

Failure handling

Failure: an offered service no longer complies with its specification

Fault: cause of a failure (e.g. failure of a component)

Fault tolerance: no failure despite faults

Fault Tolerance mechanism:

Fault detection -Checksums, heartbeat, …

Fault masking -Retransmission of corrupt messages, redundancy,

Fault toleration -Exception handling, timeouts,…

Fault recovery -Rollback mechanisms,…


Redundancy:

Services available in failure by the use of redundant components.

Ex:

Concurrency

Two different routes in internet.

in domain naming system , names is replicated in atleast two servers.

data repeated in more servers

Shared access to resources must be possible.

(i.e)Becomes a problem when two or more parties access a the same resources

Transparency

To hide from the user and the application programmer of the separation/distribution of

components, so that the system is perceived as a whole rather than a collection of

independent components.

ISO Reference Model for Open Distributed Processing (ODP) identifies the following

forms of transparencies


13

Software Architecture:


The same, looking at two distributed nodes:

1.5 System Models

Systems that are intended for use in real-world environments should be designed to function

correctly in the widest possible range of circumstances and in the face of many possible

difficulties and threats.

Resources in a distributed system are shared between users. They are normally encapsulated

within one of the computers and can be accessed from other computers by communication.

• Each resource is managed by a program, the resource manager; it offers a communication

interface enabling the resource to be deceased by its users.

• Resource managers can be in general modeled as processes.

If the system is designed according to an object oriented methodology, resources are

encapsulated in objects.

Difficulties and threats for distributed systems

–Widely varying modes of use.

–Wide range of system environments


–Internal problems: non-synchronized clocks, conflicting data updates, many modes of hardware

and software failure involving the individual components of a system.

Different types of Models

Architectural Models:

An architectural model defines the way in which the components of system

interact with one another and the way in which they are mapped onto an

underlying network of computers

Fundamental models

Fundamental models that help to reveal key problems for the designers of

distributed systems. Their purpose is to specify the design issues,

difficulties and threats that must be resolved in order to develop distribute

systems that fulfill their tasks correctly, reliably and securely. The

fundamental mode provides abstract views of just those characteristics of

distributed systems that affect the dependability characteristics -

correctness, reliability and security.

1.6 Architectural models

The architecture of a system is its structure in terms of separately specified components.

The architectural design of a building has similar aspects - it determines not only its

appearance but also its general structure and architectural style (gothic, neo-classical,

modem) provides a consistent frame of reference for the design.

An architectural model of a distributed system first simplifies and abstracts the functions

of the individual components of a distributed system and then it considers:

o the placement. of the components across a network of computers - seeking to

define useful patterns for the distribution of data and workload;

o the interrelationships between the components that is. their functional roles and

the patterns of communication between them.

An initial simplification is achieved by classifying processes as server processes, client

processes and peer processes

This classification of processes identifies the responsibilities of each and hence helps us

to assess their workloads and to determine the impact of failures in each of them.


16

The results of this analysis can then be used to specify the placement of the processes in

a manner that meets performance and reliability goals for the resulting system.

Some more dynamic systems can be built as variations on the client-server model:

o The possibility of moving code from one process to another allows a process to

delegate tasks to another process: for example, clients can download code from

servers and run it locally. Objects and the code that accesses them can be moved

to reduce access delays and minimize communication traffic.

o Some distributed systems are designed to enable computers and other mobile

devices to be added or removed seamlessly, allowing them to discover the

available services and to offer their services to others.

Software layers

The term software architecture referred originally to the structuring of software as layers

or modules in a single computer and more recently in terms of services offered and

requested between processes located in the same or different computers.

A server is a process that accepts requests from other processes. A distributed service can

be provided by one or more server processes, interacting with each other and with client

processes in order to maintain a consistent system-wide view of the service's resources.

For example, a network time service is implemented on the Internet based on the

Network Time Protocol (NTP) by server processes running on hosts throughout the

Internet that supply the current time to any client that requests it and adjust their version

of the current time as a result of interactions with each other.

Platform

Applications, services

Middleware

Operating system

Computer and networkhardware


The figure introduces the important terms platform and middleware, which we define as follows:

Platform:

The lowest-level hardware and software layers are often referred to as a platform for

distributed systems and applications. These low-level layers provide services to the layers above

them, which are implemented independently in each computer, bringing the system's

programming interface up to a level that facilitates communication and coordination between

processes.

Example: Intel x86/Windows, Sun SPARC/SunOS, Intel x86/Solaris, PowerPC/MacOS, Intel

x86/Linux.

Middleware :

Middleware a layer of software whose purpose is to mask heterogeneity and to provide a

convenient programming model to application programmers. Middleware is represented

by processes or objects in a set of computers that interact with each other to implement

communication and resource sharing support for distributed applications.

Middleware is concerned with providing useful building blocks for the construction of

software components that can work with one another in a distributed system.

Middleware can also provide services for use by application programs. They are

infrastructural services, tightly bound to the distributed programming model that the

middleware provides.

For example. CORBA offers a variety of services that provide applications with facilities,

which include naming, security, transactions, persistent storage and event notification.

Limitations of middleware:

Many distributed applications rely entirely on the services provided by the available

middleware to support their needs for communication and data sharing.

For example, an application that is suited to the client-server model such as a database of

names and addresses can rely on middleware that provides only remote method

invocation,


It has been achieved in simplifying the programming of distributed systems through the

development of middleware support, but some aspects of the dependability of systems

require support at the application level.

System Architectures

The main types of architectural model are

Client-server model

Peer to Peer

Client-server model

The system is structured as a set of processes, called servers, that offer services to the users,

called clients.

• The client-server model is usually based on a simple request/reply protocol,

implemented with send/receive primitives or using remote procedure calls (RPC) or

remote method invocation (RMI):

- the client sends a request (invocation) message to the server asking for some service;

- the server does the work and returns a result (e.g. the data requested) or an error

code if the work could not be performed.

A server can itself request services from other servers; thus, in this new relation, the

server itself acts like a client.

Clients invoke individual servers


Peer-to-Peer

All processes (objects) play similar role.

• Processes (objects) interact without particular distinction between clients and servers.

• The pattern of communication depends on the particular application.

• A large number of data objects are shared; any individual computer holds only a small part of

the application database.

• Processing and communication loads for access to objects are distributed across many

computers and access links.

• This is the most general and flexible model.

Some problems with client-server:

• Centralisation of service poor scaling

- Limitations:

capacity of server

bandwidth of network connecting the server

Peer-to-Peer tries to solve some of the above


Problems with peer-to-peer:

• High complexity due to

- Cleverly place individual objects - retrieve the objects

- maintain potentially large number of replicas.

Variations of the Basic Models

Client-server and peer-to-peer can be considered as basic models.

• Several variations have been proposed, with considering factors such as:

- multiple servers and caches

- mobile code and mobile agents

- low-cost computers at the users‘ side

- mobile devices

Services provided by multiple servers

Services may be implemented as several server processes in separated host computers interacting

as necessary to provide a service to client processes.

The servers may partition the set of objects on which the service is based and distribute them

between themselves, or they may maintain replicated copies of them on serverl hosts.


Proxy Server

A proxy server provides copies (replications) of resources which are managed by other servers.

Proxy servers are typically used as caches for web resources. They maintain a cache of recently

visited web pages or other resources.

When a request is issued by a client, the proxy server is first checked, if the requested object

(information item) is available there.

Proxy servers can be located at each client, or can be shared by several clients.

The purpose is to increase performance and availability, by avoiding frequent accesses to

remote servers.

Mobile Code

Mobile code: code that is sent from one computer to another and run at the destination.

Advantage: remote invocations are replaced by local ones.

Typical example: Java applets.


Mobile Agents

Mobile agent: a running program that travels from one computer to another carrying out a task on

someone‘s behalf.

• A mobile agent is a complete program, code + data that can work (relatively) independently.

• The mobile agent can invoke local resources/data.

Typical tasks:

• Collect information

• Install/maintain software on computers

• Compare prises from various vendors bay visiting their sites.

Network Computers

Network computers do not store locally operating system or application code. All code is loaded

from the servers and run locally on the network computer.

Advantages:

• The network computer can be simpler, with limited capacity; it does not need even a local hard

disk (if there exists one it is used to cache data or code).

• Users can log in from any computer.

• No user effort for software management/ administration.

Thin Clients

The thin client is a further step, beyond the network computer:

• Thin clients do not download code (operating system or application) from the server to run it

locally. All code is run on the server, in parallel for several clients.

• The thin client only runs the user interface!

Advantages:

• All those of network computers but the computer at the user side is even simpler (cheaper).

Strong servers are needed!

Mobile Devices

Mobile devices are hardware, computing components that move (together with their software)

between physical locations.

• This is opposed to software agents, which are software components that migrate.

• Both clients and servers can be mobile (clients more frequently).


Particular problems/issues:

• Mobility transparency: clients should not be aware if the server moves (e.g., the server keeps its

Internet address even if it moves between networks).

• Problems due to variable connectivity and bandwidth.

• The device has to explore its environment:

- Spontaneous interoperation: associations between devices (e.g. clients and servers) are

dynamically created and destroyed.

- Context awareness: available services are dependent on the physical environment in which the

device is situated.

Design requirements for distributed architectures

Performance issues

Use of caching and replication

Dependability issues

Performance issues

Responsiveness

–Users of interactive aplication require a fast and consistent response to interaction.

Throughput

–The rate at which computational work is done.

Quality of services

–The ability to meet the deadlines of users need.

Balancing computer loads

–In some case load balancing may involve moving partially-completed work as the loads

on hosts changes.

Use of caching and replication

The performance issues often appear to be major obstacles to the successful deployment of

DS, but much progress has been made in the design of systems that overcome them by the

use of data replication and caching.

Dependability issues

The dependability of computer systems as correctness, security and fault

tolerance. Fault tolerance: reliability is achieved through redundancy.


–Security: the architectural impact of the requirement for security concerns the need to locate

sensitive data and other resources only in computers that can be effectively secured against

attack.

1.7Fundamental Models

Interaction model

Failure model

Security model

Interaction model

Performance of communication channels

Computer clocks and timing events

Two variants of the interaction model

Agreement in pepperland

Event ordering

Performance of communication channels

Communication performance is often a limiting characteristic.

The delay between the sending of a message by one process and its receipt by

another is referred to as latency.

Bandwidth

Jitter is the variation in the time taken to deliver a series of messages.

Computer clock and timing event

It is impossible to maintain a single global notion of time.

There are several approaches to correcting the times on computer clocks. (from GPS)

Two variants of the interaction model

–Synchronous distributed system

–The time to execute each step f a process has known lower and uper bounds.

–Each message transmitted over a channel is received within a known bounded time

–Each process has a local clock whose drift rate from real time has a known bound.

–Asynchronous distributed system

–No bound on process executiong speeds

–No bound on message transmisson delays

–No bound on clock drift rates.


Agreement in pepperland

•The pepperland divisions need to agree on which of them will lead the charge against

the Blue Meanies, and when the charge will take place.

•In asynchronous pepperland, the messengers are very variable in their speed.

•The divisions know some useful constraints: every message takes at least min.

Minutes and at most max minutes to arive.

•The leading division sends a message ‘charge!‘, then waits for min minutes, then

it

charges.

•The other division‘s charge is guaranteed to be after the leading division‘s, but no

more

than (max-min) after it.

Event ordering

–In many cases, we are interested in knowing whether an event (sending or

receiving a message) at one process occurred before, after or concurrently with

another event at another process. The execution of a system can be described in

terms of events and their ordering despite the lack of accurate clocks.

Example

send

receive

send

receive

m1 m2

2

1

3

4X

Y

Z

Physical

time

A

m3

receive receive

send

receive receive receive

t1 t2 t3

receive

receive

m2

m1


Failure model

Omission failures

A processor or communication channel fails to perform actions it is supposed to

do. This means that the particular action is not performed!

• We do not have an omission fault if:

- An action is delayed (regardless how long) but finally executed.

- An action is executed with an erroneous result.

With synchronous systems, omission faults can be detected by timeouts.

• If we are sure that messages arrive, a timeout will indicate that the sending process

has crashed. Such a system has a fail-stop behaviour.

Arbitrary failures

–This is the most general and worst possible fault semantics.

–Intended processing steps or communications are omitted or/and unintended

ones are executed.

–Results may not come at all or may come but carry wrong values.

process p process q

Communication channel

send

Outgoing message buffer Incoming message buffer

receivem


Timing failures

o Timing faults can occur in synchronous distributed systems, where time limits are set to process

execution, communications, and clock drifts.

o A timing fault occurs if any of this time limits is exceeded.

Masking failure Each component in a distributed system is generally constructed from a collection of other componetns. It is possible to construct relibable services from components that exhibit failures.

A service masks a failure, either by hiding tit altogether or by converting it into a more acceptable

tyoe a failure.

Eg: checksums are used to mask corrupted messages-effectively converting an arbitary

failure into an omission failure.

Reliability of one to one communication

A basic communication channel can exhibit the omission failures, I*t is possible to use it

to build a communication service that masks some of those failures.


The term reliable communication is defined in terms of validity and integrity as follows:

validity: any message in the outgoing message buffer is eventually delivered to the

incoming message buffer;

integrity: the message received is identical to one sent, and no messages are delivered

twice.

The threats to integrity come from two independent sources:

• Any protocol that retransmits messages but does not reject a message that arrives

twice. Protocols can attach sequence numbers to messages so as detect those that

are delivered twice.

• Malicious users that may inject spurious messages, replay old messages or tamper

with messages. Security measures can be taken to maintain the integrity property in the

face of such attacks.

Security model

The security of a distributed system can be achieved by securing the processes and the

channels used for their interactions and by protecting the objects that they encapsulate

against unauthorized access

Protection is described in terms of objects, although the concepts apply equally well to

resources of all types.

Network

invocation

result

ClientServer

Principal (user) Principal (server)

ObjectAccess rights


This shows a server that manages a collection of objects on behalf of some users. The users

can run client programs that send invocations to the server to perform operations on the objects.

The server carries out the operation specified in each invocation and sends the result to the client.

Objects are intended to be used in different ways by different users. For example, some

objects may hold a user's private data, such as their mailbox, and other objects may hold shared data

such as web pages. To support this, access rights specify who is allowed to perform the

operations of an object for example, who is allowed to read or to write its state.

It must include users in and beneficiaries of access rights. Such an authority is called a

principal. A principal may be a user or a process.

The server is responsible for verifying the identity of the principal behind each invocation and

checking that they have sufficient access rights to perform the requested operation on the

particular object invoked, rejecting those that do not. The client may check the identity of the

principal behind the server to ensure that the result comes from the required server.

Securing processes and their interactions

Processes interact by sending messages. The messages are exposed to attack because the

network and the communication service that they use is open, to enable any pair of processes to

interact. Servers and peer processes expose their interfaces, enabling invocations to be sent to

them by any other process.

Distributed systems are often deployed and used in tasks that are likely to be subject to

external attacks by hostile users. This is especially true for applications that handle financial

transactions, confidential or classified information or any other information whose secrecy or

integrity is crucial. Integrity is threatened by security violations as well as communication failures

.


The enemy

To model security threats, we postulate an enemy that is capable of sending any message to

any process and reading or copying any message between a pair of processes, as shown in the

following figure.

Such attacks can be made simply by using a computer connected to a network to run a program

that reads network messages addressed to other computers on the network, or a program that

generates messages that make false requests to services and purporting to come from authorized users.

The attack may come from a computer that is legitimately connected to the network or from one

that is connected in an unauthorized manner.

Threats to processes:

A process that is designed to handle incoming requests may receive a message from any other

process in the distributed system, and it cannot necessarily determine the identity of the sender.

Communication protocols such as IP do include the address of the source computer in each

message, but it is not difficult for an enemy to generate a message with a forged source address.

This lack of reliable knowledge of the source of a message is a threat to the correct functioning

of both servers and clients.

Servers:

Since a server can receive invocations from many different clients, it cannot necessarily determine

the identity of the principal behind any particular invocation. Even if a server requires the inclusion

of the principal's identity in each invocation, an enemy might generate an invocation with a false

identity.


Without reliable knowledge of the sender's identity, a server cannot tell whether to

perform the operation or to reject it. For example, a mail server would not know whether the user

behind an invocation that requests a mail item from a particular mailbox is allowed to do so or

whether it was a request from an enemy.

Clients:

When a client receives the result of an invocation from a server, it cannot necessarily tell

whether the source of the result message is from the intended server or from an enemy, perhaps

'spoofing' the mail server. Thus the client could receive a result that was unrelated to the original

invocation, such as a false mail item (one that is not in the user's mailbox).

Threats to communication channels:

An enemy can copy, alter or inject messages as they travel across the network and its

intervening gateways. Such attacks present a threat to the privacy and integrity of information as

it travels over the network and to the integrity of the system.

For example, a result message containing a user's mail item might be revealed to another

user or it might be altered to say something quite different. Another form of attack is the attempt

to save copies of messages and to replay them at a later time, making it possible to reuse the

same message over and over again.

For example, someone could benefit by resending an invocation message requesting a

transfer of a sum of money from one bank account to another. All these threats can be defeated

by the use of secure channels, which are described below and are based on cryptography and

authentication.


Defeating security threats :

Cryptography and shared secrets: Suppose that a pair of processes (for example a

particular client and a particular server) share a secret; that is they both know the secret but no

other process in the distributed system knows it. Then if a message exchanged by that pair of

processes includes information that proves the sender's knowledge of the shared secret. The

recipient knows for sure that the sender was the other process in the pair.

Cryptography is the science of keeping messages secure, and encryption is the process of

scrambling a message in such a way as to hide its contents.

Modem cryptography is based on encryption algorithms that use secret keys large numbers that

are difficult to guess - to transform data in a manner that can only be reversed with knowledge of

the corresponding decryption key.

Authentication:

The use of shared secrets and encryption provides the basis for the authentication of

messages proving the identities supplied by their senders. The basic authentication technique is

to include in a message an encrypted portion that contains enough of the contents of the message

to guarantee its authenticity.

The authentication portion of a request to a file server to read part of a file, for example,

might include a representation of the requesting principal's identity, the identity of the file and the

date and time of the request, all encrypted with a secret key shared between the file server and

the requesting process. The server would decrypt this and check that it corresponds to the

unencrypted details specified in the request.

Secure channels: Encryption and authentication are used to build secure channels as a

service layer on top of existing communication services. A secure channel is a communication

channel connecting a pair of processes, each of which acts on behalf of a principal, as shown in

the following figure.


A secure channel has the following properties:

• Each of the processes knows reliably the identity of the principal on whose behalf

the other process is executing. Therefore if a client and server communicate via a

secure channel, the server knows the identity of the principal behind the

invocations and can check their access rights before performing an operation. This

enables the server to protect its objects correctly and allows the client to be sure

that it is receiving results from a bona fide server.

• A secure channel ensures the privacy and integrity (protection against tampering)

of the data transmitted across it.

• Each message includes a physical or logical time stamp to prevent message from

being replayed or reordered.

Other possible threats from an enemy:

Denial of service:

This is a form of attack in which the enemy interferes with the activities of authorized

users by making excessive and pointless invocations on services or message transmissions in a

network, resulting in overloading of physical resources (network bandwidth, server processing

capacity).

Such attacks are usually made with the intention of delaying or preventing actions by

other users. For example, the operation of electronic door locks in a building might be disabled

by an attack that saturates the computer controlling the electronic locks with invalid requests.

Mobile code:

Mobile code raises new and interesting security problems for any process that receives

and executes program code from elsewhere, such as the email attachment mentioned. A Trojan

horse role, purporting to fulfill an innocent purpose but in fact including code that accesses or

modifies resources that are legitimately available to the host process but not to the originator of

the code.


The methods by which such attacks might be carried out are many and varied, and the

host environment must he very carefully constructed in order to avoid them. Many of these issues

have been addressed in Java and other mobile code systems, but the recent history of this topic

has included the exposure of some embarrassing weaknesses. This illustrates well the need for

rigorous analysis in the design of all secure systems.

The uses of security models:

The use of security techniques such as encryption and access control incurs substantial

processing and management costs. The security model outlined above provides the basis for the

analysis and design of secure systems in which these costs are kept to a minimum, hut threats to a

distributed system arise at many points, and a careful analysis of the threats that might arise from

all possible sources in the system's network environment, physical environment and human

environment is needed.

This analysis involves the construction of a threat model listing all the forms of attack to

which the system is exposed and an evaluation of the risks and consequences of each. The

effectiveness and the cost of the security techniques needed can then be balanced against the

threats.

1.8 Networking and Internetworking

The networks used in distributed systems are built from a variety of transmission media,

including wire, cable, fiber and wireless channels; hardware devices, including routers, switches,

bridges. hubs, repeaters and network interfaces; and software components, including protocol

stacks, communication handlers and drivers.

The resulting functionality and performance available to distributed system and

application programs is affected by all of these. The computers and other devices that use the

network for communication purposes are referred to as hosts. The term node is used to refer to

any computer or switching device attached to a network.

The Internet is a single communication subsystem providing communication between all

of the hosts that are connected to it. The Internet is constructed from many subnets.A subnet is a

set of interconnected nodes, all of which employ the same technology to communicate amongst

themselves.


The Internet's infrastructure includes an architecture and hardware and software

components that effectively integrate diverse subnets into a single data communication service.

The design of a communication subsystem is strongly influenced by the characteristics of the

operating systems used in the computers of which the distributed system is composed as well as

the networks that interconnect them.

Networking issues for distributed systems Performance:

The network performance parameters like latency and data transfer rates those affecting

the speed with which individual messages can be transferred between two interconnected

computers.

Latency is the delay that occurs after a send operation is executed before data starts to

become available at the destination. It can be measured as the time required transferring an empty

message.

Data transfer rate is the speed at which data can be transferred between two computers

in the network once transmission has begun, usually quoted in bits per second. Following from

these definitions, the time required for a network to transfer a message containing length bits

between two computers is:

Message transmission time = latency + length/data transfer rate

The above equation is valid for messages whose length does not exceed a maximum that

is determined by the underlying network technology. Longer messages have to be segmented and

the transmission time is the sum of the times for the segments.

The transfer rate of a network is determined primarily by its physical characteristics,

whereas the latency is determined primarily by software overheads, routing delays and a load-

dependent statistical element arising from conflicting demands for access to transmission

channels.

Many of the messages transferred between processes in distributed systems are small in

size; latency is therefore often of equal or greater significance than transfer rate in determining

performance. .


The total system bandwidth of a network is a measure of throughput the total volume of

traffic that can be transferred across the network in a given time. The performance of networks

weakens in conditions of overload when there are too many messages in the network at the same

time.

Scalability:

Computer networks are an necessary part of the infrastructure of modern societies. The

potential future size of the Internet is adequate with the population of the planet. It is realistic to

expect it to include several billion nodes and hundreds of millions of active hosts.

These figures indicate the huge changes in size and load that the Internet must handle.

The network technologies on which it is based were not designed to cope with even the Internet's

current scale; but they have performed remarkably well. Some substantial changes to the

addressing and routing mechanisms are planned in order to handle the next phase of the Internet's

growth.

Reliability :

Many applications are able to recover from communication failures and hence do not

require guaranteed error-free communication. The end-to-end argument further supports the view

that the communication subsystem need not provide totally error-free communication; the

detection of communication errors and their correction is often best performed by application-

level software.

The reliability of most physical transmission media is very high. When errors occur they

are usually due to timing failures in the software at the sender or receiver (for example, failure by

the receiving computer to accept a packet) or buffer overflow rather than errors in the network.

Security

A firewall creates a protection boundary between the organization's intranet and the rest

of the Internet. The purpose of the firewall is to protect the resources in all of the computers

inside the organization from access by external users or processes and to control the use of

resources outside the firewall by users inside the organization.

A firewall runs on a gateway a computer that stand at the network entry point to an

organization's intranet. The firewall receives and filters all of the messages traveling into and out

36


of an organization. It is configured according to the organization's security policy to allow certain

incoming and outgoing messages to pass through it and to reject all others.

Security can be achieved through the use of cryptographic techniques. It is usually

applied at a level above the communication subsystem.Exceptions include the need to protect

network components such as routers against unauthorized interference with their operation and

the need for secure links to mobile devices and other external nodes to enable them to participate

in a secure intranet called Virtual Private Network (VPN).

Mobility:

Distributed systems to support portable computers and handheld digital devices and

mentioned the need for wireless networks in order to support continuous communication with

such devices. But the consequences of mobility extend beyond the need for wireless networking.

Mobile devices are frequently moved between locations and reconnected at convenient network

connection points.

The addressing and routing schemes of the Internet and other networks were developed

before the advent of mobile devices. Although the current mechanisms have been adapted and

extended to support them, the expected future growth in the use of mobile devices will require

further extensions.

Quality of service:

Quality of service as the ability to meet deadlines when transmitting and processing

streams of real-time multimedia data. This imposes major new requirements on computer

networks.

Applications that transmit multimedia data require guaranteed bandwidth and bounded

latencies for the communication channels that they use. Some applications vary their demands

dynamically and specify both a minimum acceptable quality of service and a desired optimum.

Multicasting

Most communication in distributed systems is between pairs of processes, but there often

is also a need for one-to-many communication. While this can be simulated by sends to several

destinations, this is more costly than necessary, and may not exhibit the fault-tolerance


characteristics required by applications. For these reasons many network technologies support the

simultaneous transmission of messages to several recipients.

1.9Types of network

1.10 Networking Principles

Packet switching was a radical step beyond the switched telecommunication networks that

used telephone and telegraph communication, exploiting the capability of computers to store data

while it is in transit.

This enables packets addressed to different destinations to share a single communications

link. Packets are queued in a buffer and transmitted when the link is available. Communication is

asynchronous - messages arrive at their destination after a delay that varies depending upon the

time that packets take to travel through the network.

Packet transmission

In most applications of computer networks the requirement is for the transmission of

logical units of information or messages sequences of data items of arbitrary length. But before

a message is transmitted it is subdivided into packets.


The simplest form of packet is a sequence of binary data (an array of bits or bytes) of

restricted length, together with addressing information sufficient to identify the source and

destination computers.

Packets of restricted length are used:

so that each computer in the network can allocate sufficient buffer storage to hold the

largest possible incoming packet;

to avoid the undue delays that would occur in waiting for communication channels to

become free if long messages were transmitted without subdivision.

Data streaming

There are major exceptions to the rule that message-based communication meets most

application needs .Multimedia applications rely upon the transmission of streams of audio and

video data elements at guaranteed rates and with hounded latencies. Such streams differ

substantially from the message-based type of traffic for which packet transmission was designed.

The streaming of audio and video requires much higher bandwidths than most other forms

of communication in distributed systems. The transmission of a video stream for real-time

display requires a bandwidth of about 1.5 Mbps if the data is compressed or 120 Mbps if

uncompressed.

The play time of a multimedia element is the time at which it must be displayed (for a

video element) or converted to audio (for a sound sample). For example, in a stream of video

frames that has a frame rate of 24 frames per second, frame N has a play time that is N/24

seconds after the stream's start time. Elements that arrive at their destination later than their play

time are no longer useful and will be dropped by the receiving process.

The timely delivery of such data streams depends upon the availability of connections

with guaranteed quality of service bandwidth, latency and reliability must all be guaranteed.

With a predefined route through the network, a reserved set of resources at each node

through which it will travel and buffering where appropriate to cushion any irregularities in the

flow of data through the channel. Data can then be passed through the channel from sender to

receiver at the required rate.


Switching schemes

A network consists of a set of nodes connected together by circuits. To transmit

information between two arbitrary nodes, a switching system is required.

The four types of switching that are used in computer networking are:

Broadcast :

Broadcasting is a transmission technique that involves no switching. Everything is

transmitted to every node, and it is up to potential receivers to notice transmissions addressed to

them.

Some LAN technologies. including Ethernet, are based on broadcasting. Wireless

networking is necessarily based on broadcasting, but in the absence of fixed circuits the

broadcasts are arranged to reach nodes grouped in cells.

Circuit switching:

At one time telephone networks were the only telecommunication networks. Their

operation was simple to understand: when a caller dialed a number, the pair of wires from her

phone to the local exchange was connected by an automatic switch at the exchange to the pair of

wires connected to the other party's phone.

For a long-distance call the process was similar but the connection would be switched

through a number of intervening exchanges to its destination. This system is sometimes referred

to as the plain old telephone system, or POTS. It is a typical circuit-switching network.

Packet Switching:

This new type of communication network is called a store-and-forward network.. There is

a computer at each switching node (wherever several circuits need to be interconnected). Packets

arriving at a node are first stored in the memory of the computer at the node and then processed

by a program that forwards it towards their destination by choosing an outgoing circuit that will

transfer the packet to another node that is closer to its ultimate destination.

Frame relay

The store-and-forward transmission of packets is not instantaneous. It typically takes

anything from a few tens of microseconds to a few milliseconds to switch a packet through each

network node, depending on packet size, hardware speeds and the quantity of other traffic.

Packets may he routed through many nodes before they reach their destination.


The Internet is based on store-and-forward switching even short Internet packets typically

take around 200 milliseconds to reach their destinations. Delays of this magnitude are much too

long for applications such as telephony, where delays of less than 50 milliseconds are needed to

sustain a telephone conversation without interference.

Since the delay is additive - the more nodes a packet passes through, the more it is

delayed and since much of the delay at each node arises from factors that are inherent to the

packet-switching technique.

Protocols

The term protocol is used to refer to a well-known set of rules and formats to be used for

communication between processes in order to perform a given task. The definition of a protocol

has two important parts to it:

• a specification of the sequence of messages that must be exchanged;

• a specification of the format of the data in the messages.

A protocol is implemented by a pair of software modules located in the sending and

receiving computers. For example, a transport protocol transmits messages of any length from a

sending process to a receiving process.

A process wishing to transmit a message to another process issues a call to a transport

protocol module, passing it a message in the specified format. The transport software then

concerns itself' with the transmission of the message to its destination, subdividing it into packets

of some specified size and format that can be transmitted to the destination via the network

protocol another, lower-level protocol.

The corresponding transport protocol module in the receiving computer receives the

packet via the network-level protocol module and performs inverse transformations to regenerate

the message before passing it to a receiving process.


Protocol layers

Network software is arranged in a hierarchy of layers. Each layer presents an interface to

the layers above it that extends the properties of the underlying communication system. A layer

is represented by a module in every computer connected to the network. Following figure

illustrates the structure and the flow of data when a message is transmitted using a layered

protocol.

Eachmodule appears to communicate directly with a module at the same level in another

computer in the network, but in reality data is not transmitted directly between the protocol

modules at each level. Instead, each layer of network software communicates by local procedure

calls with the layers above and below it.

On the sending side, each layer (except the topmost, or application layer) accepts items of

data in a specified format from the layer above it and applies transformations to encapsulate the

data in the format specified for that layer before passing it to the layer below for further

processing.


The following figure illustrates this process as it applies to the top four layers of the OSI

protocol suite. The figure shows the packet headers that hold most network-related data items, but for

clarity it omits the trailers that are present in some types of packet; it also assumes that the application-

layer message to be transmitted is shorter than the underlying network's maximum packet size.

On the receiving side, the converse transformations are applied to data items received from the

layer below before they are passed to the layer above. The protocol type of the layer above is included

in the header of each layer, to enable the protocol stack at the receiver to select the correct software

components to unpack the packets.

Protocol suites :

A complete set of protocol layers is referred to as a protocol suite or a protocol stack,

reflecting the layered structure. The following figure shows a protocol stack that conforms to the

seven-layer Reference Model for open systems interconnection (OS!) adopted by the International

Standards Organization (ISO) [ISO 1992). The OST

Reference Model was adopted in order to encourage the development of protocol standards that

would meet the requirements of open systems.


Protocol layering brings substantial benefits in simplifying and generalizing the software

interfaces for access to the communication services of networks, but it also carries significant

performance costs. The transmission of an application-level message via a protocol stack with N

layers typically involves N transfers of control to the relevant layer of software in the protocol

suite, at least one of which is an operating system entry, and taking N copies of the data as a part

of the encapsulation mechanism. All of these overheads result in data transfer rates between

application processes that are much lower than the available network bandwidth.

The figure includes examples from protocols used in the Internet, but the implementation of the

Internet does not follow the OSI model in two respects

Packet assembly

The task of dividing messages into packets before transmission and reassembling them at

the receiving computer is usually performed in the transport layer.

The network-layer protocol packets consist of a header and a data field. In most network

technologies, the data field is variable in length, but with a limit called the maximum transfer unit

(MTU).


If the length of a message exceeds the MTU of the underlying network layer, it must be

fragmented into chunks of the appropriate size, with sequence numbers for use on reassembly,

and transmitted in multiple packets. For example, the MTU for Ethernets is 1500 bytes no more

than that quantity of data can be transmitted in a single Ethernet packet.

Although the IP protocol stands in the position of a network layer protocol in the Internet

suite of protocols, its MTU is unusually large at 64 kbytes, (8 kbytes is often used in practice

because some nodes are unable to handle such large packets). Whichever MTU value is adopted

for IP packets, packets larger than the Ethernet MTU can arise and they must be fragmented for

transmission over Ethemets.

Ports :

The transport layer's task is to provide a network-independent message transport service

between pairs of network ports. Ports are software-definable destination points for

communication within a host computer.

Ports are attached to processes, enabling them to communicate in pairs. The specific

details of the port abstraction may be varied to provide additional useful properties. Here we shall

describe the addressing of ports as it occurs in the Internet and most other networks.

Addressing :

The transport layer is responsible for delivering messages to destinations with transport

addresses that are composed of the network address of a host computer and a port number. A

network address is a numeric identifier that uniquely identifies a host computer and enables it to

be located by nodes that are responsible for routing data to it.

In the Internet every host computer is assigned an IP number, which identifies it and the

subnet to which it is connected, enabling data to be routed to it from any other node as described

in the following sections. In Ethernets there are no routing nodes; each host is responsible for

recognizing and picking up packets addressed to it.

A distributed system generally has a multiplicity of servers, which differs from one time

to another and from one organization to another. Clearly, the allocation of fixed hosts or fixed

port numbers to these services is not feasible.


Packet delivery :

There are two approaches to the delivery of packets by the network layer:

Datagram packet delivery:

The term 'datagram' refers to the similarity of this delivery mode to the way in which

letters and telegrams are delivered. The essential feature of datagram networks is that the delivery

of each packet is a `one-shot' process; no setup is required and once the packet is delivered the

network retains no information about it.

In a datagram network a sequence of packets transmitted by a single host to a single

destination may follow different routes (if, for example, the network is capable of adaptation to

handle failures or to mitigate the effects of localized congestion) and when this occurs they may

arrive out of sequence.

Every datagram packet contains the full network address of the source and destination hosts;

Datagram delivery is the concept on which packet networks were originally based and it can be

found in most of the computer networks in use today. The Internet's network layer IP the

Ethernet and most wired and wireless local network technologies are based on datagram delivery.

Virtual circuit packet delivery

A virtual circuit must be set up before packets can pass from a source host A to

destination host B. The establishment of a virtual circuit involves the identification of a route

from the source to the destination, possibly passing through several intermediate nodes. At each

node along the route a table entry is made, indicating which link should be used for the next stage

of the route.

Once a virtual circuit has been set up, it can be used to transmit any number of packets.

Each network-layer packet contains only a virtual circuit number in place of the source and

destination addresses. The addresses are not needed, because packets are routed at intermediate

nodes by reference to the virtual circuit number.

When a packet reaches its destination the source can be determined from the virtual

circuit number. In the POTS a telephone call results in the establishment of a physical circuit

from the caller to the callee, and the voice links from which it is constructed are reserved for their

exclusive use.


In virtual circuit packet delivery the circuits are represented only by table entries in

routing nodes, and the links along which the packets are routed are used only for the time taken

to transmit a packet; they are free for other uses for the rest of time.

Routing:

Routing is a function that is required in all networks except those LANs, such as the

Ethernet, that provide direct connections between all pairs of attached hosts. In large networks,

adaptive routing is employed: the best route for communication between two points in the

network is re-evaluated periodically, taking into account the current traffic in the network and

any faults such as broken connections or routers.

The delivery of packets to their destinations in a network such as the one shown in the

following figure is the collective responsibility of the routers located at connection points. The

determination of routes for the transmission of packets to their destinations is the responsibility

of a routing algorithm implemented by a program in the network layer at each node.

routing algorithm has two parts:

1. It must make decisions that determine the route taken by each packet as it travels

through the network. In circuit-switched network layers such as X.25 and frame

relay networks such as ATM the route is determined whenever a virtual circuit or

connection is established.


In packet-switched network layers such as IP it is determined separately for each

packet, and the algorithm must be particularly simple and efficient if it is not to degrade

network performance.

2. It must dynamically update its knowledge of the network based on traffic

monitoring and the detection of configuration changes or failures. This activity is

less time-critical; slower and more computation-intensive techniques can be used.

Routing tables for the network

A router exchanges information about the network with its neighbouring nodes by

sending a summary of its routing table using a router information protocol (RIP). The RIP actions

performed at a router are described informally as follows:

1.. Periodically, and whenever the local routing table changes, send the table tin a

summary form) to all accessible neighbours. That is, send an RIP packet containing a copy of the

table on each non-faulty outgoing link.

2. When a table is received from a neighbouring router, if the received table shows

a route to a new destination, or a better (lower cost) route to an existing destination, then update


the local table with the new route. If the table was received on link n and it gives a different

cost than the local table for a route that begins with link n. then replace the cost in the local

table with the new cost.

This is done because the new table was received from a router that is closer to the

relevant destination and is therefore always more authoritative for routes that pass through it.

Pseudo-code for RIP routing algorithm

Congestion control

The capacity of a network is limited by the performance of its communication links

and switching nodes. When the load at any particular link or node approaches its capacity,

queues will build up at hosts trying to send packets and at intermediate nodes holding

packets whose onward transmission is blocked by other traffic.

If the load continues at the same high level, the queues will continue to grow until

they reach the limit of available buffer space. Once this state is reached at a node. the node

has no option but to drop further incoming packets. If packets are dropped at intermediate

nodes.

The network resources that they have already consumed are wasted and the

resulting


retransmissions will require a similar quantity of resources to reach the same point in

the network. As a rule of thumb, when the load on a network exceeds 80% of its capacity,

the total

throughput tends to drop as a result of packet losses unless usage of heavily loaded links is

controlled.

In general, congestion control is achieved by informing nodes along a route that congestion

has occurred, and their rate of packet transmission should therefore be reduced. For

intermediate nodes, this will result in the buffering of incoming packets for a longer period. For

hosts that are sources of the packets, the result may he to queue packets before transmission or to

block the application process that is generating them until the network can handle them.

All datagram-based network layers including IP and Ethernets rely on the end-to- end

control of traffic. That is, the sending node must reduce the rate at which it transmits packets

based only on information that it receives from the receiver. Congestion information may be

supplied to the sending node by explicit transmission of special messages (called choke !packets)

requesting a reduction in transmission rate, or by the implementation of a specific transmission

control .

Internetworking

There are many network technologies with different network, link and physical layer

protocols. Local networks are built from Ethernet and ATM technologies, wide area networks are

built over analogue and digital telephone networks of various types, satellite links and wide-area

ATM networks. Individual computers and local networks are linked to the Internet or intranets by

modems, ISDN links and DSL connections.

To build an integrated network (an internetwork) we must integrate many subnets, each of which

is based on one of these network technologies. To make this possible, the following are

needed:

1. A unified internetwork addressing scheme that enables packets to be addressed to any

host connected to any subnet.

2. A protocol defining the format of internetwork packets and giving rules according to

which they are handled.

3. Interconnecting components that route packets to their destinations in terms of

internetwork addresses, transmitting the packets using subnets with a variety of

network technologies.


For the Internet, (1) is provided by IP addresses,(2) is the IP protocol and (3) is performed

by the components called Internet Routers. The following figure shows a small part of the

intranet located at Queen Mary and Westfield College (QMW), University of London.

Routers

Routing is required in all networks except those such as Ethernets and wireless networks, in

which all of the hosts are connected by a single transmission medium. In an internetwork, the

routers may be linked by direct connections or they may be interconnected through subnets. In

both cases, the routers are responsible for forwarding the internetwork packets that arrive on any

connection to the correct outgoing connection.

Bridges: Bridges link networks of different types. Some bridges link several networks, and

these are referred to as bridge/routers because they also perform routing functions. For

example, the campus network at QMW includes a Fibre Distributed Data Interface FDDI back


bone.

Hubs :

Hubs are simply a convenient means of connecting hosts and extending segments of Ethernet and other broadcast local network technologies. They have a number of sockets (typically 4-64), to each of which a host computer can be connected. They can also be used to overcome the distance limitations on single segments and provide a means of adding additional hosts.

Switches :

Switches perform a similar function to routers, but for local networks (normally

Ethernets) only. That is, they interconnect several separate Ethernets, routing the incoming

packets to the appropriate outgoing network.

They perform their task at the level of the Ethernet network protocol. They start with no

knowledge of the wider internetwork and build up routing tables by the observation of traffic,

supplemented by broadcast requests when they lack information.

The advantage of switches over hubs is that they separate the incoming traffic and transmit it only

on the relevant outgoing network, reducing congestion on the other networks to which they are

connected.

Tunnelling:

Bridges and routers transmit internetwork packets over a variety of underlying networks, but

there is one situation in which the underlying network protocol can be hidden from those

above without the use of a special internetwork protocol.

When a pair of nodes connected to two separate networks need to communicate through another

type of network or over an `alien' protocol, they can do so by constructing a protocol `tunnel'.

The following figure illustrates the proposed use of tunnelling to support the migration of the

Internet to the recently approved IPv6 protocol. IPv6 is intended to replace the version of IP

currently in use, IPv4, and is incompatible with it.


Dept. of CSE,SIT 50 KNS

The

intervening network nodes do not need to be modified to handle the MobilelP protocol. The IP

multicast protocol is handled in a similar way, relying on a few routers that support IP multicast

routing to determine the routes, but transmitting IP packets.

1.11 Internet protocols

The Internet Protocol (IP) is the method or protocol by which data is sent from one

computer to another on the Internet. Each computer (known as a host) on the Internet has at least

one IP address that uniquely identifies it from all other computers on the Internet. When you send

or receive data (for example, an e-mail note or a Web page), the message gets divided into little

chunks called packets.

Each of these packets contains both the sender's Internet address and the receiver's address. Any

packet is sent first to a gateway computer that understands a small part of the Internet. The

gateway computer reads the destination address and forwards the packet to an adjacent

gateway that in turn reads the destination address and so forth across the Internet until one

gateway recognizes the packet as belonging to a computer within its immediate

neighborhood or domain.

That gateway then forwards the packet directly to the computer whose address is specified.

Because a message is divided into a number of packets, each packet can, if necessary, be sent by

a different route across the Internet. Packets can arrive in a different order than the order they

were sent in. The Internet Protocol just delivers them. It's up to another protocol, the

Transmission Control Protocol (TCP) to put them back in the right order.

IP is a connectionless protocol, which means that there is no continuing connection between

the end points that are communicating. Each packet that travels through the Internet is treated as

an independent unit of data without any relation to any other unit of data. (The reason the packets

http://whatis.techtarget.com/definition/0%2C289893%2Csid9_gci212839%2C00.html











do get put in the right order is because of TCP, the connection-oriented protocol that keeps track

of the packet sequence in a message.) In the Open Systems Interconnection (OSI)

communication model, IP is in layer 3, the Networking Layer.

The most widely used version of IP today is Internet Protocol Version 4 (IPv4). However, IP

Version 6 (IPv6) is also beginning to be supported. IPv6 provides for much longer addresses and

therefore for the possibility of many more Internet users. IPv6 includes the capabilities of IPv4

and any server that can support IPv6 packets can also support IPv4 packets.

TCP/IP Layers

Addressing

The scheme used for assigning host addresses to networks and the computers connected to them

had to satisfy the following requirements:

It must be universal any host must be able to send packets to any other host in the

Internet.

It must be efficient in its use of the address space it is impossible to predict the

ultimate size of the Internet and the number of network and host addresses likelyto

be required. The address space must be carefully partitioned to ensure that addresses will

not run out.

The addressing scheme must lend itself to the development of a flexible and

efficient routing scheme, but the addresses themselves cannot contain very much of

the information needed to route a packet to its destination.






The design adopted for Internet address space is shown in the following figure . There

are four allocated classes of Internet address - A, B, C and D. Class D is reserved for

Internet multicast communication. Class E contains a range of unallocated addresses, which

are reserved for future requirements.

These 32-bit Internet addresses containing a network identifier and host identifier are usually

written as a sequence of four decimal numbers separated by dots. Each decimal number

represents one of the four bytes, or octets of the IP address. The permissible values for each

class of network address are shown in the following figure.



Three classes of address were designed to meet the requirements of different types of

organization. The Class A addresses, with a capacity for 224

hosts on each subnet, are reserved for

very large networks such as the US NSFNet and other national wide area networks. Class B

addresses are allocated to organizations that operate networks likely to contain more than 255

computers, and Class C addresses are allocated to all other network operators.

The main difficulty is that network administrators in user organizations cannot easily predict

future growth in their need for host addresses and they tend to overestimate, requesting Class B

addresses when in doubt. Around 1990 it became evident that based on the rate of allocation at

the time, the NIC was likely to run out of IP addresses to allocate around 1996.

Two steps were taken. The first was to initiate the development of a new IP protocol and

addressing scheme, the result of which was the specification of lPv6. The second step was to

radically modify the way in which IP addresses were allocated. A new address allocation and

routing scheme, designed to make more effective use of the IP address space.

The IP protocol

The IP protocol transmits datagram from one host to another, if necessary via intermediate

routers. There are several header fields that are used by the transmission and routing algorithms.

IP provides a delivery service that is described as offering unreliable or best-effort delivery

semantics, because there is no guarantee of delivery. Packets can be lost, duplicated, delayed or

delivered out of order, but these errors arise only when the underlying networks fail or buffers at

the destination are full.



The only checksum in IP is a header checksum, which is inexpensive to calculate and

ensures that any corruptions in the addressing and packet management data will be detected.

There is no data checksum, which avoids overheads when crossing routers, leaving the higher-

level protocols (TCP and UDP) to provide their own checksums a practical instance of the end-

to-end argument.

The IP layer puts IP datagram into network packets suitable for transmission in the underlying

network (which might, for example, be an Ethernet). When an IP datagram is longer than the MTU

of the underlying network. it is broken into smaller packets at the source and reassembled at its

final destination. Packets can be further broken up to suit the underlying networks encountered

during the journey from source to destination.

The IP layer must also insert a `physical' network address of the message destination to the

underlying network. It obtains this from the address resolution module in the Internet Network

Interface layer.

Address Resolution

The address resolution module is responsible for converting Internet addresses to network

addresses for a specific underlying network (sometimes called physical addresses). For example, if

the underlying network is an Ethernet, the Address Resolution module converts 32-bit Internet

addresses to 48-bit Ethernet addresses.

This translation is network technology-dependent:

• Some hosts are connected directly to Internet packet switches; IP packets can be routed to them

without address translation.

• Some local area networks allow network addresses to be assigned to hosts dynamically, and the

addresses can be conveniently chosen to match the host identifier portion of the Internet address

- translation is simply a matter of extracting the host identifier from the IP address.

• For Ethernets and some other local networks the network address of each

computer is hard-wired into its network interface hardware and bears no direct relation to its

Internet address - translation depends upon knowledge of the correspondence between IP

addresses and Ethernet addresses for the hosts on the local Ethernet.

IP spoofing:

We have seen that IP packets include a source address IP address of the sending

computer. This, together with a port address encapsulated in the data field (for UDP and TCP



packets), is often used by servers to generate a return address. Unfortunately, it is not possible to

guarantee that the source address given is in fact the address of the sender.

A malicious sender can easily substitute an address that is different from its own. This

loophole has been the source of several well-known attacks, including the distributed denial of

service attacks of February 2000. These malicious ping requests all contained the IP address of a

target computer in their sender address field. The ping responses were therefore all directed to the

target, whose input buffers were overwhelmed, preventing any legitimate IP packets.

IP Routing:

The IP layer routes packets from their source to their destination. Each router in the Internet

implements IP-layer software to provide a routing algorithm.

Backbones

The topological map of the Internet is partitioned conceptually into autonomous systems

(AS). which are subdivided into areas. The intranets of most large organizations such as

universities and large companies are regarded as ASs, and they will usually include several areas.

The collection of routers that connect non-backbone areas to the backbone and the links that

interconnect those routers are called the backbone of the network. The links in the backbone are

usually of high bandwidth and are replicated for reliability.

Routing protocols

RIP-1, the first routing algorithm used in the Internet, is a version of the distance-vector

algorithm it subsequently to accommodate several additional requirements, including classless

interdomain routing, better multicast routing and the need for authentication of RIP packets to

prevent attacks on the routers

As the scale of the Internet has expanded and the processing capacity of routers has

increased, there has been a move towards the adoption of algorithms that do not suffer from the

slow convergence and potential instability of distance-vector algorithms

We should note that the adoption of new routing algorithms in IP routers can proceed

incrementally. A change in routing algorithm results in a new version of the RIP protocol, and a

version number is carried by each RIP packet. The IP protocol does not change when a new RIP

protocol is introduced.

Any IP router will correctly forward incoming IP packets on a reasonable, if not optimum route,

whatever version of RIP they use. But for routers to cooperate in the updating of their routing

tables, they must share a similar algorithm.



For this purpose the topological areas defined above are used. Within each area a single

routing algorithm applies and the routers within an area cooperate in the maintenance of their

routing tables.

Default routes

Routing algorithms has suggested that every router maintains a full routing table showing the

route to every destination (subnet or directly connected host) in the Internet. At the current scale

of the Internet this is clearly infeasible

Two possible solutions to this problem:

The first solution is to adopt some form of topological grouping of IP addresses.

The decision was taken that for future allocations, the following regional locations would be

applied:

Addresses 194.0.0.0 to 195.255.255.255 are in Europe Addresses

198.0.0.0 to 199.255.255.255 are in North America

Addresses 200.0.0.0 to 201.255.255.255 are in Central and South America Addresses 202.0.0.0

to 195.203.255.255 are in Asia and the Pacific

Because these geographical regions also correspond to well-defined topological regions in the

Internet and just a few gateway routers provide access to each region, this enables a substantial

simplification of routing tables for those address ranges. For example. a router outside

Europe can have a single table entry for the range of addresses 194.0.0.0 to

195.255.255.255 that sends all IP packets with destinations in that range on the same route to the

nearest European gateway router.

The second solution to the routing table size explosion is simpler and very effective. It is based on

the observation that the accuracy of routing information can be relaxed for most routers as long as

some key routers, those closest to the backbone links, have relatively complete routing tables. The

default entry specifies a route to be used for all IP packets whose destination is not included in

the routing table

Routing on a local subnet Q Packets addressed to hosts on the same network as the sender are

transmitted to the destination host in a single hop, using the host identifier

part of the address to obtain the address of the destination host on the underlying network. The IP

layer simply uses ARP to get the network address of the destination and

then uses the underlying network to transmit the packets.



If the IP layer in the sending computer discovers that the destination is on a different network,

it must send the message to a local router. It uses ARP to get the network address of the gateway

or router and then uses the underlying network to transmit the packet to it. Gateways and routers

are connected to two or more networks and they have several Internet addresses, one for each

network to which they are attached.

Classless interdomain routing (CIDR)

The main problem was a scarcity of Class B addresses - those for subnets with more than 255

hosts connected. Plenty of Class C addresses were available. The CIDR solution for this

problem is to allocate a batch of contiguous class C addresses to a subnet requiring more than

255 addresses.

The CIDR scheme also makes it possible to subdivide a Class B address space for allocation

to multiple subnets. The mask is a bit pattern that is used to select the portion of an IP address

that is compared with the routing table entry. This effectively enables the host/subnet address

to be any portion of the IP address, providing more flexibility than the classes A, B and C.

Once again, these changes to routers are made on an incremental basis, so some routers perform

CIDR and others use the old class-based algorithms. This works because the newly allocated

ranges of Class C addresses are assigned modulo 256, so each range represents an integral

number of Class C-sized subnet addresses.If a collection of subnets is connected to the rest of the

world entirely by CIDR routers, then the ranges of IP addresses used within the collection can be

allocated to individual subnets in chunks determined by a binary mask of any size.

For example, a Class C address space can be subdivided into 32 groups of 8. Figure 3.10 contains

an example of the use of the CLDR mechanism to split the 138.37.95 Class C-sized subnet

into several groups of eight host addresses that are routed differently. The separate groups are

enoted by notations 138.37.95.232/29, 138.37.95.248/29 and so on.

Firewalls

The purpose of a firewall is to monitor and control all communication into and out of an intranet.

A firewall is implemented by a set of processes that act as a gateway to an intranet (Figure

3.20(a)), applying a security policy determined by the organization.

The aims of a firewall security policy may include any or all of the following:

Service control:

To determine which services on internal hosts are accessible for external access and to reject all

other incoming service requests. Outgoing service requests and the responses to them may also be

controlled. These filtering actions can be based on the contents of IP packets and the TCP and



UDP requests that they contain. For example, incoming HTTP requests may be rejected unless

they are directed to an official web server host.

Behaviour control:

To prevent behaviour that infringes the organization's policies, is anti-social or has no

discernible legitimate purpose and is hence suspected of forming part of an attack. Some of these

filtering actions may be applicable at the IP or TCP level, but others may require interpretation of

messages at a higher level. For example, filtering of email `spam' attacks may require

examination of the sender's email address in message headers or even the message contents.

User control:

The organization may wish to discriminate between its users, allowing some access to

external services but inhibiting others from doing so. An example of user control that is perhaps

more socially acceptable than some is to prevent the acknowledging of software except to users

who are members of the system administration team, in order to prevent virus infection or to

maintain software standards. This particular example would in fact be difficult to implement

without inhibiting the use of the Web by ordinary users.

The policy has to be expressed in terms of filtering operations that are performed by filtering

processes operating at several different levels:

IP packet filtering:

This is a filter process examining individual IP packets. It may make decisions based on

the destination and source addresses. It may also examine the service type field of IP packets and

interpret the contents of the packets based on the type. For example, it may filter TCP packets

based on the port number to which they are addressed, and since services are generally located at

well-known ports, this enables packets to be filtered based on the service requested. For example,

many sites prohibit the use of NFS servers by external clients.

TCP gateway:

A TCP gateway process checks all TCP connection requests and segment transmissions. When a

TCP gateway process is installed, the setting-up of TCP connections can be controlled and TCP

segments can be checked for correctness (some denial of service attacks use malformed TCP

segments to disrupt client operating systems). When desired, they can be routed through an

application-level gateway for content checking.

Application-level gateway:

An application-level gateway process acts as a proxy for an application process. For



example, a policy may be desired that allows certain internal users to make Telnet connections to

certain external hosts. When a user runs a Telnet program on his local computer, it attempts to

establish a TCP connection with a remote host.

The request is intercepted by the TCP gateway. The TCP gateway starts a Telnet proxy process

and the original TCP connection is routed to it. If the proxy approves the Telnet operation (the

user is authorized to use the requested host) it establishes another connection to the requested host

and then it relays all of the TCP packets in both directions. A similar proxy process would

run on behalf of each Telnet client, and similar proxies might he employed for FTP and other

services.

.12 Question Bank 1. Define distributed systems?



2. Give examples of distributed systems . 3. Write the following (i)HTTP (ii) HTML (iii) URL 4. What are the uses of web services? 5. Define heterogeneity. 6. What are the characteristics of heterogeneity ? 7. What is the purpose of heterogeneity mobile code? 8. Why we need openness? 9. How we provide security? 10. Define scalability. 11. What are the types of transparencies? 12. Define transparencies. 13. Define System model. 14. What is the architectural model? 15. What is the fundamental model? 16. What are the difficult for treat and distributed system? 17. Define Middleware. 18. What are the different types of model? 19. Which type of network can be used by distributed system? 20. What are the different types of network? 21. Define latency. 22. What is the difference between networking and internetworking? 23. What is meant by networking? 24. What is meant by internetworking? 25. What are the different types of switching are used in computer networking? 26. Define protocol. 27. What is the function of router? 28. What is meant by internet protocol? 29. Define domain name. 30. Define mobile IP.

PART-B

1. a. Explain the Differences between intranet and internet (8)

b. Write in detail about www (8) 2. Explain the various challenges of distributed systems (16) 3. Write in detail about the characteristics of inter process communication (16) 4. a. Explain in detail about marshalling (8)

b. Explain about the networking principles. (8) 5. Describe in detail about client - server communication. (l6) 6. Write in detail about group communication. (l6) 7. Explain in detail about the various system models (16)



2.1 INTRODUCTION - INTER PROCESS COMMUNICATION

Inter process communication is concerned with the communication between processes

in a distributed system, both in its own right and as support for communication between

distributed objects.

The Java API for inter process communication in the internet provides both

datagram and stream communication.

The Application Program Interface (API) to UDP provides a message passing abstraction

- the simplest form of interprocess communication. This enables a sending process to transmit

a single message to a receiving process. The independent packets containing these messages

are called datagrams. In the Java and UNIX APIs, the sender specifies the destination using a

socket

- an indirect reference to a particular port used by the destination process at a

destination computer.

The API to TCP provides the abstraction of a two-way stream between pairs of processes.

The information communicated consists of a stream of data items with no message boundaries.

Request-reply protocols are designed to support client-server communication in the form

of either Remote Method Invocation (RMI) or Remote Procedure Call (RPC). Group

multicast protocols are designed to support group communication. Group multicast is a form of

interprocess communication in which one process in a group of processes transmits the same

message to all members of the group.

2.2 The API for the Internet Protocol

The characteristics of interprocess communication

Message passing between a pair of processes can be supported by two message

communication operations: send and receive.

In order for one process to communicate with another, one process sends a message to a

destination and another process at the destination receives the message. This activity

involves the communication of data from the sending process to the receiving process and may

involve the synchronization of the two processes.

A queue is associated with each message destination. Sending processes cause messages to be

added to remote queues and receiving processes remove messages from local queues.

Communication between sending and receiving processes may be either synchronous

or asynchronous.



In synchronous form of communication, the sending and receiving processes synchronize at

every message. In this case, both send and receive are blocking operations. Whenever a send is

issued the sending process is blocked until the corresponding receive is issued. Whenever

receive is issued, the process blocks until a message arrives.

In asynchronous form of communication, the use of the send operation is non-blocking in

that the sending process is allowed to proceed as soon as the message has been copied to a

local buffer and the transmission of the message proceeds in parallel with the sending

process. The receive operation can have blocking and non-blocking variants.

Messages are sent to (Internet address, local port) pairs. A local port is a message

destination within a computer, specified as an integer. A port has exactly one receiver but

can have many senders. Processes may use multiple ports from which to receive

messages. Servers generally publicize their port numbers for use by clients.

Sockets:

Both forms of communication (UDP and TCP) use the socket abstraction, which

provides an end point for communication between processes. Inter-process

communication consists of transmitting a message between a socket in one

process and a socket in another process.

For a process to receive messages, its socket must be bound to a local port and the

Internet address of the computer on which it runs. Processes may use the same

socket for sending and receiving messages.

Any process may make use of multiple ports to receive messages, but a process

cannot share ports with other processes on the same computer. Processes using IP

multicast are an exception in that they do share ports.

UDP datagram communication



A datagram sent by UDP is transmitted from a sending process to a receiving process

without acknowledgement or retries. If a failure occurs, the message may not arrive. To send or

receive messages, a process must first create a socket bound to an Internet address of the local

host and a local port. A server will bind its socket to a server port - one that it makes known to

clients so that they can send messages to it. A client binds its socket to any free local port.

Few issues relating to datagram communication are :

Message Size- The receiving process need to specify an array of bytes of a particular size.

If the message is too big for the array, it is truncated on arrival. Any application requiring

messages larger than the maximum must fragment them into chuncks of that size.

Blocking- Sockets normally provide non-blocking sends and blocking receives for

datagram communication.

Timeouts- The receive that blocks for ever is suitable for use by a server that is waiting to

receive requests from its clients. But in some program, it is not appropriate that a process

that has used a receive operation should wait indefinitely in situations where the potential

sending process has cashed or the expected message has been lost. To allow such

requirements, timeouts can be set on sockets.

Receive from any - The receive method does not specify an origin for messages. Instead

an invocation of receive gets a message addressed to its socket from any origin.

A failure model for UDP datagrams suffer from the following failures:

Omission failures - messages may be dropped occasionally.

Ordering - messages can sometimes be delivered out of sender order.

Use of UDP:

The Domain Name Service (DNS), which looks up DNS names in the Internet, is

implemented over UDP. UDP datagrams are sometimes an attractive choice because they do not

suffer from overheads associated with guaranteed message delivery.

Java API for UDP datagrams:

The Java API provides datagram communication by means of two classes:

DatagramPacket

DatagramSocket

Datagram Packet: This class provides a constructor that makes an instance out of an array of

bytes comprising a message, the length of the message and the Internet address and local port



number of the destination socket as shown in the following Figure. This class provides another

constructor for receiving a message. Its argument specify an array of byte to receive the message

and its length.

Array of bytes Containing

message

Length of

Message

Internet

address

Port number

Figure : Datagram Packet

DatagramSocket: This class supports sockets for sending and receiving UDP datagrams. It

provides a constructor that takes a port number as argument, for use by a processes that need

to use a particular port. It also provides a no-argument constructor that allows the system to

choose a free local port.

The class DatagramSocket provides the following methods:

Send and receive: These methods are for transmitting datagrams between a pair of sockets.

SetSoTimeout: This method allows a time out to be set. With a timeout set, the receive

method will block for the time specified and then throws an InterruptedlOException.

Connect: This method is used for connecting it to a particular remote port and Internet

address.

Program: UDP client

importjava.net .* ;

import Java, io.* ;

public class UDPCIient {

public static void main (String args [ ]) {

try {

DatagramSocket aSocket = new DatagramSocket();

byte[) m = args [0].getBytes() ;

InetAddress aHost = lnetAddress.getByName(args[1]);

int serverport = 6789;

DatagramPacket request = new DatagramPacket(m, args[0].length( ), aHost, serverport);

a Socket.send (request);

byte[ ] buffer = new byte [1000];

DatagramPacket reply = new DatagcamPacket (buffer, buffer.length);



System. out.printinfReply: " + new String(reply.getData( ) ) );

} catch (SocketException e) {System.out.println (―Socket: "+e.get MessageO )

} catch (lOException e) {System.out.println(―IO: "+e.get Message( )); }

finally { if (aSocket != null) aSocket.close(); }

}

}

In the above program, the client creates a socket, sends a message to a server at port 6789

and then waits to receive a reply. The arguments of the main method supply a message and the

DNS hostname of the server.

The code that follows is the corresponding server program, which creates a socket bound to

its server port (6789) then repeatedly waits for the request message from a client, to which it

replies by sending back the same message.

Program - UDP Server:

importjava.net.*;

import Java.io.* ;

public class UDPServer {

public static void main (String args[ ]){ try {

DatagramSocket aSocket = new DatagramSocket(6789); byte[ ]

buffer = new byte[1000];

while(true){

DatagramPacket request = new DatagramPacket (buffer, buffer.length);

aSocket. receive(request); DatagramPacket reply = new

DatagramPacket (request.getData(), request.getLength(), equest.getAddress(),

request.getPort());

aSocket.send (reply);

}

} catch (SocketException e) { System.out.println("socket: "+e.getMessage ( )) ;

} catch (lOException e) {System.out.println ("IO: "+ e.getMessage( ));

} finally { if (aSocket != null) aSocket.close( );}

}

}



TCP stream Communication.

The API to the TCP protocol provides the abstraction of a stream of bytes to which data may be

written and from which data may be read. The following characteristics of the network are

hidden by the stream abstraction.

Message size: The application can choose how much data it writes to a stream or reads

from it.

Lost messages: The TCP uses an acknowledgement scheme. If the sender does not

receive an acknowledgement within a timeout, it retransmits the message.

Flow control: TCP attempts to match the speed of the processes that read from and write

to a stream.

Message ordering and duplication: Message identifiers are associated with each IP packet,

which enables the recipient to detect and reject duplicates, or to reorder messages that do

not arrive in sender order.

Message destinations: A pair of communicating processes establishes a connection before

they can communicate over a stream. Once a connection is established, the process simply

read from and writes to the stream without the use of Internet address and ports.

The API for stream communication assumes that when a pair of processes are establishing a

connection, one of them plays the client role and the other plays the server role, but thereafter

they would be peers.

The client role involves creating a stream socket bound to any port and then making a

connect, request asking for a connection to a server at its server port. The server role involves

creating a listening socket bound to a server port and waiting for clients to request connections.

The listening socket maintains a queue of incoming connection requests. When the server

accepts a connection, a new stream socket is created for the server to communicate with a client,

meanwhile retaining its socket at the port for listening to other clients.

Some outstanding issues related to stream communication are

Matching of data items: Two communicating processes need to agree as to the contents of

the data transmitted over a stream. E.g., if one process writes an 'int' followed by a

'double' then the reader at the other end must read an 'int' followed by a 'double'.

Blocking: When a process attempts to read data from an input channel, it will get data

from the queue or it will block until data becomes available.



The process that writer data to a stream may be blocked by the TCP flow control

mechanism if the socket at the other end is queueing as much data as the protocol allows.

Threads: When a server accepts a connection, it generally creates a new thread to

communicate with the new client

Failure Model:

To satisfy the integrity property, TCP use checksums to detect and reject corrupt packets

and sequence numbers to detect and reject duplicate packets. In order to satisfy the validity

property, timeouts and retransmissions are used by TCP.

Use of TCP: Many frequently used services run over TCP connection with reserved port

numbers. These include HTTP, FTP, Telnet and SMTP.

Java API for TCP streams: The Java interface to TCP streams is provided in the classes

ServerSocket and Socket.

ServerSocket: This class is intended for use by a server to create a socket at a server port for

listening for connect requests from clients. Its accept method gets a connect request from the

queue, or if the queue is empty, it blocks until one arrives.

Socket: The client uses this constructor to create a socket specifying the DNS hostname and port

of a server. The Socket class provides methods getlnputStream and getOutputStream for

accessing the two streams associated with a Socket.

TCP client program:

importjava.net.*impor java.io.*;

public class TCPCIient {

public static void main (String args[ ]) {

try { int serverport = 7896;

Socket c = new Socket(args[1], serverport);

DatalnputStream in = new DatalnputStream (

c.getlnputStream ());

DataOutputStream out = new DataOutputStream (c. getOutputStream ());

out.writeUTF(args[0]);

String data = in.readUTF();

System.out.println("Received: " + data);

} catch (UnknownHostException e) { System.out.println("sock: " + e.getMessage( ));



} catch (EOFException e) { System.out.printin ("EOF:" + e.getMessage ( ));

} catch (lOException e) { System.out.printin ("IO:" + e.getMessage ( ));

} finally {if (c != null) try{ c.close() ;} catch (IO Exception e);

}

}

In the client program, the arguments of the main method supply a message and the DNS

hostname of the server. The client creates a socket bound to the hostname and server port 7896. It

makes a DatalnputStream and DataOutputStream then writes the message to its output stream and

waits to read a reply from its input stream. UTF is an encoding that represents string in a

particular format.

The server program opens a server socket on its server port (7896) and listens for connect

requests. When one arrives, it makes a new thread in which to communicate with the client.

TCP server Program:

importjava.net.*;

import java.io.*;

public class TCPServer {

public static void main (String args[ ]) { try {

int serverport = 7896;

ServerSocket lissoc = new ServerSocket(serverport);

while (true) {

Socket s = lissoc.accept( );

DatalnputStream in = new DatalnputStream

(s.getlnputStream ( ));

DataOutputStream out = new DataOutputStream

(s.getOutputStream ( )); String

line = in.readUTF( );

out.writeUTF(line);

}

} catch (EOFException e) { System.out.println ("EOF: " +

e.geiMessage( ));

} catch (lOException e) { System.out.println ("10: "+ e.getMessage( ));



} finally { try { lissoc.close( );} catch (lOException e);}

}

}

2.3 EXTERNAL DATA REPRESENTATION AND MARSHALLING

The information stored in running programs is represented as data structures whereas the

information in messages consists of sequences of bytes. Irrespective of the form of

communication used, the data structures must be flattened (converted to a sequences of bytes)

before transmission and rebuilt on arrival. The representation of data items differs between

architectures. Another issue is the set of codes used to represent characters for example, UNIX

system use ASCII character coding, taking one byte per character, whereas the Unicode standard

allows for the representation of text in many languages and takes two bytes per character.

One of the following methods can be used to enable any two computers to exchange data

values.

• The values are converted to an agreed external format before transmission and converted to

the local form on receipt.

• The values are transmitted in the sender's format, together with an indication of the format

used, and the recipient converts the values if necessary.

To support RMI (Remote Method Invocation) or RPC (Remote Procedure Call) any data

type that can be passed as an argument or returned as a result must be able to be flattened and the

individual primitive data values represented in an agreed format. An agreed standard for the

representation of data structures and primitive values is called an external data representation.

Marshalling and Unmarshalling :

Marshalling is the process of taking a collection of data items and assembling them into a

form suitable for transmission in a message. Unmarshalling is the process of disassembling them

on arrival to produce an equivalent collection of data items at the destination. Thus marshalling

consists of the translation of structured data items and primitive values into an external data

representation. Similarly, unmarshalling consists of the generation of primitive values from their

external data representation and the rebuilding of the data structures.

Two alternative approaches to external data representation and marshalling are:

CORBA's common data representation

Java's object serialization



CORBA's common Data Representation (CDR)

CORB A CDR is the external data representation defined with CORBA 2.0. CDR can

represent all of the data types that can be used as arguments and return values in remote

invocations in CORBA. It consists of 15 primitive types that include short (16-bit), long (32-bit),

unsigned short, unsigned long, float (32-bit), double(64-bit), char, Boolean (TRUE or FALSE),

octet (8-bit) and any constructed types as shown in following Figure

Type Representation

sequence

String

Array

Struct

enumerated

Union

length (unsigned long) followed by elements in order

length (unsigned long) followed by characters in order

array elements in order

in the order of declaration of the components

unsigned long

type tag followed by the selected member

0-3 5

4-7 "smit"

8-11 h-"

12-15 6

16-19 "Lond"

20-23 "on - "

24-27 1934

Figure: CORBA CDR message

The Figure shows a message in CORBA CDR that contains three fields of a structure whose

respective types are string, string and unsigned long. The representation of each string consists of an

unsigned long representing its length followed by the characters in the string. Variable length data is

padded with zeros. Each unsigned long occupies four bytes, so the index is a multiple of four.



The CORBA interface compiler generates appropriate marshalling and

unmarshalling of argument and results of remote invocation.

Java object serialization

In Java, the term serialization refers to the activity of flattening an object or a connected set of

objects into a serial form that is suitable for storing on disk or transmitting in a message, for example as

an argument and result of an RMI. Deserialization consists of restoring the state of an object or set

of objects from their serialized form.

Java objects can contain references to other objects. When on object is serialized, the entire

object that it references are serialized together with it. References are serialized as handles. The

handle is a reference to an object within the serialized form.

To serialize an object, its class information is written out, followed by the types and names

of its instance variables. If the instance variables belong to new classes, then their class

information must also be written out followed by the types and names of their instance variables.

This recursive procedure continues until the class information and instance variables of the

necessary classes have been written out. Each class is given a handle, and no class is written

more than once to the stream of bytes- the handles being written where necessary.

Eg: Person P = new Person ("Smith", 'Leaden', 1934);

The serialized form of the given example is shown in following Figure HO andHl are handles

PERSON 8-byte version number Ho

3 int year string name string place

1934 5-smith 6 London HI

Serialized form of Person Object

• Primitive types are written in portable format using methods of ObjectOutputStream class.

• Strings and characters are written by its method called writeUTF. (Universal Transfer

Format)

• To serialize the object (e.g. Person),

Create an instance of the class ObjectOutputStream and invoke its writeObject method by

passing the person object as argument

• To deserialize an object from a stream of data, open an ObjectOutputStream on the

stream and use its readObject method to reconstruct the original object.



Serialization and deserialization of the arguments and results of remote invocations are

generally carried out automatically by the middleware, without any participation by the

application programmer.

Remote object References:

A remote object reference is an identifier for a remote object that is valid throughout a

distributed system. A remote object reference is passed in the invocation message to specify

which object is to be invoked. Even after the remote object associated with a given remote object

reference is deleted, it is important that the remote object reference is not reused.

Remote object reference can be constructed by concatenating the Internet address of its computer,

port number of the process that created it with the time of its creation and a local object number.

The local object number is incremented each time an object is created in that process.

Internet

Address

Port

number

time Object

number

interface of remote

object

32-bits 32-bits 32-bits 32-bits

Figure: Representation of a remote object

2.4 CLIENT-SERVER COMMUNICATION

This form of commutation is designed to support the roles and message exchanges in typical

client-server interactions. In general, request-reply communication is synchronous because the

client process blocks until the reply arrives from the server. It can also be reliable because reply

is effectively an acknowledgement to the client.

The client-server exchanges messages in terms of send and receive operations in the JAVA

API. A protocol built over datagrams avoids unnecessary overheads associated with tue TCP

stream protocol.

The Request-Reply Protocol:

This protocol is based on three primitives: doOperation, getRequest and sendReply. It may

be designed to provide certain delivery guarantees. If UDP datagrams are used, the delivery

guarantees must be provided by the request-reply protocol, which may use the server reply

message as an acknowledgement of the client request message.



The doOperation method is used by client to invoke remote operations. Its arguments

specify the remote object and which method to invoke, together with additional information

required by the method. It is assumed that the client calling doOperation marshals the arguments

into an array of bytes and unmarshals the results from the array of bytes that is returned.

Syntax:

public byte[ ] doOperation(RemoteObjectRef o, int methodid, byte[ ] arguments)

The Request- Reply message structure is

Message Type int (0-request, 1 - reply)

The doOperation method sends a request message to the server whose Internet address and

port are specified in the remote object reference given as argument. After sending the request

message, doOperation invokes receive to get a reply message, from which it extracts the result

and returns it to the caller.

GetRequest is used by a server process to acquire service requests. When the server has

invoked the method in the specified object it then uses sendReply to send the reply message to

the client. When the reply message is received by the client, the doOperation is unblocked and

execution of the client program continues.

Syntax: public byte[ ] getRequest( );

public void sendReply(byte[ ] reply, InetAddress clientHost, int clientPort);

Figure:

Request

–Reply



Request Id int

Object Reference RemoteObj ectRef

Method Id int or method

Arguments array of bytes

Failure model of the request-reply protocol:

If the three primitive operations are implemented over UDP datagrams, they suffer from the

following communication failures:

They suffer from omission failures

Messages are not guaranteed to be delivered in sender order

The protocol can suffer from the failure of processes

To allow for occasions when a server has failed or a request or reply message is dropped,

doOperation uses a timeout when it is waiting to get the server's reply message. The action taken

when a timeout occurs depends upon the delivery guarantees to be offered.

The protocol is designed to recognize successive messages with the same request identifier

and to filter out duplicates. If the server has already sent the reply when it receives a duplicate request

it will need to execute the operation again to obtain the result. Some servers can execute their

operations more than once and obtain the same results each time. An idempotent operation is one that

can be performed repeatedly with the same effect as if it had been performed exactly once.

For servers that require retransmission of replies without re-execution of operations, a history may

be used. The term 'history' refers to a structure that contains a record of reply messages that

have been transmitted. An entry, in a history contains a request identifier, a message and an

identifier of the client to which it was sent. Its purpose is to allow the server to retransmit reply

messages when client processes request. A problem associated with the use of history is its

memory.

RPC exchange protocols :

The following three protocols are used for implementing various types of RPC:. These three

protocols produce differing behaviors in the presence of communication failures.

The request (R) protocol: It may be used when there is no value to be returned from the

procedure and the client requires no confirmation that the procedure has been executed.

The request-reply (RR) protocol: It is useful for most client-server exchanges. Special

acknowledgement messages are not required, because a server's reply message is regarded



as an acknowledgement.

The request-reply-acknowledgement (RRA) protocol: It is based on exchange of 3

messages: request, reply and acknowledgement. The acknowledgement contains the

requestld which will enable the server to discard entries from its history.

Use of TCP streams to implement the request-reply protocol

The desire to avoid implementing multi-packet protocols is one of the reasons for choosing

TCP streams allowing arguments and results of any size to be transmitted. If the TCP is used, it

ensures that the messages are delivered reliably, so there is no need for retransmission of

messages and filtering of duplicates or with histories. The overhead due to acknowledgement

messages is reduced when a reply message follows soon after a request message.

HTTP : an example of a request-reply protocol :

HTTP (Hyper Text Transfer Protocol) is a protocol that specifies the messages involved in a

request-reply exchange, the methods, arguments and results and the rules for representing them in

the messages. It supports a fixed set of methods (GET, PUT, POST, etc) that are applicable to all

of its resources. In addition to invoking methods on web resources, the protocol allows for

content negotiation and password-style authentication.

Content negotiation: Client's requests can include information as to what data

representation they can accept, enabling the server to choose the representation that is

most appropriate for the user.

Authentication : Password style authentication is provided to prove the identity of the

source

HTTP is implemented over TCP. Each client server interaction consists of the following

steps:-

1. The client requests and the server accept a connection at the default server port or at a

port specified in the URL.

2. The client sends a request message to the server.

3. The server sends a reply message to the client.

4. The connection is closed.

However, the need to establish and close a connection for every request-reply exchange is

expensive, both in overloading the server and in sending too many messages over the network. In

order to overcome it, a later version of the protocol uses persistent connections - connections that

remain open over a series of request-reply exchanges between client and server.



Requests and replies are marshaled into messages as ASCII text strings, but resources can be

represented as byte sequences and may be compressed. Recourses implemented as data are supplied

as Multipurpose Internet Mail Extension (MIME) like structures in arguments and results. MIME is a

standard for sending multipart data containing, text, images and sound in e-mail messages. Data is

prefixed with its MIME type so that the recipient will know how to handle it.

HTTP methods:

GET - requests the resources whose URL is given as argument.

HEAD - identical to GET, but it does not return any data. However it does return all the

information about the data such as the time of last modification, its type & size.

Post: It specifies the URL of a resource that can deal with the data supplied with the request. The

processing carried out on the data depends on the function of the program specified in the URL.

UT - requests that the data supplied in the request is stored with the given URL as its

identifier.

DELETE - The server deletes the resource identified by the given URL. Servers may not

always allow this operation, in which case the reply indicates failure.

OPTIONS - The server supplies the client with a list of methods it allows to be applied to

the given URL.

TRACE - used for diagnostic purposes.

Message Contents :

The Request Message specifies the name of a method, the URL of a resource, the protocol version,

some headers and an optional message body. Fig (a) below shows the contents of an HTTP Request

message whose method is Get.

A Reply message specifies the protocol version, a status code, and 'reason', some headers

and an optional message body.

HTTP request message (msg):

method URL or pathname HTTP version headers msg body

GET http://www.yahoo.com HTTP/1.1

HTTP reply msg:

HTTP version status code reason headers msg body

HTTP/1.1 200 OK resource data

http://www.yahoo.com/



Status code and reason provide a report on the success or otherwise in carrying out the

request: status code is a 3 digit number for interpretation by a program and reason is a

textual phrase understood by a person.

2.5 GROUP COMMUNICATION

The pair wise exchange of messages is not the best model for communication from one

process to a group of other processes. A multicast operation is more appropriate, i.e., an

operation that sends a single message from one process to each of the members of a group of

processes, usually in such a way that the membership of the group is transparent to the

sender.

Multicast messages provide a useful infrastructure for constructing distributed systems

with the following characteristics:

• Fault tolerance based on replicated services

• Finding the discovery servers in spontaneous networking

• Better performance through replicated data

• Propagation of event notifications

IP multicast is built on top of the Internet protocol IP. A multicast group is specified by a

class D Internet address. The membership is dynamic, allowing computers to join or leave at any

time. The Java API provides a datagram interface to IP multicast through the class

MulticastSocket, which is a subclass of DatagramSocket with the additional capability of being

able to join multicast groups. A process can join a multicast group with a given multicast address

by invoking the joinGroup method of its MulticastSocket. A process can leave a specified group

by invoking the leaveGroup method of its MulticastSocket.

Program :

importjava.net.*;

import java.io.*;

public class MulticastPeer {

public static void main (String args[ ]){ try {

InetAddress group = lnetAddress.getByName(args[1]);

MulticastSocket s = new MulticastSocket (6789);

s.joinGroup (group);

byte[ ] m = args[0].getBytes();



DatagramPacket message = new DatagramPacket (m, m.length, group,

6789); s.send

(message); byte [ ] buffer = new byte

[1000];

for (int i = 0; < n ; i + +) { // n - no of members in a group

DatagramPacket messageln = new DatagramPacket (

buffer, buffer.length);

s.receive(messageln);

}

s. leaveGroup(group);

} catch (socket Exception e) { system, out. print In ("Socket:" +

e. get message ());

} catch (lOException e) {System. out.printlnflO: "+e.getMessage( ) );

} finally { if (s != null) s.close(); }

}

}

In the above program, the arguments to the main method specify a message to be multicast and the

multicast address of a group. After joining that multicast group, the process makes an instance of

DatagramPacket containing the message and sends it through its multicast socket to the multicast

group. After that, it attempts to receive 'n' multicast messages from its peers via its socket, which also

belongs to the group on the same port.

2.6 CASE STUDY: INTERPROCESS COMMUNICATION IN UNIX

The IPC primitives in BSD 4.x versions of the UNDC- are provided as system calls that are

implemented as a layer over the Internet TCP and UDP protocols. Message destinations are specified

as socket addresses - a socket address consists of an Internet address and a local port number. The

interprocess communication operations are based on the socket abstractions. Messages are queued

at the sending socket until the networking protocol has transmitted them, and until an

acknowledgement arrives, if the protocol requires one. When messages arrive they are queued at the

receiving socket until the receiving process makes an appropriate system call to receive them.

Any process can create a socket to communicate with another process. This is done by invoking

socket system call, whose arguments specify the communication domain, the type and sometimes a



particular protocol. The protocol is particularly selected by the system according to whether the

communication is datagram or stream.

Datagram communication

In order to send datagrams, a socket pair is identified each time a communication is made.

This is achieved by sending process using its local socket descriptor and the socket address of the

receiving socket each time it sends a message. Figure 1.16 illustrates the sockets used for

datagrams. ClientAddress and ServerAddress are socket addresses.

s=socket (AF_INET,SOCK_DGRAM,0) s=socket (AF_INET,SOCK_DGRAM,0)

• • • bind(s,ClientAddress) • bind (s,ServerAddress)

• •

• •

amount=recvfrom (s,buffer,from) Sendto (s, "message", S erverAddress)

Figure: Sockets used for Datagrams

Both processes use the socket call to create a socket and get a descriptor for it. The first

argument of socket specifies the communication domain as the Internet domain and the second

argument indicates that datagram communication is required. The last argument to the socket

call may be used to specify a particular protocol, but setting it to zero causes the system to

select a suitable protocol-UDP in this case.

Both processes use the bind call to bind their sockets to socket addresses. The sending

process binds its socket to a socket address referring to any available local port number. The

receiving process binds its socket to a socket address that contains its server port and must

be made known to the sender.

The sending process uses sendto call with arguments specifying the socket through which

the message is to be sent, the message itself and the socket address of the destination. The

sendto call hands the message to the underlying UDP and IP protocols and returns the actual

number of bytes sent. As datagram service is requested the message is transmitted to its

destination without an acknowledgement. If the message is too long to be sent, there is an

error return.

The receiving process uses the recvfrom call with arguments specifying the local socket

on which to receive a message and memory locations in which to store the message and



the socket address of the sending socket. The recvfrom call collects the first message in

the queue at the socket, or if the queue is empty it will wait until a message arrives.

Communication occurs only when a sendto in one process addresses its message to the

socket used by a recvfrom in another process. In client-server communication there is no need for

servers to have prior knowledge of clients' socket addresses, because the recvfrom operation

supplies the sender's address with each message it delivers. The properties of datagram

communication in UNIX are the same as those described in Section 1.6.

Stream communication

In order to use the stream protocol, two processes must first establish a connection between

their pair of sockets. The arrangement is a asymmetric because one of the sockets will be

listening for a request for connection and the other will be asking for a connection. Once a pair of

socket has been connected, they may be used for transmitting data in both or either direction.

That is, they behave like streams in that any available data is read immediately in the same order

as it was written and there is no indication of the boundary of the messages. However there is a

bounded queue at the receiving socket and the receiver blocks if the queue is empty; the sender

blocks if it is full. Figure 1.17 illustrates stream communication, in which the details of the

arguments are simplified; it does not show the server closing the socket on which it listens.

Normally a server would first listen and accept a connection and then fork a new process to

communicate with the client. Meanwhile, it will continue to listen in the original process. The

properties of stream communication in UNIX are the same as those described in Section 1.6.

Server or listening process first uses the socket operation to create a stream socket and

the bind operation to bind its socket to the servers' socket address. The second argument

to the socket system call is given as STOCKSTREAM, to indicate that stream

communication is required. If the third argument is left as zero, the TCP/IP protocol will

be selected automatically. It uses the listen operation to listen on its socket for client

requests for connections. The second argument to the listen system call specifies the

maximum number of requests for connections that can Be*queued at this socket.

The server uses accept system call to accept a connection requested by a client and obtain

a new socket for communication with that client. The original socket may still be used to

accept further connections with other clients.

The client process uses the socket operation to create a stream socket and then uses the

connect system call to request a connection via the socket address of the listening process.



As the connect call automatically binds a socket name to the callers' socket prior binding is

unnecessary.

After a connection has been established, both processes may then use the write and read

operations on their respective sockets to send and receive sequences of bytes via the

connection. The write operation is similar to the write operation for files. It specifies a message to be

sent to a socket. It hands the message to the underlying TCP/IP protocol and returns the actual

number of characters sent. The read operation receives some characters in its buffer and returns

the number of characters received.

S =

socket(AF_INET,SOCK_STREAM,0)

• •

Connect (s,ServerAddress)

•

S = SOcket(AF_INET,SOCK_STREAM,0)

•

•

bind(s, ServerAddress);

listen(s,5);

write(s,"message",length) sNew = accept(s,ClientAddress);

• n = read (sNew, buffer, amount)

Figure : Sockets used for streams



DISTRIBUTED TRANSACTION PROCESSING

5.1-Transactions

Transactions protect a shared resource against simultaneous access by several

concurrent processes. In particular, transactions are used to protect shared data. They

allow a process to access and modify multiple data items as a single atomic operation. If

the process backs out halfway during the transaction, everything is restored to the point just

before the transaction started.

The goal of transactions is to ensure that all of the objects managed by a server

remain in a consistent state when they are accessed by multiple transactions and in the

presence of server crashes. A transaction is specified by the client as a set of operations

on objects to be performed as an indivisible unit by the servers managing those subjects.

Operations that are free from interface from concurrent operations being performed in

other threads are called atomic operations.

In some situations, clients require a sequence of separate requests to a server to be

atomic in the sense that:

They are free from interference by operations being performed on behalf of other

concurrent clients (Isolation).

Either all of the operations must be completed successfully or they must have no

effect at all in the presence of server crashes.

This all-or-nothing effect has two further aspects of its own:

Failure atomicity: The effects are atomic even when the server crashes.

Durability: after a transaction has completed successfully all its effects are saved in

permanent storage. Data saved in a file will survive if the server process crashes.

To support the requirement for failure atomicity and durability, the objects must be

recoverable; when a server process crashes unexpectedly due to a hardware fault or a

software error, the changes due to all completed transactions must be available in

permanent storage so that when the server is replaced by a new process, it can recover



the objects to reflect the all-or-nothing effect. Each transaction is created and managed

by a coordinator, which implements the Coordinator interface shown in the Figure 4.1.

The coordinator gives each transaction an identifier, or TID.

openTransaction ( ) → trans;

Starts a new transaction and delivers a unique TID trans. These identifiers will

be used in the other operations in the transaction.

closeTransaction (trans) → (commit, abort);

ends a transaction: a commit return value indicates that the transaction has

committed; an abort return value indicates that it has aborted.

abortTransaction(trans);

Aborts the transactions.

Figure: Operations in Coordinator interface

Transactions are:

1. Atomic: To the outside world, the transaction happens indivisibly.

2. Consistent: The transaction does not violate system invariants.

3. Isolated: Concurrent transactions do not interfere with each other.

4. Durable: Once a transaction commits, the changes are permanent.

These properties are often referred by their initial letters, ACID. A transaction is

achieved by cooperation between a client program, some recoverable objects and a

coordinator. The client specifies the sequence of invocations on recoverable objects that

are to compromise a transaction. To achieve this, the client sends with each invocation

the transaction identifier returned by open Transaction. One way to make this possible is

to include an extra argument in each operation of a recoverable object to carry the TID.

Normally, a transaction completes when the client makes a close Transaction request. If

the transaction has progressed normally the reply states that the transaction is committed-

this constitutes an undertaking to the client that all of the changes requested in the

transaction are permanently recorded and that any future transactions that access the

same data will see the results of all of the changes made during the transaction.

Alternatively, when a transaction is aborted the parties involved (the recoverable objects

and the coordinator) must ensure that none of its effects is visible to future transactions,

either in the objects or in their copies in permanent storage. Concurrency control.



In this section we illustrate two well- known problems of concurrent transactions in

the context of the banking example - the 'lost update' problem and the 'inconsistent

retrievals' problem. Then we discuss as to how both of these problems can be avoided by

using serially equivalent executions of transactions.

The lost update problem: The lost update problem is illustrated by the following

pair of transactions on bank accounts A, B and C, whose balances are $100, $200 and

$300, respectively. Transaction T transfers an amount from account A to account B.

Transaction U transfers an amount from account C to account B. In both cases, the

amount transferred is calculated to increase the balance of B by 10%. The net effects on

account B of executing the transactions T and U should be to increase the balance of

account B by 10% twice, so its final value is $242.

Let us consider the effects of allowing the transaction T and U to run concurrently, as

in Figure 4.2. Both transactions get the balance of B as $200 and then deposit $20. The result

is incorrect, increasing the balance of account B by $20 instead of $42. This is an illustration

of the 'lost update' problem. U's update is lost because T overwrites it without seeing it.

Both transactions have read the old value before either writes the new value.

Transaction T: Transaction U:

Balance = b.getBalance( ); balance = b.getBalance( );

b.setBalance(balance* 1.1);

c .withdraw(balance/1-0)

b.setBalance(balance*l.l); $220

a.withdraw(balance/10) $80

c .withdraw(balance/10) $28

b.setBalance(balance*l.l);

a.withdraw(balance/10)

Balance = b.getBalance( ); $200

$200

$220

Balance = b.getBalance();

b.setBalance(balance*l.l);

Figure: The lost update problem

Inconsistent retrievals: Figure shows another example related to a bank account in

transaction V transfers a sum from account A to B and transaction W invokes the branch

Total method to obtain the sum of the balances of all the accounts in the bank. The balances



of the two bank accounts, A and B, are both initially $200. The result of branchTotal includes the

sum of A and B as $300, which is wrong. This is an illustration of the 'inconsistent retrievals'

problem. W's retrievals are inconsistent because V has performed only the withdrawal

part of a transfer at the time the sum is calculated.

Transaction V:

a.withdraw (lOO)

b.deposit(lOO)

Transaction W:

aBranch.branchTotal( )

a. withdraw(lOO); $100

.

total=a.getBalance() $100

total=total + b.getBalance( ) $300

total=total + c.getBalance( )

b.deposit(lOO) $300

.

Serial equivalence:

An interleaving of the operations of transactions in which the combined effect is the

same as if the transactions had been performed one at a time in some order is a serially

equivalent interleaving. The use of serial equivalence as a criterion for correct concurrent

execution prevents the occurrence of lost updates and inconsistent retrievals.

The lost update problem occurs when two transactions read the old value of a

variable and then use it to calculate the new value. This cannot happen if one transaction

is performed before the other, because the later transaction will read the value written by

the earlier one. As a serially equivalent interleaving of two transactions produces the

same effect as a serial one, we can solve the lost update problem by means of serial

equivalence. Figure 4.4 shows one such interleaving in which the operations the affect

the shared account, B, are actually serial, for transaction T does all its operations on B

before transaction U does. Another Interleaving of T and U that has this property is one

in which transaction U completes it operations on account B before transaction T starts.

The inconsistent retrievals problem can occur when a retrieval transaction runs

concurrently with an update transaction. It cannot occur if the retrieval transaction is

performed before or after the update transaction. A serially equivalent interleaving of a



retrieval transaction and an update transaction, for example as in Figure 4.5, will prevent

inconsistent retrievals occurring.

Transaction T:

balance = b.getBalance( )

b.setBalance(balance* 1.1)

a.wihtdraw(balance/10)

Transaction U:

balance = b.getBalance()

b.setBalance(balance* 1.1)

c.withdraw(balance/10)

balance = b.getBalance( ) $200


$220

b.setBalance(balance*l.l) $220

b.setBalance(balance*l.l) $242

a.withdraw(balance/10) $80

c.withdraw(balance/10) $278

Transaction V:

a.withdrawal (lOO);

b.deposit (lOO)

Transaction W:

aBranch.branchTotal()

a.withdrawal (lOO);

b.deposit (lOO)

$100

$300

total = a.getBalance()

total = total+b.getBalance()

total = total+c.getBalance()

$100

$400

Figure: A serially equivalent interleaving of V and W

Conflicting operations: When we say that a pair of operations conflicts we mean that

their combined effect depends on the order in which they are executed. Consider a pair

of operations read and write. Read accesses the value of an object and write changes its

value. The effect of an operation refers to the value of an object set by a write operation



and the result returned by a read operation. The conflict rules for read and write

operations are given in Figure Serial equivalence can be defined in terms of operation

conflicts as follows:

For two transactions to be serially equivalent it is necessary and sufficient that all

pairs of conflicting operations of the two transactions be executed in the same order at all

of the objects they both access.

Operations of different

transaction

Conflict Reason

Read read

Read write

Write write

No

Yes

Yes

Because the effect of a pair of read operations does

not depend on the order in which they are executed

Because the effect of a read and write operation

depends on the order of their execution

Because the effect of a pair of write operations

depends on the order of their execution

Figure: Read and Write operation conflict rules

Recoverability from aborts:

Server must record the effects of all committed transaction and none of the effects

of aborted transactions. They must therefore allow for the fact that a transaction may

abort by preventing it affecting other concurrent transactions if it does so. In this section

we illustrate two problems that are associated with aborting transactions in the context of

the banking example. These problems are called 'dirty reads' and 'premature writes', arid

both of them can occur in the presence of serially equivalent executions of transactions.

Dirty reads: The 'dirty read' problem is caused by the interaction between a read

operation in one transaction and an earlier write operation in another transaction on the

same object. Consider the executions illustrated in Figure 4.7 in which T gets the balance

of account A and sets it to $10 more, then U gets the balance of account A and sets it to

$20 more, and the two executions are serially equivalent. Now suppose that the

transaction T aborts after U has committed. Then the transaction U will have seen a value

that never existed, since A will be restored to its original value. We say that the

transaction U has performed a dirty read. As it has committed, it cannot be undone.



Recoverability of transaction: The strategy for recoverability is to delay commits until

after the commitment of any other transaction whose uncommitted state has been observed.

In our example, U delays its commit until after T commits. In the case that T aborts, then

U must abort as well.

Cascading aborts: The aborting of one transaction may cause still further transactions to be

aborted. Such situations are called cascading aborts

Transaction T:

a.getBalance( )

a.setBalance(balance +10)

Transaction U:

a.getBalance( )

a.setBalance(balance +20)

balance = a.getBalance( ) $

100 a.setBalance(balance+10)

$110

balance = a.getBalance( ) $110

a.setBalance(balance+20) $130

abort transaction

commit transaction

Figure : A dirty read when transaction T aborts

Premature writes: This one is related to the interaction between write operations on the

same object belonging to different transactions. For an illustration, we consider two

setBalance transactions T and U on account A, as shown in Figure 4.8. Before the

transactions, the balance of account A was $100. The two executions are serially

equivalent, with T setting the balance to $105 and U setting it to $110. If the transaction U

aborts and T commits, the balance should be $105. Some database systems implement the

action of abort by restoring 'before images' of all the writes of a transaction. In our

example, A is $100 initially, which is the 'before image' of T's write; similarly $105 is the

'before image' of U's write. Thus if U aborts, we get the correct balance of $105.

Now consider the case when U commits and then T aborts. The balance should be $

110, but as the 'before image' of T's write is $100, we get the wrong balance of $100.

Similarly if T aborts and then U aborts, the 'before image' of U's write is $105 and we get the

wrong balance of $ 105 -the balance should revert to $ 100. To ensure correct results in a

recovery scheme that uses before images, write operations must be delayed until earlier

transactions that updated the same objects have either committed or aborted.



Transaction T:

a.setBalance(105)

Transaction U:

a.setBalance(HO)

a.setBalance(105)

$100

$105

a.setBalance(HO)

$110

Figure: Overwriting uncommitted values

Strict executions of transactions: Generally it is required that transactions delay both their

read and write operations so as to avoid 'dirty reads' and 'premature writes'. The executions

of transactions are called strict if the service delays both read and write operations on an

object until all transactions that previously wrote that object have either committed or

aborted. The strict execution of transactions enforces the desired property of isolation.

5.2 NESTED TRANSACTIONS

A nested transaction is constructed from a number of sub transactions. Transactions

other than the top level transaction are called sub transactions. Since transactions can

be nested arbitrarily deeply, considerable administration is needed to get everything right.

Several transactions may be started from within a transaction, allowing transactions

to be regarded as modules that can be composed as required. The outermost transaction

in a set of nested transactions is called the top-level transaction. For example in Figure

4.9 T is a top level transaction, which starts a pair of sub transactions T, and T2. The sub

transaction Tl starts its own pair of sub transactions TM and T12. Also, subtransaction T2

starts its own subtransaction T21 which starts another subtransaction T2U. A

subtransaction appears atomic to its parent with respect to transaction failures and to

concurrent access.

Nested transactions have the following main advantages:

Sub transactions at one level may run concurrently with other sub

transactions at the same level in the hierarchy. This can allow additional

concurrency in a transaction. When sub transactions run in different

servers, they can work in parallel.

Sub transactions can commit or abort independently The rules for

commitment of nested transactions are rather subtle:

A transaction may commit or abort only after its child transactions

have completed.



When a subtransaction completes it makes an independent decision either to

commit provisionally or to abort. Its decision to abort is final.

When a parent aborts, all of its sub transactions are aborted. For example, if T2

aborts then T21 and T211 must also abort, even though they may have provisionally

committed.

When a subtransaction aborts the parent can decide whether to abort or not. In our

example, T decides to commit although T2 has aborted.

If the top level transaction commits, then all of the sub transactions that have

provisionally committed can commit too, provided that none of their ancestors

has aborted. In our example, T's commitment allows T1 T11 and T12 to commit, but

not T21 and T21 since their parent T2 aborted. Note that the effects of a

subtransaction are not permanent until the top-level transaction commits.

5.3 LOCKS

When a process needs to read or write a data item as a part of a transaction, it

requests the scheduler to grant it a lock for that data item. Likewise, when a data item is no

longer needed, the scheduler is requested to release the lock. The task of the scheduler is to

grant and release locks in such a way that only valid schedules result. In other words, it needs

to apply an algorithm that provides only serializable schedules. One such algorithm is two

phase locking. Two phase locking

In two phase locking, the scheduler first acquires all the locles it needs during the

growing phase, and then releases them during the shrinking phase.

Transaction must be scheduled so that their effect on shared data is serially

equivalent. A server can achieve serial equivalence of transactions by serializing access to

the objects. A simple example of a serializing mechanism is the use of exclusive locks. In

this locking scheme the server attempts to lock any object that is about to be used by any

operation of a client's transaction. If a client requests access to an object that is already

locked due to another client's transaction, the request is suspended and the client must wait

until the object is unlocked.

In the example given in Figure 4.10, it is assumed that when transactions T and U

start the balances of the accounts A,B and C are not yet locked. When transaction T is about



to use account B, It is looked for T. Subsequently, when transaction U is about to use B it

is still locked for T and transaction U waits. When transaction T is committed, B is

unlocked, whereupon transaction U is resumed. The use of the lock on B effectively

serializes the access to B. Note that if, for example, T had released the lock on B

between its getBalance and setBalance operations, transaction U's getBalance operation

on B could be interleaved

Transaction T:


b.setBalance(bal*l.l)

a.withdraw(bal/10)

Transaction U:


b.setBalance(bal* 1.1)

c.withdraw(bal/10)

Operations Locks Operations Locks

openTransaction

bal = b.getBalance( ) lock B


a.withdraw(bal/10) lock A

closeTransaction unlock A,B

openTransaction

bal = b.getBalance( ) waits for T's

lock on B

…..

lockB


c.withdraw(bal/10) lock C

closeTransaction unlock B,C

transaction is not allowed any new locks after it has released a lock. The first phase of

each transaction is a 'growing phase', during which new locks are acquired. In the

second phase, the locks are released (a 'shrinking phase'). This is called two-phase

locking. Under a strict execution regime, a transaction that needs to read or write an

object must be delayed until other transaction that wrote the same object have committed

or aborted. To enforce this rule, any locks applied during the progress of a transaction

are held until the transaction commits

or aborts. This is called strict two-phase locking. The presence of the locks prevents

other transactions reading or writing the objects.

It is preferable to adopt a locking scheme that controls the access to each object



so that there can be several concurrent transactions reading an object, or a single

transaction writing an object, but not both. This is commonly referred to as a 'many

readers/single writers' scheme. Two types of locks are used: read locks and write locks.

Before a transaction's read operation is performed, a read lock should be set on the

object. Before a transaction's write operation is performed, a write lock should be set on

the object. Whenever it is impossible to set a lock immediately, the transaction (and the

client) must wait until it is possible to do so - a client's request is never rejected. As

pairs of read operations from different transactions do not conflict, an attempt to set a

read lock on an object with a read lock is always successful. All the transactions

reading the same object share its read lock - for this reason, read locks are sometimes

called shared locks.

The operation conflict rules tell us that:

If a transaction T has already performed a read operation on a particular object,

then a concurrent transaction U must not write that object until T commits or

aborts

If a transaction T has already performed a write operation on a particular object,

then a concurrent transaction U must not read or write that object until T

commits or aborts.

Figures show the compatibility of read locks and write locks on any particular

object. The entries in the first column in the table show the type of lock already set,

if any. The entries in the first row show the type of lock requested. The entry in each

cell shows the effect on a transaction that requests the type of lock given above

when the object has been locked in another transaction with the type of lock on the

left.

For one object Lock requested read

write

Lock already set

read write

none OK

OK

wait

OK

wait wait

Figure: Lock compatibility

The rules for the use of locks in a strict two-phase locking implementation are

summarized below:



1. When an operation accesses an object within a transaction:

(a) If the object is not already locked, it is locked and the operation proceeds.

(b) If the object has a conflicting lock set by another transaction, the transaction

must wait until it is unlocked.

(c) If the object has a non-conflicting lock set by another transaction, the lock is

shared and the operation proceeds.

(d) If the object has already been locked in the same transaction, the lock will be

promoted if necessary and the operation proceeds. (Where promotion is prevented by

a conflicting lock, rule (b) is used).

2. When a transaction is committed or aborted, the server unlocks all objects it locked

for the transaction.

Locking rules for nested transactions: The aim of a locking scheme for nested

transactions is to serialize access to objects so that:

1. Each set of nested transactions is a single entity that must be prevented

from observing the partial effects of any other set of nested transactions.

2. Each transaction within a set of nested transactions must be prevented

from observing the partial effects of the other transactions in the set.

The first rule is enforced by arranging that every lock that is acquired by a successful

subtransaction is inherited by its parent when it completes. Inherited locks are also

inherited ancestors. Note that this form of inheritance passes from child to parent.

The second rule is enforced as follows:

Parent transactions are not allowed to run concurrently with their child

transactions. If a parent transaction has a lock on an object, it retains the lock

during the time that its child transaction is executing. This means that child

transaction temporarily acquires the lock from its parent for its duration.

Sub transactions at the same level are allowed to run concurrently, so when they

access the same objects, the locking scheme must serialize their access.

The following rules describe lock acquisition and release:

For a sub transaction to acquire a read lock on an object, no other active

transaction can have a write lock on that object, and the only retainers of a write

lock are its ancestors.

For a subtransaction to acquire a write lock on an object, no other active



transaction can have a read or write lock on that object, and the only retainers of

read and write locks on that object are its ancestors.

When a subtransaction commits, its locks are inherited by its parent, allowing the

parent to retain the locks in the same mode as the child.

When a subtransaction aborts, its locks are discarded. If the parent already retains

the locks it can continue to do so.

Deadlocks

The use of locks can lead to deadlock. Consider the use of locks shown in Figure

below. Each of them acquires a lock on one account and then gets blocked when it tries

to access the account that the other one has locked. This is deadlock situation - two

transactions are waiting, and each is dependent on the other to release a lock so it can

resume.

Transaction T Transaction U

Operations Locks Operations Locks

a.deposit(lOO)

b.withdraw(lOO)

write lock A

wait for U's

b.deposit(200)

a.withdraw(200)

...

write lock B

waits for T's

lock on A

lock on B

Figure: Deadlock with write locks

Definition of deadlock: Deadlock is a state in which each member of a group of

transactions is waiting for some other member to release a lock. A wait-for-graph can be

used to represent the waiting relationship between current transactions. In a wait-for

graph the nodes represent transactions and the edges represent wait-for relationships

between transactions

Deadlock prevention: An apparently simple but not very good way to overcome deadlock

is to lock all of the objects used by a transaction when it starts. This would need to be

done as a single atomic step so as to avoid deadlock at this stage. Such a transaction

cannot run into deadlock with other transactions, but it unnecessarily restricts access to

shared resources. In addition, it is sometimes impossible to predict at the start of a



transaction which objects will be used.

Deadlock detection: Deadlocks may be detected by finding cycles in the wait-for graph.

Having detected a deadlock, a transaction must be selected for abortion to break the

cycle. The choice of the transaction to abort is not simple. Some factors that may be taken

into account for aborting a transaction are the age of the transaction and the number of

cycles it is involved in.

Increasing concurrency in locking schemes

Even when locking rules are based on the conflicts between read and write

operations and the granularity at which they are applied is as small as possible, there is

scope for increasing concurrency. In the first approach (two-version locking), the setting

of exclusive locks is delayed until a transaction commits

Two-version locking: This is an optimistic scheme that allows one transaction to

write tentative versions of objects while other transactions read from the committed

version of the same objects. Read operations only wait if another transaction is currently

committing the same object. Transactions cannot commit their write operations

immediately if other uncompleted transactions have read the same objects. Therefore,

transactions that request to commit in such a situation are made to wait until the reading

transactions have completed. Deadlock may occur when transactions are waiting to

commit. Therefore transactions may need to be aborted when they are waiting to

commit, to resolve deadlocks.

This variation on strict two-phase locking uses three types of lock: a read lock, a

write iock and a commit lock. Before a transaction's read operation is performed, a read

lock must be set on the object the attempt to set a read lock is successful unless the

object has a commit lock, in which case the transaction waits. Before a transaction's

write operation is performed, a write lock must be set on the object - the attempt to set a

write lock is successful unless the abject has a write lock or a commit lock, in which case

the transaction waits.

When the transaction coordinator receives a request to commit a transaction, it

attempts to convert all that transactions' write locks to commit locks. If any of the objects

have outstanding read locks, the transaction must wait until the transactions that set these

locks have completed and the locks are replaced. The compatibility of read, write and



commit Jocks is shown in Figure

For one object Lock to be set

read

write Commit

Lock already

set

None Ok Ok Ok

Read Ok ok Wait

Write Ok wait

Commit Wait wait

Hierarchic Locks: At each level, the setting of a parent lock has the same effect as

setting all the equivalent child locks. This economizes on the number of locks to be set.

In the banking example which we consider the branch is the parent and the accounts are

The operation to view a week would cause a read lock to be set at the top of this

hierarchy, whereas the operation to enter an appointment would cause a write lock to be

set on a time slot. The effect of a read lock on a week would be to prevent write

operations on any of the substructures, for example the time slots for each day in that

week.

Each node in the hierarchy can be locked - giving the owner of the lock explicit

access to the node and implicit access to its children. Before a child node is granted a

read/write lock, an intention to read/write lock is set on the parent node and its

ancestors, if any. The intention locks is compatible with other intention locks but

conflicts with read and write locks according to the usual rules. Figure 4.17 gives the

compatibility table for hierarchic locks. Hierarchic locks have the advantage of

reducing the number of locks when mixed-granularity locking is required. The

compatibility tables and the rules for promoting locks are more complex.

For one object Lock to be set

read write

I-read

I-write



Lock already set none

read

write

I-read

I-write

OK

OK

wait

OK

wait

OK

wait

wait

wait

wait

OK

OK

wait

OK

OK

OK

wait

wait

OK

OK

Lock compatibility table for hierarchic locks

5.4 OPTIMISTIC CONCURRENCY CONTROL

The drawbacks of locking are:

* Lock maintenance represents an overhead

* The use of locks can result in deadlock

* To avoid cascading aborts, locks cannot be released until the end of the

transaction. This may reduce significantly the potential for concurrency.

The alternative approach proposed by Kung and Robinson is. 'optimistic' because it is

based on the observation that, in most application, the likelihood of two clients'

transactions accessing the same object is low. Transactions are allowed to proceed as

though there were no possibility of conflict with other transactions until the client

completes its task and issues a close Transaction request. When a conflict arises some,

transaction is generally aborted and will need to be restarted by the client. Each

transaction has the following phases:

Working Phase: during this phase, each transaction has a tentative version of

each of the objects that it updates. This is a copy of the most recently committed

version of the object. Read operations are performed immediately- if a tentative

version for that transaction already exists, a read operation accesses it; otherwise

it accesses the most recently committed value of the object. Write operations record

the new values of the objects as tentative values (which are invisible to other

transactions). When there are several concurrent transactions, several different tentative

values of the same object may coexist. In addition, two records are kep: of the objects

accessed within a transaction: a read set containing the objects read by the transaction: and

a write set containing the objects written by the transaction.

Validation phase: When the close Transaction request is received, the transaction is

validated to establish whether or not its operations on objects conflict with operations of



other transactions on the same objects. If the validation is successful then the transaction

can commit. If the validation fails, then some form of conflict resolution must be used

and either the current transaction, or in some cases those with which it conflicts, will

need to be aborted.

Update phase: If a transaction is validated, all of the changes recorded in inj tentative

versions are made permanent.

Validation of transactions: To assist in performing validation each transaction is

assigned i transaction number when it enters the validation phase. Transaction numbers

are integers assigned in ascending sequence; the number of a transaction therefore

defines its position m time - a transaction always finishes its working phase after all

transactions with lower numbers That is, a transaction with the number 1\ always

precedes a transaction with the number T: i<j. The validation test on transaction Tv is

based on conflicts between operations in pairs transaction T. and Tv. For a transaction Tv

to be serializable with respect to an overlapped transaction Ti5 their operations must

conform to the following rules:

To prevent overlapping, the entire validation and update phases can be implemented

ad a critical section so that only one client at a time can execute it.

Backward validation: As all the read operations of earlier overlapping transactions won

performed before the validation of Tv started, they cannot be affected by the writes of a

current transaction. The validation of transaction Tv checks whether its read set with any

of the write sets of earlier overlapping transactions T.. If there is any overlap. Hue

validation fails.

Let start Tn be the biggest transaction number assigned at the time when transaction

1 started its working phase and finish Tn be the biggest transaction number assigned at the

when Tv entered the validation phase. The following program describes the algorithm 1

validation of Tv :

boolean valid = true;

for( int T, = startTn + 1; Ti <= finishTn; T,+ +){

if(read set of Tv intersects write set of Ti) valid =

false;

Figure 4.18 shows overlapping transactions that might be considered in the

validation of a transaction Tv. Time increases from left to right. The earlier committed



transactions are T1, T, and T3. T, committed before Tv started. T2, and T3, committed

before Tv finished its working phase. StartTn +1 = T2 and finishTn =T3. In backward

validation, the read set of Tv must be compared with the write sets of T2 and T3. In

backward validation, the read set of the transaction being validated is compared with the

write sets of other transaction that have already committed,. Therefore, the only way to

resolve any conflicts is to abort the transactions that are undergoing validation.

Forward validation: In forward validation of the transaction T , the write set of T, is

compared with the read sets of all overlapping active transactions- those that.are still

in their working phase (rule 1). Rule 2 is automatically fulfilled because the active

transactions do not write until after Tv has completed. Let the active transactions have

(consecutive) transaction identifiers active, to activeN, the following program

describes the algorithm for the forward validation of Tv:

boolean valid = true;

for (int Tid = active^ Tid < = activeN; Tid ++){

if (write set of Tv intersects read set of Tid) valid = false; }

In figure 4.18, the write set of transaction Tv must be compared with the read

sets of the transactions with identifiers aactive, and active2. As the transaction being

compared with the validating transaction are still active, we have a choice of whether

to abort the validating transaction or to take some alternative way of resolving the

conflict.

5.5 TIME STAMP ORDERING

In concurrency control schemes based on timestamp ordering, each operation in a

transaction is validated when it is carried out. If the operation cannot be validated, their

transaction is aborted immediately and can then be restarted by the client. Each

transaction is assigned a unique timestamp value when it starts. The basic

timestamp ordering rule is based on operation conflicts and is very simple:

A transactions' request to write an object is valid only if that object was last

read and written by earlier transactions. A transactions' request to read an

object if valid only if that object was last written by an earlier transaction.

In timestamp ordering, each request by a transaction for read or write

operation on ai object is checked to see whether it confirms to the operation

conflict rules. A request by tic current transaction Tc can conflict with previous

operations done by other transactions, Tp whose timestamps indicate that they should



be later than TC . These rules are shown in Figure 4.19, in which Ti >Tc means Ti. is

later than Tc and Ti.<Tc means Ti., is earlier than Tc .

Rule TC Ti

1. Write Read T must now write an object that has been read by

any T C

i where Ti > T this requires that TC > the

2. Write Write maximum read timestamp of the object

Tc must not write an object that has been written by

3. read write and Ti where Ti > Tc this required that Tc > write

timestamp of the committed object

T must now write an object that has been written by

any Ti where T>T this requires that T > write

timestamp of the committed object.

Figure: Operation conflicts for timestamp ordering

Timestamp ordering write rule: By combining rules 1 and 2 we have the following rule

far deciding whether to accept a write operation requested by transaction Tc on object D:

if (Tc>= maximum read timestamp on D &&

Tc> write timestamp on committed version of D)

perform write operation on tentative version of D with write timestamp Tc

else

Abort transaction Tc

If a tentative version with write timestamp Tc already exist the write operation: m|

addressed to it, otherwise a new tentative version is created and given write timestamp TC

Figure illustrates the action of a write operation by a transaction T3 in cases where T3 >=

maximum read timestamp on the object( the read timestamps are not shown). In cases (a) to



(c) T3> write timestamp on the committed version of the object and a tentative version with

write timestamp T3 is inserted at the appropriate place in the list of tentative versions ordered by

their transactions timestamps. In case (d), T3< write timestamp on the committed version of

the object and the transaction is aborted

Timestamp ordering read rule: By using rule 3 we have the following rule for deciding

whether to accept immediately to wait or to reject read operation requested by transaction T

on object D:

If (Tc>write timestamp on committed version of D){

let DSelected be the version of D with the maximum write timestamp < = Tc

if(Dselected

is committed)

perform read operation on the version Dseleoted else

wait until the transaction that made version Dselected commits or aborts

then reapply the read rule

} else

Abort transaction Tc

Figure illustrates the timestamp ordering read rule. It includes four cases labeled (a)

to (d), each of which illustrates the action of a read operation by transaction T3. In each case,

a version whose write timestamp is less than or equal to T3 is selected. If such a version exists,

it is indicated with a line. In cases (a) and (b) the read operation is directed to a committed

version - in (a) it is the only version, whereas in (b) there is a tentative version belonging to

a later transaction. In case (c) the read operation is directed to a tentative version and must wait

until the transaction that made it commits or aborts. In case (d) there is no suitable version to

read and transaction T3 is aborted.

5.6 COMPARISON OF METHODS FOR CONCURRENCY CONTROL

All of the methods carry some overheads in the time and space they require, and they

all limit to some extent the potential for concurrent operation.

The timestamp ordering method is similar to two-phase locking in that both use

pessimistic approaches in which conflicts between transactions are detected as each object is

accessed.

On the one hand, timestamp ordering decides the serialization order statically- when a

transaction starts. On the other hand, two-phase locking decides the serialization order

dynamically- according to the order in which objects are accessed.



Time stamp ordering and in particular multisession timestamp ordering is better than

strict two-phase locking for read-only transactions. Two-phase locking is better when the

operations in transactions are predominantly

The pessimistic methods differ in the strategy used when a conflicting access to

an object is detected. Timestamp ordering aborts the transaction immediately,

whereas locking makes the transaction wait - but with a possible later penalty of

aborting to avoid deadlock.

When optimistic concurrency control is used, all transactions are allowed to

proceed, but some are aborted when they attempt to commit, or in forward validation,

transactions are aborted earlier.

This results in relatively efficient operation when there are few conflicts, but a

substantial amount of work may have to be repeated when a transaction is aborted.

Locking has been in use for many years in database systems, but timestamp

ordering has been used in the SDD-1 database system. Both methods have been used

in file servers.

5.7 INTRODUCTION TO DISTRIBUTED TRANSACTIONS

In the general case, a transaction, whether flat or nested, will access objects

located in several different computers. We use the term distributed transactions to

refer to a flat file or nested transaction that accesses objects managed by multiple

servers.

When a distributed transaction comes to an end, the atomicity property of

transactions requires that either all of the servers involved commit the transaction or

all role, which involves ensuring the same outcome at all of the servers. The manner

in which the coordinator achieves this depends on the protocol chosen.

A protocol known as the two-phase commit protocol is the one most commonly

used. This protocol allows the servers to communicate with one another to reach a

joint decision as to whether to commit or abort.

Concurrency control in distributed transactions is based on the methods

discussed previously. Each server applies local concurrency control to its own object,

which ensures that transactions are serialized locally.

Distributed transactions must be serialized globally. How this is achieved varies

as to whether locking, timestamp ordering or optimistic concurrency control is in use.



In some cases, the transactions may be serialized at the individual servers, but at the

same time a cycle of dependencies between the different servers may occur and

distributed deadlock arise.

Transaction recovery is concerned with ensuring that all the objects involved in

transactions are recoverable. In addition to that, it guarantees that the values of the

objects reflect all the changes made by committed transactions and none of those

made by aborted ones.

5.8 FLAT AND NESTED DISTRIBUTED TRANSACTIONS

A client transaction becomes distributed if it invokes operations in several

different servers. There are two different ways that distributed transactions can be

structured: as flat transactions and as nested transactions. In a flat transaction, a client

makes requests to more than one server. For example, in Figure 4.22(a), transaction T

is a transaction that invokes operations on objects in servers X,Y and Z. A flat client

transaction completes each of its requests before going on to the next one. Therefore,

each transaction accesses, servers' objects sequentially. When servers use locking, a

transaction can only be waiting for one object at a time.

In a nested transaction, the top-level transaction can open subtransactions, and each

subtransaction can open further subtransactions down to any depth of nesting. A client's

transaction T that opens two subtransactions T, and Ti open which access objects at servers

X and Y. The subtransactions Tl and T2 open further subtransactions Tn, TJ2, T21 and T22,

which access objects at servers M, N, and P. In the nested case, subtransactions at the same level

can run concurrently, so T, and T2 are concurrent, and as they invoke objects in different

servers, they can run in parallel. The four subtransactions TM, T12, T2I and Tn also run

concurrently.

The coordinator of a distributed transaction

Servers that execute requests as part of a distributed transaction need to able to

communicate with one another to coordinate their actions when the transactions commits. A

client starts a transaction by sending an openTransaction request to a coordinator in any

server. The coordinator that is contacted carries out the openTransaction and returns the

resulting transaction identifier to the client. Transaction identifiers for distributed transactions

must be unique within a distributed system.

The coordinator that opened the transaction becomes the coordinator for the distribute



transaction and at the end is responsible for committing or aborting it. Each of the servers

that manage an object accessed by a transaction is a participant in the transaction. During tie

progress of the transaction, the coordinator records a list of references to the participant!,

and each participant records a reference to the coordinator. The method, join, is used whenever a

new participant joins the transaction:

join(Trans, reference to participant)

Informs a coordinator that a new participant has joined the transaction Trans

5.9 ATOMIC COMMIT PROTOCOLS

In the case of a distributed transaction, the client has requested the operations at

more than one server. A transaction comes to an end when the client requests that a

transaction be admitted or aborted. A simple way to complete the transaction in an

atomic manner is for ±e coordinator to communicate the commit or abort request to all of

the participants in the transaction and to keep on repeating the request until all of them

have acknowledged that they had carried it out. This is an example of a one-phase atomic

commit protocol. This sample one-phase atomic commit protocol is inadequate because,

in the case when the client requests a commit, it does not allow a server to make a

unilateral decision to abort a transaction.The two-phase commit protocol is designed to

allow any participant to abort its part of a transaction. Due to the requirement for

atomicity, if one part of a transaction is aborted, then the whole transaction must also

be aborted.

The two-phase commit protocol

In the first phase of the two-phase commit protocol the coordinator asks all the

participants if they are prepared to commit; and in the second, it tells them to commit (or

abort) the transaction. If a participant can commit its part of a transaction, it will agree as

soon as it has recorded the changes and its status in permanent storage- and is prepared

to commit. The coordinator in a distributed transaction communicates with the

participants to carry out the two-phase commit protocol by means of the operations

summarized below:

canCommit?'(trans)-* Yes/No

Call from coordinator to participant to ask whether it can commit a transaction.

Participant replies with its vote

doCommit(trans)



Call from coordinator to participant to tell participant to commit its part of a

transaction doAbort(trans)

Call from coordinator to participant to tell participant to abort its part of a transaction.

haveCommitted(trans, participant)

Call from participant to coordinator to confirm that it has committed the transaction.

getDecision(trans)-> Yes/No

Call from participant to coordinator to ask for the decision on a transaction after it has

voted Yes but has still had no reply after some delay. Used to recover from seVver crash

or delayed messages.

The two phase commit protocol consists of a voting phase and a completion phase.

The steps involved are summarized below. At the end of the second step the coordinator

and all the participants that voted Yes are prepared to commit. By the end of third step

the transaction

is effectively completed. At step 3.a the coordinator and the participants are committed,

so the coordinator can report a decision to commit to the client. At 3.b the coordinator

reports a decision to abort the client. At step four participants confirm that they have

committed so that the coordinator knows when the information it has recorded about the

transaction is no longer needed.

Phase 1 (voting phase):

1. The coordinator sends a canCommit request to each of the participants in the

transaction.

When a participant receives a canCommit request it replies with its vote (Yes

or No) to the coordinator. Before voting Yes, it prepares to commit by saving

objects in permanent storage. If the vote is No the participant aborts

immediately.

(completion according to outcome of vote):

The coordinator collects the votes (including its own).

(a) If there are no failures and all the votes are Yes the coordinator decides to

Commit the transaction and sends a doCommit request to each of the Participants.

(b) Otherwise the coordinator decides to abort the transaction and sends doAbort



requests to all participants that voted Yes.

Participants that voted Yes are waiting for a doCommit or do Abort request

from the coordinator. When a participant receives one of these messages it

acts accordingly and in the case of commit, makes a haveCommitted call as

confirmation to the coordinator.

There are various stages in the protocol at which the coordination or a participant

cannot progress its part of the protocol until it receives another request or reply from one

of the others.

Consider the situation where a participant has voted Yes and is waiting for the

coordinator to report on the outcome of the vote by telling it to commit or abort the

transaction. As mentioned in Figure 4.23 participant is uncertain of the outcome and

cannot proceed any further until it gets the outcome of the vote from the coordinator.

The participant cannot decide unilaterally what to do next, and meanwhile the

objects used by its transaction cannot be released for use by other transactions. The

participant makes a getDecision request to the coordinator to determine the outcome of

the transaction. When it gets the reply it continues the protocol at step 4 mentioned

above. If the coordinator has failed, the participant will not be able to get the decision

until the coordinator is replaced, which can result in extensive delays for participants in

the uncertain state

Two phase commit protocol for nested transactions:

A coordinator for a subtransaction will provide an operation to open a

subtransaction together with an operation enabling the coordinator of a subtransaction

to enquire whether ■s parent has yet committed or aborted, as shown in Figure 4.24. A

client starts a set of transactions by opening a top-level transaction with an

openTransaction operation, It returns a transaction identifier for the top-level

transaction. The client starts a subtransaction by invoking the openSubTransaction

operation, whose argument specifies its parents transaction. The new subtransaction

automatically joins the parent transaction, and a transaction identifier for a

subtransaction is returned.

operations in coordinator for nested transactions

openSubTransaction(trans) → subTrans

Opens a new subtransaction whose parent is trans and returns a



unique subtransaction identifier.

getStatus(trans) → committed, aborted provisional

Asks the coordinator to report on the status of the transaction trans. Returns

values

representing one' of the following: committed, aborted, provisional.

Figure: Operations in coordinator for nested transactions

Each of the nested transactions carries out its operations. When they are finished,

the server managing subtransaction records information as to whether the

subtransaction committed provisionally or aborted. Note that if its parent aborts, then

the subtransaction be forced to abort too.

A parent transaction - including a top-level transaction - can commit even if one

of its child subtransactions has aborted. Consider the top-level transaction T and its

subtransactions shown in Figure 4.25, which is based on Figure above. Each

subtransaction has either provisionally committed or aborted. For example, T)2 has

provisionally committed and T„ has aborted, but the fate of T12, depends on it parent T,

and sventuilly on the top-level transaction, T. Although T21 and T22 have both

provisionally committed, T2 has aborted and this means that T21 and T22 must also

abort. Suppose that T decides to commit in spite of the fact that T2 has aborted, also

that T, decides to commit in spite of the fact that Tn has aborted.

When a top-level transaction completes, its coordinator carries out a two-phase

commit protocol. The top-level transaction plays the role of coordinator in the two-phase

commit protocol.

The two-phase commit protocol may be performed in either a hierarchic manner or

in a flat manner. The second phase of the two-phase commit protocol is the same as for

the non-nested case. The coordinator collects the votes and then informs the participants

as to the outcome. When it is complete, coordinator and participants will have committed

or aborted their transactions.

Hierarchic two-phase commit protocol: In this approach, the two-phase commit

protocol becomes a multi-level nested protocol. The coordinator of the top-level

transaction communicates with the coordinators of the subtransactions for which it is the

immediate parent.



It sends canCommit messages to each of the latter, which in turn pass them on to

the coordinators of their child transactions (and so on down the tree). Each participant

collects the replies from its descendants before replying to its parent. In our example, T

sends canCommit messages to the coordinator of T, and then T, sends canCommit

messages to TB asking about descendants of T,.

The protocol does not include the coordinators of transaction such as T2, which

has aborted. The participant receiving the call looks in its transaction list for any

provisionally committed transaction or subtransaction, if it finds, it replies with Yes vote,

otherwise it replies with a No vote.

Flat two-phase commit protocol: In this approach, the coordinator of the top-level

transaction sends canCommit messages to the coordinators of all of the subtransactions

in the provisional commit list. In our example, to the coordinators of T, and T12

When a participant receives a canCommit request, it does the following:

If the participant has any provisionally committed transactions that are

descendants of the top-level transaction, trans:

- check that they do not have aborted ancestors in the abort List. Then prepare to

commit (by recording the transaction and its objects in permanent storage).

-those with aborted ancestors are aborted;

-send a Yes vote to the coordinator.

If the participant does not have a provisionally committed descendant of the top-

level transaction, it must have failed since it performed the subtransaction and it

sends a No vote to the coordinator.

5.10 CONCURRENCY CONTROL IN DISTRIBUTED

TRANSACTIONS Locking

In a distributed transaction, the locks on an object are held locally. The local lock

■imager can decide whether to grant a lock or make the requesting transaction wait.

However lit cannot release any locks until it knows that the transaction has been

committed or aborted at all the servers involved in the transaction. As lock managers in

different servers set their independently of one another, it is possible that different

servers may impose different on transactions. Consider the following interleaving of

transactions T and U at senders X and Y:



T U ' -

Write(A) at X locks A

Write(B)

waits for

Read(A)

atY

U

atX

locks B

waits for T

Read(B) at Y

•

Timestamp ordering concurrency control

In distributed transactions, a globally unique transaction timestamp is issued to the

client by the first coordinator accessed by a transaction. The transaction timestamp is

passed to the coordinator at each server whose objects perform an operation in the

transaction. The servers of distributed transactions are jointly responsible for ensuring

that they are performed m a serially equivalent manner. For example, if the version of an

object accessed by transaction U commits after the version accessed by T at one server,



then if T and U access the same object as one another at other servers, the coordinators

must agree as to the ordering of their timestamps. A timestamp consists of a pair <local

timestamp, server-id>. The agreed ordering of pairs of timestamps is based on a

comparison in which the server-id part is less significant.

When timestamp ordering is used for concurrency control, conflicts are resolved as

each operation is performed. If the resolution of a conflict requires a transaction to be

aborted, the coordinator will be informed and it will abort the transaction at all the

participants. Therefore any transaction that reaches the client request to commit should

always be able to commit.

A distributed transaction is validated by a collection of independent servers, each of

which validates transactions that access its own objects. The validation at all of the servers

takes place during the first phase of the two-phase commit protocol. Consider the

following interleaving of transactions T and U, which access objects A and B servers X

and Y, respectively.

T U

Read (A)

Write (A)

Read (B)

Write (B)

atX Read (B) atY

atX

atY

Write (B)

Read (A)

Write (A)

The transactions access the objects in the order T before U at server X and in the

order U before T at server Y Now suppose that T and U start validation at about the

same time, but server X validates T first and server Y validates U first. Each server will

be unable to validate the other transaction until the first one has completed. This is an

example of commitment deadlock. In distributed optimistic transactions, each server

applies a parallel validation protocol. This is an extension of either backward or forward

validation to allow multiple transactions to be in the validation phase at the same time. If

parallel validation is used transactions will not suffer from commitment deadlock.



In a distributed system involving multiple servers being accessed by multiple

transactions, a global wait for graph can in theory be constructed from the local ones. Then

can be a cycle in the global wait-for graph that is not in any single local one -that is, then

can be a distributed deadlock. Figure shows the interleaving of the transactions U, V and

W involving the objects A and B managed by servers X and Y and objects C and E

managed by server Z.

The complete wait-for graph shows that a deadlock cycle consists of alternate edges,

which represent a transaction waiting for an object and an object held by a transaction.

5.11 Distributed deadlocks

U V W

d.deposit(lO) lockD

a.deposit(20) lock A

atX

b.wifhdraw(30) wait at Y

b.deposit (10) lockB

atY

c.withdraw(20) wait at Z

c.deposit(30) lockC

atZ

a.withdraw(20) wait at X



Phantom deadlocks: A deadlock that is 'detected' but is not really a deadlock is called

phantom deadlock. In distributed deadlock detection, information about wait-for

relationships between transactions is transmitted from on server to another. If there is a

deadlock, the necessary information will eventually be collected in one place and a

cycle will be detected. Ja this procedure will take some time, there is a chance that one

of the transactions that Holds a lock will meanwhile have released it, in which case the

deadlock will no longer exist.

Liege chasing: A distributed approach to deadlock detection uses a technique called

edge casing or path pushing. In this approach, the global wait-for graph is not

constructed, but the servers involved have knowledge about some of its edges. The

servers attempt cycles by forwarding messages called probes, which follow the edges

of the graph throughout the distributed system. A probe message consists of transaction

wait-for representing a path in the global wait-for graph.

Edge-chasing algorithms have three steps - initiation, detection and resolution.

Initiation: When a server notes that a transaction T starts waiting for another

transaction J, where U is waiting to access an object at another server, it initiates

detection by sending a probe containing the edge <T-> U > to the server of the object at

which transaction U is locked. If U is sharing a lock, probes are sent to all the holders of

the lock. Sometimes farther transactions may start sharing the lock later on, in which

case probes can be sent to item too.

Detection: Detection consists of receiving probes and deciding whether deadlock

has occurred and whether to forward the probes. For example:

When a server of an object receives a probe <T → U > (indicating that T is

waiting for a transaction U that holds a local object), it checks to see whether U is also

waiting. If it is. the transaction it waits for (for example, V) is added to the probe

(making it < T →U → V >), and if the new transaction (V) is waiting for another

object elsewhere, the probe is forwarded. In this way, paths through the global wait-

for graph are built one edge at a time Before forwarding a probe, the server checks to

see whether the transaction (for example, < T →U → V → T>). If this is the case, it has

found a cycle in the graph and deadlock has been detected.



Resolution: When a cycle is detected, a transaction in the cycle is aborted to

break the deadlock.

In our example, the following steps describe how deadlock detection it

initiated and the probes that are forward during the corresponding detection phase.

Server X initiated detection by sending probe < W→ U> to the server of B

(server Y)

Server Y receives probe <W? U>, notes that B is held by V and appends V

to the Probe to procedure <W→U→V>. it notes that V is waiting for C at

Server Z. This probe is forwarded to server Z.

Server Z receives probe <W→U→V> and notes C is held by W and

appends by to the probe to produce <W→U→V→W>.

This path contains a cycle. The server detects a deadlock. One of the transactions in

out cycle must be aborted to break the deadlock. The transaction to be aborted can be

chosen according to transaction priorities. Transaction priorities could also be used to

reduce ae number of probes that are forwarded.

5.12 TRANSACTION RECOVERY

Recovery is concerned with ensuring that a server's objects are durable and that

the service provides failure atomicity.

|He task of a recovery manager is:

• to save objects in permanent storage (in a recovery file) for committed

transactions

• to restore the server's objects after a crash

• to recognize the recovery file to improve the performances of recovery

• to reclaim storage space (in the recovery file)

Intentions List: Any server that provides transactions needs to1

keep track of the objects

accessed by client's transactions. At each server, an intentions list is recorded for all of its

I currently active transactions - an intentions list of a particular transaction contains a list

of references and the values of all the objects that are altered by that transactions. When

a transaction is committed, that transaction's intentions list is used to identify the objects

it I affected. The committed version of each object is replaced by the tentative version

made by transaction, and the new value is written to the server's recovery file. When a



transaction I , the server uses the intentions list to delete all the tentative versions of

objects made I that transaction.

Logging: In the logging technique, the recovery file represents a log containing the

history i of all the transactions performed by a server. The history consists of values

of objects, Transactions status entries and intentions lists of transactions. The recovery

file will contain I * recent snapshot of the values of all the objects in the server

followed by a history of 1

Transactions after the snapshot.

During the normal operation of a server, its recovery manager is called whenever

a transaction prepares to commit, commits or aborts a transaction. When the server

is prepared I ID commit a transaction, the recovery manager appends all the objects

in its intentions list to recovery file, followed by the current status of that transaction

(prepared) together with its intentions list.

The log was recently reorganized and entries to the left of the double line represent

a snapshot of the values of A, B and C before transactions T and U started. In this we

use the names A, B and C as unique identifiers for objects. We show the situation

wtien transaction T has committed and transaction U has prepared but not

committed. When transaction T prepares to commit, the values of objects A and B are

written at positions PI and P2 in the log, followed by a prepared .transaction status

entry for T With its intentions list (<A, P, >, < B, P2 >). When transaction T

commits, a committed transaction states entry for T is put at position P4. Then when

transaction U prepares to commit, the values of objects C.

Distributed Systems Notes - WordPress.com · By running a distributed system software the computers...

Documents

Transcript of Distributed Systems Notes - WordPress.com · By running a distributed system software the computers...