DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING

DISTRIBUTED ANDHIGH-PERFORMANCE COMPUTING

Chapter 9 : Distributed Computing vs Distributed High Performance Computing

What is Distributed Computing

Distributed computing is a field of computer science that studies distributed systems.

A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal.

A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs.

What is Distributed High Performance Computing

Distributed High-Performance Computing (HPDC) combines the advances in research and technologies in high speed networks, software, distributed computing and parallel processing to deliver high-performance, large-scale and cost-effective computational, storage and communication capabilities to a wide range of applications.

Issues in Distributed System

1. Architectures

2. Processes

3. Communication

4. Naming

5. Synchronization

6. Consistency and Replication

7. Fault Tolerance

8. Security

1. Architecture Models

Three basic architectural models for distributed systems: workstations/servers model; processor pool (thin client) model; integrated model.

6

Example 1: Internet

7

Example 2 : Intranets

8

Example 5: Distributed Mutlimedia System

1. DHPC : Example of Architecture for Distributed High Performance Computing

2. Processes : Interprocess Communication

Distributed processes or tasks need to communicate

For distributed computing we usually do not have shared memory, so need to use message passing method

process A sends a message to process B

process B receives it

send/receive may be synchronous (A blocks until B receives the message) or asynchronous (some bu ering mechanism allows A to ffproceed as soon as it has sent data)

simple ideas: send/receive, and some startup interrogation to find out process identities, form basis for distributed and parallel computation mechanism

pairing or receive and send together into a single unit forms a transaction

3. Communication : Remote Procedure Calls

Need a standard mechanism for invoking some processing on a remote machine

Remote Procedure Calls (RPC) enable this for procedural languages

C-based precursor to remote method invocation in object-oriented systems like Java and CORBA

RPCs look like normal procedure calls – relatively transparent API

When an RPC happens, input parameters are copied to the destination process

Body of the procedure is executed in the context of the remote process

Output parameters are copied back and the call returns

RPCs are implemented using a structured form of message passing

RPC transparency does break down however, e.g. timeouts on RPC calls are sometimes desirable

Also call by value (copy) semantics are necessary, so cannot transparently pass pointer types over RPC

Cost of the remote call can be orders of magnitude greater than a local call, unless computation required for the call is much larger than time to initiate the RPC

3. Communication : Client/Server using RPC

RPC callee lifetime is almost always longer than the call

Callee is usually some kind of server

The callee never terminates (in practical terms)

For example:loop

accept_call(...);process_this_call(...);complete_call(...);

end loop;

Hence the RPCs have a sort of local data persistence

RPC calls of this sort are a form of generator

Interesting set of problems in controlling long lived data and resource allocation and access at the server end

13

Names are used to share resources, to uniquely identify entities, to refer to locations in computer systems.

An important issue with naming is that a name can be resolved to the entity it refers to. Name resolution allows a process to access the named entity.

To resolve names, it is necessary to implement a naming system. The different between naming in DSs and non-DSs lies in the way

naming systems are implemented. In a DS, the implementation of a naming system is itself often distributed across multiple machines.

Two major issues in designing naming systems in DS: efficiency and scalability.

4. Naming : Name Space Distribution

14


An example partitioning of the DNS name space, including Internet-accessible files, into three layers.

15

A naming service is implemented by name servers. In large DSs with many entities it is necessary to distribute the implementation of a name space over multiple name servers.

To efficiently implement a name space for a large-scale, possibly worldwide, DS, it is usually organized hierarchically and may be partitioned into logical layers:

* global layer: formed by the highest-level nodes, e.g., root and other directory nodes logically close to the root. The directory tables in these nodes are rarely changed.

* administrational layer: formed by the directory nodes managed within single organization. The nodes in this layer are relatively stable although less stable than those in global layers.


16

* managerial layer: formed by the nodes that may change regularly, e.g., nodes representing hosts in the LAN. The nodes in this layer are also maintained by end users of a DS.

The distribution of a name space across multiple name servers affects the implementation of name resolution.

Iterative name resolution: The root name server contacts the other name servers iteratively to resolve the name.

Recursive name resolution: The root name server contacts the other name servers recursively to resolve the name.


DHPC : 4. Naming System

Refer to article

5. Synchronization

We need to measure time accurately: to know the time an event occurred at a computer to do this we need to synchronize its clock with an

authoritative external clock Algorithms for clock synchronization useful for

concurrency control based on timestamp ordering There is no global clock in a distributed system Logical time is an alternative

It gives ordering of events - also useful for consistency of replicated data

6. Consistency and Replication

19

• Two primary reasons for replicating data in DS: reliability and performance.

• Reliability: It can continue working after one replica crashes by simply switch to one of the other replicas; Also, it becomes possible to provide better protection against corrupted data.

• Performance: When the number of processes to access data managed by a server increases, performance can be improved by replicating the server and subsequently dividing the work; Also, a copy of data can be placed in the proximity of the process using them to reduce the time of data access.

• Consistency issue: keeping all replicas up-to-date.

20

6.1 Distribution Protocols: Replica Placement

Several ways of distributing (propagating) updates to replicas, independent of the supported consistency model, have been proposed.

Replica Placement: deciding where, when, and by whom copies of the data store are to be placed.

Three different types of copies, permanent replicas, server-initiated replicas, and client-initiated replicas, can be distinguished, and logically organized as show in the next slide.

Permanent replicas: the initial set of replicas constituting a distributed data store.

21

Server-initiated replicas: copies of a data store for enhancing performance. They are created at the initiative of the (owner of the) data store.

For example, it may be worthwhile to install a number of such replicas of a Web server in regions where many requests are coming from.

One of the major problems with such replicas is to decide exactly where and when the replicas should be created or deleted.

Server-initiated replication is gradually increasing in popularity, especially in the context of Web hosting services. Such hosting services can dynamically replicate files to servers close to demanding clients.

6.1 Server Initiated Replicas

6.1 Client-initiated replicas

22

Client-initiated replicas: copies created at the initiative of clients, known as caches.

In principle, managing the cache is left entirely to the client, but there are many occasions in which the client can rely on participation from the data store to inform it when the cached data has become stale.

Placement of client caches is relatively simple: a cache is normally placed in the same machine as its client, or on a machine shared by clients in the same LAN.

Data are generally kept in a cache for a limited amount time to prevent extremely stale data from being used, or simply to make room for other data.

Replica Placement

DHPC : 6. Replica consistency in a Data Grid

A Data Grid is a wide area computing infrastructure that employs

Grid technologies to provide storage capacity and processing

power to applications that handle very large quantities of data.

Data Grids rely on data replication to achieve better performance

and reliability by storing copies of data sets on different Grid

nodes. When a data set can be modified by applications, the

problem of maintaining consistency among existing copies

arises.

DHPC : 6. Replica consistency in a Data Grid

The consistency problem also concerns metadata, i.e., additional information

about application data sets such as indices, directories, or catalogues. This kind

of metadata is used both by the applications and by the Grid middleware to

manage the data. For instance, the Replica Management Service (the Grid

middleware component that controls data replication) uses catalogues to find the

replicas of each data set. Such catalogues can also be replicated and their

consistency is crucial to the correct operation of the Grid.

Therefore, metadata consistency generally poses stricter requirements than data

consistency. In this paper we report on the development of a Replica

Consistency Service based on the middleware mainly developed by the

European Data Grid Project. The paper summarises the main issues in the

replica consistency problem, and lays out a high-level architectural design for a

Replica Consistency Service. Finally, results from simulations of different

consistency models are presented.

Cont..

7: Fault Tolerance

Failure: When a component is not living up to its specifications, a failure occurs

Error: That part of a component's state that can lead to a failure

Fault: The cause of an error

Fault prevention: prevent the occurrence of a fault

Fault tolerance: build a component in such a way that it can meet its specifications in the presence of faults

7.1 Failure Models

Different types of failures.

Type of failure Description

Crash failure A server halts, but is working correctly until it halts

Omission failure Receive omission Send omission

A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages

Timing failure A server's response lies outside the specified time interval

Response failure Value failure State transition failure

The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control

Arbitrary failure A server may produce arbitrary responses at arbitrary times

DHPC : 7. Fault Tolerance in Grid

Refer to article

8. Security

In distributed systems, security is the combination of availability, integrity, and confidentiality. A dependable distributed system is thus fault tolerant and secure.

Property Description

Availability Accessible and usable upon demand for authorized entities

Reliability Continuity of service delivery

Safety Very low probability of catastrophes

Confidentiality No unauthorized disclosure of information

Integrity No accidental or malicious alterations of information have been performed (even by authorized entities)

8.1 Types of Threats

Threat Channel Object

Interruption Preventing message transfer Denial of service

Inspection Reading the content of transferred messages

Reading the data contained in an object

Modification Changing message content Changing an object's encapsulated data

Fabrication Inserting messages Spoofing an object

8.2 Security Mechanisms

Encryption Hiding message content Check for message modification

Authentication Verifying subject identity

Authorization Auditing

Closing the barn door

DHPC : 8. Security in Grid

Refer to article

Difference Between Grid Computing Vs. Distributed Computing

Definition of Distributed Computing Distributed Computing is an environment in which a group of independent

and geographically dispersed computer systems take part to solve a

complex problem, each by solving a part of solution and then combining the

result from all computers. These systems are loosely coupled systems

coordinately working for a common goal. It can be defined as :-

A computing system in which services are provided by a pool of computers collaborating over a network .

A computing environment that may involve computers of differing architectures and data representation formats that share data and system resources.


Definition of Grid Computing• The Basic idea between Grid Computing is to utilize the ideal CPU cycles

and storage of million of computer systems across a worldwide network function as a flexible, pervasive, and inexpensive accessible pool that could be harnessed by anyone who needs it, similar to the way power companies and their users share the electrical grid. There are many definitions of the term: Grid computing:1. A service for sharing computer power and data storage capacity over the

Internet 2. An ambitious and exciting global effort to develop an environment in which

individual users can access computers, databases and experimental facilities simply and transparently, without having to consider where those facilities are located. [RealityGrid, Engineering & Physical Sciences Research Council, UK 2001] http://www.realitygrid.org/information.html

3. Grid computing is a model for allowing companies to use a large number of computing resources on demand, no matter where they are located.www.informatica.com/solutions/resource_center/glossary/default.htm

Since 1980, two advances in technology has made distributed computing a

more practical idea, computer CPU power and communication bandwidth.

The result of these technologies is not only feasible but easy to put together

large number of computer systems for solving complex computational power

or storage requirements. But the numbers of real distributable applications

are still somewhat limited, and the challenges are still significant

(standardization, interoperability etc).

As it is clear from the definition, traditional distributed computing can be

characterized as a subset of grid computing. some of the differences

between these two are :-


Cont…1. Distributed Computing normally refers to managing or pooling

the hundreds or thousands of computer systems which

individually are more limited in their memory and processing

power. On the other hand, grid computing has some extra

characteristics. It is concerned to efficient utilization of a pool of

heterogeneous systems with optimal workload management

utilizing an enterprise's entire computational

resources( servers, networks, storage, and information) acting

together to create one or more large pools of computing

resources. There is no limitation of users, departments or

originations in grid computing.

Cont…2. Grid computing is focused on the ability to support

computation across multiple administrative domains that

sets it apart from traditional distributed computing. Grids

offer a way of using the information technology resources

optimally inside an organization involving virtualization of

computing resources. Its concept of support for multiple

administrative policies and security authentication and

authorization mechanisms enables it to be distributed over

a local, metropolitan, or wide-area network

Case Study : Distributed Computing Air Traffic Management System

The Air Traffic Management System is an example of a distributed problem-

solving system.

It has elements of both cooperative and competitive problem-solving.

It includes complex organizations such as Flight Operations Centers, the FAA

Air Traffic Control Systems Command Center (ATCSCC), and traffic management

units at en route centers that focus on daily strategic planning, as well as individuals

concerned more with immediate tactical decisions (such as air traffic controllers and

pilots).

The design of this system has evolved over time to rely heavily on the distribution of

tasks and control authority in order to keep cognitive complexity manageable for any

one individual operator, and to provide redundancy (both human and technological) to

serve as a safety net to catch the slips or mistakes that any one person or entity might

make.

Within this distributed architecture, a number of different conceptual approaches

have been applied to deal with cognitive complexity and to provide redundancy.

Cont… These approaches can be characterized in terms of the strategy

for distributing: (1) control or responsibility, (2) knowledge or expertise, (3) access to data, (4) processing capacity, and (5) goals and priorities.

This paper will provide an abstract characterization of these alternative strategies for distributing work in terms of these 5 dimensions, and will illustrate and evaluate their effectiveness in terms of concrete realizations found within the National Airspace System.

Case Study : Distributed High Performance Computing

1. ATM-based Distributed High Performance Computing System

2. DISCWorld

An ATM-based Distributed High Performance Computing System

We describe the distributed high performance computing system, we have developed to integrate together a heterogeneous set of high

performance computers, high capacity storage systems and fast communications hardware. Our system is based upon Asynchronous Transfer Mode (ATM)

communications technology and we routinely operate between the geographically distant sites of Adelaide and Canberra (separated by some ll00km),

using Telstra's ATM-based Experimental Broadband Network (EBN). We discuss some of the latency and performance issues that result

from running day-to-day operations across such a long distance network.

DISCWorld: A Distributed High Performance Computing Environment

An increasing number of science and engineering applications require distributed and parallel computing resources to satisfy user response time requirements.

Distributed science and engineering applications require a high performance "middleware" which will both allow the embedding of legacy applications as well as enable new distributed programs,

and which allows the best use of existing and specialised (parallel) computing resources.

We are developing a distributed information systems control environment which will meet the needs of a middleware for scientific applications.

We describe our DISCWorld system and some of its key attributes. A critical attribute is architecture scalability. We discuss DISCWorld in the context of some existing middleware systems such as

CORBA and other distributed computing research systems such as Legion and Globus.

Our approach is to embed applications in the middleware as services, which can be chained together.

User interfaces are provided in the form of Java Applets downloadable across the World Wide Web.

These form a gateway for user-requests to be transmitted into a semi-opaque "cloud" of high-performance resources for distributed execution.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING

Documents

Transcript of DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING