DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING
description
Transcript of DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING
DISTRIBUTED ANDHIGH-PERFORMANCE COMPUTING
Chapter 9 : Distributed Computing vs Distributed High Performance Computing
What is Distributed Computing
Distributed computing is a field of computer science that studies distributed systems.
A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal.
A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs.
What is Distributed High Performance Computing
Distributed High-Performance Computing (HPDC) combines the advances in research and technologies in high speed networks, software, distributed computing and parallel processing to deliver high-performance, large-scale and cost-effective computational, storage and communication capabilities to a wide range of applications.
Issues in Distributed System
1. Architectures
2. Processes
3. Communication
4. Naming
5. Synchronization
6. Consistency and Replication
7. Fault Tolerance
8. Security
1. Architecture Models
Three basic architectural models for distributed systems: workstations/servers model; processor pool (thin client) model; integrated model.
6
Example 1: Internet
7
Example 2 : Intranets
8
Example 5: Distributed Mutlimedia System
1. DHPC : Example of Architecture for Distributed High Performance Computing
2. Processes : Interprocess Communication
Distributed processes or tasks need to communicate
For distributed computing we usually do not have shared memory, so need to use message passing method
process A sends a message to process B
process B receives it
send/receive may be synchronous (A blocks until B receives the message) or asynchronous (some bu ering mechanism allows A to ffproceed as soon as it has sent data)
simple ideas: send/receive, and some startup interrogation to find out process identities, form basis for distributed and parallel computation mechanism
pairing or receive and send together into a single unit forms a transaction
3. Communication : Remote Procedure Calls
Need a standard mechanism for invoking some processing on a remote machine
Remote Procedure Calls (RPC) enable this for procedural languages
C-based precursor to remote method invocation in object-oriented systems like Java and CORBA
RPCs look like normal procedure calls – relatively transparent API
When an RPC happens, input parameters are copied to the destination process
Body of the procedure is executed in the context of the remote process
Output parameters are copied back and the call returns
RPCs are implemented using a structured form of message passing
RPC transparency does break down however, e.g. timeouts on RPC calls are sometimes desirable
Also call by value (copy) semantics are necessary, so cannot transparently pass pointer types over RPC
Cost of the remote call can be orders of magnitude greater than a local call, unless computation required for the call is much larger than time to initiate the RPC
3. Communication : Client/Server using RPC
RPC callee lifetime is almost always longer than the call
Callee is usually some kind of server
The callee never terminates (in practical terms)
For example:loop
accept_call(...);process_this_call(...);complete_call(...);
end loop;
Hence the RPCs have a sort of local data persistence
RPC calls of this sort are a form of generator
Interesting set of problems in controlling long lived data and resource allocation and access at the server end
13
Names are used to share resources, to uniquely identify entities, to refer to locations in computer systems.
An important issue with naming is that a name can be resolved to the entity it refers to. Name resolution allows a process to access the named entity.
To resolve names, it is necessary to implement a naming system. The different between naming in DSs and non-DSs lies in the way
naming systems are implemented. In a DS, the implementation of a naming system is itself often distributed across multiple machines.
Two major issues in designing naming systems in DS: efficiency and scalability.
4. Naming : Name Space Distribution
14
4. Naming : Name Space Distribution
An example partitioning of the DNS name space, including Internet-accessible files, into three layers.
15
A naming service is implemented by name servers. In large DSs with many entities it is necessary to distribute the implementation of a name space over multiple name servers.
To efficiently implement a name space for a large-scale, possibly worldwide, DS, it is usually organized hierarchically and may be partitioned into logical layers:
* global layer: formed by the highest-level nodes, e.g., root and other directory nodes logically close to the root. The directory tables in these nodes are rarely changed.
* administrational layer: formed by the directory nodes managed within single organization. The nodes in this layer are relatively stable although less stable than those in global layers.
4. Naming : Name Space Distribution
16
* managerial layer: formed by the nodes that may change regularly, e.g., nodes representing hosts in the LAN. The nodes in this layer are also maintained by end users of a DS.
The distribution of a name space across multiple name servers affects the implementation of name resolution.
Iterative name resolution: The root name server contacts the other name servers iteratively to resolve the name.
Recursive name resolution: The root name server contacts the other name servers recursively to resolve the name.
4. Naming : Name Space Distribution
DHPC : 4. Naming System
Refer to article
5. Synchronization
We need to measure time accurately: to know the time an event occurred at a computer to do this we need to synchronize its clock with an
authoritative external clock Algorithms for clock synchronization useful for
concurrency control based on timestamp ordering There is no global clock in a distributed system Logical time is an alternative
It gives ordering of events - also useful for consistency of replicated data
6. Consistency and Replication
19
• Two primary reasons for replicating data in DS: reliability and performance.
• Reliability: It can continue working after one replica crashes by simply switch to one of the other replicas; Also, it becomes possible to provide better protection against corrupted data.
• Performance: When the number of processes to access data managed by a server increases, performance can be improved by replicating the server and subsequently dividing the work; Also, a copy of data can be placed in the proximity of the process using them to reduce the time of data access.
• Consistency issue: keeping all replicas up-to-date.
20
6.1 Distribution Protocols: Replica Placement
Several ways of distributing (propagating) updates to replicas, independent of the supported consistency model, have been proposed.
Replica Placement: deciding where, when, and by whom copies of the data store are to be placed.
Three different types of copies, permanent replicas, server-initiated replicas, and client-initiated replicas, can be distinguished, and logically organized as show in the next slide.
Permanent replicas: the initial set of replicas constituting a distributed data store.
21
Server-initiated replicas: copies of a data store for enhancing performance. They are created at the initiative of the (owner of the) data store.
For example, it may be worthwhile to install a number of such replicas of a Web server in regions where many requests are coming from.
One of the major problems with such replicas is to decide exactly where and when the replicas should be created or deleted.
Server-initiated replication is gradually increasing in popularity, especially in the context of Web hosting services. Such hosting services can dynamically replicate files to servers close to demanding clients.
6.1 Server Initiated Replicas
6.1 Client-initiated replicas
22
Client-initiated replicas: copies created at the initiative of clients, known as caches.
In principle, managing the cache is left entirely to the client, but there are many occasions in which the client can rely on participation from the data store to inform it when the cached data has become stale.
Placement of client caches is relatively simple: a cache is normally placed in the same machine as its client, or on a machine shared by clients in the same LAN.
Data are generally kept in a cache for a limited amount time to prevent extremely stale data from being used, or simply to make room for other data.
Replica Placement
DHPC : 6. Replica consistency in a Data Grid
A Data Grid is a wide area computing infrastructure that employs
Grid technologies to provide storage capacity and processing
power to applications that handle very large quantities of data.
Data Grids rely on data replication to achieve better performance
and reliability by storing copies of data sets on different Grid
nodes. When a data set can be modified by applications, the
problem of maintaining consistency among existing copies
arises.
DHPC : 6. Replica consistency in a Data Grid
The consistency problem also concerns metadata, i.e., additional information
about application data sets such as indices, directories, or catalogues. This kind
of metadata is used both by the applications and by the Grid middleware to
manage the data. For instance, the Replica Management Service (the Grid
middleware component that controls data replication) uses catalogues to find the
replicas of each data set. Such catalogues can also be replicated and their
consistency is crucial to the correct operation of the Grid.
Therefore, metadata consistency generally poses stricter requirements than data
consistency. In this paper we report on the development of a Replica
Consistency Service based on the middleware mainly developed by the
European Data Grid Project. The paper summarises the main issues in the
replica consistency problem, and lays out a high-level architectural design for a
Replica Consistency Service. Finally, results from simulations of different
consistency models are presented.
Cont..
7: Fault Tolerance
Failure: When a component is not living up to its specifications, a failure occurs
Error: That part of a component's state that can lead to a failure
Fault: The cause of an error
Fault prevention: prevent the occurrence of a fault
Fault tolerance: build a component in such a way that it can meet its specifications in the presence of faults
7.1 Failure Models
Different types of failures.
Type of failure Description
Crash failure A server halts, but is working correctly until it halts
Omission failure Receive omission Send omission
A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages
Timing failure A server's response lies outside the specified time interval
Response failure Value failure State transition failure
The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control
Arbitrary failure A server may produce arbitrary responses at arbitrary times
DHPC : 7. Fault Tolerance in Grid
Refer to article
8. Security
In distributed systems, security is the combination of availability, integrity, and confidentiality. A dependable distributed system is thus fault tolerant and secure.
Property Description
Availability Accessible and usable upon demand for authorized entities
Reliability Continuity of service delivery
Safety Very low probability of catastrophes
Confidentiality No unauthorized disclosure of information
Integrity No accidental or malicious alterations of information have been performed (even by authorized entities)
8.1 Types of Threats
Threat Channel Object
Interruption Preventing message transfer Denial of service
Inspection Reading the content of transferred messages
Reading the data contained in an object
Modification Changing message content Changing an object's encapsulated data
Fabrication Inserting messages Spoofing an object
8.2 Security Mechanisms
Encryption Hiding message content Check for message modification
Authentication Verifying subject identity
Authorization Auditing
Closing the barn door
DHPC : 8. Security in Grid
Refer to article
Difference Between Grid Computing Vs. Distributed Computing
Definition of Distributed Computing Distributed Computing is an environment in which a group of independent
and geographically dispersed computer systems take part to solve a
complex problem, each by solving a part of solution and then combining the
result from all computers. These systems are loosely coupled systems
coordinately working for a common goal. It can be defined as :-
A computing system in which services are provided by a pool of computers collaborating over a network .
A computing environment that may involve computers of differing architectures and data representation formats that share data and system resources.
Difference Between Grid Computing Vs. Distributed Computing
Definition of Grid Computing• The Basic idea between Grid Computing is to utilize the ideal CPU cycles
and storage of million of computer systems across a worldwide network function as a flexible, pervasive, and inexpensive accessible pool that could be harnessed by anyone who needs it, similar to the way power companies and their users share the electrical grid. There are many definitions of the term: Grid computing:1. A service for sharing computer power and data storage capacity over the
Internet 2. An ambitious and exciting global effort to develop an environment in which
individual users can access computers, databases and experimental facilities simply and transparently, without having to consider where those facilities are located. [RealityGrid, Engineering & Physical Sciences Research Council, UK 2001] http://www.realitygrid.org/information.html
3. Grid computing is a model for allowing companies to use a large number of computing resources on demand, no matter where they are located.www.informatica.com/solutions/resource_center/glossary/default.htm
Since 1980, two advances in technology has made distributed computing a
more practical idea, computer CPU power and communication bandwidth.
The result of these technologies is not only feasible but easy to put together
large number of computer systems for solving complex computational power
or storage requirements. But the numbers of real distributable applications
are still somewhat limited, and the challenges are still significant
(standardization, interoperability etc).
As it is clear from the definition, traditional distributed computing can be
characterized as a subset of grid computing. some of the differences
between these two are :-
Difference Between Grid Computing Vs. Distributed Computing
Cont…1. Distributed Computing normally refers to managing or pooling
the hundreds or thousands of computer systems which
individually are more limited in their memory and processing
power. On the other hand, grid computing has some extra
characteristics. It is concerned to efficient utilization of a pool of
heterogeneous systems with optimal workload management
utilizing an enterprise's entire computational
resources( servers, networks, storage, and information) acting
together to create one or more large pools of computing
resources. There is no limitation of users, departments or
originations in grid computing.
Cont…2. Grid computing is focused on the ability to support
computation across multiple administrative domains that
sets it apart from traditional distributed computing. Grids
offer a way of using the information technology resources
optimally inside an organization involving virtualization of
computing resources. Its concept of support for multiple
administrative policies and security authentication and
authorization mechanisms enables it to be distributed over
a local, metropolitan, or wide-area network
Case Study : Distributed Computing Air Traffic Management System
The Air Traffic Management System is an example of a distributed problem-
solving system.
It has elements of both cooperative and competitive problem-solving.
It includes complex organizations such as Flight Operations Centers, the FAA
Air Traffic Control Systems Command Center (ATCSCC), and traffic management
units at en route centers that focus on daily strategic planning, as well as individuals
concerned more with immediate tactical decisions (such as air traffic controllers and
pilots).
The design of this system has evolved over time to rely heavily on the distribution of
tasks and control authority in order to keep cognitive complexity manageable for any
one individual operator, and to provide redundancy (both human and technological) to
serve as a safety net to catch the slips or mistakes that any one person or entity might
make.
Within this distributed architecture, a number of different conceptual approaches
have been applied to deal with cognitive complexity and to provide redundancy.
Cont… These approaches can be characterized in terms of the strategy
for distributing: (1) control or responsibility, (2) knowledge or expertise, (3) access to data, (4) processing capacity, and (5) goals and priorities.
This paper will provide an abstract characterization of these alternative strategies for distributing work in terms of these 5 dimensions, and will illustrate and evaluate their effectiveness in terms of concrete realizations found within the National Airspace System.
Case Study : Distributed High Performance Computing
1. ATM-based Distributed High Performance Computing System
2. DISCWorld
An ATM-based Distributed High Performance Computing System
We describe the distributed high performance computing system, we have developed to integrate together a heterogeneous set of high
performance computers, high capacity storage systems and fast communications hardware. Our system is based upon Asynchronous Transfer Mode (ATM)
communications technology and we routinely operate between the geographically distant sites of Adelaide and Canberra (separated by some ll00km),
using Telstra's ATM-based Experimental Broadband Network (EBN). We discuss some of the latency and performance issues that result
from running day-to-day operations across such a long distance network.
DISCWorld: A Distributed High Performance Computing Environment
An increasing number of science and engineering applications require distributed and parallel computing resources to satisfy user response time requirements.
Distributed science and engineering applications require a high performance "middleware" which will both allow the embedding of legacy applications as well as enable new distributed programs,
and which allows the best use of existing and specialised (parallel) computing resources.
We are developing a distributed information systems control environment which will meet the needs of a middleware for scientific applications.
We describe our DISCWorld system and some of its key attributes. A critical attribute is architecture scalability. We discuss DISCWorld in the context of some existing middleware systems such as
CORBA and other distributed computing research systems such as Legion and Globus.
Our approach is to embed applications in the middleware as services, which can be chained together.
User interfaces are provided in the form of Java Applets downloadable across the World Wide Web.
These form a gateway for user-requests to be transmitted into a semi-opaque "cloud" of high-performance resources for distributed execution.