Download - Distributed systems and Distributed databases design

1

Distributed systems and Distributed databases

design

Enterprise systems DT211 4

• A distributed system is a collection of computers that communicate by means of some networked media. There are a number of key issues which must be considered when discussing distributed system architectures:

1. The principle of locality. This means that parts of a system which are associated with each other should be in close proximity: ideally, be on the same computer or, less ideally, the same local area network.

2. The principle of sharing. This means that ideally resources (memory, file space, processor power) should be carefully shared in order to minimise the load on some of the elements of a distributed system.

3. The parallelism principle. This means that maximum use should be made the multiple elements of behind the distributed systems: Work in parallel to complete tasks

• Principle of locality: These entities that should be close together are:

– (1)Keeping data together:• Probably the best known example of the locality principle is

that data that is related to each other should be grouped together. One example of this where two tables which are related by virtue of the fact that they are often accessed together are moved onto the same server.

– (2)Keeping programs together• The idea behind this is that if two programs communicate

with each other in a distributed system then, ideally, they should be located on the same computer or if not possible on the same local area network and in the worst case is where programs communicate by passing data over, a slow communication medium, such as wide area networks.

• (3)Bringing users and data close together. There are two popular ways of implementing this principle. The first is the use of replicated data and the second is caching. – Replicated data:

• Data that is duplicated at various locations in a distributed system and so can be moved from a WAN to a LAN or same the same server.

• However, replicating of data requires synchronisation of the data which can be counter productive if many updates (changes to the data) take place

– Caching is the storing of, a copy, frequently used data in a fast memory is an excellent way of speeding up a system for data which is not subject to much change. Then the have to use the same methods to synchronously update all the cached and original data

• (4)Keeping programs and data together: – same principles as users and data.

• The principle of sharing – This principle is concerned with the sharing of

resources – memory and processing. • (1)Sharing amongst servers : A major decision to be made

about the design of a distributed system is how the servers in a system are going to have the work performed by the system partitioned among them. The main rationale for sharing work amongst servers is to avoid bottlenecks where servers are overloaded with work which could be reallocated to other servers.

• (2)Sharing Memory : A distributed system will have, as a given, the fact that data should be shared between users. The two main decisions are where to situate the tables that make up a relational application and what locking strategy to adopt.

– Where to store tables has already been covered– Locking is to ensure the integrity of data when transactions run

concurrently (at the same time but not in parallel)

• The parallelism principle

– A key idea behind the parallel principle is that of load balancing:• Should this program be split up into different parts

which execute in parallel, either on a single server or on a number of distributed servers?

• partition a database into tables and files in order that the transactions are evenly spread around the file storage devices that are found on a distributed system.

• input/output parallelism. E.g. the use Redundant array of independent disks RAID which allows writes and reads in parallel.

7

Parallel Data Management• The argument goes:

– if your main problem is that your queries run too slowly, use more than one machine at a time to make them run faster (Parallel Processing).

• SMP – All the processors share the same memory and the O.S. runs and schedules tasks on more than one processor without distinction.– in other words, all processors are treated equally in an effort to get

the list of jobs done.– However, SMP can suffer from bottleneck problems when all the

CPUs attempt to access the same memory at once. • MPP - more varied in its design, but essentially consists of multiple

processors, each running their own program on their own memory i.e. memory is not shared between processors.– the problem with MPP is to harness all these processors to solve a

single problem.– But they do not suffer from bottleneck problems

8

Distributed database design

Distributed DatabaseA logically interrelated collection of shared data (and a description of this data), physically distributed over a computer network.

Distributed DBMSSoftware system that permits the management of the distributed database and makes the distribution transparent to users.

9

Concepts of Distribute databases

• Collection of logically-related shared data.• Data split into fragments.• Fragments may be replicated.• Fragments/replicas allocated to sites.• Sites linked by a communications network.• Each DBMS participates in at least one

global application.

10

Advantages of DDBMSs

• Reflects organizational structure• Improved shareability and local

autonomy• Improved availability• Improved reliability• Improved performance

11

Disadvantages of DDBMSs

• Complexity• Cost• Security of network • Integrity control (concurrency and

recovery) more difficult• Database design more complex

12

Types of DDBMS

• Homogeneous DDBMS• Heterogeneous DDBMS

– Sites may run different DBMS products, with possibly different underlying data models.

– Occurs when sites have implemented their own databases and integration is considered later: ad hoc planning. Enterprise resource planning (ERP) is the new approach that attempts to overcome this problem

13

Distributed Database Design criteria

• Three key issues :Fragmentation

Relation may be divided into a number of sub-relations, which are then distributed.

AllocationEach fragment is stored at site with "optimal" distribution (see principles of distribution design).

ReplicationCopy of fragment may be maintained at several sites.

14

Fragmentation• Quantitative information (replication) used for may

include:– frequency with which an application is run;– site from which an application is run;– performance criteria for transactions and

applications.

• Qualitative information (fragmentation) may include transactions that are executed by application: relations, attributes and tuples.

15

Comparison of Strategies for Data Distribution

16

Correctness of Fragmentation• Three correctness rules:

CompletenessIf relation R is decomposed into fragments R1, R2, ... Rn, each data item that can be found in R must appear in at least one fragment.

Reconstruction• Must be possible to define a relational operation that will

reconstruct R from the fragments.• Reconstruction for horizontal fragmentation is Union operation

and Join for vertical .Disjointness• If data item di appears in fragment Ri, then it should not appear

in any other fragment.; Exception: vertical fragmentation, where primary key attributes must be repeated to allow reconstruction.

• For horizontal fragmentation, data item is a tuple (row)• For vertical fragmentation, data item is an attribute.

17

Horizontal Fragmentation

• Consists of a subset of the tuples of a relation.• Defined using Selection operation of relational algebra:

p(R)• For example:

P1 = type='House'(PropertyForRent)P2 = type='Flat' (PropertyForRent)

Result (PNo., St, City, postcode,type,room,rent,ownerno.,staffno., branchno.)

• This strategy is determined by looking at predicates used by transactions.

• Reconstruction involves using a union eg R = r1 U r2

18

Vertical Fragmentation

• Consists of a subset of attributes of a relation.• Defined using Projection operation of relational algebra:

a1, ... ,an(R)

• For example:S1 = staffNo, position, sex, DOB, salary(Staff)S2 = staffNo, fName, lName, branchNo(Staff)

• Determined by establishing affinity of one attribute to another.

• For vertical fragements reconstruction involves the join operation; Each fragment is disjointed except for the primary key

19

Mixed Fragmentation

• Consists of a horizontal fragment that is vertically fragmented, or a vertical fragment that is horizontally fragmented.

• Defined using Selection and Projection operations of relational algebra:

p(a1, ... ,an(R)) or a1, ... ,an(σp(R))

20

Essential Transaction criteria in a DDBMS

• The DDBMS must have the following ability – Allow transaction to run Concurrency– Allow for transaction failure and subsequent

recovery– Improve transaction performance or query

optimisation

21

Transaction (local/global) Concurrency

• All transactions must execute independently and be logically consistent with results obtained if transactions executed one at a time, in some arbitrary serial order.

• Same fundamental principles as for centralized DBMS.

• Replication makes concurrency more complex. – If a copy of a replicated data item is updated, update must

be propagated to all copies. – However, if one site holding copy is not reachable, then

transaction is delayed until site is reachable.

22

Failure and recovery of (local/global) transactions

• DDBMS must ensure atomicity and durability of global transaction.

• Means ensuring that sub-transactions of global transaction either all commit or all abort.

• Thus, DDBMS must synchronize global transaction to ensure that all sub-transactions have completed successfully before recording a final COMMIT for global transaction.

• Must do this in the presence of site and network failures.

23

Transaction (local/global) Performance or distributed query processing • Must consider:

– fragmentation, – replication, – allocation schemas.

• DQP has to decide e.g. :– which fragment to access;– which copy of a fragment to use;– which location to use.

24

Performance Transparency

• DQP produces an execution strategy optimized with respect to some cost function.

• Typically, costs associated with a distributed request include:

– I/O cost;– Communication cost: WAN….

25

DQP - ExampleProperty(propNo, city) 10000 records in LondonClient(clientNo,maxPrice) 100000 records in GlasgowViewing(propNo, clientNo) 1000000 records in London

SELECT p.propNoFROM Property p INNER JOIN Client c INNER JOIN Viewing v ON c.clientNo = v.clientNo)

ON p.propNo = v.propNo WHERE p.city=‘Aberdeen’ AND c.maxPrice > 200000;• This query selects properties that viewed in aberdeen that have a

price greater than £200, 000.

26

Performance Transparency - Example

Assume:• Each tuple in each relation is 100 characters long.• 10 renters with maximum price greater than £200,000.• 100 000 viewings for properties in Aberdeen.

• In addition the data transmission rate is 10,000 characters per sec and there is a 1 sec access delay to send a message.

27

Performance Transparency - Example

• Derive the following :

28

Question• Three important characteristics of distributed databases

are: Fragmentation, replication and allocation. – Explain what is meant by each term.

– (8 marks)– Explain when one would decide when to implement a

centralised system or a fragment distributed system– (10 marks)

– Discuss the relationship between the principles of database design and the above characteristics of distributed databases

– (12 marks)•