Distributed Databases. Outline Distributed DBMS (DBMS) Features Detailed insight of its Objectives.

Distributed Databases

Outline

Distributed DBMS (DBMS) Features Detailed insight of its Objectives

Distributed DBMS (1)

A distributed DBMS coordinates the data access at various nodes for a distributed database.

Although each site will have a DBMS to manage local database at that site, a distributed DBMS will perform the following:

Keep track of data location in a data dictionary.

Determine location from which requested data can be retrieved.

Determine where each part of a distributed query can be processed

Distributed DBMS (2) If required translate request at one node (using a local

DBMS) into proper request to another node (using a different DBMS+data model) and return data to requesting node in its acceptable format.

Provide data management functions like security, dead lock control, concurrency, recovery, etc.

Provide consistency among data copies.

Use of global primary keys.

Scalability, replication and transparency.

Permit different DBMS on different nodes.

A Popular DDBMS Architecture

Distributed/ data repository

Database Site n

Database site2

Distributed/ data repository

Distributed DBMS

Application Programs

Local Database

Distributed DBMS

Application Programs

Local Database

Comm. controller

Comm. controller

User 1

User 2

User n

User 2

User 1

User n

DDBMS Architecture Description (1) The architecture shown on previous slide shows a

system in which each site has a local DBMS, a copy of DDBMS and associated DD/D or data repository

The DD/D has data definitions and info about location of distributed data.

User requests for data are first processed by DDBMS which determines whether a local* or global* transaction is involved.

DDBMS Architecture Description(2)

For local transactions, DDBMS passes the local requests to local DBMS.

For global transactions, DDBMS routes the request to other sites.

In this architecture, it is assumed that copies of distributed DBMS and DD/D exist at each site.

In another setup, these may be at a central site, but this setup will be vulnerable to failures.

Other strategies are also possible

Key Objectives of DDBMS

Location Transparency Replication Transparency Failure Transparency

Commit protocol Concurrency Transparency Query Optimization

Location Transparency

It implies that User of the data need not to know where the data are physically located.

User being unaware of data distribution assumes one single database (physically).

Example: User at region A wants to see data of customers whose

sales have exceeded $10,000 Select * from customer where total_sales>10000 Now distributed DBMS at local site (region A) consults the

distributed DD/D and determines that this request has to be routed to Region C.

On display of results to user, it appears as if data were locally retrieved (unless there is long comm. delay)

Data Distribution Example

PriceList PriceList

Account ing

RegionB parts

Engg Parts

RegionA parts

PriceList

Customers

RegionC parts

RegionA RegionB RegionC

Replication Transparency(1)

Although a given data item may be replicated at several nodes in the network, with replication transparency, a user may treat it as if it were a single item at a single node.

Also called fragmentation transparency. Example:

User wants to view the Price List file. An identical copy of this file exists at all three nodes (three

regions) – case of full replication. Now distributed DBMS will consult DD/D and determine

that this is a local transaction. User need not be aware that same data is stored at other

sites.

Replication Transparency(2)

In case requested data are not at the local site, the distributed DBMS decides the optimum route to the target site.

Consider another situation when one or more users try to update replicated data. user at region C wants to change price of a part. This update must be reflected at all replicated

sites for data integrity. With replication transparency this is easily

handled.

Failure Transparency (1) Each site in a distributed environment is prone to

same type of failures as in centralized systems (like disk failure, erroneous data,etc.)

An additional risk is communication link failure A robust system must be able to

Detect failure Reconfigure system to continue computation Recover when processor or link is repaired

The 1st & 2nd tasks are jobs of comm. controller or processor and 3rd one is job of DDBMS

DDBMS has a software module – Transaction Manager

Failure Transparency (2) Transaction Manager:

Maintain a log of transactions and “before & after images” of database Maintain concurrency control scheme to ensure data integrity during

parallel execution of transactions at that site For global transactions, transaction managers at each

participating site cooperate to ensure all update operations are synchronized

Example A person at site1 wants to change a part’s price in the

StandardPriceList (copies are present at three different sites) This global transaction must update every copy of record for that part

at the three sites Suppose, site1 and site2 are successfully updated but due to

transmission failure, update does not occur at site3 With failure transparency, either all actions of a transaction are

committed or none of them are committed

Commit protocol Commit Protocol

Ensures data integrity for distributed update operations, the cooperating transaction managers execute a commit protocol

Ensures that a global transaction is successfully completed at all sites or else aborted

Two-phase Commit (an algorithm) Most widely used protocol Coordinates updates in a distributed environment Ensures that concurrent transactions at multiple sites are

processed as though they were executes in same serial order at all sites

How does it work?

How does the protocol work?(1)

Site originating the global transaction sends a request to each of the sites that will process some portion of the transaction.

Each site processes sub-transaction but does not commit the result to database (holds in temporary file).

Each site locks its portion of database being updated. Each site notifies originating site about completion of

sub-transaction. When all sites have responded, originating site initiates

two-phase commit protocol Preparation Phase Final Commit Phase


Preparation Phase

Message is broadcasted to every participating sites, asking whether it is willing to commit its portion of transaction at that site

Each site returns “OK” or “not OK”

Originating site collects all messages.


Final Commit Phase

If all are “OK”, it broadcasts a message to all sites to commit the portion of transaction handled at each site

If one or more sites respond with “not OK”, it broadcasts a message to all sites to abort the transaction

A transaction can fail during the commit phase

Such transaction will be in limbo


A limbo transaction can be identified by polling or by a timeout. With a timeout (no confirmation of commit for a specified time period) its not possible to distinguish between a failed or a busy site.

Polling is expensive in terms of network load and processing time

In case a commit confirmation is not received from one or more sites, the originating sites forces all sites to undo changes by abort message

Improvement Strategies for two-phase commit Two-phase commit protocol is slow due to delays caused by

extensive coordination among sites. Some improvements develop –ed are:

Read-only optimization This approach identifies read-only portions of a transaction and

eliminates need of confirmation messages on them, e.g., a transaction may include reading of data before inserting new data. (inventory balance check before creating new order, so data read can happen without callback confirmation)

Lazy commit optimization This approach allows those sites which can update to proceed to

update and the ones which can not update are allowed to “catch up” later

Linear commit optimization This approach allows each part of transaction (sub-transaction) to be

committed in sequence rather than holding up a whole transaction when sub-transaction parts are delayed from being processed

Concurrency Transparency Concurrency control

When multiple users access and update a database, data integrity may be lost unless locking mechanisms are used to protect data from effects of concurrent updates

Concurrency control is more complex in a distributed database as users are geographically distributed at sites Data are often replicated at several sites

Concurrency Transparency Allows a distributed system to run many transactions Allows each transaction to appear as if it were the only activity in

the system When several transactions are processed concurrently, results

must be same as if each transaction were processed in serial order

Transaction managers at each site must cooperate to provide concurrency control in distributed Database

Concurrency Control Approaches

1st approach - Locking Data retrieved by a user for updating must be locked or denied to

other users until that update is completed or aborted. Transactions can place locks on data resources.

Locks are implemented at the following levels Database Table Block or Page Record level Field level

Lock Types Shared locks (S lock or read lock) Exclusive locks (X lock or write lock) Details on page 474 (Hoffer - 6/e)


2nd approach – Versioning Uses optimistic approach (opposite to pessimistic locking

approach) that most of the time other users do not want the same record

Each transaction is restricted to a view of database as of the time that transaction started and when a transaction modifies a record, the DBMS creates a new record version instead of overwriting the old record.

When there is no conflict (only one user transaction changes database records) the changes are merged directly to database

If two users make conflicting changes, changes made by one of the users are committed to database – usually the earlier time stamped transaction gets priority. Other user is told about the conflict.


3rd approach - Timestamping Every transaction gets a globally unique timpestamp (clock

time when transaction occurred) Ensures that even if two simultaneous events/transactions

occur at different sites, each will get a unique timestamp Timestamp ensures serial order processing of transactions Every record carries timestamp of transaction that last

updated it. A transaction can not process a record until its timestamp

is later than that carried in the record If a transaction timestamp is earlier than that carried in

record, it is assigned a new timestamp and restarted

Query Optimization The decision of DDBMS of how to process a query is affected by

the way user formulates a query and intelligence of the distributed DBMS to develop a sensible plan

A distributed DBMS typically uses the following 3 steps to develop a query processing plan Query decomposition

Query is simplified and rewritten into structured relational algebra form

Data localization Query is transformed from a query referencing data across the

network into one or more fragments which each explicitly reference data at only one site

Global optimization In this final step, decisions are made about the order in which to

execute query fragments, where to move data between sites and where parts of the query will be executed

Query Optimization

One of the techniques of processing a distributed query efficiently is to use semijoin operation. In Semijoin, only the joining attribute is sent to sent from

one site to another rather than sending all selected attributes from every qualified row.

After that then only the required rows are returned Hence there is minimal transfer of data over the network.

A DDBMS also uses a cost model to predict execution time of alternative execution plans.

Reading assignment

Evolution of DDBMS

That’s it for today…!

Distributed Databases. Outline Distributed DBMS (DBMS) Features Detailed insight of its Objectives.

Documents

Transcript of Distributed Databases. Outline Distributed DBMS (DBMS) Features Detailed insight of its Objectives.